# CLAUDE.md

Project context file for **AI Data Analyst** - local analytics environment with access to your organization's internal data.

## Quick Status

| Property | Value |
|----------|-------|
| **Project Type** | AI Data Analyst |
| **Database** | DuckDB at `user/duckdb/analytics.duckdb` |
| **Data Source** | {ssh_alias} server ({server_host}) |
| **Data Format** | Parquet files in `server/parquet/` |
| **Analyst** | {username} |

---

## CRITICAL: Always Start Here

### 1. Sync Data When Starting

**MANDATORY: Automatically run sync in these situations:**
- This is a new session (first interaction today)
- The session is from a previous day or older
- Data may be stale (updated multiple times daily on server)
- The user explicitly requests fresh data

```bash
bash server/scripts/sync_data.sh
```

This updates data, scripts, documentation, and CLAUDE.md.

### 2. Read Schema Documentation Before Writing SQL

**MANDATORY: Before writing ANY SQL query, you MUST read the relevant documentation files:**

#### For table structure (columns, types, descriptions):

```bash
# ALWAYS read this FIRST before querying tables
cat server/docs/schema.yml
```

- **NEVER use DESCRIBE, SHOW COLUMNS, or similar commands**
- **NEVER guess column names**
- schema.yml contains: all column names, types, descriptions, primary keys

#### For table relationships (joins, foreign keys):

```bash
# Read this for understanding relationships between tables
cat server/docs/data_description.md
```

- Contains primary/foreign keys, sync strategies, and table descriptions
- Essential for writing correct JOIN queries

#### For additional dataset schemas (if available):

```bash
# Check for additional dataset schemas
ls server/docs/datasets/ 2>/dev/null
```

### 3. Read Metrics Definitions (if available)

**Before calculating ANY business metric, check for metric definitions:**

```bash
# Check if metrics index exists
cat server/docs/metrics/metrics.yml 2>/dev/null

# Or list available metric files
ls server/docs/metrics/ 2>/dev/null
```

If metric definitions exist, always read the specific metric file before calculating.
Do not calculate metrics from memory - the formulas contain critical details.

---

## Directory Structure

```
project_root/
├── server/                         # READ-ONLY - synced from server
│   ├── docs/                       # Documentation
│   │   ├── data_description.md     # Table relationships and descriptions
│   │   ├── schema.yml              # Table schemas and column definitions
│   │   ├── metrics/                # Metric definitions (if available)
│   │   └── datasets/               # Additional dataset docs (if available)
│   ├── scripts/                    # Helper scripts (sync_data.sh, setup_views.sh)
│   ├── examples/                   # Example scripts (if available)
│   └── parquet/                    # Synced parquet data files
│
├── user/                           # YOUR WORKSPACE - never overwritten
│   ├── duckdb/                     # DuckDB database (analytics.duckdb)
│   ├── artifacts/                  # Analysis outputs, charts, exports
│   └── scripts/                    # Your custom scripts
│
├── .claude/                        # Claude Code config
├── .venv/                          # Python virtual environment
├── CLAUDE.md                       # This file (auto-updated from server)
└── CLAUDE.local.md                 # Your personal notes (never overwritten)
```

**Never modify files in `server/` - they are overwritten on every sync.**

---

## Essential Commands

```bash
# Data freshness and sync
bash server/scripts/sync_data.sh            # Sync latest data from server

# DuckDB management
bash server/scripts/setup_views.sh          # Recreate DuckDB views

# Python environment
source .venv/bin/activate                   # Activate venv (macOS/Linux)
.venv/Scripts/activate                      # Activate venv (Windows)
```

---

## Quick Start

### List all tables

```python
import duckdb
con = duckdb.connect('user/duckdb/analytics.duckdb')
tables = con.execute("SHOW TABLES;").fetchall()
for table in tables:
    print(table[0])
con.close()
```

### Query data

```bash
# Read schema first, then query
cat server/docs/schema.yml
```

```python
import duckdb
con = duckdb.connect('user/duckdb/analytics.duckdb')
# Write your query based on schema.yml column definitions
result = con.execute("SELECT * FROM your_table LIMIT 10").fetchdf()
print(result)
con.close()
```

---

## Startup Checklist

When starting a new session:

1. **Sync latest data**
   ```bash
   bash server/scripts/sync_data.sh
   ```

2. **Verify database exists**
   ```bash
   ls -lh user/duckdb/analytics.duckdb
   ```

You're ready to analyze!

---

## Important Reminders

- Always read `server/docs/schema.yml` before writing SQL queries
- Always read `server/docs/data_description.md` for table relationships and joins
- Check `server/docs/metrics/` for metric definitions before calculating business metrics
- Use DuckDB views, not direct parquet file reads
- Never modify files in `server/` - they're read-only

---

## Reporting Issues

Report issues to your platform team or the project's issue tracker.

Include:
- Error messages or unexpected behavior
- Steps to reproduce
- Output of `bash server/scripts/sync_data.sh`