agnes-the-ai-analyst/docs/setup/claude_md_template.txt

# CLAUDE.md

Project context file for **AI Data Analyst** - local analytics environment with access to your organization's internal data.

## Quick Status

| Property | Value |
|----------|-------|
| **Project Type** | AI Data Analyst |
| **Database** | DuckDB at `user/duckdb/analytics.duckdb` |
| **Data Source** | {ssh_alias} server ({server_host}) |
| **Data Format** | Parquet files in `server/parquet/` |
| **Analyst** | {username} |

---

## CRITICAL: Always Start Here

### 1. Sync Data When Starting

**MANDATORY: Automatically run sync in these situations:**
- This is a new session (first interaction today)
- The session is from a previous day or older
- Data may be stale (updated multiple times daily on server)
- The user explicitly requests fresh data

```bash
bash server/scripts/sync_data.sh
```

This updates data, scripts, documentation, and CLAUDE.md.

### 2. Read Schema Documentation Before Writing SQL

**MANDATORY: Before writing ANY SQL query, you MUST read the relevant documentation files:**

#### For table structure (columns, types, descriptions):

```bash
# ALWAYS read this FIRST before querying tables
cat server/docs/schema.yml
```

- **NEVER use DESCRIBE, SHOW COLUMNS, or similar commands**
- **NEVER guess column names**
- schema.yml contains: all column names, types, descriptions, primary keys

#### For table relationships (joins, foreign keys):

```bash
# Read this for understanding relationships between tables
cat server/docs/data_description.md
```

- Contains primary/foreign keys, sync strategies, and table descriptions
- Essential for writing correct JOIN queries

#### For additional dataset schemas (if available):

```bash
# Check for additional dataset schemas
ls server/docs/datasets/ 2>/dev/null
```

### 3. Read Metrics Definitions (if available)

**Before calculating ANY business metric, check for metric definitions:**

```bash
# Check if metrics index exists
cat server/docs/metrics/metrics.yml 2>/dev/null

# Or list available metric files
ls server/docs/metrics/ 2>/dev/null
```

If metric definitions exist, always read the specific metric file before calculating.
Do not calculate metrics from memory - the formulas contain critical details.

---

## Directory Structure

```
project_root/
├── server/                         # READ-ONLY - synced from server
│   ├── docs/                       # Documentation
│   │   ├── data_description.md     # Table relationships and descriptions
│   │   ├── schema.yml              # Table schemas and column definitions
│   │   ├── metrics/                # Metric definitions (if available)
│   │   └── datasets/               # Additional dataset docs (if available)
│   ├── scripts/                    # Helper scripts (sync_data.sh, setup_views.sh)
│   ├── examples/                   # Example scripts (if available)
│   └── parquet/                    # Synced parquet data files
│
├── user/                           # YOUR WORKSPACE - never overwritten
│   ├── duckdb/                     # DuckDB database (analytics.duckdb)
│   ├── artifacts/                  # Analysis outputs, charts, exports
│   └── scripts/                    # Your custom scripts
│
├── .claude/                        # Claude Code config
├── .venv/                          # Python virtual environment
├── CLAUDE.md                       # This file (auto-updated from server)
└── CLAUDE.local.md                 # Your personal notes (never overwritten)
```

**Never modify files in `server/` - they are overwritten on every sync.**

---

## Essential Commands

```bash
# Data freshness and sync
bash server/scripts/sync_data.sh            # Sync latest data from server

# DuckDB management
bash server/scripts/setup_views.sh          # Recreate DuckDB views

# Python environment
source .venv/bin/activate                   # Activate venv (macOS/Linux)
.venv/Scripts/activate                      # Activate venv (Windows)
```

---

## Quick Start

### List all tables

```python
import duckdb
con = duckdb.connect('user/duckdb/analytics.duckdb')
tables = con.execute("SHOW TABLES;").fetchall()
for table in tables:
    print(table[0])
con.close()
```

### Query data

```bash
# Read schema first, then query
cat server/docs/schema.yml
```

```python
import duckdb
con = duckdb.connect('user/duckdb/analytics.duckdb')
# Write your query based on schema.yml column definitions
result = con.execute("SELECT * FROM your_table LIMIT 10").fetchdf()
print(result)
con.close()
```

---

## Startup Checklist

When starting a new session:

1. **Sync latest data**
   ```bash
   bash server/scripts/sync_data.sh
   ```

2. **Verify database exists**
   ```bash
   ls -lh user/duckdb/analytics.duckdb
   ```

You're ready to analyze!

---

## Important Reminders

- Always read `server/docs/schema.yml` before writing SQL queries
- Always read `server/docs/data_description.md` for table relationships and joins
- Check `server/docs/metrics/` for metric definitions before calculating business metrics
- Use DuckDB views, not direct parquet file reads
- Never modify files in `server/` - they're read-only

---

## Remote Queries (BigQuery)

Some tables are too large for local Parquet sync and are queried remotely via BigQuery.
These tables have `query_mode: "remote"` in `server/docs/data_description.md`.

### How to recognize remote tables

Before writing any query, read `server/docs/data_description.md`. Each table has:
- `query_mode: "local"` -- available as a local DuckDB view (query normally)
- `query_mode: "remote"` -- NOT in local DuckDB, must use remote query protocol below
- `query_mode: "hybrid"` -- local view exists AND can query BQ for live data

### Remote table metadata in data_description.md

Remote tables include metadata to help you write safe queries:

- **`volume`** -- rows_per_day, unique entities per day (tells you table size)
- **`columns`** -- column names, types, value distributions
- **`dimension_profile`** -- cardinality per dimension with value distributions
- **`query_result_estimates`** -- expected row counts after GROUP BY combinations
- **`join_keys`** -- how to join with other tables

**ALWAYS read these sections before writing a remote query.** Use `query_result_estimates`
to predict how many rows your query will return. The server has limited RAM -- keep BQ
sub-query results under 500K rows.

### Two-phase query protocol

Remote queries run **on the server** via SSH (server has DuckDB + Parquet + BigQuery access).
You write two SQL statements:

1. **BQ sub-query** (`--register-bq "alias=SQL"`) -- runs on BigQuery, result registered in DuckDB as a view.
   This MUST contain WHERE and/or GROUP BY to reduce the result set. Never SELECT * from a remote table.
2. **DuckDB SQL** (`--sql "SQL"`) -- runs in DuckDB after all views (local + BQ) are ready.
   Can JOIN local tables with registered BQ results.

### Command format (JSON via stdin -- ALWAYS use this)

**IMPORTANT:** Always use the `--stdin` JSON mode to avoid shell escaping issues with
backticks, quotes, and parentheses in SQL. Write a heredoc with the JSON query spec:

```bash
cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
{
  "register_bq": {
    "ALIAS": "SELECT ... FROM `project.dataset.table` WHERE ... GROUP BY ..."
  },
  "sql": "SELECT ... FROM ALIAS JOIN local_table ...",
  "format": "table"
}
QUERY
```

The `<<'QUERY'` heredoc passes SQL **literally** -- no escaping needed for backticks,
single quotes, parentheses, or any other special characters.

**JSON fields:**
- `"sql"` (required) -- DuckDB SQL query (can reference local views + registered BQ aliases)
- `"register_bq"` (optional) -- Object mapping alias names to BigQuery SQL queries
- `"format"` (optional) -- `"table"`, `"csv"`, `"json"`, or `"parquet"` (default: `"table"`)
- `"output"` (optional) -- File path for parquet/csv/json output
- `"max_rows"` (optional) -- Override max result rows

### Example 1: Remote-only query (aggregated data)

```bash
cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
{
  "register_bq": {
    "agg_data": "SELECT date_col, dim_col, SUM(metric) as total FROM `project.dataset.table` WHERE date_col >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) GROUP BY 1,2"
  },
  "sql": "SELECT * FROM agg_data ORDER BY date_col, dim_col",
  "format": "table"
}
QUERY
```

### Example 2: JOIN local + remote

```bash
cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
{
  "register_bq": {
    "remote_data": "SELECT date_col, dim_col, SUM(metric) as total FROM `project.dataset.table` WHERE date_col >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) GROUP BY 1,2"
  },
  "sql": "SELECT l.*, r.total FROM local_table l JOIN remote_data r ON l.date_col = r.date_col AND l.dim_col = r.dim_col ORDER BY 1,2",
  "format": "table"
}
QUERY
```

### Example 3: Download result as Parquet for local analysis

```bash
# 1. Run query, save as Parquet on server
cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
{
  "register_bq": {
    "remote_data": "SELECT ... FROM `project.dataset.table` WHERE ... GROUP BY ..."
  },
  "sql": "SELECT ... FROM local_table JOIN remote_data ...",
  "format": "parquet",
  "output": "/tmp/remote_query/analysis.parquet"
}
QUERY

# 2. Download to local machine
scp {ssh_alias}:/tmp/remote_query/analysis.parquet ./user/parquet/

# 3. Register in local DuckDB for further analysis
python3 -c "
import duckdb
conn = duckdb.connect('user/duckdb/analytics.duckdb')
conn.execute(\"CREATE OR REPLACE VIEW analysis AS SELECT * FROM read_parquet('user/parquet/analysis.parquet')\")
print('View created:', conn.execute('SELECT COUNT(*) FROM analysis').fetchone()[0], 'rows')
conn.close()
"
```

### How to estimate result sizes

Before writing a BQ sub-query, check `dimension_profile` and `query_result_estimates`
in `server/docs/data_description.md`.

**Rule of thumb:** rows = (estimate per day from query_result_estimates) * (number of days in WHERE clause).
If that exceeds 100K rows, add more aggregation or tighter date filters.

### Safety rules

1. **NEVER** run `SELECT * FROM remote_table` without WHERE + GROUP BY
2. **ALWAYS** check `dimension_profile` before writing BQ sub-queries
3. **ALWAYS** include date range in WHERE clause
4. **Limits**: 500K rows max per BQ sub-query, 100K rows max in final result
5. If the query might take > 60 seconds, use nohup pattern:
   ```bash
   # Write query to temp file, then run via nohup
   cat <<'QUERY' | ssh {ssh_alias} 'cat > /tmp/rq_spec.json && nohup bash ~/server/scripts/remote_query.sh --stdin < /tmp/rq_spec.json > /tmp/rq.log 2>&1 &'
   {"register_bq": {"data": "SELECT ..."}, "sql": "SELECT ...", "format": "parquet", "output": "/tmp/remote_query/result.parquet"}
   QUERY
   ssh {ssh_alias} 'tail -5 /tmp/rq.log'  # check progress
   scp {ssh_alias}:/tmp/remote_query/result.parquet ./user/parquet/
   ```

---

## Reporting Issues

Report issues to your platform team or the project's issue tracker.

Include:
- Error messages or unexpected behavior
- Steps to reproduce
- Output of `bash server/scripts/sync_data.sh`