agnes-the-ai-analyst/docs/setup/claude_md_template.txt
Petr 67df4acd73 Add --stdin JSON mode to avoid shell escaping nightmare
Agent was failing 3x on SSH commands due to backticks (BQ table names)
and single quotes (SQL string literals) getting mangled by nested shell
interpretation (local -> SSH -> bash -> Python).

New --stdin mode reads query spec as JSON from stdin via heredoc:
  cat <<'QUERY' | ssh alias 'bash remote_query.sh --stdin'
  {"register_bq": {"alias": "SELECT ... FROM \`table\` ..."}, "sql": "..."}
  QUERY

Heredoc with <<'QUERY' (quoted) passes everything literally -- no
escaping needed for backticks, quotes, or parentheses.

Updated claude_md_template.txt to use --stdin as the primary method.
2026-03-21 12:15:50 +01:00

338 lines
11 KiB
Text

# CLAUDE.md
Project context file for **AI Data Analyst** - local analytics environment with access to your organization's internal data.
## Quick Status
| Property | Value |
|----------|-------|
| **Project Type** | AI Data Analyst |
| **Database** | DuckDB at `user/duckdb/analytics.duckdb` |
| **Data Source** | {ssh_alias} server ({server_host}) |
| **Data Format** | Parquet files in `server/parquet/` |
| **Analyst** | {username} |
---
## CRITICAL: Always Start Here
### 1. Sync Data When Starting
**MANDATORY: Automatically run sync in these situations:**
- This is a new session (first interaction today)
- The session is from a previous day or older
- Data may be stale (updated multiple times daily on server)
- The user explicitly requests fresh data
```bash
bash server/scripts/sync_data.sh
```
This updates data, scripts, documentation, and CLAUDE.md.
### 2. Read Schema Documentation Before Writing SQL
**MANDATORY: Before writing ANY SQL query, you MUST read the relevant documentation files:**
#### For table structure (columns, types, descriptions):
```bash
# ALWAYS read this FIRST before querying tables
cat server/docs/schema.yml
```
- **NEVER use DESCRIBE, SHOW COLUMNS, or similar commands**
- **NEVER guess column names**
- schema.yml contains: all column names, types, descriptions, primary keys
#### For table relationships (joins, foreign keys):
```bash
# Read this for understanding relationships between tables
cat server/docs/data_description.md
```
- Contains primary/foreign keys, sync strategies, and table descriptions
- Essential for writing correct JOIN queries
#### For additional dataset schemas (if available):
```bash
# Check for additional dataset schemas
ls server/docs/datasets/ 2>/dev/null
```
### 3. Read Metrics Definitions (if available)
**Before calculating ANY business metric, check for metric definitions:**
```bash
# Check if metrics index exists
cat server/docs/metrics/metrics.yml 2>/dev/null
# Or list available metric files
ls server/docs/metrics/ 2>/dev/null
```
If metric definitions exist, always read the specific metric file before calculating.
Do not calculate metrics from memory - the formulas contain critical details.
---
## Directory Structure
```
project_root/
├── server/ # READ-ONLY - synced from server
│ ├── docs/ # Documentation
│ │ ├── data_description.md # Table relationships and descriptions
│ │ ├── schema.yml # Table schemas and column definitions
│ │ ├── metrics/ # Metric definitions (if available)
│ │ └── datasets/ # Additional dataset docs (if available)
│ ├── scripts/ # Helper scripts (sync_data.sh, setup_views.sh)
│ ├── examples/ # Example scripts (if available)
│ └── parquet/ # Synced parquet data files
├── user/ # YOUR WORKSPACE - never overwritten
│ ├── duckdb/ # DuckDB database (analytics.duckdb)
│ ├── artifacts/ # Analysis outputs, charts, exports
│ └── scripts/ # Your custom scripts
├── .claude/ # Claude Code config
├── .venv/ # Python virtual environment
├── CLAUDE.md # This file (auto-updated from server)
└── CLAUDE.local.md # Your personal notes (never overwritten)
```
**Never modify files in `server/` - they are overwritten on every sync.**
---
## Essential Commands
```bash
# Data freshness and sync
bash server/scripts/sync_data.sh # Sync latest data from server
# DuckDB management
bash server/scripts/setup_views.sh # Recreate DuckDB views
# Python environment
source .venv/bin/activate # Activate venv (macOS/Linux)
.venv/Scripts/activate # Activate venv (Windows)
```
---
## Quick Start
### List all tables
```python
import duckdb
con = duckdb.connect('user/duckdb/analytics.duckdb')
tables = con.execute("SHOW TABLES;").fetchall()
for table in tables:
print(table[0])
con.close()
```
### Query data
```bash
# Read schema first, then query
cat server/docs/schema.yml
```
```python
import duckdb
con = duckdb.connect('user/duckdb/analytics.duckdb')
# Write your query based on schema.yml column definitions
result = con.execute("SELECT * FROM your_table LIMIT 10").fetchdf()
print(result)
con.close()
```
---
## Startup Checklist
When starting a new session:
1. **Sync latest data**
```bash
bash server/scripts/sync_data.sh
```
2. **Verify database exists**
```bash
ls -lh user/duckdb/analytics.duckdb
```
You're ready to analyze!
---
## Important Reminders
- Always read `server/docs/schema.yml` before writing SQL queries
- Always read `server/docs/data_description.md` for table relationships and joins
- Check `server/docs/metrics/` for metric definitions before calculating business metrics
- Use DuckDB views, not direct parquet file reads
- Never modify files in `server/` - they're read-only
---
## Remote Queries (BigQuery)
Some tables are too large for local Parquet sync and are queried remotely via BigQuery.
These tables have `query_mode: "remote"` in `server/docs/data_description.md`.
### How to recognize remote tables
Before writing any query, read `server/docs/data_description.md`. Each table has:
- `query_mode: "local"` -- available as a local DuckDB view (query normally)
- `query_mode: "remote"` -- NOT in local DuckDB, must use remote query protocol below
- `query_mode: "hybrid"` -- local view exists AND can query BQ for live data
### Remote table metadata in data_description.md
Remote tables include metadata to help you write safe queries:
- **`volume`** -- rows_per_day, unique entities per day (tells you table size)
- **`columns`** -- column names, types, value distributions
- **`dimension_profile`** -- cardinality per dimension with value distributions
- **`query_result_estimates`** -- expected row counts after GROUP BY combinations
- **`join_keys`** -- how to join with other tables
**ALWAYS read these sections before writing a remote query.** Use `query_result_estimates`
to predict how many rows your query will return. The server has limited RAM -- keep BQ
sub-query results under 500K rows.
### Two-phase query protocol
Remote queries run **on the server** via SSH (server has DuckDB + Parquet + BigQuery access).
You write two SQL statements:
1. **BQ sub-query** (`--register-bq "alias=SQL"`) -- runs on BigQuery, result registered in DuckDB as a view.
This MUST contain WHERE and/or GROUP BY to reduce the result set. Never SELECT * from a remote table.
2. **DuckDB SQL** (`--sql "SQL"`) -- runs in DuckDB after all views (local + BQ) are ready.
Can JOIN local tables with registered BQ results.
### Command format (JSON via stdin -- ALWAYS use this)
**IMPORTANT:** Always use the `--stdin` JSON mode to avoid shell escaping issues with
backticks, quotes, and parentheses in SQL. Write a heredoc with the JSON query spec:
```bash
cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
{
"register_bq": {
"ALIAS": "SELECT ... FROM `project.dataset.table` WHERE ... GROUP BY ..."
},
"sql": "SELECT ... FROM ALIAS JOIN local_table ...",
"format": "table"
}
QUERY
```
The `<<'QUERY'` heredoc passes SQL **literally** -- no escaping needed for backticks,
single quotes, parentheses, or any other special characters.
**JSON fields:**
- `"sql"` (required) -- DuckDB SQL query (can reference local views + registered BQ aliases)
- `"register_bq"` (optional) -- Object mapping alias names to BigQuery SQL queries
- `"format"` (optional) -- `"table"`, `"csv"`, `"json"`, or `"parquet"` (default: `"table"`)
- `"output"` (optional) -- File path for parquet/csv/json output
- `"max_rows"` (optional) -- Override max result rows
### Example 1: Remote-only query (aggregated data)
```bash
cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
{
"register_bq": {
"agg_data": "SELECT date_col, dim_col, SUM(metric) as total FROM `project.dataset.table` WHERE date_col >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) GROUP BY 1,2"
},
"sql": "SELECT * FROM agg_data ORDER BY date_col, dim_col",
"format": "table"
}
QUERY
```
### Example 2: JOIN local + remote
```bash
cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
{
"register_bq": {
"remote_data": "SELECT date_col, dim_col, SUM(metric) as total FROM `project.dataset.table` WHERE date_col >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) GROUP BY 1,2"
},
"sql": "SELECT l.*, r.total FROM local_table l JOIN remote_data r ON l.date_col = r.date_col AND l.dim_col = r.dim_col ORDER BY 1,2",
"format": "table"
}
QUERY
```
### Example 3: Download result as Parquet for local analysis
```bash
# 1. Run query, save as Parquet on server
cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
{
"register_bq": {
"remote_data": "SELECT ... FROM `project.dataset.table` WHERE ... GROUP BY ..."
},
"sql": "SELECT ... FROM local_table JOIN remote_data ...",
"format": "parquet",
"output": "/tmp/remote_query/analysis.parquet"
}
QUERY
# 2. Download to local machine
scp {ssh_alias}:/tmp/remote_query/analysis.parquet ./user/parquet/
# 3. Register in local DuckDB for further analysis
python3 -c "
import duckdb
conn = duckdb.connect('user/duckdb/analytics.duckdb')
conn.execute(\"CREATE OR REPLACE VIEW analysis AS SELECT * FROM read_parquet('user/parquet/analysis.parquet')\")
print('View created:', conn.execute('SELECT COUNT(*) FROM analysis').fetchone()[0], 'rows')
conn.close()
"
```
### How to estimate result sizes
Before writing a BQ sub-query, check `dimension_profile` and `query_result_estimates`
in `server/docs/data_description.md`.
**Rule of thumb:** rows = (estimate per day from query_result_estimates) * (number of days in WHERE clause).
If that exceeds 100K rows, add more aggregation or tighter date filters.
### Safety rules
1. **NEVER** run `SELECT * FROM remote_table` without WHERE + GROUP BY
2. **ALWAYS** check `dimension_profile` before writing BQ sub-queries
3. **ALWAYS** include date range in WHERE clause
4. **Limits**: 500K rows max per BQ sub-query, 100K rows max in final result
5. If the query might take > 60 seconds, use nohup pattern:
```bash
# Write query to temp file, then run via nohup
cat <<'QUERY' | ssh {ssh_alias} 'cat > /tmp/rq_spec.json && nohup bash ~/server/scripts/remote_query.sh --stdin < /tmp/rq_spec.json > /tmp/rq.log 2>&1 &'
{"register_bq": {"data": "SELECT ..."}, "sql": "SELECT ...", "format": "parquet", "output": "/tmp/remote_query/result.parquet"}
QUERY
ssh {ssh_alias} 'tail -5 /tmp/rq.log' # check progress
scp {ssh_alias}:/tmp/remote_query/result.parquet ./user/parquet/
```
---
## Reporting Issues
Report issues to your platform team or the project's issue tracker.
Include:
- Error messages or unexpected behavior
- Steps to reproduce
- Output of `bash server/scripts/sync_data.sh`