# CLAUDE.md Project context file for **AI Data Analyst** - local analytics environment with access to your organization's internal data. ## Quick Status | Property | Value | |----------|-------| | **Project Type** | AI Data Analyst | | **Database** | DuckDB at `user/duckdb/analytics.duckdb` | | **Data Source** | {ssh_alias} server ({server_host}) | | **Data Format** | Parquet files in `server/parquet/` | | **Analyst** | {username} | --- ## CRITICAL: Always Start Here ### 1. Sync Data When Starting **MANDATORY: Automatically run sync in these situations:** - This is a new session (first interaction today) - The session is from a previous day or older - Data may be stale (updated multiple times daily on server) - The user explicitly requests fresh data ```bash bash server/scripts/sync_data.sh ``` This updates data, scripts, documentation, and CLAUDE.md. ### 2. Read Schema Documentation Before Writing SQL **MANDATORY: Before writing ANY SQL query, you MUST read the relevant documentation files:** #### For table structure (columns, types, descriptions): ```bash # ALWAYS read this FIRST before querying tables cat server/docs/schema.yml ``` - **NEVER use DESCRIBE, SHOW COLUMNS, or similar commands** - **NEVER guess column names** - schema.yml contains: all column names, types, descriptions, primary keys #### For table relationships (joins, foreign keys): ```bash # Read this for understanding relationships between tables cat server/docs/data_description.md ``` - Contains primary/foreign keys, sync strategies, and table descriptions - Essential for writing correct JOIN queries #### For additional dataset schemas (if available): ```bash # Check for additional dataset schemas ls server/docs/datasets/ 2>/dev/null ``` ### 3. Read Metrics Definitions (if available) **Before calculating ANY business metric, check for metric definitions:** ```bash # Check if metrics index exists cat server/docs/metrics/metrics.yml 2>/dev/null # Or list available metric files ls server/docs/metrics/ 2>/dev/null ``` If metric definitions exist, always read the specific metric file before calculating. Do not calculate metrics from memory - the formulas contain critical details. --- ## Directory Structure ``` project_root/ ├── server/ # READ-ONLY - synced from server │ ├── docs/ # Documentation │ │ ├── data_description.md # Table relationships and descriptions │ │ ├── schema.yml # Table schemas and column definitions │ │ ├── metrics/ # Metric definitions (if available) │ │ └── datasets/ # Additional dataset docs (if available) │ ├── scripts/ # Helper scripts (sync_data.sh, setup_views.sh) │ ├── examples/ # Example scripts (if available) │ └── parquet/ # Synced parquet data files │ ├── user/ # YOUR WORKSPACE - never overwritten │ ├── duckdb/ # DuckDB database (analytics.duckdb) │ ├── artifacts/ # Analysis outputs, charts, exports │ └── scripts/ # Your custom scripts │ ├── .claude/ # Claude Code config ├── .venv/ # Python virtual environment ├── CLAUDE.md # This file (auto-updated from server) └── CLAUDE.local.md # Your personal notes (never overwritten) ``` **Never modify files in `server/` - they are overwritten on every sync.** --- ## Essential Commands ```bash # Data freshness and sync bash server/scripts/sync_data.sh # Sync latest data from server # DuckDB management bash server/scripts/setup_views.sh # Recreate DuckDB views # Python environment source .venv/bin/activate # Activate venv (macOS/Linux) .venv/Scripts/activate # Activate venv (Windows) ``` --- ## Quick Start ### List all tables ```python import duckdb con = duckdb.connect('user/duckdb/analytics.duckdb') tables = con.execute("SHOW TABLES;").fetchall() for table in tables: print(table[0]) con.close() ``` ### Query data ```bash # Read schema first, then query cat server/docs/schema.yml ``` ```python import duckdb con = duckdb.connect('user/duckdb/analytics.duckdb') # Write your query based on schema.yml column definitions result = con.execute("SELECT * FROM your_table LIMIT 10").fetchdf() print(result) con.close() ``` --- ## Startup Checklist When starting a new session: 1. **Sync latest data** ```bash bash server/scripts/sync_data.sh ``` 2. **Verify database exists** ```bash ls -lh user/duckdb/analytics.duckdb ``` You're ready to analyze! --- ## Important Reminders - Always read `server/docs/schema.yml` before writing SQL queries - Always read `server/docs/data_description.md` for table relationships and joins - Check `server/docs/metrics/` for metric definitions before calculating business metrics - Use DuckDB views, not direct parquet file reads - Never modify files in `server/` - they're read-only --- ## Remote Queries (BigQuery) Some tables are too large for local Parquet sync and are queried remotely via BigQuery. These tables have `query_mode: "remote"` in `server/docs/data_description.md`. **IMPORTANT: When remote tables exist, proactively offer hybrid analyses that combine local and remote data.** For example, if the user asks for a business overview, suggest joining local order data with remote traffic data to show a complete picture (conversion funnels, revenue per visitor, etc.). Don't wait for the user to ask -- hybrid insights are more valuable than single-source analysis. ### How to recognize remote tables Before writing any query, read `server/docs/data_description.md`. Each table has: - `query_mode: "local"` -- available as a local DuckDB view (query normally) - `query_mode: "remote"` -- NOT in local DuckDB, must use remote query protocol below - `query_mode: "hybrid"` -- local view exists AND can query BQ for live data ### Remote table metadata in data_description.md Remote tables include metadata to help you write safe queries: - **`volume`** -- rows_per_day, unique entities per day (tells you table size) - **`columns`** -- column names, types, value distributions - **`dimension_profile`** -- cardinality per dimension with value distributions - **`query_result_estimates`** -- expected row counts after GROUP BY combinations - **`join_keys`** -- how to join with other tables **ALWAYS read these sections before writing a remote query.** Use `query_result_estimates` to predict how many rows your query will return. The server has limited RAM -- keep BQ sub-query results under 500K rows. ### Two-phase query protocol Remote queries run **on the server** via SSH (server has DuckDB + Parquet + BigQuery access). You write two SQL statements: 1. **BQ sub-query** (`--register-bq "alias=SQL"`) -- runs on BigQuery, result registered in DuckDB as a view. This MUST contain WHERE and/or GROUP BY to reduce the result set. Never SELECT * from a remote table. 2. **DuckDB SQL** (`--sql "SQL"`) -- runs in DuckDB after all views (local + BQ) are ready. Can JOIN local tables with registered BQ results. ### Command format (JSON file via stdin) **IMPORTANT:** Always use the `--stdin` JSON mode to avoid shell escaping issues with backticks, quotes, and parentheses in SQL. Use the Write tool to create a JSON query spec file, then pipe it to SSH via stdin redirect: **Step 1:** Use the Write tool to create a JSON file (e.g., `user/scripts/rq_query.json`): ```json { "register_bq": { "ALIAS": "SELECT ... FROM `project.dataset.table` WHERE ... GROUP BY ..." }, "sql": "SELECT ... FROM ALIAS JOIN local_table ...", "format": "table" } ``` **Step 2:** Run the query via SSH with stdin redirect: ```bash ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin' < user/scripts/rq_query.json ``` **NEVER use `cat <= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) GROUP BY 1,2" }, "sql": "SELECT * FROM agg_data ORDER BY date_col, dim_col", "format": "table" } ``` Then run: ```bash ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin' < user/scripts/rq_query.json ``` ### Example 2: JOIN local + remote Write to `user/scripts/rq_query.json`: ```json { "register_bq": { "remote_data": "SELECT date_col, dim_col, SUM(metric) as total FROM `project.dataset.table` WHERE date_col >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) GROUP BY 1,2" }, "sql": "SELECT l.*, r.total FROM local_table l JOIN remote_data r ON l.date_col = r.date_col AND l.dim_col = r.dim_col ORDER BY 1,2", "format": "table" } ``` Then run: ```bash ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin' < user/scripts/rq_query.json ``` ### Example 3: Download result as Parquet for local analysis Write to `user/scripts/rq_query.json`: ```json { "register_bq": { "remote_data": "SELECT ... FROM `project.dataset.table` WHERE ... GROUP BY ..." }, "sql": "SELECT ... FROM local_table JOIN remote_data ...", "format": "parquet", "output": "/tmp/remote_query/analysis.parquet" } ``` Then run: ```bash # 1. Run query on server ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin' < user/scripts/rq_query.json # 2. Download to local machine scp {ssh_alias}:/tmp/remote_query/analysis.parquet ./user/parquet/ # 3. Register in local DuckDB for further analysis python3 -c " import duckdb conn = duckdb.connect('user/duckdb/analytics.duckdb') conn.execute(\"CREATE OR REPLACE VIEW analysis AS SELECT * FROM read_parquet('user/parquet/analysis.parquet')\") print('View created:', conn.execute('SELECT COUNT(*) FROM analysis').fetchone()[0], 'rows') conn.close() " ``` ### How to estimate result sizes Before writing a BQ sub-query, check `dimension_profile` and `query_result_estimates` in `server/docs/data_description.md`. **Rule of thumb:** rows = (estimate per day from query_result_estimates) * (number of days in WHERE clause). If that exceeds 100K rows, add more aggregation or tighter date filters. ### Safety rules 1. **NEVER** run `SELECT * FROM remote_table` without WHERE + GROUP BY 2. **ALWAYS** check `dimension_profile` before writing BQ sub-queries 3. **ALWAYS** include date range in WHERE clause 4. **Limits**: 500K rows max per BQ sub-query, 100K rows max in final result 5. If the query might take > 60 seconds, use nohup pattern: ```bash # Write query spec to user/scripts/rq_query.json first, then: ssh {ssh_alias} 'cat > /tmp/rq_spec.json && nohup bash ~/server/scripts/remote_query.sh --stdin < /tmp/rq_spec.json > /tmp/rq.log 2>&1 &' < user/scripts/rq_query.json ssh {ssh_alias} 'tail -5 /tmp/rq.log' # check progress scp {ssh_alias}:/tmp/remote_query/result.parquet ./user/parquet/ ``` --- ## Reporting Issues Report issues to your platform team or the project's issue tracker. Include: - Error messages or unexpected behavior - Steps to reproduce - Output of `bash server/scripts/sync_data.sh`