Analyst user (foundry_e_psimecek) couldn't access /opt/data-analyst/. Added to data-ops group on server. New scripts/remote_query.sh wrapper handles env setup (PYTHONPATH, CONFIG_DIR, .env) so agents use simple: ssh alias 'bash ~/server/scripts/remote_query.sh --sql "..." --format table' Updated claude_md_template.txt to use wrapper instead of raw commands.
311 lines
10 KiB
Text
311 lines
10 KiB
Text
# CLAUDE.md
|
|
|
|
Project context file for **AI Data Analyst** - local analytics environment with access to your organization's internal data.
|
|
|
|
## Quick Status
|
|
|
|
| Property | Value |
|
|
|----------|-------|
|
|
| **Project Type** | AI Data Analyst |
|
|
| **Database** | DuckDB at `user/duckdb/analytics.duckdb` |
|
|
| **Data Source** | {ssh_alias} server ({server_host}) |
|
|
| **Data Format** | Parquet files in `server/parquet/` |
|
|
| **Analyst** | {username} |
|
|
|
|
---
|
|
|
|
## CRITICAL: Always Start Here
|
|
|
|
### 1. Sync Data When Starting
|
|
|
|
**MANDATORY: Automatically run sync in these situations:**
|
|
- This is a new session (first interaction today)
|
|
- The session is from a previous day or older
|
|
- Data may be stale (updated multiple times daily on server)
|
|
- The user explicitly requests fresh data
|
|
|
|
```bash
|
|
bash server/scripts/sync_data.sh
|
|
```
|
|
|
|
This updates data, scripts, documentation, and CLAUDE.md.
|
|
|
|
### 2. Read Schema Documentation Before Writing SQL
|
|
|
|
**MANDATORY: Before writing ANY SQL query, you MUST read the relevant documentation files:**
|
|
|
|
#### For table structure (columns, types, descriptions):
|
|
|
|
```bash
|
|
# ALWAYS read this FIRST before querying tables
|
|
cat server/docs/schema.yml
|
|
```
|
|
|
|
- **NEVER use DESCRIBE, SHOW COLUMNS, or similar commands**
|
|
- **NEVER guess column names**
|
|
- schema.yml contains: all column names, types, descriptions, primary keys
|
|
|
|
#### For table relationships (joins, foreign keys):
|
|
|
|
```bash
|
|
# Read this for understanding relationships between tables
|
|
cat server/docs/data_description.md
|
|
```
|
|
|
|
- Contains primary/foreign keys, sync strategies, and table descriptions
|
|
- Essential for writing correct JOIN queries
|
|
|
|
#### For additional dataset schemas (if available):
|
|
|
|
```bash
|
|
# Check for additional dataset schemas
|
|
ls server/docs/datasets/ 2>/dev/null
|
|
```
|
|
|
|
### 3. Read Metrics Definitions (if available)
|
|
|
|
**Before calculating ANY business metric, check for metric definitions:**
|
|
|
|
```bash
|
|
# Check if metrics index exists
|
|
cat server/docs/metrics/metrics.yml 2>/dev/null
|
|
|
|
# Or list available metric files
|
|
ls server/docs/metrics/ 2>/dev/null
|
|
```
|
|
|
|
If metric definitions exist, always read the specific metric file before calculating.
|
|
Do not calculate metrics from memory - the formulas contain critical details.
|
|
|
|
---
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
project_root/
|
|
├── server/ # READ-ONLY - synced from server
|
|
│ ├── docs/ # Documentation
|
|
│ │ ├── data_description.md # Table relationships and descriptions
|
|
│ │ ├── schema.yml # Table schemas and column definitions
|
|
│ │ ├── metrics/ # Metric definitions (if available)
|
|
│ │ └── datasets/ # Additional dataset docs (if available)
|
|
│ ├── scripts/ # Helper scripts (sync_data.sh, setup_views.sh)
|
|
│ ├── examples/ # Example scripts (if available)
|
|
│ └── parquet/ # Synced parquet data files
|
|
│
|
|
├── user/ # YOUR WORKSPACE - never overwritten
|
|
│ ├── duckdb/ # DuckDB database (analytics.duckdb)
|
|
│ ├── artifacts/ # Analysis outputs, charts, exports
|
|
│ └── scripts/ # Your custom scripts
|
|
│
|
|
├── .claude/ # Claude Code config
|
|
├── .venv/ # Python virtual environment
|
|
├── CLAUDE.md # This file (auto-updated from server)
|
|
└── CLAUDE.local.md # Your personal notes (never overwritten)
|
|
```
|
|
|
|
**Never modify files in `server/` - they are overwritten on every sync.**
|
|
|
|
---
|
|
|
|
## Essential Commands
|
|
|
|
```bash
|
|
# Data freshness and sync
|
|
bash server/scripts/sync_data.sh # Sync latest data from server
|
|
|
|
# DuckDB management
|
|
bash server/scripts/setup_views.sh # Recreate DuckDB views
|
|
|
|
# Python environment
|
|
source .venv/bin/activate # Activate venv (macOS/Linux)
|
|
.venv/Scripts/activate # Activate venv (Windows)
|
|
```
|
|
|
|
---
|
|
|
|
## Quick Start
|
|
|
|
### List all tables
|
|
|
|
```python
|
|
import duckdb
|
|
con = duckdb.connect('user/duckdb/analytics.duckdb')
|
|
tables = con.execute("SHOW TABLES;").fetchall()
|
|
for table in tables:
|
|
print(table[0])
|
|
con.close()
|
|
```
|
|
|
|
### Query data
|
|
|
|
```bash
|
|
# Read schema first, then query
|
|
cat server/docs/schema.yml
|
|
```
|
|
|
|
```python
|
|
import duckdb
|
|
con = duckdb.connect('user/duckdb/analytics.duckdb')
|
|
# Write your query based on schema.yml column definitions
|
|
result = con.execute("SELECT * FROM your_table LIMIT 10").fetchdf()
|
|
print(result)
|
|
con.close()
|
|
```
|
|
|
|
---
|
|
|
|
## Startup Checklist
|
|
|
|
When starting a new session:
|
|
|
|
1. **Sync latest data**
|
|
```bash
|
|
bash server/scripts/sync_data.sh
|
|
```
|
|
|
|
2. **Verify database exists**
|
|
```bash
|
|
ls -lh user/duckdb/analytics.duckdb
|
|
```
|
|
|
|
You're ready to analyze!
|
|
|
|
---
|
|
|
|
## Important Reminders
|
|
|
|
- Always read `server/docs/schema.yml` before writing SQL queries
|
|
- Always read `server/docs/data_description.md` for table relationships and joins
|
|
- Check `server/docs/metrics/` for metric definitions before calculating business metrics
|
|
- Use DuckDB views, not direct parquet file reads
|
|
- Never modify files in `server/` - they're read-only
|
|
|
|
---
|
|
|
|
## Remote Queries (BigQuery)
|
|
|
|
Some tables are too large for local Parquet sync and are queried remotely via BigQuery.
|
|
These tables have `query_mode: "remote"` in `server/docs/data_description.md`.
|
|
|
|
### How to recognize remote tables
|
|
|
|
Before writing any query, read `server/docs/data_description.md`. Each table has:
|
|
- `query_mode: "local"` -- available as a local DuckDB view (query normally)
|
|
- `query_mode: "remote"` -- NOT in local DuckDB, must use remote query protocol below
|
|
- `query_mode: "hybrid"` -- local view exists AND can query BQ for live data
|
|
|
|
### Remote table metadata in data_description.md
|
|
|
|
Remote tables include metadata to help you write safe queries:
|
|
|
|
- **`volume`** -- rows_per_day, unique entities per day (tells you table size)
|
|
- **`columns`** -- column names, types, value distributions
|
|
- **`dimension_profile`** -- cardinality per dimension with value distributions
|
|
- **`query_result_estimates`** -- expected row counts after GROUP BY combinations
|
|
- **`join_keys`** -- how to join with other tables
|
|
|
|
**ALWAYS read these sections before writing a remote query.** Use `query_result_estimates`
|
|
to predict how many rows your query will return. The server has limited RAM -- keep BQ
|
|
sub-query results under 500K rows.
|
|
|
|
### Two-phase query protocol
|
|
|
|
Remote queries run **on the server** via SSH (server has DuckDB + Parquet + BigQuery access).
|
|
You write two SQL statements:
|
|
|
|
1. **BQ sub-query** (`--register-bq "alias=SQL"`) -- runs on BigQuery, result registered in DuckDB as a view.
|
|
This MUST contain WHERE and/or GROUP BY to reduce the result set. Never SELECT * from a remote table.
|
|
2. **DuckDB SQL** (`--sql "SQL"`) -- runs in DuckDB after all views (local + BQ) are ready.
|
|
Can JOIN local tables with registered BQ results.
|
|
|
|
### Command format
|
|
|
|
```bash
|
|
ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh \
|
|
--register-bq "ALIAS=BQ_SQL_QUERY" \
|
|
--sql "DUCKDB_SQL_QUERY" \
|
|
--format table'
|
|
```
|
|
|
|
The wrapper script (`remote_query.sh`) handles environment setup automatically
|
|
(PYTHONPATH, CONFIG_DIR, .env loading). All arguments are passed to `python -m src.remote_query`.
|
|
|
|
Arguments:
|
|
- `--register-bq "alias=SQL"` -- Register a BQ query result as DuckDB view (repeatable for multiple remote tables)
|
|
- `--sql "SQL"` -- The final DuckDB query (can reference local views + registered BQ aliases)
|
|
- `--format table|csv|json|parquet` -- Output format (default: table)
|
|
- `--output /path/file` -- Output file for parquet/csv/json
|
|
- `--max-rows N` -- Override max result rows
|
|
|
|
### Example 1: Remote-only query (aggregated data)
|
|
|
|
```bash
|
|
ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh \
|
|
--register-bq "agg_data=SELECT date_col, dim_col, SUM(metric) as total FROM \`project.dataset.table\` WHERE date_col >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) GROUP BY 1,2" \
|
|
--sql "SELECT * FROM agg_data ORDER BY date_col, dim_col" \
|
|
--format table'
|
|
```
|
|
|
|
### Example 2: JOIN local + remote
|
|
|
|
```bash
|
|
ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh \
|
|
--register-bq "remote_data=SELECT date_col, dim_col, SUM(metric) as total FROM \`project.dataset.table\` WHERE date_col >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) GROUP BY 1,2" \
|
|
--sql "SELECT l.*, r.total FROM local_table l JOIN remote_data r ON l.date_col = r.date_col AND l.dim_col = r.dim_col ORDER BY 1,2" \
|
|
--format table'
|
|
```
|
|
|
|
### Example 3: Download result as Parquet for local analysis
|
|
|
|
```bash
|
|
# 1. Run query, save as Parquet on server
|
|
ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh \
|
|
--register-bq "remote_data=SELECT ... GROUP BY ..." \
|
|
--sql "SELECT ... JOIN ..." \
|
|
--format parquet --output /tmp/remote_query/analysis.parquet'
|
|
|
|
# 2. Download to local machine
|
|
scp {ssh_alias}:/tmp/remote_query/analysis.parquet ./user/parquet/
|
|
|
|
# 3. Register in local DuckDB for further analysis
|
|
python3 -c "
|
|
import duckdb
|
|
conn = duckdb.connect('user/duckdb/analytics.duckdb')
|
|
conn.execute(\"CREATE OR REPLACE VIEW analysis AS SELECT * FROM read_parquet('user/parquet/analysis.parquet')\")
|
|
print('View created:', conn.execute('SELECT COUNT(*) FROM analysis').fetchone()[0], 'rows')
|
|
conn.close()
|
|
"
|
|
```
|
|
|
|
### How to estimate result sizes
|
|
|
|
Before writing a BQ sub-query, check `dimension_profile` and `query_result_estimates`
|
|
in `server/docs/data_description.md`.
|
|
|
|
**Rule of thumb:** rows = (estimate per day from query_result_estimates) * (number of days in WHERE clause).
|
|
If that exceeds 100K rows, add more aggregation or tighter date filters.
|
|
|
|
### Safety rules
|
|
|
|
1. **NEVER** run `SELECT * FROM remote_table` without WHERE + GROUP BY
|
|
2. **ALWAYS** check `dimension_profile` before writing BQ sub-queries
|
|
3. **ALWAYS** include date range in WHERE clause
|
|
4. **Limits**: 500K rows max per BQ sub-query, 100K rows max in final result
|
|
5. If the query might take > 60 seconds, use nohup pattern:
|
|
```bash
|
|
ssh {ssh_alias} 'nohup bash ~/server/scripts/remote_query.sh --register-bq "..." --sql "..." --format parquet --output /tmp/remote_query/result.parquet > /tmp/rq.log 2>&1 &'
|
|
ssh {ssh_alias} 'tail -5 /tmp/rq.log' # check progress
|
|
scp {ssh_alias}:/tmp/remote_query/result.parquet ./user/parquet/
|
|
```
|
|
|
|
---
|
|
|
|
## Reporting Issues
|
|
|
|
Report issues to your platform team or the project's issue tracker.
|
|
|
|
Include:
|
|
- Error messages or unexpected behavior
|
|
- Steps to reproduce
|
|
- Output of `bash server/scripts/sync_data.sh`
|