agnes-the-ai-analyst/docs/setup/claude_md_template.txt
Petr 2237334b05 Make CLAUDE.md template generic and instance-aware
- Remove all Keboola-specific content (metric categories, MRR/ARR refs,
  corporate memory, hardcoded server IP)
- Add {ssh_alias}, {server_host}, {webapp_url} placeholders
- Bootstrap saves .sync_connection file with instance details
- sync_data.sh reads .sync_connection to substitute all placeholders
- Text instructions in dashboard include .sync_connection step
2026-03-14 23:57:58 +01:00

193 lines
5.3 KiB
Text

# CLAUDE.md
Project context file for **AI Data Analyst** - local analytics environment with access to your organization's internal data.
## Quick Status
| Property | Value |
|----------|-------|
| **Project Type** | AI Data Analyst |
| **Database** | DuckDB at `user/duckdb/analytics.duckdb` |
| **Data Source** | {ssh_alias} server ({server_host}) |
| **Data Format** | Parquet files in `server/parquet/` |
| **Analyst** | {username} |
---
## CRITICAL: Always Start Here
### 1. Sync Data When Starting
**MANDATORY: Automatically run sync in these situations:**
- This is a new session (first interaction today)
- The session is from a previous day or older
- Data may be stale (updated multiple times daily on server)
- The user explicitly requests fresh data
```bash
bash server/scripts/sync_data.sh
```
This updates data, scripts, documentation, and CLAUDE.md.
### 2. Read Schema Documentation Before Writing SQL
**MANDATORY: Before writing ANY SQL query, you MUST read the relevant documentation files:**
#### For table structure (columns, types, descriptions):
```bash
# ALWAYS read this FIRST before querying tables
cat server/docs/schema.yml
```
- **NEVER use DESCRIBE, SHOW COLUMNS, or similar commands**
- **NEVER guess column names**
- schema.yml contains: all column names, types, descriptions, primary keys
#### For table relationships (joins, foreign keys):
```bash
# Read this for understanding relationships between tables
cat server/docs/data_description.md
```
- Contains primary/foreign keys, sync strategies, and table descriptions
- Essential for writing correct JOIN queries
#### For additional dataset schemas (if available):
```bash
# Check for additional dataset schemas
ls server/docs/datasets/ 2>/dev/null
```
### 3. Read Metrics Definitions (if available)
**Before calculating ANY business metric, check for metric definitions:**
```bash
# Check if metrics index exists
cat server/docs/metrics/metrics.yml 2>/dev/null
# Or list available metric files
ls server/docs/metrics/ 2>/dev/null
```
If metric definitions exist, always read the specific metric file before calculating.
Do not calculate metrics from memory - the formulas contain critical details.
---
## Directory Structure
```
project_root/
├── server/ # READ-ONLY - synced from server
│ ├── docs/ # Documentation
│ │ ├── data_description.md # Table relationships and descriptions
│ │ ├── schema.yml # Table schemas and column definitions
│ │ ├── metrics/ # Metric definitions (if available)
│ │ └── datasets/ # Additional dataset docs (if available)
│ ├── scripts/ # Helper scripts (sync_data.sh, setup_views.sh)
│ ├── examples/ # Example scripts (if available)
│ └── parquet/ # Synced parquet data files
├── user/ # YOUR WORKSPACE - never overwritten
│ ├── duckdb/ # DuckDB database (analytics.duckdb)
│ ├── artifacts/ # Analysis outputs, charts, exports
│ └── scripts/ # Your custom scripts
├── .claude/ # Claude Code config
├── .venv/ # Python virtual environment
├── CLAUDE.md # This file (auto-updated from server)
└── CLAUDE.local.md # Your personal notes (never overwritten)
```
**Never modify files in `server/` - they are overwritten on every sync.**
---
## Essential Commands
```bash
# Data freshness and sync
bash server/scripts/sync_data.sh # Sync latest data from server
# DuckDB management
bash server/scripts/setup_views.sh # Recreate DuckDB views
# Python environment
source .venv/bin/activate # Activate venv (macOS/Linux)
.venv/Scripts/activate # Activate venv (Windows)
```
---
## Quick Start
### List all tables
```python
import duckdb
con = duckdb.connect('user/duckdb/analytics.duckdb')
tables = con.execute("SHOW TABLES;").fetchall()
for table in tables:
print(table[0])
con.close()
```
### Query data
```bash
# Read schema first, then query
cat server/docs/schema.yml
```
```python
import duckdb
con = duckdb.connect('user/duckdb/analytics.duckdb')
# Write your query based on schema.yml column definitions
result = con.execute("SELECT * FROM your_table LIMIT 10").fetchdf()
print(result)
con.close()
```
---
## Startup Checklist
When starting a new session:
1. **Sync latest data**
```bash
bash server/scripts/sync_data.sh
```
2. **Verify database exists**
```bash
ls -lh user/duckdb/analytics.duckdb
```
You're ready to analyze!
---
## Important Reminders
- Always read `server/docs/schema.yml` before writing SQL queries
- Always read `server/docs/data_description.md` for table relationships and joins
- Check `server/docs/metrics/` for metric definitions before calculating business metrics
- Use DuckDB views, not direct parquet file reads
- Never modify files in `server/` - they're read-only
---
## Reporting Issues
Report issues to your platform team or the project's issue tracker.
Include:
- Error messages or unexpected behavior
- Steps to reproduce
- Output of `bash server/scripts/sync_data.sh`