- Remove all Keboola-specific content (metric categories, MRR/ARR refs,
corporate memory, hardcoded server IP)
- Add {ssh_alias}, {server_host}, {webapp_url} placeholders
- Bootstrap saves .sync_connection file with instance details
- sync_data.sh reads .sync_connection to substitute all placeholders
- Text instructions in dashboard include .sync_connection step
193 lines
5.3 KiB
Text
193 lines
5.3 KiB
Text
# CLAUDE.md
|
|
|
|
Project context file for **AI Data Analyst** - local analytics environment with access to your organization's internal data.
|
|
|
|
## Quick Status
|
|
|
|
| Property | Value |
|
|
|----------|-------|
|
|
| **Project Type** | AI Data Analyst |
|
|
| **Database** | DuckDB at `user/duckdb/analytics.duckdb` |
|
|
| **Data Source** | {ssh_alias} server ({server_host}) |
|
|
| **Data Format** | Parquet files in `server/parquet/` |
|
|
| **Analyst** | {username} |
|
|
|
|
---
|
|
|
|
## CRITICAL: Always Start Here
|
|
|
|
### 1. Sync Data When Starting
|
|
|
|
**MANDATORY: Automatically run sync in these situations:**
|
|
- This is a new session (first interaction today)
|
|
- The session is from a previous day or older
|
|
- Data may be stale (updated multiple times daily on server)
|
|
- The user explicitly requests fresh data
|
|
|
|
```bash
|
|
bash server/scripts/sync_data.sh
|
|
```
|
|
|
|
This updates data, scripts, documentation, and CLAUDE.md.
|
|
|
|
### 2. Read Schema Documentation Before Writing SQL
|
|
|
|
**MANDATORY: Before writing ANY SQL query, you MUST read the relevant documentation files:**
|
|
|
|
#### For table structure (columns, types, descriptions):
|
|
|
|
```bash
|
|
# ALWAYS read this FIRST before querying tables
|
|
cat server/docs/schema.yml
|
|
```
|
|
|
|
- **NEVER use DESCRIBE, SHOW COLUMNS, or similar commands**
|
|
- **NEVER guess column names**
|
|
- schema.yml contains: all column names, types, descriptions, primary keys
|
|
|
|
#### For table relationships (joins, foreign keys):
|
|
|
|
```bash
|
|
# Read this for understanding relationships between tables
|
|
cat server/docs/data_description.md
|
|
```
|
|
|
|
- Contains primary/foreign keys, sync strategies, and table descriptions
|
|
- Essential for writing correct JOIN queries
|
|
|
|
#### For additional dataset schemas (if available):
|
|
|
|
```bash
|
|
# Check for additional dataset schemas
|
|
ls server/docs/datasets/ 2>/dev/null
|
|
```
|
|
|
|
### 3. Read Metrics Definitions (if available)
|
|
|
|
**Before calculating ANY business metric, check for metric definitions:**
|
|
|
|
```bash
|
|
# Check if metrics index exists
|
|
cat server/docs/metrics/metrics.yml 2>/dev/null
|
|
|
|
# Or list available metric files
|
|
ls server/docs/metrics/ 2>/dev/null
|
|
```
|
|
|
|
If metric definitions exist, always read the specific metric file before calculating.
|
|
Do not calculate metrics from memory - the formulas contain critical details.
|
|
|
|
---
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
project_root/
|
|
├── server/ # READ-ONLY - synced from server
|
|
│ ├── docs/ # Documentation
|
|
│ │ ├── data_description.md # Table relationships and descriptions
|
|
│ │ ├── schema.yml # Table schemas and column definitions
|
|
│ │ ├── metrics/ # Metric definitions (if available)
|
|
│ │ └── datasets/ # Additional dataset docs (if available)
|
|
│ ├── scripts/ # Helper scripts (sync_data.sh, setup_views.sh)
|
|
│ ├── examples/ # Example scripts (if available)
|
|
│ └── parquet/ # Synced parquet data files
|
|
│
|
|
├── user/ # YOUR WORKSPACE - never overwritten
|
|
│ ├── duckdb/ # DuckDB database (analytics.duckdb)
|
|
│ ├── artifacts/ # Analysis outputs, charts, exports
|
|
│ └── scripts/ # Your custom scripts
|
|
│
|
|
├── .claude/ # Claude Code config
|
|
├── .venv/ # Python virtual environment
|
|
├── CLAUDE.md # This file (auto-updated from server)
|
|
└── CLAUDE.local.md # Your personal notes (never overwritten)
|
|
```
|
|
|
|
**Never modify files in `server/` - they are overwritten on every sync.**
|
|
|
|
---
|
|
|
|
## Essential Commands
|
|
|
|
```bash
|
|
# Data freshness and sync
|
|
bash server/scripts/sync_data.sh # Sync latest data from server
|
|
|
|
# DuckDB management
|
|
bash server/scripts/setup_views.sh # Recreate DuckDB views
|
|
|
|
# Python environment
|
|
source .venv/bin/activate # Activate venv (macOS/Linux)
|
|
.venv/Scripts/activate # Activate venv (Windows)
|
|
```
|
|
|
|
---
|
|
|
|
## Quick Start
|
|
|
|
### List all tables
|
|
|
|
```python
|
|
import duckdb
|
|
con = duckdb.connect('user/duckdb/analytics.duckdb')
|
|
tables = con.execute("SHOW TABLES;").fetchall()
|
|
for table in tables:
|
|
print(table[0])
|
|
con.close()
|
|
```
|
|
|
|
### Query data
|
|
|
|
```bash
|
|
# Read schema first, then query
|
|
cat server/docs/schema.yml
|
|
```
|
|
|
|
```python
|
|
import duckdb
|
|
con = duckdb.connect('user/duckdb/analytics.duckdb')
|
|
# Write your query based on schema.yml column definitions
|
|
result = con.execute("SELECT * FROM your_table LIMIT 10").fetchdf()
|
|
print(result)
|
|
con.close()
|
|
```
|
|
|
|
---
|
|
|
|
## Startup Checklist
|
|
|
|
When starting a new session:
|
|
|
|
1. **Sync latest data**
|
|
```bash
|
|
bash server/scripts/sync_data.sh
|
|
```
|
|
|
|
2. **Verify database exists**
|
|
```bash
|
|
ls -lh user/duckdb/analytics.duckdb
|
|
```
|
|
|
|
You're ready to analyze!
|
|
|
|
---
|
|
|
|
## Important Reminders
|
|
|
|
- Always read `server/docs/schema.yml` before writing SQL queries
|
|
- Always read `server/docs/data_description.md` for table relationships and joins
|
|
- Check `server/docs/metrics/` for metric definitions before calculating business metrics
|
|
- Use DuckDB views, not direct parquet file reads
|
|
- Never modify files in `server/` - they're read-only
|
|
|
|
---
|
|
|
|
## Reporting Issues
|
|
|
|
Report issues to your platform team or the project's issue tracker.
|
|
|
|
Include:
|
|
- Error messages or unexpected behavior
|
|
- Steps to reproduce
|
|
- Output of `bash server/scripts/sync_data.sh`
|