Sweep operator runbooks (docs/QUICKSTART, docs/HEADLESS_USAGE, docs/architecture, docs/sample-data, docs/agent-workspace-prompt, docs/metrics/metrics.yml, dev_docs/server, dev_docs/disaster-recovery), the corporate-memory service README, the jira connector README + backfill scripts, the deploy skill, and test docstrings. Replaces `da sync` → `agnes pull`, `da analyst setup` → `agnes init`, `da metrics ...` → `agnes catalog --metrics` / `agnes admin metrics ...`, `da fetch` → `agnes snapshot create`, plus the matching docker-compose admin invocations. Vendor-specific `/opt/data-analyst/` install paths in jira backfill / consistency scripts and operator docs are replaced with the placeholder `<install-dir>` and a new `AGNES_ENV_FILE` env-var override that lets a deployment inject its actual install path without a code change. Aligns with the OSS vendor-agnostic policy in CLAUDE.md. CHANGELOG `### Internal` entry summarizes the audit and reaffirms the intentional stale-marker tuples (`_LEGACY_STRINGS`, `_OUR_COMMAND_MARKERS`) that must keep referencing `da sync` / `da fetch` / etc. for hook upgrade and override-detection logic.
184 lines
7.1 KiB
Markdown
184 lines
7.1 KiB
Markdown
# Sample Data Generator
|
|
|
|
Generate realistic synthetic e-commerce and marketing data for demo, testing, and development without connecting a real data source adapter.
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# Install dependency
|
|
pip install faker
|
|
|
|
# Generate small dataset (default)
|
|
python scripts/generate_sample_data.py --size s --output data/sample
|
|
|
|
# List available sizes
|
|
python scripts/generate_sample_data.py --list-sizes
|
|
```
|
|
|
|
## Data Model
|
|
|
|
9 interrelated tables covering the full e-commerce funnel:
|
|
|
|
```
|
|
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
|
│ campaigns │ │ customers │ │ products │
|
|
│ CMP-0001 │ │ C-000001 │ │ P-00001 │
|
|
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
|
|
│ │ │
|
|
▼ ▼ │
|
|
┌──────────────┐ ┌──────────────┐ │
|
|
│ web_sessions │ │ web_leads │ │
|
|
│ S-00000001 │ │ L-000001 │ │
|
|
└──────────────┘ └──────────────┘ │
|
|
│ │
|
|
▼ ▼
|
|
┌──────────────┐ ┌──────────────┐
|
|
│ orders │────▶│ order_items │
|
|
│ ORD-0000001 │ │ OI-00000001 │
|
|
└──────┬───────┘ └──────────────┘
|
|
│
|
|
┌──────┴───────┐
|
|
▼ ▼
|
|
┌──────────────┐ ┌──────────────┐
|
|
│ payments │ │ support │
|
|
│ PAY-0000001 │ │ tickets │
|
|
└──────────────┘ │ TKT-000001 │
|
|
└──────────────┘
|
|
```
|
|
|
|
### Table Reference
|
|
|
|
| Table | Key Columns | Foreign Keys |
|
|
|-------|-------------|--------------|
|
|
| **customers** | customer_id, email, segment, country, registration_date | - |
|
|
| **products** | product_id, name, category, price, cost | - |
|
|
| **campaigns** | campaign_id, channel, budget, spend, impressions, clicks | - |
|
|
| **web_sessions** | session_id, started_at, duration_seconds, device_type | customer_id?, campaign_id? |
|
|
| **web_leads** | lead_id, source, status, converted_at | customer_id?, campaign_id? |
|
|
| **orders** | order_id, status, total_amount, channel | customer_id |
|
|
| **order_items** | order_item_id, quantity, unit_price, line_total | order_id, product_id |
|
|
| **payments** | payment_id, amount, method, status | order_id, customer_id |
|
|
| **support_tickets** | ticket_id, category, priority, satisfaction_score | customer_id, order_id? |
|
|
|
|
`?` = nullable (not every record has a value)
|
|
|
|
### Customer Segments
|
|
|
|
- **b2c** (60%): Individual consumers, smaller order values
|
|
- **b2b_small** (25%): Small business buyers, moderate volumes
|
|
- **b2b_enterprise** (15%): Large buyers, high quantities, invoice payments
|
|
|
|
### Product Categories
|
|
|
|
Electronics, Clothing, Home & Garden, Sports & Outdoors, Books & Media, Beauty & Health
|
|
|
|
Each category has distinct price ranges and cost margins for realistic profitability analysis.
|
|
|
|
## Size Presets
|
|
|
|
| Size | Customers | Products | Sessions | Orders | Tickets | ~CSV | ~Time |
|
|
|------|-----------|----------|----------|--------|---------|------|-------|
|
|
| **xs** | 50 | 30 | 500 | 100 | 30 | 1 MB | <1s |
|
|
| **s** | 500 | 100 | 10K | 2K | 500 | 15 MB | <1s |
|
|
| **m** | 5,000 | 300 | 100K | 20K | 5K | 150 MB | ~7s |
|
|
| **l** | 50,000 | 1,000 | 1M | 200K | 50K | 1.5 GB | ~3min |
|
|
|
|
- **xs** - local development, quick iteration
|
|
- **s** - unit/integration testing, CI
|
|
- **m** - realistic demo, performance testing
|
|
- **l** - stress testing, production-like volumes
|
|
|
|
## CLI Options
|
|
|
|
```
|
|
python scripts/generate_sample_data.py [OPTIONS]
|
|
|
|
--size {xs,s,m,l} Data size preset (default: s)
|
|
--output PATH Output directory (default: data/sample)
|
|
--seed INT Random seed for reproducibility (default: 42)
|
|
--list-sizes Show presets and exit
|
|
```
|
|
|
|
## Convert to Parquet
|
|
|
|
After generating CSVs, convert to Parquet for analytical use:
|
|
|
|
```bash
|
|
python -c "
|
|
import pandas as pd
|
|
from pathlib import Path
|
|
|
|
csv_dir = Path('data/sample')
|
|
parquet_dir = Path('data/sample/parquet')
|
|
parquet_dir.mkdir(exist_ok=True)
|
|
|
|
for f in sorted(csv_dir.glob('*.csv')):
|
|
df = pd.read_csv(f)
|
|
out = parquet_dir / f'{f.stem}.parquet'
|
|
df.to_parquet(out, index=False)
|
|
print(f' {f.stem}: {len(df):,} rows -> {out}')
|
|
"
|
|
```
|
|
|
|
## Load into DuckDB
|
|
|
|
```bash
|
|
python -c "
|
|
import duckdb
|
|
from pathlib import Path
|
|
|
|
db = duckdb.connect('data/sample/analytics.duckdb')
|
|
parquet_dir = Path('data/sample/parquet')
|
|
|
|
for f in sorted(parquet_dir.glob('*.parquet')):
|
|
table = f.stem
|
|
db.execute(f'CREATE OR REPLACE TABLE {table} AS SELECT * FROM read_parquet(\"{f}\")')
|
|
count = db.execute(f'SELECT count(*) FROM {table}').fetchone()[0]
|
|
print(f' {table}: {count:,} rows')
|
|
|
|
db.close()
|
|
print('Database: data/sample/analytics.duckdb')
|
|
"
|
|
```
|
|
|
|
## Built-in Analytical Patterns
|
|
|
|
The generator creates data with discoverable patterns for realistic analysis:
|
|
|
|
- **Seasonality**: Q4 traffic and orders ~2x higher than Q1
|
|
- **Growth trend**: 50% increase in activity over the time period
|
|
- **Channel effectiveness**: paid_search has highest click-through rates
|
|
- **Customer lifetime**: Pareto distribution (20% of customers generate 80% of orders)
|
|
- **Segment differences**: B2B enterprise has 3-5x higher order values
|
|
- **Product mix**: Electronics = high revenue / lower margin, Books = low revenue / high margin
|
|
- **Support correlation**: 60% of tickets linked to specific orders
|
|
|
|
## Reproducibility
|
|
|
|
Same `--seed` always produces identical output. The default seed is 42.
|
|
|
|
```bash
|
|
# These two commands produce the same files
|
|
python scripts/generate_sample_data.py --size s --seed 42 --output run1
|
|
python scripts/generate_sample_data.py --size s --seed 42 --output run2
|
|
diff -r run1 run2 # no differences
|
|
```
|
|
|
|
## Server Deployment
|
|
|
|
To use sample data on a deployed server (instead of connecting a data adapter):
|
|
|
|
```bash
|
|
# On the server, from the install directory containing your repo checkout
|
|
# and Python venv (paths vary per deployment):
|
|
cd <install-dir>/repo
|
|
|
|
# Generate Parquet files directly using project's ParquetManager
|
|
# (snappy compression, proper column types, metadata embedding)
|
|
<install-dir>/.venv/bin/python scripts/generate_sample_data.py \
|
|
--size m --format parquet --output /data/src_data/parquet --seed 42
|
|
|
|
# Set correct permissions
|
|
chown -R root:data-ops /data/src_data/parquet
|
|
chmod -R 2775 /data/src_data/parquet
|
|
```
|