agnes-the-ai-analyst/docs/sample-data.md
ZdenekSrotyr 8233c3e3f9 chore(docs): replace stale da verbs and vendor-specific install paths
Sweep operator runbooks (docs/QUICKSTART, docs/HEADLESS_USAGE,
docs/architecture, docs/sample-data, docs/agent-workspace-prompt,
docs/metrics/metrics.yml, dev_docs/server, dev_docs/disaster-recovery),
the corporate-memory service README, the jira connector README + backfill
scripts, the deploy skill, and test docstrings. Replaces `da sync` →
`agnes pull`, `da analyst setup` → `agnes init`, `da metrics ...` →
`agnes catalog --metrics` / `agnes admin metrics ...`, `da fetch` →
`agnes snapshot create`, plus the matching docker-compose admin
invocations.

Vendor-specific `/opt/data-analyst/` install paths in jira backfill /
consistency scripts and operator docs are replaced with the
placeholder `<install-dir>` and a new `AGNES_ENV_FILE` env-var override
that lets a deployment inject its actual install path without a code
change. Aligns with the OSS vendor-agnostic policy in CLAUDE.md.

CHANGELOG `### Internal` entry summarizes the audit and reaffirms the
intentional stale-marker tuples (`_LEGACY_STRINGS`, `_OUR_COMMAND_MARKERS`)
that must keep referencing `da sync` / `da fetch` / etc. for hook upgrade
and override-detection logic.
2026-05-04 21:22:19 +02:00

7.1 KiB

Sample Data Generator

Generate realistic synthetic e-commerce and marketing data for demo, testing, and development without connecting a real data source adapter.

Quick Start

# Install dependency
pip install faker

# Generate small dataset (default)
python scripts/generate_sample_data.py --size s --output data/sample

# List available sizes
python scripts/generate_sample_data.py --list-sizes

Data Model

9 interrelated tables covering the full e-commerce funnel:

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  campaigns   │     │  customers   │     │   products   │
│  CMP-0001    │     │  C-000001    │     │   P-00001    │
└──────┬───────┘     └──────┬───────┘     └──────┬───────┘
       │                    │                    │
       ▼                    ▼                    │
┌──────────────┐     ┌──────────────┐            │
│ web_sessions │     │  web_leads   │            │
│  S-00000001  │     │  L-000001    │            │
└──────────────┘     └──────────────┘            │
                            │                    │
                            ▼                    ▼
                     ┌──────────────┐     ┌──────────────┐
                     │   orders     │────▶│ order_items  │
                     │ ORD-0000001  │     │ OI-00000001  │
                     └──────┬───────┘     └──────────────┘
                            │
                     ┌──────┴───────┐
                     ▼              ▼
              ┌──────────────┐ ┌──────────────┐
              │  payments    │ │   support    │
              │ PAY-0000001  │ │   tickets    │
              └──────────────┘ │ TKT-000001   │
                               └──────────────┘

Table Reference

Table Key Columns Foreign Keys
customers customer_id, email, segment, country, registration_date -
products product_id, name, category, price, cost -
campaigns campaign_id, channel, budget, spend, impressions, clicks -
web_sessions session_id, started_at, duration_seconds, device_type customer_id?, campaign_id?
web_leads lead_id, source, status, converted_at customer_id?, campaign_id?
orders order_id, status, total_amount, channel customer_id
order_items order_item_id, quantity, unit_price, line_total order_id, product_id
payments payment_id, amount, method, status order_id, customer_id
support_tickets ticket_id, category, priority, satisfaction_score customer_id, order_id?

? = nullable (not every record has a value)

Customer Segments

  • b2c (60%): Individual consumers, smaller order values
  • b2b_small (25%): Small business buyers, moderate volumes
  • b2b_enterprise (15%): Large buyers, high quantities, invoice payments

Product Categories

Electronics, Clothing, Home & Garden, Sports & Outdoors, Books & Media, Beauty & Health

Each category has distinct price ranges and cost margins for realistic profitability analysis.

Size Presets

Size Customers Products Sessions Orders Tickets ~CSV ~Time
xs 50 30 500 100 30 1 MB <1s
s 500 100 10K 2K 500 15 MB <1s
m 5,000 300 100K 20K 5K 150 MB ~7s
l 50,000 1,000 1M 200K 50K 1.5 GB ~3min
  • xs - local development, quick iteration
  • s - unit/integration testing, CI
  • m - realistic demo, performance testing
  • l - stress testing, production-like volumes

CLI Options

python scripts/generate_sample_data.py [OPTIONS]

  --size {xs,s,m,l}   Data size preset (default: s)
  --output PATH        Output directory (default: data/sample)
  --seed INT           Random seed for reproducibility (default: 42)
  --list-sizes         Show presets and exit

Convert to Parquet

After generating CSVs, convert to Parquet for analytical use:

python -c "
import pandas as pd
from pathlib import Path

csv_dir = Path('data/sample')
parquet_dir = Path('data/sample/parquet')
parquet_dir.mkdir(exist_ok=True)

for f in sorted(csv_dir.glob('*.csv')):
    df = pd.read_csv(f)
    out = parquet_dir / f'{f.stem}.parquet'
    df.to_parquet(out, index=False)
    print(f'  {f.stem}: {len(df):,} rows -> {out}')
"

Load into DuckDB

python -c "
import duckdb
from pathlib import Path

db = duckdb.connect('data/sample/analytics.duckdb')
parquet_dir = Path('data/sample/parquet')

for f in sorted(parquet_dir.glob('*.parquet')):
    table = f.stem
    db.execute(f'CREATE OR REPLACE TABLE {table} AS SELECT * FROM read_parquet(\"{f}\")')
    count = db.execute(f'SELECT count(*) FROM {table}').fetchone()[0]
    print(f'  {table}: {count:,} rows')

db.close()
print('Database: data/sample/analytics.duckdb')
"

Built-in Analytical Patterns

The generator creates data with discoverable patterns for realistic analysis:

  • Seasonality: Q4 traffic and orders ~2x higher than Q1
  • Growth trend: 50% increase in activity over the time period
  • Channel effectiveness: paid_search has highest click-through rates
  • Customer lifetime: Pareto distribution (20% of customers generate 80% of orders)
  • Segment differences: B2B enterprise has 3-5x higher order values
  • Product mix: Electronics = high revenue / lower margin, Books = low revenue / high margin
  • Support correlation: 60% of tickets linked to specific orders

Reproducibility

Same --seed always produces identical output. The default seed is 42.

# These two commands produce the same files
python scripts/generate_sample_data.py --size s --seed 42 --output run1
python scripts/generate_sample_data.py --size s --seed 42 --output run2
diff -r run1 run2  # no differences

Server Deployment

To use sample data on a deployed server (instead of connecting a data adapter):

# On the server, from the install directory containing your repo checkout
# and Python venv (paths vary per deployment):
cd <install-dir>/repo

# Generate Parquet files directly using project's ParquetManager
# (snappy compression, proper column types, metadata embedding)
<install-dir>/.venv/bin/python scripts/generate_sample_data.py \
    --size m --format parquet --output /data/src_data/parquet --seed 42

# Set correct permissions
chown -R root:data-ops /data/src_data/parquet
chmod -R 2775 /data/src_data/parquet