agnes-the-ai-analyst/ARCHITECTURE.md
Petr 38b86127ed Branding cleanup: remove Keboola-specific references from docs and config
- server/deploy.sh: KEBOOLA_ENV_FILE -> SYNC_ENV_FILE
- server/ws-gateway.service, notify-bot.service: remove Keboola from descriptions
- .gitignore: generic comment for data directory
- CLAUDE.md, README.md, ARCHITECTURE.md: update paths from src/adapters to connectors/
- docs/DATA_SOURCES.md: update custom connector guide to connectors/ pattern
- connectors/jira/README.md: keboola-analyst -> data-analyst in config paths
- dev_docs/desktop-app.md: KeboolaAnalyst -> DataAnalyst branding
2026-03-09 12:22:27 +01:00

4.7 KiB

Architecture

System Overview

Data Source (Keboola / CSV / BigQuery)
      |
      v
+------------------------------------------+
|  Data Broker Server                      |
|                                          |
|  src/data_sync.py                        |
|    -> connectors/*.py (fetch data)       |
|    -> src/parquet_manager.py (convert)   |
|                                          |
|  /data/src_data/parquet/   (output)      |
|  /data/docs/               (synced docs) |
|  /data/scripts/            (helpers)     |
+------------------------------------------+
      | rsync over SSH
      v
+------------------------------------------+
|  Analyst Machine                         |
|                                          |
|  server/parquet/  -> DuckDB views        |
|  user/duckdb/analytics.duckdb            |
|  Claude Code queries DuckDB via SQL      |
+------------------------------------------+

Components

1. Data Sync Engine (src/)

Pulls data from configured source, converts to Parquet.

File Role
src/data_sync.py Orchestration + DataSource ABC (line 149)
connectors/keboola/adapter.py Keboola data source
connectors/keboola/client.py Low-level Keboola API client
src/parquet_manager.py CSV -> typed Parquet conversion
src/config.py Reads data_description.md for table definitions
src/profiler.py Data profiling for catalog UI

2. Web Application (webapp/)

Flask app for user onboarding, settings, and data catalog.

File Role
webapp/app.py Flask entry point, routes
webapp/config.py Loads instance.yaml, exposes Config to templates
webapp/account_service.py User account details, sync status
webapp/templates/ Jinja2 templates (dashboard, setup, catalog)

3. Configuration (config/)

File Role
config/instance.yaml Main instance config (not committed)
config/instance.yaml.example Template with all options
config/loader.py YAML loader with ${ENV_VAR} interpolation
config/.env.template Secret variable placeholders
docs/data_description.md Table schemas + sync strategies (not committed)

4. Server Infrastructure (server/)

Deployment, systemd services, security.

File Role
server/setup.sh Initial server provisioning (groups, users, dirs)
server/webapp-setup.sh Nginx, SSL, systemd for webapp
server/deploy.sh CI/CD deployment script
server/sudoers-deploy Least-privilege sudo rules for deploy user
server/sudoers-webapp Sudo rules for www-data (webapp)
server/bin/ Management scripts (add-analyst, list-analysts, etc.)

5. Analyst Scripts (scripts/)

Helper scripts synced to analyst machines.

File Role
scripts/sync_data.sh Sync data from server via rsync
scripts/setup_views.sh Create DuckDB views over Parquet files

Config Loading Chain

config/instance.yaml
    |  (loaded by config/loader.py)
    |  (${ENV_VAR} references resolved from .env / environment)
    v
webapp/config.py
    |  (_load_instance_config at module level)
    |  (_get(config, *keys) for safe nested access)
    v
inject_config() context processor
    |  (exposes Config object to all Jinja templates)
    v
{{ config.INSTANCE_NAME }} in templates

Data Flow

1. Admin defines tables in docs/data_description.md
2. src/config.py parses YAML blocks from markdown
3. src/data_sync.py iterates tables, calls adapter
4. Adapter fetches CSV/JSON from source API
5. src/parquet_manager.py converts to typed Parquet
6. Parquet files stored in /data/src_data/parquet/
7. Analyst runs scripts/sync_data.sh (rsync over SSH)
8. scripts/setup_views.sh creates DuckDB views
9. Claude Code queries DuckDB, returns insights

Security Model

  • Groups: data-ops (admins), dataread (analysts), data-private (privileged)
  • Sudoers: Explicit command whitelisting (no wildcards)
  • SSH: Key-based auth only, keys registered via webapp
  • OAuth: Google domain restriction via auth.allowed_domain
  • Secrets: ${ENV_VAR} in YAML, actual values in .env (gitignored)
  • Staging: /tmp/data_analyst_staging with setgid for group ownership

Key Patterns

  • Connector pattern: Dynamic connector registry in src/data_sync.py, connectors/keboola/ for reference
  • Atomic writes: tempfile.mkstemp() + os.fchmod() + os.replace() for JSON state files
  • User home writes: sudo install -o {user} -g {user} for writing to analyst home dirs
  • Config interpolation: ${ENV_VAR} in YAML resolved at load time, missing vars logged as warnings