agnes-the-ai-analyst/ARCHITECTURE.md
Petr 38b86127ed Branding cleanup: remove Keboola-specific references from docs and config
- server/deploy.sh: KEBOOLA_ENV_FILE -> SYNC_ENV_FILE
- server/ws-gateway.service, notify-bot.service: remove Keboola from descriptions
- .gitignore: generic comment for data directory
- CLAUDE.md, README.md, ARCHITECTURE.md: update paths from src/adapters to connectors/
- docs/DATA_SOURCES.md: update custom connector guide to connectors/ pattern
- connectors/jira/README.md: keboola-analyst -> data-analyst in config paths
- dev_docs/desktop-app.md: KeboolaAnalyst -> DataAnalyst branding
2026-03-09 12:22:27 +01:00

134 lines
4.7 KiB
Markdown

# Architecture
## System Overview
```
Data Source (Keboola / CSV / BigQuery)
|
v
+------------------------------------------+
| Data Broker Server |
| |
| src/data_sync.py |
| -> connectors/*.py (fetch data) |
| -> src/parquet_manager.py (convert) |
| |
| /data/src_data/parquet/ (output) |
| /data/docs/ (synced docs) |
| /data/scripts/ (helpers) |
+------------------------------------------+
| rsync over SSH
v
+------------------------------------------+
| Analyst Machine |
| |
| server/parquet/ -> DuckDB views |
| user/duckdb/analytics.duckdb |
| Claude Code queries DuckDB via SQL |
+------------------------------------------+
```
## Components
### 1. Data Sync Engine (`src/`)
Pulls data from configured source, converts to Parquet.
| File | Role |
|------|------|
| `src/data_sync.py` | Orchestration + `DataSource` ABC (line 149) |
| `connectors/keboola/adapter.py` | Keboola data source |
| `connectors/keboola/client.py` | Low-level Keboola API client |
| `src/parquet_manager.py` | CSV -> typed Parquet conversion |
| `src/config.py` | Reads `data_description.md` for table definitions |
| `src/profiler.py` | Data profiling for catalog UI |
### 2. Web Application (`webapp/`)
Flask app for user onboarding, settings, and data catalog.
| File | Role |
|------|------|
| `webapp/app.py` | Flask entry point, routes |
| `webapp/config.py` | Loads `instance.yaml`, exposes `Config` to templates |
| `webapp/account_service.py` | User account details, sync status |
| `webapp/templates/` | Jinja2 templates (dashboard, setup, catalog) |
### 3. Configuration (`config/`)
| File | Role |
|------|------|
| `config/instance.yaml` | Main instance config (not committed) |
| `config/instance.yaml.example` | Template with all options |
| `config/loader.py` | YAML loader with `${ENV_VAR}` interpolation |
| `config/.env.template` | Secret variable placeholders |
| `docs/data_description.md` | Table schemas + sync strategies (not committed) |
### 4. Server Infrastructure (`server/`)
Deployment, systemd services, security.
| File | Role |
|------|------|
| `server/setup.sh` | Initial server provisioning (groups, users, dirs) |
| `server/webapp-setup.sh` | Nginx, SSL, systemd for webapp |
| `server/deploy.sh` | CI/CD deployment script |
| `server/sudoers-deploy` | Least-privilege sudo rules for deploy user |
| `server/sudoers-webapp` | Sudo rules for www-data (webapp) |
| `server/bin/` | Management scripts (add-analyst, list-analysts, etc.) |
### 5. Analyst Scripts (`scripts/`)
Helper scripts synced to analyst machines.
| File | Role |
|------|------|
| `scripts/sync_data.sh` | Sync data from server via rsync |
| `scripts/setup_views.sh` | Create DuckDB views over Parquet files |
## Config Loading Chain
```
config/instance.yaml
| (loaded by config/loader.py)
| (${ENV_VAR} references resolved from .env / environment)
v
webapp/config.py
| (_load_instance_config at module level)
| (_get(config, *keys) for safe nested access)
v
inject_config() context processor
| (exposes Config object to all Jinja templates)
v
{{ config.INSTANCE_NAME }} in templates
```
## Data Flow
```
1. Admin defines tables in docs/data_description.md
2. src/config.py parses YAML blocks from markdown
3. src/data_sync.py iterates tables, calls adapter
4. Adapter fetches CSV/JSON from source API
5. src/parquet_manager.py converts to typed Parquet
6. Parquet files stored in /data/src_data/parquet/
7. Analyst runs scripts/sync_data.sh (rsync over SSH)
8. scripts/setup_views.sh creates DuckDB views
9. Claude Code queries DuckDB, returns insights
```
## Security Model
- **Groups**: `data-ops` (admins), `dataread` (analysts), `data-private` (privileged)
- **Sudoers**: Explicit command whitelisting (no wildcards)
- **SSH**: Key-based auth only, keys registered via webapp
- **OAuth**: Google domain restriction via `auth.allowed_domain`
- **Secrets**: `${ENV_VAR}` in YAML, actual values in `.env` (gitignored)
- **Staging**: `/tmp/data_analyst_staging` with setgid for group ownership
## Key Patterns
- **Connector pattern**: Dynamic connector registry in `src/data_sync.py`, `connectors/keboola/` for reference
- **Atomic writes**: `tempfile.mkstemp()` + `os.fchmod()` + `os.replace()` for JSON state files
- **User home writes**: `sudo install -o {user} -g {user}` for writing to analyst home dirs
- **Config interpolation**: `${ENV_VAR}` in YAML resolved at load time, missing vars logged as warnings