- server/deploy.sh: KEBOOLA_ENV_FILE -> SYNC_ENV_FILE - server/ws-gateway.service, notify-bot.service: remove Keboola from descriptions - .gitignore: generic comment for data directory - CLAUDE.md, README.md, ARCHITECTURE.md: update paths from src/adapters to connectors/ - docs/DATA_SOURCES.md: update custom connector guide to connectors/ pattern - connectors/jira/README.md: keboola-analyst -> data-analyst in config paths - dev_docs/desktop-app.md: KeboolaAnalyst -> DataAnalyst branding
4.7 KiB
4.7 KiB
Architecture
System Overview
Data Source (Keboola / CSV / BigQuery)
|
v
+------------------------------------------+
| Data Broker Server |
| |
| src/data_sync.py |
| -> connectors/*.py (fetch data) |
| -> src/parquet_manager.py (convert) |
| |
| /data/src_data/parquet/ (output) |
| /data/docs/ (synced docs) |
| /data/scripts/ (helpers) |
+------------------------------------------+
| rsync over SSH
v
+------------------------------------------+
| Analyst Machine |
| |
| server/parquet/ -> DuckDB views |
| user/duckdb/analytics.duckdb |
| Claude Code queries DuckDB via SQL |
+------------------------------------------+
Components
1. Data Sync Engine (src/)
Pulls data from configured source, converts to Parquet.
| File | Role |
|---|---|
src/data_sync.py |
Orchestration + DataSource ABC (line 149) |
connectors/keboola/adapter.py |
Keboola data source |
connectors/keboola/client.py |
Low-level Keboola API client |
src/parquet_manager.py |
CSV -> typed Parquet conversion |
src/config.py |
Reads data_description.md for table definitions |
src/profiler.py |
Data profiling for catalog UI |
2. Web Application (webapp/)
Flask app for user onboarding, settings, and data catalog.
| File | Role |
|---|---|
webapp/app.py |
Flask entry point, routes |
webapp/config.py |
Loads instance.yaml, exposes Config to templates |
webapp/account_service.py |
User account details, sync status |
webapp/templates/ |
Jinja2 templates (dashboard, setup, catalog) |
3. Configuration (config/)
| File | Role |
|---|---|
config/instance.yaml |
Main instance config (not committed) |
config/instance.yaml.example |
Template with all options |
config/loader.py |
YAML loader with ${ENV_VAR} interpolation |
config/.env.template |
Secret variable placeholders |
docs/data_description.md |
Table schemas + sync strategies (not committed) |
4. Server Infrastructure (server/)
Deployment, systemd services, security.
| File | Role |
|---|---|
server/setup.sh |
Initial server provisioning (groups, users, dirs) |
server/webapp-setup.sh |
Nginx, SSL, systemd for webapp |
server/deploy.sh |
CI/CD deployment script |
server/sudoers-deploy |
Least-privilege sudo rules for deploy user |
server/sudoers-webapp |
Sudo rules for www-data (webapp) |
server/bin/ |
Management scripts (add-analyst, list-analysts, etc.) |
5. Analyst Scripts (scripts/)
Helper scripts synced to analyst machines.
| File | Role |
|---|---|
scripts/sync_data.sh |
Sync data from server via rsync |
scripts/setup_views.sh |
Create DuckDB views over Parquet files |
Config Loading Chain
config/instance.yaml
| (loaded by config/loader.py)
| (${ENV_VAR} references resolved from .env / environment)
v
webapp/config.py
| (_load_instance_config at module level)
| (_get(config, *keys) for safe nested access)
v
inject_config() context processor
| (exposes Config object to all Jinja templates)
v
{{ config.INSTANCE_NAME }} in templates
Data Flow
1. Admin defines tables in docs/data_description.md
2. src/config.py parses YAML blocks from markdown
3. src/data_sync.py iterates tables, calls adapter
4. Adapter fetches CSV/JSON from source API
5. src/parquet_manager.py converts to typed Parquet
6. Parquet files stored in /data/src_data/parquet/
7. Analyst runs scripts/sync_data.sh (rsync over SSH)
8. scripts/setup_views.sh creates DuckDB views
9. Claude Code queries DuckDB, returns insights
Security Model
- Groups:
data-ops(admins),dataread(analysts),data-private(privileged) - Sudoers: Explicit command whitelisting (no wildcards)
- SSH: Key-based auth only, keys registered via webapp
- OAuth: Google domain restriction via
auth.allowed_domain - Secrets:
${ENV_VAR}in YAML, actual values in.env(gitignored) - Staging:
/tmp/data_analyst_stagingwith setgid for group ownership
Key Patterns
- Connector pattern: Dynamic connector registry in
src/data_sync.py,connectors/keboola/for reference - Atomic writes:
tempfile.mkstemp()+os.fchmod()+os.replace()for JSON state files - User home writes:
sudo install -o {user} -g {user}for writing to analyst home dirs - Config interpolation:
${ENV_VAR}in YAML resolved at load time, missing vars logged as warnings