- server/deploy.sh: KEBOOLA_ENV_FILE -> SYNC_ENV_FILE - server/ws-gateway.service, notify-bot.service: remove Keboola from descriptions - .gitignore: generic comment for data directory - CLAUDE.md, README.md, ARCHITECTURE.md: update paths from src/adapters to connectors/ - docs/DATA_SOURCES.md: update custom connector guide to connectors/ pattern - connectors/jira/README.md: keboola-analyst -> data-analyst in config paths - dev_docs/desktop-app.md: KeboolaAnalyst -> DataAnalyst branding
134 lines
4.7 KiB
Markdown
134 lines
4.7 KiB
Markdown
# Architecture
|
|
|
|
## System Overview
|
|
|
|
```
|
|
Data Source (Keboola / CSV / BigQuery)
|
|
|
|
|
v
|
|
+------------------------------------------+
|
|
| Data Broker Server |
|
|
| |
|
|
| src/data_sync.py |
|
|
| -> connectors/*.py (fetch data) |
|
|
| -> src/parquet_manager.py (convert) |
|
|
| |
|
|
| /data/src_data/parquet/ (output) |
|
|
| /data/docs/ (synced docs) |
|
|
| /data/scripts/ (helpers) |
|
|
+------------------------------------------+
|
|
| rsync over SSH
|
|
v
|
|
+------------------------------------------+
|
|
| Analyst Machine |
|
|
| |
|
|
| server/parquet/ -> DuckDB views |
|
|
| user/duckdb/analytics.duckdb |
|
|
| Claude Code queries DuckDB via SQL |
|
|
+------------------------------------------+
|
|
```
|
|
|
|
## Components
|
|
|
|
### 1. Data Sync Engine (`src/`)
|
|
|
|
Pulls data from configured source, converts to Parquet.
|
|
|
|
| File | Role |
|
|
|------|------|
|
|
| `src/data_sync.py` | Orchestration + `DataSource` ABC (line 149) |
|
|
| `connectors/keboola/adapter.py` | Keboola data source |
|
|
| `connectors/keboola/client.py` | Low-level Keboola API client |
|
|
| `src/parquet_manager.py` | CSV -> typed Parquet conversion |
|
|
| `src/config.py` | Reads `data_description.md` for table definitions |
|
|
| `src/profiler.py` | Data profiling for catalog UI |
|
|
|
|
### 2. Web Application (`webapp/`)
|
|
|
|
Flask app for user onboarding, settings, and data catalog.
|
|
|
|
| File | Role |
|
|
|------|------|
|
|
| `webapp/app.py` | Flask entry point, routes |
|
|
| `webapp/config.py` | Loads `instance.yaml`, exposes `Config` to templates |
|
|
| `webapp/account_service.py` | User account details, sync status |
|
|
| `webapp/templates/` | Jinja2 templates (dashboard, setup, catalog) |
|
|
|
|
### 3. Configuration (`config/`)
|
|
|
|
| File | Role |
|
|
|------|------|
|
|
| `config/instance.yaml` | Main instance config (not committed) |
|
|
| `config/instance.yaml.example` | Template with all options |
|
|
| `config/loader.py` | YAML loader with `${ENV_VAR}` interpolation |
|
|
| `config/.env.template` | Secret variable placeholders |
|
|
| `docs/data_description.md` | Table schemas + sync strategies (not committed) |
|
|
|
|
### 4. Server Infrastructure (`server/`)
|
|
|
|
Deployment, systemd services, security.
|
|
|
|
| File | Role |
|
|
|------|------|
|
|
| `server/setup.sh` | Initial server provisioning (groups, users, dirs) |
|
|
| `server/webapp-setup.sh` | Nginx, SSL, systemd for webapp |
|
|
| `server/deploy.sh` | CI/CD deployment script |
|
|
| `server/sudoers-deploy` | Least-privilege sudo rules for deploy user |
|
|
| `server/sudoers-webapp` | Sudo rules for www-data (webapp) |
|
|
| `server/bin/` | Management scripts (add-analyst, list-analysts, etc.) |
|
|
|
|
### 5. Analyst Scripts (`scripts/`)
|
|
|
|
Helper scripts synced to analyst machines.
|
|
|
|
| File | Role |
|
|
|------|------|
|
|
| `scripts/sync_data.sh` | Sync data from server via rsync |
|
|
| `scripts/setup_views.sh` | Create DuckDB views over Parquet files |
|
|
|
|
## Config Loading Chain
|
|
|
|
```
|
|
config/instance.yaml
|
|
| (loaded by config/loader.py)
|
|
| (${ENV_VAR} references resolved from .env / environment)
|
|
v
|
|
webapp/config.py
|
|
| (_load_instance_config at module level)
|
|
| (_get(config, *keys) for safe nested access)
|
|
v
|
|
inject_config() context processor
|
|
| (exposes Config object to all Jinja templates)
|
|
v
|
|
{{ config.INSTANCE_NAME }} in templates
|
|
```
|
|
|
|
## Data Flow
|
|
|
|
```
|
|
1. Admin defines tables in docs/data_description.md
|
|
2. src/config.py parses YAML blocks from markdown
|
|
3. src/data_sync.py iterates tables, calls adapter
|
|
4. Adapter fetches CSV/JSON from source API
|
|
5. src/parquet_manager.py converts to typed Parquet
|
|
6. Parquet files stored in /data/src_data/parquet/
|
|
7. Analyst runs scripts/sync_data.sh (rsync over SSH)
|
|
8. scripts/setup_views.sh creates DuckDB views
|
|
9. Claude Code queries DuckDB, returns insights
|
|
```
|
|
|
|
## Security Model
|
|
|
|
- **Groups**: `data-ops` (admins), `dataread` (analysts), `data-private` (privileged)
|
|
- **Sudoers**: Explicit command whitelisting (no wildcards)
|
|
- **SSH**: Key-based auth only, keys registered via webapp
|
|
- **OAuth**: Google domain restriction via `auth.allowed_domain`
|
|
- **Secrets**: `${ENV_VAR}` in YAML, actual values in `.env` (gitignored)
|
|
- **Staging**: `/tmp/data_analyst_staging` with setgid for group ownership
|
|
|
|
## Key Patterns
|
|
|
|
- **Connector pattern**: Dynamic connector registry in `src/data_sync.py`, `connectors/keboola/` for reference
|
|
- **Atomic writes**: `tempfile.mkstemp()` + `os.fchmod()` + `os.replace()` for JSON state files
|
|
- **User home writes**: `sudo install -o {user} -g {user}` for writing to analyst home dirs
|
|
- **Config interpolation**: `${ENV_VAR}` in YAML resolved at load time, missing vars logged as warnings
|