Move all Jira-specific code into a self-contained connector module: - 22 files moved via git mv (transform, service, webhook, scripts, systemd units, tests, docs, bin helper) - All imports updated to use connectors.jira.* paths - Jira is now conditional: auto-detected via JIRA_DOMAIN env var - Webapp registers Jira blueprint only when available - Health service monitors Jira timers only when enabled - Profiler loads Jira tables dynamically from filesystem - Sync settings uses config-driven dependency validation - Renamed keboola_platform_url -> custom_url in transform - Updated deploy.sh, sudoers-deploy, backfill_gap.sh paths - Fixed pytest.ini to skip live tests by default
5.8 KiB
5.8 KiB
AI Data Analyst
Open-source data distribution platform for AI analytical systems. Syncs data from various sources, converts to Parquet, and distributes to analysts who use Claude Code for local analysis.
First-Time Setup
When a user opens this project for the first time, guide them through interactive setup:
Step 1: Gather Information
Ask the user for:
- Company domain (e.g., "acme.com") - used for Google OAuth
- Data source type: keboola / csv / bigquery (future)
- Instance name (e.g., "Acme Data Analyst")
Step 2: Generate Configuration
- Copy
config/instance.yaml.exampletoconfig/instance.yaml - Fill in values from Step 1
- If Keboola: ask for Storage API token, stack URL, project ID
- Create
.envfromconfig/.env.template
Step 3: Generate Data Description
- If Keboola adapter: use the API to fetch table metadata and generate
docs/data_description.md - If CSV: ask user to describe their data files
- The file defines tables, sync strategies, and schema
Step 4: Server Setup (if deploying)
- Guide VM provisioning (or use existing server)
- Run
server/setup.shon the target VM - Run
server/webapp-setup.shfor the web portal - Set up CI/CD from
.github/workflows/deploy.yml.example
Project Structure
├── src/ # Core data sync engine
│ ├── adapters/ # Data source adapters (Keboola, CSV, etc.)
│ ├── config.py # Configuration from data_description.md
│ ├── data_sync.py # Sync orchestration
│ ├── parquet_manager.py # Parquet file management
│ └── profiler.py # Data profiling
├── webapp/ # Flask web portal (login, dashboard, API)
├── server/ # Server deployment (systemd, scripts)
├── scripts/ # Utility scripts (sync, DuckDB setup)
├── config/ # Configuration templates
│ ├── instance.yaml.example
│ └── data_description.md.example
├── docs/ # Documentation
└── tests/ # Test suite
Architecture
Data Source (Keboola / CSV / BigQuery)
│
▼
┌─────────────────────────────────┐
│ Data Broker Server │
│ ├── /data/src_data/parquet/ │ Converted data
│ ├── /data/docs/ │ Documentation
│ └── /data/scripts/ │ Helper scripts
└─────────────────────────────────┘
│ rsync (via ~/server/ symlinks)
▼
┌─────────────────────────────────┐
│ Analyst (local machine) │
│ ├── ./server/ (read-only) │ parquet, docs, scripts
│ └── ./user/ (workspace) │ duckdb, notifications
└─────────────────────────────────┘
Configuration
Instance-specific config is in config/instance.yaml. See config/instance.yaml.example for all options.
Environment variables go in .env (never committed to git).
Data schema is defined in docs/data_description.md (YAML blocks in markdown).
Development
# Setup
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Run webapp locally
flask --app webapp.app run --debug
# Run tests
pytest tests/ -v
# Sync data
python -m src.data_sync
Data Source Adapters
The platform supports pluggable data sources via src/adapters/:
- Keboola (
keboola): Syncs from Keboola Storage API - CSV (
csv): Import from local CSV files (planned) - BigQuery (
bigquery): Query from Google BigQuery (planned)
Configure in config/instance.yaml under data_source.type.
Server Management
# Add analyst user
sudo add-analyst username "ssh-rsa AAAA..."
# Add privileged analyst
sudo add-analyst username "ssh-rsa AAAA..." --private
# List analysts
list-analysts
# Server monitoring
uptime && free -h && df -h /data
Returning Users
When reopening the project in Claude Code:
- Sync latest data:
bash server/scripts/sync_data.sh - Verify DuckDB:
ls -lh user/duckdb/analytics.duckdb - Start analyzing with Claude Code
Key Implementation Details
Config Loading Chain
config/loader.pyloadsinstance.yaml(checks$CONFIG_DIR, then./config/)webapp/config.pycalls_load_instance_config()at module level_get(config, *keys, default="")traverses nested dicts safelyinject_config()context processor exposesConfigto all Jinja templates- Templates use
{{ config.INSTANCE_NAME }},{{ config.INSTANCE_SUBTITLE }}, etc.
Adapter Pattern
- Factory:
src/adapters/__init__.py->create_data_source(adapter_type, **kwargs) - ABC:
DataSourceclass insrc/data_sync.py(lines 149-172) - Keboola:
src/adapters/keboola_adapter.py-> thin facade wrappingLocalKeboolaSource - Core Keboola logic:
src/keboola_client.py(788 lines, Keboola Storage API wrapper)
Server Patterns
- Atomic JSON writes:
tempfile.mkstemp()+os.fchmod(fd, 0o660)+os.replace() - User home writes:
sudo /usr/bin/install -o {user} -g {user}pattern - Staging dir:
/tmp/data_analyst_staging(deploy.sh creates it with setgid) - Dev docs:
dev_docs/server.mddocuments all established patterns
Files NOT to modify (stable infrastructure)
src/parquet_manager.py- Parquet conversion engineconnectors/jira/file_lock.py- Advisory file lockingconnectors/jira/incremental_transform.py- Jira monthly Parquet transformserver/ws_gateway/- WebSocket notification gateway
Git Commits & Pull Requests
- Keep commit messages clean and concise
- Do not include AI attribution in commits or PRs