From c8e232e43e76dbaaeda9f573e7b8a691b7d3c25c Mon Sep 17 00:00:00 2001 From: ZdenekSrotyr Date: Thu, 9 Apr 2026 18:44:25 +0200 Subject: [PATCH] docs: update stale v1 docs to v2 Docker/FastAPI/DuckDB architecture - CONFIGURATION.md: remove Flask/SendGrid/WEBAPP_SECRET_KEY references, update env vars to JWT_SECRET_KEY and SESSION_SECRET, point to config/.env.template and config/instance.yaml.example - disaster-recovery.md: rewrite for Docker volumes; cover GCP disk snapshot backup/restore and full VM rebuild; drop systemd/nginx/SSH - server.md: strip rsync, systemd, nginx, Linux group, and sudo sections; keep Docker Compose operations, log viewing, health checks, sync/admin CLI, and Jira webhook procedures --- dev_docs/disaster-recovery.md | 306 ++--- dev_docs/server.md | 2342 +++------------------------------ docs/CONFIGURATION.md | 58 +- 3 files changed, 347 insertions(+), 2359 deletions(-) diff --git a/dev_docs/disaster-recovery.md b/dev_docs/disaster-recovery.md index 7d4a4e9..ce58c90 100644 --- a/dev_docs/disaster-recovery.md +++ b/dev_docs/disaster-recovery.md @@ -1,161 +1,61 @@ # Disaster Recovery -Recovery procedures for the Data Broker Server (`your-server`). +Recovery procedures for the AI Data Analyst Docker deployment. ## Overview ``` -Disk Layout: - sda (10 GB) / System disk (instance) - EXPENDABLE - sdb (30 GB) /data Data disk - SNAPSHOTTED daily - sdc (30 GB) /home Home disk - SNAPSHOTTED daily +What lives where: + Docker volumes /data DuckDB files, parquet extracts, state + Git repo/ Application code — rebuild from GitHub + .env secrets Recreate from GitHub Secrets / 1Password ``` -**Key principle**: sda is disposable. Everything on it is either in git or can be reinstalled. All unique data lives on sdb and sdc, which are independently snapshotted. +**Key principle**: the container is disposable. All unique data lives in the `/data` +Docker volume (or a GCP persistent disk mounted at `/data`). Re-pulling the image +and restoring `/data` brings the service back to full operation. -## What Lives Where +## Data Layout -| Location | Content | Recovery Method | -|----------|---------|-----------------| -| sda: `/opt/data-analyst/repo/` | Application code | `git clone` from GitHub | -| sda: `/opt/data-analyst/.venv/` | Python packages | `pip install -r requirements.txt` | -| sda: `/opt/data-analyst/.env` | Application secrets | deploy.sh creates from GitHub secrets | -| sda: `/etc/sudoers.d/` | Permissions | deploy.sh copies from repo | -| sda: `/etc/security/limits.d/` | Resource limits | deploy.sh copies from repo | -| sda: `/etc/nginx/` | Nginx config | deploy.sh or manual copy from repo | -| sda: `/etc/letsencrypt/` | SSL certificate | `certbot` renews automatically | -| sdb: `/data/src_data/parquet/` | Parquet data | Regenerate from Keboola (`update.sh`) or restore snapshot | -| sdb: `/data/notifications/` | Notification state | Restore from snapshot | -| sdb: `/data/docs/`, `/data/scripts/` | Docs & scripts | deploy.sh copies from repo | -| sdc: `/home/*/` | User accounts, SSH keys, workspaces, scripts | Restore from snapshot | +| Path | Content | Backup | +|------|---------|--------| +| `/data/state/system.duckdb` | Table registry, users, sync state | Daily snapshot | +| `/data/analytics/server.duckdb` | Master analytics DB (views) | Regenerated on start | +| `/data/extracts/*/extract.duckdb` | Per-source extract DBs | Daily snapshot | +| `/data/extracts/*/data/*.parquet` | Parquet files (local sources) | Daily snapshot | -## Scenario A: System Disk Failure (sda dies) +`analytics/server.duckdb` is rebuilt automatically by `SyncOrchestrator.rebuild()` +on every startup, so it does not need to be backed up separately. -**Impact**: Server is down, but all user data is safe on sdb/sdc. +## Scenario A: Container Crash / Bad Deploy -**Recovery time**: ~30 minutes +**Impact**: Service down, data intact. -### Steps +**Recovery time**: ~2 minutes -1. **Create new VM** (same zone, attach existing disks): - ```bash - # Create new instance with existing disks - gcloud compute instances create your-server \ - --project=your-gcp-project \ - --zone=europe-north1-a \ - --machine-type=e2-medium \ - --image-family=debian-12 \ - --image-project=debian-cloud \ - --boot-disk-size=10GB \ - --tags=http-server,https-server +```bash +# Pull latest image and restart +docker compose pull +docker compose up -d - # Attach existing data disks - gcloud compute instances attach-disk your-server \ - --project=your-gcp-project \ - --zone=europe-north1-a \ - --disk=data-disk +# Check health +curl https://your-instance.example.com/health +``` - gcloud compute instances attach-disk your-server \ - --project=your-gcp-project \ - --zone=europe-north1-a \ - --disk=home-disk - ``` +If a bad image was pushed, roll back to the previous tag: +```bash +docker compose down +# Edit docker-compose.yml to pin the previous image tag +docker compose up -d +``` -2. **SSH in and mount disks**: - ```bash - # Mount data disk - mkdir -p /data - mount /dev/sdb /data +## Scenario B: /data Volume Corruption or Loss - # Mount home disk - mount /dev/sdc /home +**Impact**: All DuckDB state and parquet data lost. - # Add to fstab (get UUIDs with blkid) - echo "UUID=$(blkid -s UUID -o value /dev/sdb) /data ext4 discard,defaults,nofail 0 2" >> /etc/fstab - echo "UUID=$(blkid -s UUID -o value /dev/sdc) /home ext4 discard,defaults,nofail 0 2" >> /etc/fstab - ``` +**Recovery time**: ~10 minutes (from snapshot) or ~30 minutes (regenerate from source) -3. **Install prerequisites**: - ```bash - apt-get update - apt-get install -y git python3.11-venv python3-pip nginx certbot python3-certbot-nginx - ``` - -4. **Recreate deploy user and groups**: - ```bash - # Create groups - groupadd dataread - groupadd data-private - groupadd data-ops - - # Create deploy user - useradd -m -s /bin/bash deploy - usermod -aG data-ops deploy - - # Restore deploy SSH key (generate new one) - sudo -u deploy ssh-keygen -t ed25519 -f /home/deploy/.ssh/id_ed25519 -N '' -C 'deploy@data-broker' - sudo -u deploy bash -c 'echo -e "Host github.com\n IdentityFile ~/.ssh/id_ed25519\n StrictHostKeyChecking accept-new" > /home/deploy/.ssh/config' - chmod 600 /home/deploy/.ssh/config - - # Add new public key to GitHub as Deploy Key - cat /home/deploy/.ssh/id_ed25519.pub - ``` - -5. **Clone repo and run setup**: - ```bash - mkdir -p /opt/data-analyst - chown deploy:data-ops /opt/data-analyst - sudo -u deploy git clone git@github.com:keboola/agnes-the-ai-analyst.git /opt/data-analyst/repo - git config --global --add safe.directory /opt/data-analyst/repo - /opt/data-analyst/repo/server/setup.sh - ``` - -6. **Restore user accounts from /home**: - ```bash - # Users already exist on home-disk, just recreate /etc/passwd entries - # For each directory in /home (except deploy): - for dir in /home/*/; do - username=$(basename "$dir") - [[ "$username" == "deploy" ]] && continue - # Create user if not exists - if ! id "$username" &>/dev/null; then - useradd -M -d "/home/$username" -s /bin/bash "$username" - usermod -aG dataread "$username" - fi - done - ``` - Note: Group memberships (data-private, sudo, data-ops) need manual review. Check the admin list in `server/limits-users.conf` for admin users. - -7. **Trigger deploy via GitHub Actions** (or manually): - ```bash - sudo -u deploy bash -c 'cd /opt/data-analyst/repo && ./server/deploy.sh' - ``` - -8. **Set up SSL certificate**: - ```bash - certbot --nginx -d your-instance.example.com - ``` - -9. **Restore crontab**: - ```bash - sudo -u deploy crontab -e - # Add: - # MAILTO=admin@your-domain.com - # 0 6,14,19 * * * cd /opt/data-analyst/repo && ./scripts/update.sh > /var/log/update.log 2>&1 || cat /var/log/update.log - ``` - -10. **Update external IP** if it changed: - - DNS: `your-instance.example.com` A record - - GitHub secrets: `SERVER_HOST` - - SSH configs of all users - -## Scenario B: Data Disk Failure (sdb/data-disk dies) - -**Impact**: Parquet data lost, users unaffected. - -**Recovery time**: ~10 minutes (from snapshot) or ~30 minutes (from Keboola) - -### Option 1: Restore from snapshot (faster) +### Option 1: Restore from GCP disk snapshot (faster) ```bash # Find latest snapshot @@ -169,95 +69,99 @@ gcloud compute disks create data-disk \ --source-snapshot=SNAPSHOT_NAME \ --type=pd-balanced -# Attach to VM (may need to stop VM first) +# Attach to VM and mount gcloud compute instances attach-disk your-server \ --project=your-gcp-project \ --zone=europe-north1-a \ --disk=data-disk -# Mount -ssh kids "sudo mount /dev/sdb /data" +# Restart containers +docker compose up -d ``` -### Option 2: Regenerate from Keboola +### Option 2: Regenerate from source ```bash -# Create fresh disk -gcloud compute disks create data-disk \ - --project=your-gcp-project \ - --zone=europe-north1-a \ - --size=30GB \ - --type=pd-balanced +# Start with empty /data volume +docker compose up -d -# Attach, format, mount -ssh kids "sudo mkfs.ext4 /dev/sdb && sudo mount /dev/sdb /data" - -# Run deploy to recreate directory structure -ssh kids "sudo -u deploy bash -c 'cd /opt/data-analyst/repo && ./server/deploy.sh'" - -# Regenerate parquet data from Keboola -ssh kids "cd /opt/data-analyst/repo && ./scripts/update.sh" +# Trigger a full sync from the data source +curl -X POST http://localhost:8000/api/sync/trigger +# Or via CLI: +docker compose exec app da sync ``` -## Scenario C: Home Disk Failure (sdc/home-disk dies) +DuckDB extract files and parquet will be repopulated from Keboola / BigQuery. +`system.duckdb` (table registry, users) must be restored from snapshot if +not regenerated — user accounts and table definitions are not recreated by sync. -**Impact**: All user accounts, SSH keys, and personal workspaces lost. +## Scenario C: Complete VM Loss -**Recovery time**: ~10 minutes (from snapshot) +**Recovery time**: ~20 minutes -### Restore from snapshot +1. **Create new VM** (or use managed instance group): + ```bash + gcloud compute instances create your-server \ + --project=your-gcp-project \ + --zone=europe-north1-a \ + --machine-type=e2-medium \ + --image-family=debian-12 \ + --image-project=debian-cloud + ``` -```bash -# Find latest snapshot -gcloud compute snapshots list --project=your-gcp-project \ - --filter="sourceDisk:home-disk" --sort-by=~creationTimestamp --limit=5 +2. **Install Docker**: + ```bash + curl -fsSL https://get.docker.com | sh + ``` -# Create new disk from snapshot -gcloud compute disks create home-disk \ - --project=your-gcp-project \ - --zone=europe-north1-a \ - --source-snapshot=SNAPSHOT_NAME \ - --type=pd-balanced +3. **Attach and mount the data disk** (or restore from snapshot per Scenario B): + ```bash + gcloud compute instances attach-disk your-server \ + --project=your-gcp-project --zone=europe-north1-a --disk=data-disk + # Add mount to /etc/fstab and mount /data + ``` -# Attach to VM -gcloud compute instances attach-disk your-server \ - --project=your-gcp-project \ - --zone=europe-north1-a \ - --disk=home-disk +4. **Clone repo and create .env**: + ```bash + git clone git@github.com:your-org/ai-data-analyst.git /opt/data-analyst + cd /opt/data-analyst + cp config/.env.template .env + # Fill in secrets from GitHub Secrets / 1Password + ``` -# Mount -ssh kids "sudo mount /dev/sdc /home" -``` +5. **Start the stack**: + ```bash + docker compose up -d + ``` -If no snapshot exists, users must re-register via https://your-instance.example.com. - -## Scenario D: Complete Server Loss (VM + all disks) - -**Recovery time**: ~45 minutes - -1. Follow **Scenario A** steps 1-5 (new VM, prerequisites, deploy user) -2. Restore `data-disk` from snapshot (Scenario B, Option 1) -3. Restore `home-disk` from snapshot (Scenario C) -4. Follow **Scenario A** steps 6-10 (user accounts, deploy, SSL, cron, IP) +6. **Update DNS** if the external IP changed: + - A record for `your-instance.example.com` ## Verification Checklist After any recovery, verify: -- [ ] `ssh kids` works (admin access) -- [ ] `https://your-instance.example.com` loads (webapp) -- [ ] `https://your-instance.example.com/health` returns OK -- [ ] At least one analyst can SSH in -- [ ] `ls /data/src_data/parquet/` shows data -- [ ] `ls /home/` shows user directories -- [ ] `systemctl status webapp` is active -- [ ] `systemctl status notify-bot` is active -- [ ] `sudo crontab -u deploy -l` shows data sync cron +- [ ] `docker compose ps` — all services `Up` +- [ ] `https://your-instance.example.com/health` returns `{"status": "ok"}` +- [ ] Login works (Google OAuth or email magic link) +- [ ] At least one table appears in the data catalog +- [ ] `docker compose logs app` — no ERROR lines at startup ## Preventive Measures -- **GCP snapshots**: Daily automatic snapshots of `data-disk` and `home-disk` (14-day retention) -- **Setup script**: `server/setup-snapshot-schedule.sh` configures snapshot policy -- **Limits in git**: `server/limits-users.conf` is version-controlled and deployed automatically -- **All configs in git**: sudoers, nginx, systemd services, management scripts -- **Secrets in GitHub**: `.env` is recreated by deploy.sh from GitHub Actions secrets +- **GCP snapshots**: Daily automatic snapshots of the `/data` persistent disk + (14-day retention). Configure via: + ```bash + gcloud compute resource-policies create snapshot-schedule daily-backup \ + --project=your-gcp-project \ + --region=europe-north1 \ + --max-retention-days=14 \ + --on-source-disk-delete=keep-auto-snapshots \ + --daily-schedule \ + --start-time=03:00 + gcloud compute disks add-resource-policies data-disk \ + --project=your-gcp-project --zone=europe-north1-a \ + --resource-policies=daily-backup + ``` +- **Secrets in GitHub / 1Password**: `.env` is never committed; recreate from stored secrets +- **Image tags**: Pin a known-good image tag in `docker-compose.yml` before each deploy diff --git a/dev_docs/server.md b/dev_docs/server.md index 51c5f7d..b576769 100644 --- a/dev_docs/server.md +++ b/dev_docs/server.md @@ -1,2249 +1,311 @@ -# Data Broker Server +# Server Operations -Central server for distributing data to AI analytical systems. +Operational guide for the AI Data Analyst Docker deployment. ## Basic Information | Parameter | Value | |-----------|-------| -| Name | your-server | | GCP Project | your-gcp-project | | Zone | europe-north1-a | -| Type | e2-medium | +| Machine type | e2-medium | | OS | Debian 12 (bookworm) | | External IP | YOUR_SERVER_IP | -## Hardware +## Docker Compose -| Resource | Size | -|----------|------| -| RAM | 3.8 GB | -| Swap | 2 GB (/mnt/swapfile) | -| System disk (sda) | 10 GB - OS, packages, app (expendable) | -| Data disk (sdb) | 30 GB - /data, pd-balanced (snapshotted) | -| Home disk (sdc) | 30 GB - /home, pd-balanced (snapshotted) | -| Temp disk (sdd) | 100 GB - /tmp, pd-standard (not snapshotted) | - -## Access - -### SSH connection (admin) +### Starting and stopping ```bash -ssh kids +# Start all services (app + scheduler) +docker compose up -d + +# Include optional services (Telegram bot, etc.) +docker compose --profile full up -d + +# Stop all services +docker compose down + +# Restart a single service +docker compose restart app + +# Pull latest images and redeploy +docker compose pull && docker compose up -d ``` -Requires SSH config: -``` -Host kids - HostName YOUR_SERVER_IP - User admin1 - IdentityFile ~/.ssh/google_compute_engine -``` +### Status -Or via gcloud: ```bash -gcloud compute ssh your-server --project=your-gcp-project --zone=europe-north1-a +# List running containers and their state +docker compose ps + +# Resource usage +docker stats +``` + +## Log Viewing + +```bash +# All services, follow +docker compose logs -f + +# Single service +docker compose logs -f app +docker compose logs -f scheduler + +# Last N lines +docker compose logs --tail=100 app + +# Since a timestamp +docker compose logs --since=1h app +``` + +Application logs are written to stdout/stderr and captured by Docker. + +## Health Check + +```bash +# Quick check +curl https://your-instance.example.com/health + +# With response body +curl -s https://your-instance.example.com/health | python3 -m json.tool +``` + +Expected response: +```json +{"status": "ok"} +``` + +The `/health` endpoint also checks DuckDB connectivity and returns `503` if +the database is unavailable. + +## Data Sync + +### Trigger a manual sync + +```bash +# Via API +curl -X POST http://localhost:8000/api/sync/trigger + +# Via CLI inside the container +docker compose exec app da sync + +# Sync a single table +docker compose exec app da sync --table table_name +``` + +### Check sync status + +```bash +curl -s http://localhost:8000/api/sync/status | python3 -m json.tool ``` ## Data Structure ``` -/data/ # Data disk (30 GB, pd-balanced) -├── lost+found/ # System directory -├── src_data/ # Source data (group: dataread, 750) -│ ├── raw/ # Raw data from Keboola (reserved for future use) -│ ├── parquet/ # Converted data (parquet format) -│ │ ├── sales/ # CRM data (in.c-crm bucket) - group: dataread -│ │ └── private/ # Private data - group: data-private -│ ├── metadata/ # Sync state, cache, profiles -│ │ ├── sync_state.json # Per-table sync stats (rows, columns, size) -│ │ └── profiles.json # Data profiler output (mode 644, ~900 KB) -│ └── staging/ # Temporary processing (reserved for future use) -├── docs/ # Documentation (deployed from repo) -│ └── schema.yml # Auto-generated table schemas (from data sync) -├── scripts/ # Helper scripts (deployed from repo) -├── examples/ # Example notification scripts (admin1:data-ops, 755) -│ └── notifications/ # Example notification scripts for analysts -├── notifications/ # Notification data (deploy:data-ops, 2770 setgid) -│ ├── telegram_users.json # username -> {chat_id, linked_at} mapping -│ ├── desktop_users.json # username -> {linked_at} mapping (desktop app link state) -│ ├── pending_codes.json # temporary verification codes -│ └── bot.log # Bot service log -├── auth/ # Password auth data (www-data:data-ops, 2770 setgid) -│ └── users.json # Hashed passwords and metadata -├── corporate-memory/ # Knowledge base data (deploy:data-ops, 2770 setgid) -│ ├── knowledge.json # Collected knowledge items from CLAUDE.local.md files -│ ├── votes.json # User votes on knowledge items -│ └── user_hashes.json # MD5 hashes for change detection -└── user_sessions/ # Session collector data (root:data-ops, 2770 setgid) - └── *.jsonl # User session logs collected every 6 hours - -/run/notify-bot/ # Systemd RuntimeDirectory (mode 0755) -└── bot.sock # Unix socket for send API (mode 0666) - -/tmp/data_analyst_staging/ # Keboola staging directory (root:data-ops, 2770 setgid) -└── *.parquet # Temporary Parquet files during Keboola data load +/data/ # Persistent volume (GCP pd-balanced, snapshotted) +├── state/ +│ └── system.duckdb # Table registry, users, sync state, audit log +├── analytics/ +│ └── server.duckdb # Master analytics DB (rebuilt on startup) +└── extracts/ + └── {source_name}/ + ├── extract.duckdb # Per-source extract DB with views + └── data/ # Parquet files (local sources: Keboola, Jira) + └── *.parquet ``` -### Folder Mapping +`system.duckdb` is the source of truth for configuration. Back it up before +any destructive operation. -Parquet subfolders are mapped from Keboola bucket names in `docs/data_description.md`: - -```yaml -folder_mapping: - in.c-crm: sales # CRM/Salesforce data - in.c-private: private # Private/sensitive data -``` - -This mapping is used by `src/config.py` to determine where to save Parquet files. - -## Access Control - -Three-tier permission model: - -| Role | Groups | Access | -|------|--------|--------| -| **Standard Analyst** | `dataread` | Public data read-only | -| **Privileged Analyst** | `dataread` + `data-private` | Public + private data read-only | -| **Admin** | `sudo` + `google-sudoers` + `dataread` + `data-private` + `data-ops` | Full server access (NOPASSWD) + all data read/write + deployment | - -- **Standard Analyst** - can read public data, sync via rsync, run scripts in their workspace -- **Privileged Analyst** - same as standard + access to private/sensitive data (executives, management) -- **Admin** - server administration, can add/remove users, has sudo privileges, full data access with write permissions, can deploy application updates - -### Data Directory Permissions - -Data in `/data/src_data/` uses ACL for granular access: - -``` -/data/src_data/ owner: admin1, group: data-ops -├── raw/ data-ops: rwx, dataread: r-x -├── parquet/ data-ops: rwx, dataread: r-x -│ └── private/ data-ops: rwx, data-private: r-x -└── staging/ data-ops: rwx, dataread: r-x -``` - -- **Admins (data-ops)**: Full read/write access to prepare data -- **Analysts (dataread)**: Read-only access to consume data -- **Private data (data-private)**: Additional group for sensitive data access - -**Atomic writes and ACL — required pattern:** - -Directories under `/data/` use default ACLs (e.g., `default:group:data-ops:rwx`). Files created with `open()` inherit these correctly. However, `tempfile.mkstemp()` explicitly sets mode `0600`, which overrides the ACL mask to `---` and silently breaks group access for all other services. - -**Always use `os.fchmod()` immediately after `mkstemp()`:** - -```python -fd, tmp_path = tempfile.mkstemp(dir=str(target.parent), suffix=".tmp") -os.fchmod(fd, 0o660) # REQUIRED: restore ACL mask for group access -try: - with os.fdopen(fd, "w") as f: - json.dump(data, f, indent=2) - os.replace(tmp_path, str(target)) -except Exception: - os.unlink(tmp_path) - raise -``` - -Use `0o660` for files accessed by services via data-ops group ACL, `0o644` for world-readable files (e.g., profiler output). See [#203](https://github.com/keboola/agnes-the-ai-analyst/issues/203) for a production incident caused by missing `fchmod`. - -**Per-issue file locking for concurrent writers:** - -When multiple services write to the same JSON file (e.g., SLA poll and webhook handler both updating `/data/src_data/raw/jira/issues/SUPPORT-1234.json`), use advisory file locking to prevent races: - -```python -from connectors.jira.file_lock import issue_json_lock - -with issue_json_lock(issues_dir, issue_key): - # read JSON, modify, atomic write, transform to Parquet - ... -``` - -- Uses `fcntl.flock()` (POSIX advisory, blocking, exclusive) -- Lock files stored in `{issues_dir}/.locks/{issue_key}.lock` -- Different issue keys don't block each other (fine-grained locking) -- The lock must cover the entire read-modify-write **and** the Parquet transform — otherwise another writer could overwrite the JSON between write and transform, causing the transform to read stale data - -Currently used by: -- `connectors/jira/scripts/poll_sla.py` — wraps SLA+status update + `transform_single_issue()` -- `connectors/jira/service.py` — wraps `save_issue()` JSON write + `trigger_incremental_transform()`, and `_handle_deletion()` read-modify-write + transform - -Attachment downloads in `save_issue()` intentionally run **outside** the lock (can take tens of seconds and don't modify JSON). - -## User Management - -Each user has: -- Own Linux account with home directory `/home/username/` -- Server symlinks: `/home/username/server/` (read-only links to `/data/`) -- User workspace: `/home/username/user/` (writable: duckdb, notifications, artifacts, scripts, parquet) -- Notification state: `/home/username/.notifications/{state,logs}` -- SSH key authentication - -### Management Commands +## Admin CLI ```bash -# Add standard analyst (public data only) -sudo add-analyst username "ssh-rsa AAAA... comment" +# List registered tables +docker compose exec app da admin tables list -# Add privileged analyst (public + private data) -sudo add-analyst username "ssh-rsa AAAA... comment" --private +# Register a new table +docker compose exec app da admin tables add -# Add server admin (sudo + all data) -sudo add-admin username "ssh-rsa AAAA... comment" +# User management +docker compose exec app da admin users list -# List all analysts -list-analysts - -# Remove user (interactive) -sudo remove-analyst username - -# Remove user (non-interactive, e.g., via SSH) -sudo remove-analyst username --force +# Query data directly +docker compose exec app da query "SELECT * FROM my_table LIMIT 10" ``` -### Examples - -```bash -# Regular analyst -sudo add-analyst novak "ssh-rsa AAAAB3... jan.novak@example.com" - -# Executive with private data access -sudo add-analyst ceo "ssh-rsa AAAAB3... ceo@example.com" --private - -# Server administrator -sudo add-admin admin2 "ssh-rsa AAAAB3... admin2@example.com" -sudo add-admin admin3 "ssh-ed25519 AAAAC3... admin3@your-domain.com" -``` - -Output for admin: -``` -Admin admin2 created successfully - - Added to group: sudo (server administration) - - Added to group: dataread (public data access) - - Added to group: data-private (private data access) - - Added to group: data-ops (application deployment) - - Added to resource limits (unlimited) - - Workspace: /home/admin2/workspace - - Data link: /home/admin2/data -> /data/src_data -``` - -## SSH Configuration - -- Passwords disabled (SSH keys only) -- Root login disabled -- MaxSessions: 20 (per user) -- MaxStartups: 30:50:100 (rate limiting for DDoS protection) -- ClientAliveInterval: 300s - -## Resource Limits - -Protection against fork bombs and resource abuse. Configuration is version-controlled in `server/limits-users.conf` and deployed automatically by `deploy.sh` to `/etc/security/limits.d/99-users.conf`: - -| Resource | Analysts | Admins | -|----------|----------|--------| -| Max processes (nproc) | 100/150 | unlimited | -| Virtual memory (as) | 4 GB / 6 GB | unlimited | -| File size (fsize) | 2 GB / 4 GB | unlimited | -| Open files (nofile) | 1024/2048 | 65535 | -| Core dumps | disabled | unlimited | - -- **Admins** (`data-ops` group members) are explicitly listed in the limits file with unlimited access -- New admins are automatically added to exceptions by `add-admin` script -- **All other users** get restricted limits via wildcard rule (protection against fork bombs) - -## Data Sync Scripts - -### Server: update.sh - -Syncs data from Keboola to Parquet files. Run via cron 3x daily (6:00, 14:00, 19:00 UTC). - -```bash -cd /opt/data-analyst/repo && ./scripts/update.sh -``` - -**What it does:** -1. Activates virtual environment (supports both local `./.venv` and server `/opt/data-analyst/.venv`) -2. Downloads data from Keboola Storage API, converts to Parquet format in `DATA_DIR/parquet/{folder}/` -3. Generates data profiles (`python -m src.profiler` → `profiles.json`) — non-fatal if it fails - -**Cron setup:** -```bash -sudo crontab -u deploy -e -# Add: -# MAILTO=admin@your-domain.com -# 0 6,14,19 * * * cd /opt/data-analyst/repo && ./scripts/update.sh > /var/log/update.log 2>&1 || cat /var/log/update.log -``` - -### Client: sync_data.sh - -Main sync script for analysts. Syncs docs, scripts, data, and regenerates CLAUDE.md: - -```bash -bash server/scripts/sync_data.sh # Full sync (pull server/ + push user/) -bash server/scripts/sync_data.sh --dry-run # Preview only -bash server/scripts/sync_data.sh --push # Only upload user/ to server -``` - -**What it does:** -1. Syncs `server/docs/`, `server/scripts/`, `server/examples/`, `server/metadata/` from server -2. Regenerates `CLAUDE.md` from latest template (preserves username, never touches `CLAUDE.local.md`) -3. Updates `.claude/settings.json` with project permissions from server -4. Syncs parquet data files to `server/parquet/` (incremental) -5. Uploads `user/` to server (backup + runtime for notifications) -6. Downloads corporate memory rules from `~/.claude_rules/` to `.claude/rules/` -7. Updates sync timestamp on server (`touch ~/server/`) - used by the webapp Account card "Last Sync" display. Each user's `~/server/` directory is per-user, so the timestamp is independent. -8. Reinitializes DuckDB in `user/duckdb/` (core tables via `duckdb_manager.py`, optional dataset views via `sync_jira.sh --views-only` etc.) - -**Note:** Rsync uses `--delete` to remove obsolete files from client (e.g., old monthly partitions when switching to daily). Files are compared by mtime+size (no `--checksum` for better performance). If rsync is not available (Windows without WSL), scp is used as fallback with explicit dotfile handling. - -**CLAUDE.md update mechanism:** -- `CLAUDE.md` is regenerated from `server/docs/setup/claude_md_template.txt` on every sync -- Template is maintained centrally and deployed to server via CI/CD -- User's personal `CLAUDE.local.md` is never overwritten (higher priority in Claude Code) -- New features added to template are automatically delivered to all analysts on next sync - -**Claude Code settings.json:** -- `.claude/settings.json` is copied from `server/docs/setup/claude_settings.json` on every sync -- Contains project-wide permissions (allow/deny/ask rules for tools) -- Protects `server/` directory from accidental modifications by Claude -- Centrally managed - analysts cannot override these permissions locally - -### Client: init.sh + setup_views.sh - -**First time setup (init.sh):** -```bash -./scripts/init.sh -``` -Creates virtual environment, installs dependencies, and creates data folders including `duckdb/`. - -**After rsync (setup_views.sh):** -```bash -bash server/scripts/setup_views.sh -``` -Initializes DuckDB views from synced Parquet files. DuckDB database is created at `user/duckdb/analytics.duckdb`. - -Steps: -1. Activates virtual environment -2. Runs `duckdb_manager.py --reinit` for core Keboola tables (from `data_description.md`) -3. Calls optional dataset scripts with `--views-only` flag: - - If `server/parquet/jira/` exists → `sync_jira.sh --views-only` (creates `jira_issues`, `jira_comments`, `jira_attachments`, `jira_changelog` views) - - Future datasets follow the same pattern (e.g., `sync_github.sh --views-only`) - -**Convention:** Each data source sync script (e.g., `sync_jira.sh`) manages its own DuckDB views. The `--views-only` flag creates/refreshes views without syncing data. This keeps `duckdb_manager.py` focused on core tables while optional datasets are self-contained. - -## Server Purpose - -1. **Sync from Keboola** - periodically pulls data from Keboola Storage -2. **Convert to Parquet** - transforms data to efficient format -3. **Chunking** - splits data by hour for incremental sync -4. **Distribution** - clients pull data via rsync to local machines -5. **On-server analysis** - analysts can run scripts directly on the server - -## Usage Guide - -### User Types - -| Type | Groups | Data Access | Use Case | -|------|--------|-------------|----------| -| **Standard Analyst** | `dataread` | Public data | Regular analysts, data scientists | -| **Privileged Analyst** | `dataread` + `data-private` | Public + private | Executives, management | -| **Admin** | `sudo` + `data-ops` + all data groups | Everything + server + deployment | DevOps, IT team | - -- **Standard analysts** see all company data except sensitive information stored in `private/` -- **Privileged analysts** have access to everything including executive reports and financial details -- **Admins** can manage the server, add/remove users, and have full sudo access - -### What Each User Gets - -Every analyst has their own Linux account with: - -``` -/home/username/ -├── server/ # Symlinks to shared read-only data on /data -│ ├── docs -> /data/docs -│ ├── scripts -> /data/scripts -│ ├── examples -> /data/examples -│ ├── parquet -> /data/src_data/parquet -│ └── metadata -> /data/src_data/metadata -├── user/ # User's OWN writable directories -│ ├── duckdb/ # Per-user DuckDB database -│ │ └── analytics.duckdb -│ ├── notifications/ # Notification scripts (*.py) -│ ├── artifacts/ # Analysis outputs -│ ├── scripts/ # Custom scripts -│ └── parquet/ # Custom parquet files -├── .notifications/ # Notification runner state -│ ├── state/ # Cooldown tracking per script -│ └── logs/ # Runner and cron logs -└── .ssh/authorized_keys # SSH key for authentication -``` - -- **Home directory** (`/home/username/`) - private space for each user -- **Server data** (`~/server/`) - read-only symlinks to shared `/data/` on disk -- **User workspace** (`~/user/`) - writable directories for user's own files -- **DuckDB** (`~/user/duckdb/analytics.duckdb`) - per-user database built from shared parquet - -### Typical Workflow - -**Option A: Local analysis with rsync (recommended)** - -1. Analyst syncs data to their local machine: - ```bash - # Recommended: use the sync script - bash server/scripts/sync_data.sh - - # Or manual rsync - rsync -avz data-analyst:server/parquet/ ./server/parquet/ - ``` - -2. Run analysis locally with Claude Code or other tools -3. Data stays on analyst's machine - they can do whatever they want with it - -**Option B: Server-side analysis** - -1. SSH into the server: - ```bash - ssh username@YOUR_SERVER_IP - ``` - -2. Work in personal workspace: - ```bash - cd ~/user - # Run scripts, analyze data from ~/server/parquet/ - ``` - -3. Copy results back to local machine if needed - -### Data Access Examples - -**Standard analyst (public data only):** -```bash -$ ls ~/server/parquet/ -sales/ products/ customers/ orders/ private/ - -$ ls ~/server/parquet/private/ -ls: cannot open directory 'private/': Permission denied -``` - -**Privileged analyst (public + private):** -```bash -$ ls ~/server/parquet/ -sales/ products/ customers/ orders/ private/ - -$ ls ~/server/parquet/private/ -executive_reports/ financial_details/ board_materials/ -``` - -### Rsync Permissions - -When syncing with rsync: -- Standard analysts will get "Permission denied" errors for `private/` folder (expected) -- Use `--exclude='private/'` to skip it cleanly: - ```bash - rsync -avz --exclude='private/' data-analyst:server/parquet/ ./server/parquet/ - ``` -- Privileged analysts can sync everything including private data - -## Monitoring - -### Cloud Monitoring (GCP) - -**Ops Agent** is installed and reports VM metrics to Cloud Monitoring, including disk space utilization. - -**Installation** (already done): -```bash -curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh -sudo bash add-google-cloud-ops-agent-repo.sh --also-install -``` - -**Check agent status:** -```bash -sudo systemctl status google-cloud-ops-agent -``` - -**Available metrics:** -- `agent.googleapis.com/disk/percent_used` - Disk utilization percentage -- `agent.googleapis.com/memory/percent_used` - Memory utilization -- `agent.googleapis.com/cpu/utilization` - CPU usage -- `agent.googleapis.com/network/traffic` - Network I/O - -**View metrics in GCP Console:** -1. Go to [Cloud Console > Monitoring > Metrics Explorer](https://console.cloud.google.com/monitoring/metrics-explorer) -2. Select resource type: `VM Instance` -3. Select metric: `agent.googleapis.com/disk/percent_used` -4. Filter by device: `/dev/sdb` (data disk) - -**Alert Policy for Disk Space:** - -Alert triggers when `/data` partition exceeds 85% usage for 5 minutes. - -To create the alert policy manually: - -1. Go to [Cloud Console > Monitoring > Alerting](https://console.cloud.google.com/monitoring/alerting) -2. Click **Create Policy** -3. Click **Add Condition**: - - **Resource type**: VM Instance - - **Metric**: `agent.googleapis.com/disk/percent_used` - - **Filter**: `metadata.system_labels.device="/dev/sdb"` AND `metadata.system_labels.state="used"` - - **Threshold**: > 85 - - **Duration**: 5 minutes -4. Click **Next** > **Notifications** (add email/Slack channel) -5. Click **Next** > **Documentation**: - ``` - Disk /data partition is above 85% full. - - Check /data/src_data/ for large files or run cleanup. - - Common causes: - - Keboola data sync (check cron logs) - - bot.log growth (check /data/notifications/bot.log) - - Jira attachments (check /data/src_data/raw/jira/attachments/) - ``` -6. **Name**: "Disk Space Alert - /data partition" -7. Click **Create Policy** - -**Cost:** Free tier (first 150 time series free, this VM uses ~25) - -**Dashboard:** Available in GCP Console > Monitoring > Dashboards > "VM Instances" - -### Local Monitoring - -```bash -# Server status -ssh kids "uptime && free -h && df -h / /data /home" - -# Active users -ssh kids "who" - -# Recent logins -ssh kids "last | head -20" - -# Check disk space for all partitions -ssh kids "df -h" - -# Check disk usage by directory -ssh kids "du -sh /data/*" -``` - -## Backup & Disaster Recovery - -### Disk Layout - -| Disk | Mount | Size | Purpose | Backup | -|------|-------|------|---------|--------| -| `your-server` (sda) | `/` | 10 GB | OS, packages, app | Expendable (rebuild from git) | -| `data-disk` (sdb) | `/data` | 30 GB | Parquet data, docs, scripts | Daily GCP snapshots | -| `home-disk` (sdc) | `/home` | 30 GB | User homes, SSH keys, workspaces | Daily GCP snapshots | -| `tmp-disk` (sdd) | `/tmp` | 100 GB | Temporary files | Expendable (not snapshotted) | - -### Automatic Snapshots - -Both `data-disk` and `home-disk` have daily GCP snapshot schedules with 14-day retention. Setup via `server/setup-snapshot-schedule.sh`. - -```bash -# Check snapshot schedule status -gcloud compute resource-policies describe daily-backup \ - --project=your-gcp-project --region=europe-north1 - -# List existing snapshots -gcloud compute snapshots list --project=your-gcp-project - -# Manual snapshot (if needed) -gcloud compute disks snapshot data-disk home-disk \ - --project=your-gcp-project \ - --zone=europe-north1-a \ - --snapshot-names=data-disk-$(date +%Y%m%d),home-disk-$(date +%Y%m%d) -``` - -### Recovery - -See `disaster-recovery.md` for detailed recovery procedures for each failure scenario. - ## Application Deployment -### Directory Structure +Application is deployed via Docker image. The recommended workflow: -``` -/opt/data-analyst/ # Application directory (group: data-ops) -├── repo/ # Git repository -│ ├── src/ # Python source code -│ ├── scripts/ # Data sync scripts -│ ├── server/ # Server management scripts -│ │ ├── bin/ # add-analyst, notify-runner, notify-scripts, etc. -│ │ └── telegram_bot/ # Telegram bot service -│ ├── webapp/ # Flask web application -│ └── examples/ # Example notification scripts -├── .venv/ # Python virtual environment -├── .env # Webapp env (Google OAuth, secret key) -└── logs/ # Application logs -``` +1. Push changes to the `main` branch +2. CI builds and pushes a new image +3. On the server, pull and restart: + ```bash + cd /opt/data-analyst + docker compose pull + docker compose up -d + ``` -### CI/CD Pipeline +To pin a specific image version, set the tag in `docker-compose.yml` before deploying. -Application is automatically deployed via GitHub Actions when changes are pushed to `main` branch. - -**How it works:** -1. Push to `main` triggers GitHub Actions workflow -2. Action connects to server via SSH as `deploy` user -3. Runs `/opt/data-analyst/repo/server/deploy.sh` -4. Deploy script: - - Pulls latest code from `origin/main` - - Updates server management scripts in `/usr/local/bin/` - - Updates sudoers configurations (`/etc/sudoers.d/`) - - Updates resource limits (`/etc/security/limits.d/99-users.conf`) - - Deploys `notify-runner` and `notify-scripts` to `/usr/local/bin/` - - Creates data directories: - - `/data/notifications/` (notification state) - - `/data/src_data/raw/jira/` (Jira webhook data) - - `/data/auth/` (password auth) - - `/data/corporate-memory/` (knowledge base) - - `/data/user_sessions/` (session logs) - - `/data/examples/` (example scripts) - - `/tmp/data_analyst_staging/` (Keboola staging) - - Deploys systemd units: - - `notify-bot.service` (Telegram bot) - - `ws-gateway.service` (WebSocket gateway) - - `corporate-memory.{service,timer}` (knowledge collector) - - `jira-sla-poll.{service,timer}` (SLA refresh) - - `jira-consistency.{service,timer,timer-deep}` (data integrity monitoring) - - `session-collector.{service,timer}` (session logs) - - Sets ACLs for Jira attachments (dataread group) - - Creates/updates Keboola `.env` file (if secrets provided) - - Sets correct permissions on `/opt/data-analyst/` - - Restarts webapp, notify-bot, ws-gateway services - - Enables/starts timers (if credentials configured) - -**Deploy user permissions:** -The `deploy` user has limited sudo access defined in `/etc/sudoers.d/deploy`: - -**Core Operations:** -- Can copy scripts to `/usr/local/bin/` -- Can update sudoers files in `/etc/sudoers.d/` -- Can manage permissions on `/opt/data-analyst/` -- Can update resource limits in `/etc/security/limits.d/` - -**Service Management:** -- Can restart/reload webapp, nginx services -- Can manage notify-bot, ws-gateway services -- Can manage corporate-memory timer -- Can manage jira-sla-poll timer -- Can manage jira-consistency timers (incremental + deep) -- Can manage session-collector timer -- Can run `systemctl daemon-reload` - -**Data Directories:** -- Can manage `/data/scripts/` (helper scripts for analysts) -- Can manage `/data/docs/` (documentation) -- Can manage `/data/notifications/` (notification state) -- Can manage `/data/examples/` (example scripts) -- Can manage `/data/src_data/raw/jira/` (Jira webhook data) -- Can manage `/data/auth/` (password auth state) -- Can manage `/data/corporate-memory/` (knowledge base) -- Can manage `/data/user_sessions/` (session collector data) -- Can manage `/tmp/data_analyst_staging/` (Keboola staging directory) - -**Special Permissions:** -- Can run `notify-scripts` as any user (list/run notification scripts) -- Can set ACLs on Jira attachments (dataread group access) -- Can create log files in `/opt/data-analyst/logs/` - -**Full sudoers reference:** `server/sudoers-deploy` in repository - -Note: On Debian 12, core utils are in `/usr/bin/` (not `/bin/`). The sudoers file uses full paths like `/usr/bin/cp`, `/usr/bin/chmod`, etc. - -### Initial Setup (one-time) - -**1. Install prerequisites:** -```bash -sudo apt-get update -sudo apt-get install -y git python3.11-venv python3-pip -``` - -**2. Create deploy user and SSH key for GitHub:** -```bash -# Create deploy user -sudo useradd -m -s /bin/bash deploy -sudo groupadd data-ops 2>/dev/null || true -sudo usermod -aG data-ops deploy - -# Generate SSH key for GitHub -sudo -u deploy ssh-keygen -t ed25519 -f /home/deploy/.ssh/id_ed25519 -N '' -C 'deploy@data-broker' - -# Configure SSH for GitHub -sudo -u deploy bash -c 'echo -e "Host github.com\n IdentityFile ~/.ssh/id_ed25519\n StrictHostKeyChecking accept-new" > /home/deploy/.ssh/config' -sudo chmod 600 /home/deploy/.ssh/config - -# Show public key (add this to GitHub as Deploy Key) -sudo cat /home/deploy/.ssh/id_ed25519.pub -``` - -**3. Add Deploy Key to GitHub:** -- Go to: https://github.com/keboola/agnes-the-ai-analyst/settings/keys -- Click "Add deploy key" -- Title: `data-broker-server` -- Key: (paste public key from previous step) -- Allow write access: NO - -**4. Clone repository and run setup:** -```bash -sudo mkdir -p /opt/data-analyst -sudo chown deploy:data-ops /opt/data-analyst -sudo -u deploy git clone git@github.com:keboola/agnes-the-ai-analyst.git /opt/data-analyst/repo -sudo git config --global --add safe.directory /opt/data-analyst/repo -sudo -u deploy git config --global --add safe.directory /opt/data-analyst/repo -sudo /opt/data-analyst/repo/server/setup.sh -``` - -**5. Add existing admins to data-ops group:** -```bash -sudo usermod -aG data-ops admin1 -sudo usermod -aG data-ops admin2 -sudo usermod -aG data-ops admin3 -``` - -### GitHub Secrets Required - -Set these in GitHub repository settings (Settings > Secrets > Actions): - -| Secret | Value | -|--------|-------| -| `SERVER_HOST` | `YOUR_SERVER_IP` | -| `SERVER_USER` | `deploy` | -| `SERVER_SSH_KEY` | Private SSH key (`/home/deploy/.ssh/id_ed25519`) | -| `TELEGRAM_BOT_TOKEN` | Telegram Bot API token (from @BotFather) | -| `SENDGRID_API_KEY` | SendGrid API key for password auth emails | -| `ALLOWED_EMAILS` | Comma-separated whitelisted emails for password auth | - -### Manual Deployment - -Admins can trigger deployment manually: +### Environment configuration ```bash -# Via GitHub Actions UI (Actions > Deploy to Server > Run workflow) -# Or via SSH: -ssh kids "cd /opt/data-analyst/repo && ./server/deploy.sh" +# Edit .env (never commit this file) +nano /opt/data-analyst/.env + +# Restart app to apply changes +docker compose restart app ``` -### Deployment Logs +See `config/.env.template` for the full variable reference and +`config/instance.yaml.example` for instance configuration. + +## Monitoring + +### GCP Cloud Monitoring + +The VM reports metrics via the Google Cloud Ops Agent: ```bash -# View deployment history -cat /opt/data-analyst/logs/deploy.log - -# Follow live deployment -tail -f /opt/data-analyst/logs/deploy.log +# Check agent status +sudo systemctl status google-cloud-ops-agent ``` -### Troubleshooting CI/CD +Key metrics in GCP Console > Monitoring > Metrics Explorer: +- `agent.googleapis.com/disk/percent_used` — watch `/data` partition +- `agent.googleapis.com/memory/percent_used` +- `agent.googleapis.com/cpu/utilization` -**"sudo: a terminal is required to read the password"** -- Deploy user is missing NOPASSWD sudo permission for a specific command -- Check `/etc/sudoers.d/deploy` exists and has correct permissions (440) -- Verify the command path matches (Debian 12 uses `/usr/bin/`, not `/bin/`) -- **Fix:** Add missing permission to `server/sudoers-deploy` and redeploy: - ```bash - # Edit server/sudoers-deploy in repo - # Add the missing command with full path - deploy ALL=(ALL) NOPASSWD: /usr/bin/command-name args +A disk space alert fires when `/data` exceeds 85% for 5 minutes. - # Commit and push - git add server/sudoers-deploy - git commit -m "Add missing sudo permission" - git push origin main - - # Manually update on server (one-time) - ssh kids "sudo cp /opt/data-analyst/repo/server/sudoers-deploy /etc/sudoers.d/deploy" - ssh kids "sudo chmod 440 /etc/sudoers.d/deploy" - ``` - -**"Permission denied" on .env file** -- Deploy user cannot write directly to files owned by root -- Solution: Use `sudo /usr/bin/tee` instead of direct file write - -**Deploy script changes not taking effect** -- The deploy script pulls new code AFTER it starts running -- Changes to `deploy.sh` itself require manual pull first: - ```bash - ssh kids "sudo -u deploy bash -c 'cd /opt/data-analyst/repo && git pull'" - ``` - -**Verify sudoers configuration:** -```bash -# Check if sudoers file exists and has correct permissions -ssh kids "ls -la /etc/sudoers.d/deploy" - -# Validate syntax (exit code 0 = OK) -ssh kids "sudo visudo -cf /etc/sudoers.d/deploy && echo 'Syntax OK'" - -# View current sudoers rules -ssh kids "sudo cat /etc/sudoers.d/deploy" -``` - -**Test deploy locally as deploy user:** -```bash -ssh kids "sudo -u deploy bash -c 'cd /opt/data-analyst/repo && ./server/deploy.sh'" -``` - -## Web Application (Self-Service Portal) - -A web application at `https://your-instance.example.com` allows team members to create their own analyst accounts via Google SSO. - -### Features - -- Google Sign-In (restricted to `@your-domain.com` emails only) -- Email/password login for external users (whitelisted emails) -- Self-service account creation for new users -- Dashboard showing account info for existing users (2-column layout) -- Dynamic data stats (tables, columns, rows, size) loaded from `sync_state.json` -- Data catalog page with dynamic table listings from `data_description.md` + `sync_state.json` -- Data profiler with per-column statistics, visualizations, and alerts (from `profiles.json`) -- SSH connection instructions -- Claude Code integration hints for AI-assisted setup -- Telegram notification linking -- macOS desktop app linking/unlinking with install instructions - -### User Flow - -1. User visits `https://your-instance.example.com` -2. Signs in with Google (@your-domain.com account) -3. Dashboard shows instructions and form for SSH key -4. User can ask Claude Code to generate SSH key and guide them -5. After pasting SSH key, account is created automatically -6. User syncs data and starts analyzing with Claude Code - -### Dynamic Data Stats - -Dashboard and catalog pages display live data statistics (table count, columns, rows, size). These are loaded dynamically from `sync_state.json` on every page request - no webapp restart needed. - -**Data flow:** -``` -Cron (update.sh) → data_sync.py → /data/src_data/metadata/sync_state.json - ↓ - Flask reads on request → dashboard + catalog templates -``` - -- `sync_state.json` is updated by the data sync process with per-table stats (rows, columns, file size) -- Flask aggregates these into totals for display -- If `sync_state.json` is missing or unreadable, hardcoded fallback values are used -- Catalog page merges `data_description.md` (table names, descriptions, categories) with `sync_state.json` (row counts) - -### Architecture - -``` -Browser -> Nginx (HTTPS/Let's Encrypt) -> Gunicorn -> Flask App - | - v - sudo add-analyst (via sudoers) -``` - -### Setup - -**1. Run webapp setup script:** -```bash -sudo /opt/data-analyst/repo/server/webapp-setup.sh -``` - -**2. Configure Google OAuth:** -- Go to [Google Cloud Console](https://console.cloud.google.com/apis/credentials) -- Create OAuth 2.0 Client ID (Web application) -- Authorized JavaScript origins: `https://your-instance.example.com` -- Authorized redirect URIs: `https://your-instance.example.com/authorize` - -**3. Update environment file:** -```bash -sudo nano /opt/data-analyst/.env - -# Add: -WEBAPP_SECRET_KEY= -GOOGLE_CLIENT_ID= -GOOGLE_CLIENT_SECRET= -``` - -**4. Start/restart webapp:** -```bash -sudo systemctl restart webapp -``` - -### Monitoring +### Local checks ```bash -# Service status -sudo systemctl status webapp -sudo systemctl status nginx +# Disk usage +df -h /data -# Logs -tail -f /opt/data-analyst/logs/webapp-access.log -tail -f /opt/data-analyst/logs/webapp-error.log +# Data directory breakdown +du -sh /data/* -# Test endpoint -curl -I https://your-instance.example.com/health +# Container resource usage +docker stats --no-stream ``` -### Security Notes +## Backup and Disaster Recovery -- Only `@your-domain.com` emails can log in via Google OAuth -- External users can log in via email/password if their email is whitelisted -- Self-service creates **standard analyst** accounts only (no --private flag) -- www-data is member of `data-ops` group (for access to /opt/data-analyst and static files) -- www-data can only run `add-analyst` via sudoers (not add-admin) - configured in `/etc/sudoers.d/webapp` -- HTTPS enforced with Let's Encrypt certificate -- SSH keys are validated before passing to add-analyst script -- Reserved system usernames (root, admin, deploy, etc.) are blocked from registration -- Username collision with existing system accounts shows error and requires admin intervention -- Password auth uses Argon2id hashing (state of the art) with rate limiting (5 attempts/minute) -- Magic links for password setup expire in 24 hours, reset links in 1 hour +The `/data` persistent disk has daily GCP snapshot schedules with 14-day retention. -### Technical Notes - -**Sudoers configuration:** - -The webapp needs sudo access to run `add-analyst` and `notify-scripts`. This is configured via `server/sudoers-webapp` file which is deployed to `/etc/sudoers.d/webapp`: - -``` -www-data ALL=(ALL) NOPASSWD: /usr/local/bin/add-analyst -www-data ALL=(ALL) NOPASSWD: /usr/local/bin/notify-scripts -``` - -**Absolute paths requirement:** - -Gunicorn runs with a restricted PATH (only `/opt/data-analyst/.venv/bin`). Therefore, all system commands in Python code must use absolute paths: -- `/usr/bin/sudo` (not just `sudo`) -- `/usr/local/bin/add-analyst` -- `/usr/local/bin/notify-scripts` - -This is handled in `webapp/user_service.py` and `services/telegram_bot/runner.py`. - -### Username Generation - -Username is generated from email address: the part before `@` converted to lowercase. - -Examples: -- `John.Doe@your-domain.com` -> `john.doe` -- `john@your-domain.com` -> `john` - -If a username conflicts with a reserved system name or existing non-analyst account, the user sees an error and must contact an admin to create the account manually with a different username. - -### Prerequisites - -**GCP Firewall:** ```bash -# Allow HTTP/HTTPS traffic (required for Let's Encrypt and webapp) -gcloud compute firewall-rules create allow-http-data-broker \ - --project=your-gcp-project \ - --direction=INGRESS \ - --priority=1000 \ - --network=default \ - --action=ALLOW \ - --rules=tcp:80,tcp:443 \ - --source-ranges=0.0.0.0/0 \ - --target-tags=http-server,https-server +# List existing snapshots +gcloud compute snapshots list --project=your-gcp-project \ + --filter="sourceDisk:data-disk" --sort-by=~creationTimestamp -# Add tags to VM -gcloud compute instances add-tags your-server \ +# Create a manual snapshot before risky operations +gcloud compute disks snapshot data-disk \ --project=your-gcp-project \ --zone=europe-north1-a \ - --tags=http-server,https-server + --snapshot-names=data-disk-$(date +%Y%m%d)-manual ``` -**DNS:** -- A record: `your-instance.example.com` -> `YOUR_SERVER_IP` +See `disaster-recovery.md` for full recovery procedures. -## Password Authentication for External Users +## Web Application -External users (investors, partners) who don't have @your-domain.com Google accounts can authenticate using email/password. +The FastAPI app is available at `https://your-instance.example.com`. -### How It Works +- **Google OAuth**: restricted to `allowed_domain` set in `config/instance.yaml` +- **Email magic link**: available out of the box (no external service required) +- **Admin API**: `POST /api/admin/tables/{id}` — register/update tables +- **Sync API**: `POST /api/sync/trigger` — trigger data extraction -1. **Admin adds email to whitelist** (via GitHub Secrets): - - Go to GitHub repo Settings > Secrets > Actions - - Update `ALLOWED_EMAILS` secret (comma-separated list) - - Push any change to trigger deploy, or manually restart webapp +### Google OAuth setup -2. **User visits login page and clicks "Sign in with Email"** - -3. **First-time setup (Sign Up tab):** - - User enters their whitelisted email - - Clicks "Request Access" - - Receives email with setup link (valid 24 hours) - - Sets up password via the link - -4. **Subsequent logins (Sign In tab):** - - User enters email + password - - Same session/dashboard as Google OAuth users - -### Username Generation - -Usernames are derived from email addresses differently for internal vs external users: - -| Email | Username | Type | -|-------|----------|------| -| `john.doe@your-domain.com` | `john.doe` | Internal (Google OAuth) | -| `emily@investor.com` | `emily_investor_com` | External (password auth) | -| `partner@example.org` | `partner_example_org` | External (password auth) | - -This prevents username collisions between internal and external users. - -### Configuration - -**GitHub Secrets (recommended):** - -| Secret | Description | -|--------|-------------| -| `ALLOWED_EMAILS` | Comma-separated list of whitelisted emails | -| `SENDGRID_API_KEY` | SendGrid API key for sending emails | -| `EMAIL_FROM_ADDRESS` | Sender email address (e.g., `noreply@your-domain.com`) | -| `EMAIL_FROM_NAME` | Sender display name (e.g., `Data Analyst Platform`) | - -**Data storage:** - -``` -/data/auth/ # Password auth data (www-data:data-ops, 2770) -└── password_users.json # User records (hashes, tokens, metadata) -``` - -### Security Features - -- **Argon2id** password hashing (most secure algorithm) -- **Rate limiting**: 5 failed attempts per minute per email -- **Single-use tokens**: Setup/reset links invalidate after use -- **Token expiry**: Setup 24h, reset 1h -- **No email enumeration**: Reset endpoint always shows same message -- **Password requirements**: Min 8 chars, uppercase, lowercase, digit - -### Password Reset - -Users can reset their password via "Forgot Password?" link on the Sign In tab. They receive an email with a reset link valid for 1 hour. - -## Telegram Notification Bot - -A Telegram bot (`@YourBot`) allows analysts to receive alerts from their custom notification scripts. - -### Architecture - -``` -Telegram Bot Service (systemd: notify-bot) -├── Telegram polling (handles /start, /test commands) -└── HTTP server on unix socket (/run/notify-bot/bot.sock) - ▲ - │ POST /send, POST /send_photo - │ -notify-runner (user crontab, /usr/local/bin/notify-runner) -└── Executes ~/user/notifications/*.py -``` - -The webapp reads/writes shared JSON files in `/data/notifications/` for user-Telegram linking (verification codes, user mappings). - -### Services - -| Service | User | Description | -|---------|------|-------------| -| `notify-bot` | deploy:data-ops | Telegram polling + send API on unix socket | -| `webapp` | www-data:data-ops | Dashboard with Telegram link/unlink UI | - -### Bot Commands - -| Command | Description | -|---------|-------------| -| `/start` | Link account (or show status if already linked) | -| `/whoami` | Show username and email | -| `/status` | List notification scripts with Run buttons | -| `/test` | Send a demo graphical report | -| `/help` | Show available commands | - -The `/status` command shows inline keyboard buttons to run scripts on demand. Scripts are executed as the owning user via `sudo -u` using the `notify-scripts` helper (see below). - -### Data Files - -``` -/data/notifications/ # deploy:data-ops, mode 2770 (setgid, no others) -├── telegram_users.json # username -> {chat_id, linked_at} -├── desktop_users.json # username -> {linked_at} (desktop app link state) -├── pending_codes.json # code -> {chat_id, created_at} -└── bot.log # Bot service log - -/run/notify-bot/ # systemd RuntimeDirectory (mode 0755) -└── bot.sock # Unix socket for send API (mode 0666) -``` - -The setgid bit (`2770`) ensures all files created in `/data/notifications/` inherit the `data-ops` group, allowing both the bot service (deploy) and webapp (www-data) to read/write them. Analysts have no access to this directory. - -The socket is in `/run/notify-bot/`, a systemd-managed directory with `0755` permissions, so any local user can connect to send notifications. - -### Notification Runner - -Users create Python scripts in `~/user/notifications/` that output JSON to stdout. The `notify-runner` script (installed at `/usr/local/bin/notify-runner`) executes these scripts and sends results via the bot's unix socket. - -Per-user state is stored in `~/.notifications/state/` (cooldown tracking) and logs in `~/.notifications/logs/`. - -Users configure their own crontab: -```bash -crontab -e -# Add: -*/5 * * * * ~/.venv/bin/python /usr/local/bin/notify-runner >> ~/.notifications/logs/cron.log 2>&1 -``` - -### Notify-Scripts Helper - -The `notify-scripts` helper (`/usr/local/bin/notify-scripts`) provides a secure way for services (webapp, Telegram bot) to list and run user notification scripts without needing filesystem access to user home directories. - -**Why it exists:** User home directories are set to `750` permissions. Services like `www-data` and `deploy` cannot traverse `/home/{user}/` to read scripts or state files. The helper runs **as the target user** via `sudo -u`, so it has full access to `~/user/notifications/` and `~/.notifications/state/`. - -**Usage:** -```bash -# List scripts with last_run metadata (returns JSON array) -sudo -u /usr/local/bin/notify-scripts list - -# Run a script and return its JSON output -sudo -u /usr/local/bin/notify-scripts run - -# Get last sync time (returns JSON with elapsed_seconds, elapsed_display) -sudo -u /usr/local/bin/notify-scripts sync-status -``` - -The `sync-status` command reads the mtime of `~/server/` directory. This is updated by `sync_data.sh` via `touch ~/server/` at the end of each sync. Each user has their own `~/server/` directory (containing symlinks to shared `/data/`), so timestamps are per-user. - -**Callers:** -- `services/telegram_bot/status.py` - `/status` command and script list API -- `services/telegram_bot/runner.py` - on-demand script execution (Telegram "Run" button, webapp API) -- `webapp/account_service.py` - Account card "Last Sync" display - -**Sudoers rules:** -``` -# /etc/sudoers.d/webapp -www-data ALL=(ALL) NOPASSWD: /usr/local/bin/notify-scripts - -# /etc/sudoers.d/deploy -deploy ALL=(ALL) NOPASSWD: /usr/local/bin/notify-scripts -``` - -### Monitoring - -```bash -# Bot service -sudo systemctl status notify-bot -tail -f /data/notifications/bot.log - -# Linked users -cat /data/notifications/telegram_users.json | python3 -m json.tool - -# Runner logs (per user) -cat ~/.notifications/logs/runner.log -``` - -### Security - -- Bot token is stored centrally in `/opt/data-analyst/repo/.env` (loaded via systemd EnvironmentFile) -- Users never see the token - they communicate via unix socket only -- Socket in `/run/notify-bot/bot.sock` (systemd RuntimeDirectory, mode `0755`), socket itself `0666` -- `/data/notifications/` is `2770` (only deploy + data-ops), no analyst access to logs or user mappings -- Notification scripts run under the user's own account (no sudo) when triggered by crontab -- On-demand runs (via /status button and webapp API) use `sudo -u /usr/local/bin/notify-scripts` -- services never access user home directories directly -- Scripts have a 60-second timeout (enforced by `notify-scripts` helper) -- Verification codes expire after 10 minutes and are single-use - -### Known Issues - -**On-demand script execution security hardening (partially resolved):** -The `notify-scripts` helper replaced direct `sudo -H -u ... /usr/bin/env ...` calls with a single auditable entry point. Services no longer need filesystem access to user home directories (750 permissions are preserved). The bot still requires `NoNewPrivileges=false` and `/tmp` in `ReadWritePaths` for sudo execution. A queue-based approach ([#51](https://github.com/keboola/agnes-the-ai-analyst/issues/51)) could further improve this by having `notify-runner` pick up run requests from a queue instead of the bot calling sudo directly. - -## Data Sync Settings (Web Portal) - -Users can configure which optional datasets to sync via the web portal at `https://your-instance.example.com`. Settings are stored server-side and downloaded by `sync_data.sh` before each sync. - -### Architecture - -``` -┌─────────────────────────────────────┐ -│ Web Portal (Dashboard) │ -│ └── Data Settings widget │ -│ ├── Toggle: Jira (~50 MB) │ -│ └── Toggle: Jira Attachments │ -│ (~500 MB+) │ -└─────────────────────────────────────┘ - │ POST /api/sync-settings - ▼ -┌─────────────────────────────────────┐ -│ Flask API │ -│ ├── Save to sync_settings.json │ -│ └── Write ~/.sync_settings.yaml │ -│ (via sudo install) │ -└─────────────────────────────────────┘ - │ - ▼ -/data/notifications/sync_settings.json ← Central storage (all users) -/home/{user}/.sync_settings.yaml ← Per-user config file - │ - ▼ scp (analyst sync) -┌─────────────────────────────────────┐ -│ sync_data.sh (client) │ -│ ├── Download ~/.sync_settings.yaml │ -│ ├── Read dataset toggles │ -│ └── Conditionally run sync_jira.sh │ -└─────────────────────────────────────┘ -``` - -### Data Files - -| File | Location | Purpose | -|------|----------|---------| -| `sync_settings.json` | `/data/notifications/` | Central storage for all users' settings | -| `.sync_settings.yaml` | `/home/{user}/` | Per-user config file (YAML format) | - -**sync_settings.json format:** -```json -{ - "john.doe": { - "datasets": { - "jira": true, - "jira_attachments": false - }, - "updated_at": "2026-02-03T12:00:00Z" - } -} -``` - -**Per-user .sync_settings.yaml format:** -```yaml -# Data Analyst - Sync Configuration -# Managed by web portal - changes here may be overwritten - -datasets: - jira: true - jira_attachments: false -``` - -### Sudoers Configuration - -The webapp needs sudo to write config files to user home directories. This is configured in `/etc/sudoers.d/webapp-sync`: - -``` -# Allow webapp to install sync settings to user home directories -www-data ALL=(ALL) NOPASSWD: /usr/bin/install -o * -g * -m 644 /tmp/*.yaml /home/*/.sync_settings.yaml -``` - -**Why this approach:** -- Webapp runs as `www-data` which cannot write to `/home/{user}/` -- Using `install` command allows setting ownership in one atomic operation -- Tempfile must be in `/tmp/` (Gunicorn has restricted PATH) -- Target is restricted to `.sync_settings.yaml` only - -### Client Sync Flow - -When `sync_data.sh` runs: - -1. Downloads config from server: - ```bash - scp -q data-analyst:~/.sync_settings.yaml /tmp/.sync_settings_$(id -u).yaml - ``` - -2. If no config exists on server, creates default (jira: false) - -3. Reads config and conditionally runs dataset sync scripts: - ```bash - if grep -qE '^\s*jira:\s*true' "$SYNC_CONFIG_LOCAL"; then - bash sync_jira.sh - fi - ``` - -4. `sync_jira.sh` syncs data AND creates DuckDB views automatically (no separate step needed) - -5. `sync_jira.sh` checks `jira_attachments` setting for attachment sync - -### Available Datasets - -| Dataset | Size | Description | -|---------|------|-------------| -| `jira` | ~50 MB | Support tickets from SUPPORT project (issues, comments, changelog, attachment metadata) | -| `jira_attachments` | ~500 MB+ | Actual attachment files (images, logs, etc.). Requires `jira` to be enabled. | - -### API Endpoints - -| Endpoint | Method | Description | -|----------|--------|-------------| -| `/api/sync-settings` | GET | Get current user's sync settings | -| `/api/sync-settings` | POST | Update settings and regenerate user config | - -### Troubleshooting - -**Settings not being saved to user home:** -- Check `/etc/sudoers.d/webapp-sync` exists -- Verify tempfile is created in `/tmp/` (not other directory) -- Check webapp logs: `tail -f /opt/data-analyst/logs/webapp-error.log` - -**Old scripts on client after sync:** -- `sync_data.sh` downloads scripts from `/data/scripts/` on server -- Ensure `deploy.sh` copies all scripts including `sync_jira.sh` -- If scripts are missing from `/data/scripts/`, run manual deploy or CI/CD +1. Go to [Google Cloud Console](https://console.cloud.google.com/apis/credentials) +2. Create OAuth 2.0 Client ID (Web application) +3. Authorized JavaScript origins: `https://your-instance.example.com` +4. Authorized redirect URIs: `https://your-instance.example.com/auth/google/callback` +5. Add `GOOGLE_CLIENT_ID` and `GOOGLE_CLIENT_SECRET` to `.env` ## Jira Webhook Integration -Receives webhooks from Atlassian Jira to maintain a **real-time** copy of issue data for analysis. - -### Architecture - -``` -Jira Cloud (your-org.atlassian.net) - │ - │ POST /webhooks/jira (HTTPS) - ▼ -┌─────────────────────────────────────┐ -│ Webapp (Flask) │ -│ ├── Verify HMAC signature │ -│ ├── Fetch full issue via REST API │ -│ ├── Save JSON + download attachs │ -│ └── Trigger incremental transform │ -│ │ │ -│ ▼ │ -│ ┌─────────────────────────────┐ │ -│ │ incremental_jira_transform │ │ -│ │ • Upsert to monthly Parquet │ │ -│ │ • Copy to distribution dir │ │ -│ └─────────────────────────────┘ │ -└─────────────────────────────────────┘ - │ - ▼ rsync (analyst sync) -┌─────────────────────────────────────┐ -│ Analyst (local) │ -│ • Only changed monthly files sync │ -│ • Data available within seconds │ -└─────────────────────────────────────┘ -``` - -### Data Structure - -``` -/data/src_data/ -├── raw/jira/ # Raw Jira data from webhooks -│ ├── issues/ # Individual issue JSON files -│ │ ├── SUPPORT-1234.json -│ │ └── SUPPORT-1235.json -│ ├── attachments/ # Downloaded attachment files -│ │ └── SUPPORT-1234/ -│ │ └── 56340_image.png -│ └── webhook_events/ # Raw webhook payloads (audit) -│ └── 20260203_120000_jira_issue_created.json -│ -└── parquet/jira/ # Transformed data (monthly partitioned) - ├── issues/ - │ ├── 2024-01.parquet - │ └── 2024-02.parquet - ├── comments/ - ├── attachments/ # Metadata only (not binary) - └── changelog/ - -~/server/parquet/jira/ # Distribution directory (symlink or copy) - # This is what analysts sync via rsync -``` - -**Monthly partitioning:** Each issue belongs to the month of its `created_at` date. When an issue is updated, only that month's Parquet file changes. Rsync detects changed files by checksum and only transfers those (~50-100KB per month). +Receives webhooks from Atlassian Jira for real-time issue sync. ### Configuration -Add to `/opt/data-analyst/.env`: - +Add to `.env`: ```bash -# Jira Webhook Integration JIRA_WEBHOOK_SECRET= -JIRA_DOMAIN=your-org.atlassian.net -JIRA_EMAIL=integration-user@your-domain.com -JIRA_API_TOKEN= - -# SLA polling (JSM service account for elapsed_millis refresh) -JIRA_SLA_EMAIL= -JIRA_SLA_API_TOKEN= -JIRA_CLOUD_ID=f0f7a244-4fb4-41f9-b1f0-b79e24a20f11 +JIRA_API_TOKEN= ``` -**Get Jira API token:** -1. Go to https://id.atlassian.com/manage-profile/security/api-tokens -2. Create API token -3. Store in `.env` as `JIRA_API_TOKEN` +Add to `config/instance.yaml`: +```yaml +jira: + domain: "your-org.atlassian.net" + email: "integration-user@your-domain.com" + webhook_secret: "${JIRA_WEBHOOK_SECRET}" + api_token: "${JIRA_API_TOKEN}" +``` -### Jira Webhook Setup +### Jira webhook setup 1. Go to Jira Admin > System > WebHooks 2. Create new webhook: - - **Name**: `Data Analyst Sync` - **URL**: `https://your-instance.example.com/webhooks/jira` - - **Secret**: Same value as `JIRA_WEBHOOK_SECRET` in `.env` - - **JQL Filter**: `project = "Your Project"` (or your project) - - **Events**: - - Issue: created, updated, deleted - - Comment: created, updated - - Attachment: created - - Issue link: created - -### Endpoints - -| Endpoint | Method | Description | -|----------|--------|-------------| -| `/webhooks/jira` | POST | Receive Jira webhooks | -| `/webhooks/jira/health` | GET | Health check (shows config status) | -| `/webhooks/jira/test` | POST | Manual issue fetch (debug mode only) | + - **Secret**: same value as `JIRA_WEBHOOK_SECRET` + - **Events**: Issue created/updated/deleted, Comment created/updated, Attachment created ### Monitoring ```bash -# Check webhook health +# Health check curl https://your-instance.example.com/webhooks/jira/health -# View recent webhook events -ls -la /data/src_data/raw/jira/webhook_events/ | tail -20 - -# Check saved issues -ls /data/src_data/raw/jira/issues/ | wc -l - -# View webapp logs for webhook processing -tail -f /opt/data-analyst/logs/webapp-error.log | grep -i jira +# Webhook processing logs +docker compose logs -f app | grep -i jira ``` -### SLA Polling +## Troubleshooting -SLA elapsed values (`first_response_elapsed_millis`, `time_to_resolution_elapsed_millis`) only update when a webhook fires. For idle open tickets, these values go stale. The SLA polling timer refreshes them periodically and self-heals stale status data from missed webhooks. - -| Component | Description | -|-----------|-------------| -| `jira-sla-poll.service` | Oneshot service that polls open tickets for fresh SLA + status data | -| `jira-sla-poll.timer` | Runs every 15 minutes (10min after boot, then every 15min) | -| `connectors/jira/scripts/poll_sla.py` | Reads Parquet to find open issues, fetches SLA + status via cloud API | -| `connectors/jira/file_lock.py` | Per-issue advisory file locking (shared with webhook handler) | - -**How it works:** -1. Reads Parquet issues to find open tickets with SLA data (~49 tickets) -2. For each: fetches fresh SLA **and status** fields via JSM service account (cloud API) -3. Acquires per-issue advisory file lock (prevents concurrent webhook writes) -4. Updates raw JSON atomically (tempfile + `os.fchmod(0o660)` + os.replace) -5. If ticket is resolved in Jira but "open" locally: logs `Self-healing: SUPPORT-XXXX is resolved in Jira` -6. Calls `transform_single_issue()` to update Parquet + distribution (inside lock) -7. Releases lock - -**Monitoring:** -```bash -# Check timer status -systemctl status jira-sla-poll.timer -systemctl list-timers | grep jira - -# View last run logs -journalctl -u jira-sla-poll.service --since "1 hour ago" - -# Manual dry run (count open issues) -cd /opt/data-analyst/repo -/opt/data-analyst/.venv/bin/python -m connectors.jira.scripts.poll_sla --dry-run -``` - -**Requires:** `JIRA_SLA_EMAIL`, `JIRA_SLA_API_TOKEN`, `JIRA_CLOUD_ID` in `.env`. Timer is auto-enabled by `deploy.sh` when `JIRA_SLA_API_TOKEN` is set. - -### Consistency Monitoring - -Automated check every 30 minutes to detect missing Jira issues caused by webhook losses, disk failures, or processing errors. Validates data integrity by comparing three sources: Jira API (ground truth), raw JSON files, and Parquet data. - -| Component | Description | -|-----------|-------------| -| `jira-consistency.service` | Oneshot service that validates data consistency across all sources | -| `jira-consistency.timer` | Runs every 30 minutes (10min after boot) | -| `jira-consistency-deep.timer` | Weekly full history check (Sunday 3 AM) | -| `connectors/jira/scripts/consistency_check.py` | Validation script with auto-backfill capability | - -**How it works:** -1. Queries Jira API for all issue keys (last 30 days by default) -2. Compares with raw JSON files in `/data/src_data/raw/jira/issues/` -3. Compares with Parquet data in `/data/src_data/parquet/jira/issues/` -4. Auto-backfills if 1-10 issues missing (downloads JSON + transforms to Parquet) -5. Alerts (ERROR log) if 11+ issues missing (requires manual investigation) -6. Re-transforms JSON to Parquet for issues with transform lag - -**Grace period:** Ignores issues created in last 5 minutes to avoid false positives from webhook timing windows. - -**Alert levels:** -- **INFO**: 1-5 missing issues, auto-backfilled successfully -- **WARNING**: 6-10 missing issues, auto-backfilled successfully -- **ERROR**: 11+ missing issues, manual review required (no auto-fix) - -**Monitoring:** -```bash -# Check timer status -systemctl status jira-consistency.timer -systemctl list-timers | grep jira - -# View last run logs -journalctl -u jira-consistency.service --since "1 hour ago" - -# Manual check (dry run) -cd /opt/data-analyst/repo -/opt/data-analyst/.venv/bin/python -m connectors.jira.scripts.consistency_check --dry-run --max-age-days 7 - -# Manual check with auto-fix -/opt/data-analyst/.venv/bin/python -m connectors.jira.scripts.consistency_check --auto-fix --max-age-days 30 - -# View consistency report -cat /data/src_data/raw/jira/_consistency_report.json | python3 -m json.tool -``` - -**Manual recovery (if 11+ issues found):** -```bash -# List missing issues from report -jq -r '.discrepancies.missing_in_json[]' /data/src_data/raw/jira/_consistency_report.json - -# Backfill specific issues -cd /opt/data-analyst/repo -/opt/data-analyst/.venv/bin/python -m connectors.jira.scripts.backfill --issue-keys SUPPORT-15307,SUPPORT-15308 - -# Verify in Parquet -/opt/data-analyst/.venv/bin/python -c " -import duckdb -con = duckdb.connect() -result = con.execute(''' - SELECT issue_key, created_at, summary - FROM read_parquet('/data/src_data/parquet/jira/issues/*.parquet') - WHERE issue_key IN ('SUPPORT-15307', 'SUPPORT-15308') -''').fetchall() -for row in result: - print(row) -" -``` - -**Requires:** `JIRA_DOMAIN`, `JIRA_EMAIL`, `JIRA_API_TOKEN` in `.env`. Timers are auto-enabled by `deploy.sh` when Jira credentials are configured. - -### Security - -- Webhooks are verified using HMAC-SHA256 signature -- API token has read-only access to Jira (no write permissions needed) -- Webhook events are logged for audit purposes -- Multiple services write to `/data/src_data/raw/jira/`: webapp (www-data), SLA poll (root), consistency check (root), backfill scripts (admin users) -- Concurrent writes to the same issue JSON are serialized via per-issue advisory file locking (`connectors/jira/file_lock.py`, `fcntl.flock`). Lock files in `issues/.locks/`. See [#203](https://github.com/keboola/agnes-the-ai-analyst/issues/203). - -## Data Profiler - -Generates YData-inspired statistical profiles for all tables in the data catalog, including Jira support tables. Profiles include per-column statistics, type-specific visualizations (histograms, top values, timelines), data quality alerts, and business context (relationships, metrics). Profiles are preserved across runs — if a table fails to profile, its previous valid data is retained. - -### Architecture - -``` -Cron (update.sh, 3x daily) - Step 2: python -m src.data_sync → parquet + sync_state.json + schema.yml - Step 3: python -m src.profiler → profiles.json - │ - ▼ -/data/src_data/metadata/profiles.json (mode 644, admin1:data-ops) - │ - ▼ -Webapp: GET /api/catalog/profile/ - │ - ▼ -Catalog page: profiler modal (Chart.js visualizations) -``` - -### How It Works - -1. **Profiler runs as Step 4 in `scripts/update.sh`** after data sync and metadata generation -2. **Materializes Parquet into DuckDB** — `CREATE TEMP TABLE` loads each table once into DuckDB columnar storage (instead of re-reading Parquet files for every query) -3. **Batch statistics** — base stats (COUNT, COUNT DISTINCT) for all columns in one query; type-specific aggregates (numeric, string, date, boolean) batched per category -4. **Large tables** (>500K rows) are sampled: `USING SAMPLE 500000 ROWS` -5. **Merges metadata** from `data_description.md` (descriptions, foreign keys), `sync_state.json` (row counts, file sizes), and `docs/metrics/*.yml` (business metric mappings) -6. **Writes `profiles.json`** atomically (`tempfile.mkstemp()` + `os.chmod(0o644)` + `os.replace()`) -7. **Preserves existing profiles on failure** — if a table fails to profile, the previous valid profile is retained (marked `_stale: true`) -8. **Profiler failure is non-fatal** — if the entire profiler fails, the update pipeline continues -9. **Jira table relationships** — `issue_key` foreign keys are defined between all Jira tables (comments, attachments, changelog, issuelinks, remote_links → jira_issues), visible in the Relationships tab - -### Output File - -``` -/data/src_data/metadata/profiles.json # ~900 KB for ~29 tables -``` - -**Permissions:** File must be `644` (world-readable) so the webapp (`www-data`) can serve it. The profiler sets `os.chmod(tmp, 0o644)` before `os.replace()` because `mkstemp()` defaults to `600`. - -### Per-Table Profile Structure - -Each table profile contains: - -| Field | Source | Description | -|-------|--------|-------------| -| `row_count`, `column_count` | DuckDB | Table dimensions | -| `file_size_mb` | sync_state.json | Parquet file size on disk | -| `description`, `primary_key` | data_description.md | Business context | -| `avg_completeness` | DuckDB | Average non-null percentage across columns | -| `missing_cells`, `missing_cells_pct` | DuckDB | Total NULL cells count and percentage | -| `duplicate_rows` | DuckDB | `COUNT(*) - COUNT(DISTINCT *)` | -| `date_range` | DuckDB | Earliest/latest date from date columns | -| `variable_types` | DuckDB | Breakdown by type (STRING, NUMERIC, DATE, BOOLEAN) | -| `alerts` | Computed | Auto-detected data quality issues (see below) | -| `related_tables` | data_description.md | Foreign key relationships (outgoing + incoming) | -| `used_by_metrics` | docs/metrics/*.yml | Which business metrics use this table | -| `sample_rows` | DuckDB | First 5 rows for preview | -| `columns` | DuckDB | Per-column detailed statistics | -| `_stale` | Profiler | `true` if this profile is from a previous run (current profiling failed) | - -### Alert System - -Auto-detection of data quality issues, displayed as colored badges: - -| Alert | Condition | Severity | -|-------|-----------|----------| -| `constant` | `unique_count == 1` | warning (yellow) | -| `unique` | `unique_pct == 100%` | info (red) | -| `high_missing` | `missing_pct > 30%` | error (red) | -| `missing` | `missing_pct > 5%` | warning (yellow) | -| `imbalance` | `top_value_pct > 60%` (categorical) | info (blue) | -| `zeros` | `zero_pct > 50%` (numeric) | info (blue) | -| `high_cardinality` | `unique_count > 50` (text) | info (grey) | - -### Type-Specific Column Statistics - -| Column Type | Statistics | Visualization | -|-------------|-----------|---------------| -| **STRING** (low cardinality ≤50) | Top 10 values with counts/percentages | Horizontal bar chart | -| **STRING** (high cardinality >50) | min/max/avg length, sample values | Sample list | -| **NUMERIC** (FLOAT64, INT64, DECIMAL) | min, max, mean, median, p5/p25/p75/p95, stddev, zeros | Histogram (10-20 buckets) | -| **DATE/TIMESTAMP** | earliest, latest, span_days | Timeline histogram (quarterly) | -| **BOOLEAN** | true_count, false_count, true_pct | True/false ratio bar | - -### Webapp Integration - -**API endpoint:** `GET /api/catalog/profile/` (requires login) -- Returns JSON profile for a single table from `profiles.json` -- 404 if profiler hasn't run yet or table not found -- 500 if file unreadable (check permissions) - -**Catalog page:** Click any table row to open profiler modal with tabs: -- **Overview** — dataset statistics + variable type breakdown -- **Variables** — per-column cards with type-specific charts (Chart.js) -- **Alerts** — all detected issues with colored severity badges -- **Missing Values** — horizontal bar chart of completeness per column -- **Relationships** — foreign key links (clickable to open related table's profile) -- **Sample** — first 5 rows in table format - -### Performance - -- **Runtime:** ~1-2 minutes for ~29 tables (optimized from ~8min via TABLE materialization + batch queries) -- **Sampling:** Tables >500K rows use `USING SAMPLE 500000 ROWS` for consistent performance -- **Memory:** In-memory DuckDB with temporary tables (dropped after profiling) -- **Output size:** ~900 KB JSON for ~29 tables (including 6 Jira tables) - -### Files - -| File | Description | -|------|-------------| -| `src/profiler.py` | Profiler engine (~1220 lines) | -| `tests/test_profiler.py` | Unit + integration tests (24 tests) | -| `scripts/update.sh` | Pipeline integration (Step 4) | -| `webapp/app.py` | API route `/api/catalog/profile/` | -| `webapp/templates/catalog.html` | Profiler modal UI + Chart.js | - -### Monitoring +### Container won't start ```bash -# Manual profiler run -ssh kids "cd /opt/data-analyst/repo && source /opt/data-analyst/.venv/bin/activate && python -m src.profiler" - -# Check output -ssh kids "ls -la /data/src_data/metadata/profiles.json" -ssh kids "python3 -c \"import json; d=json.load(open('/data/src_data/metadata/profiles.json')); print(f'Tables: {len(d[\\\"tables\\\"])}')\"" - -# Check update.sh logs (profiler runs as Step 4) -ssh kids "cat /var/log/update.log | grep -A5 'Generating data profiles'" - -# Test API endpoint -curl -s https://your-instance.example.com/api/catalog/profile/company | python3 -m json.tool | head -20 +docker compose logs app | tail -50 +# Look for configuration or DuckDB errors at startup ``` -### Troubleshooting +### DuckDB locked -**"Profile data not available for this table"** -- Profiler hasn't been run yet, or table name doesn't match -- Run manually: `python -m src.profiler` on server -- Note: Since v1.1, profiler preserves old profiles on failure — this should only appear for truly new tables - -**HTTP 500 on `/api/catalog/profile/*`** -- Check file permissions: `ls -la /data/src_data/metadata/profiles.json` — must be `644` -- Fix: `sudo chmod 644 /data/src_data/metadata/profiles.json` -- Root cause: `mkstemp()` creates files with `600`; fixed in profiler.py with `os.chmod(0o644)` - -**Profiler takes too long** -- Normal runtime is ~1-2 minutes; if significantly longer, check which tables are large in profiler logs -- Sampling threshold is 500K rows (configurable in `src/profiler.py` constant `SAMPLE_THRESHOLD`) -- TABLE materialization + batch queries keep it fast; if DuckDB runs out of memory, check server RAM - -**Metrics not showing in profiler** -- Metrics are loaded from `docs/metrics/` directory (split by category: `docs/metrics/*/*.yml`) -- Legacy `docs/metrics.yml` path is still supported but the directory structure takes precedence -- Check that metric files exist: `ls docs/metrics/*/*.yml` - -## Corporate Memory - -A knowledge sharing system that extracts reusable insights from analysts' personal notes (`CLAUDE.local.md`), lets the team vote on them via a webapp, and syncs upvoted items back to each user's Claude Code rules. - -### Architecture - -``` -┌─────────────────────────────────────┐ -│ Analyst Workstations │ -│ ├── CLAUDE.local.md │ ← Personal notes (synced to server) -│ └── .claude/rules/*.md │ ← Synced rules from upvoted items -└─────────────────────────────────────┘ - │ sync_data.sh ▲ sync_data.sh - │ (upload CLAUDE.local.md) │ (download .claude_rules/*) - ▼ │ -┌─────────────────────────────────────┐ │ -│ Server: /home/{user}/ │ │ -│ ├── CLAUDE.local.md │ │ -│ └── .claude_rules/*.md │───┘ -└─────────────────────────────────────┘ - │ corporate-memory.timer (every 30 min) - ▼ -┌─────────────────────────────────────┐ -│ Knowledge Collector (full refresh) │ -│ ├── MD5 hash change detection │ -│ ├── ALL files + existing catalog │ -│ │ → single Claude Haiku 4.5 call │ -│ │ (Structured Outputs) │ -│ ├── Sensitivity check (new items) │ -│ └── Save to knowledge.json │ -└─────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────┐ -│ /data/corporate-memory/ │ -│ ├── knowledge.json │ -│ ├── votes.json │ -│ └── user_hashes.json │ -└─────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────┐ -│ Webapp: /corporate-memory │ -│ ├── Browse, search, filter │ -│ ├── Upvote / downvote items │ -│ └── On vote → regenerate user rules│ -└─────────────────────────────────────┘ -``` - -### How It Works - -#### Collection (server-side, every 30 min) - -1. **Analysts write notes** in `CLAUDE.local.md` during their work with Claude Code -2. **`sync_data.sh`** uploads `CLAUDE.local.md` to `/home/{user}/CLAUDE.local.md` on the server -3. **Collector checks for changes** by comparing MD5 hashes of all users' files against `user_hashes.json` -4. **If any file changed**, collector sends ALL users' files + the existing knowledge catalog to **Claude Haiku 4.5** in a single API call (full refresh approach) -5. **Haiku maps knowledge** to existing catalog items (preserving IDs for vote stability) or creates new items -6. **Sensitivity check** runs only on newly created items (existing items were already checked) -7. **Knowledge base** is updated atomically (`tempfile` + `os.replace`) - -#### Voting and Rules Sync (webapp → analyst) - -1. **Users browse** knowledge at `/corporate-memory` (search, filter by category, sort by score) -2. **Upvoting an item** records the vote in `votes.json` and immediately regenerates the user's rule files -3. **Rule files** are installed to `/home/{server_user}/.claude_rules/{item_id}.md` via the `install-user-rules` sudo helper (see below) -4. **Next `sync_data.sh` run** downloads `.claude_rules/*` to the analyst's `.claude/rules/` directory -5. **Claude Code** automatically reads files from `.claude/rules/` as project context - -There is no threshold - any personal upvote syncs the item to that user's rules. - -#### Rules Installation (sudo helper) - -The webapp runs as `www-data` which cannot write to `/home/{user}/` directories (mode `drwxr-x---`). Rule files are installed using the established **sudo install pattern** (same approach as `sync_settings_service.py` for `.sync_settings.yaml`): - -1. Webapp writes rule `.md` files to a temp directory -2. Calls `sudo -n /usr/local/bin/install-user-rules {username} {tmp_dir}` -3. Helper script creates `/home/{user}/.claude_rules/` (mode 700), removes old `km_*.md` files, installs new files with `/usr/bin/install -o {user} -g {user} -m 600` -4. Webapp cleans up the temp directory - -**Files involved:** -- `server/bin/install-user-rules` → deployed to `/usr/local/bin/install-user-rules` -- `server/sudoers-webapp` → entry: `www-data ALL=(ALL) NOPASSWD: /usr/local/bin/install-user-rules` -- `webapp/corporate_memory_service.py` → `_regenerate_user_rules()` calls the helper via `subprocess.run()` - -### Username Mapping - -The webapp uses email-derived usernames (e.g., `john.doe`) while the server uses Linux home directory names (e.g., `john`). Most users match directly; add overrides when they differ. - -Mapping is in `webapp/corporate_memory_service.py`: -```python -WEBAPP_TO_SERVER_USERNAME = { - "john.doe": "john", -} -``` - -Display names for avatars (initials + tooltip): -```python -USER_DISPLAY_NAMES = { - "john": {"name": "John Doe", "initials": "JD"}, - "jane.smith": {"name": "Jane Smith", "initials": "DD"}, - "mike.brown": {"name": "Mike Brown", "initials": "MM"}, - "tom.davis": {"name": "Tom Davis", "initials": "JM"}, - "alice.wilson": {"name": "Alice Wilson", "initials": "PD"}, -} -``` - -### Data Files - -``` -/data/corporate-memory/ # deploy:data-ops, mode 2770 -├── knowledge.json # Extracted knowledge items + metadata -├── votes.json # Per-user votes {username: {item_id: 1/-1}} -├── user_hashes.json # MD5 hashes for change detection -└── collection.log # Collection run history - -/home/{user}/ -├── CLAUDE.local.md # User's personal notes (source) -└── .claude_rules/ # Generated rule files (mode 700, owner-only) - ├── km_abc123.md # mode 600, owned by user - └── km_def456.md -``` - -**knowledge.json structure:** -```json -{ - "items": { - "km_abc123": { - "id": "km_abc123", - "title": "DuckDB Schema Reference Protocol", - "content": "Always read schema before queries...", - "category": "workflow", - "tags": ["duckdb", "best-practices"], - "source_users": ["john"], - "extracted_at": "2026-02-05T21:54:18Z", - "updated_at": "2026-02-05T21:54:18Z" - } - }, - "metadata": { - "last_collection": "2026-02-05T21:54:18Z", - "total_users": 3 - } -} -``` - -**votes.json structure:** -```json -{ - "john": { - "km_abc123": 1, - "km_def456": -1 - } -} -``` - -### Full Refresh Approach - -The collector uses a **full refresh** strategy to avoid duplicates: - -1. **Change detection**: MD5 hash of each user's `CLAUDE.local.md` is compared against `user_hashes.json` -2. **If no changes**: Skip the API call entirely (saves cost) -3. **If any file changed**: Load ALL user files and the existing catalog -4. **Single Haiku call**: The prompt includes the existing catalog with IDs, so Haiku can: - - Map knowledge to existing items (preserving `existing_id` for vote stability) - - Merge similar knowledge from different users into single items - - Add genuinely new items (assigned new `km_*` IDs) - - Preserve `source_users` from existing items even if a user removed their notes -5. **Sensitivity check**: Only NEW items (without `existing_id`) are checked - existing items passed the check previously - -This approach ensures: -- No duplicates from non-deterministic AI output -- Stable item IDs across runs (votes are preserved) -- Cross-user knowledge merging in a single pass - -### Systemd Services - -| Service | Type | Schedule | Description | -|---------|------|----------|-------------| -| `corporate-memory.service` | oneshot | on-demand | Runs the knowledge collector | -| `corporate-memory.timer` | timer | every 30 min | Triggers the service | - -**Service configuration:** -- Runs as `root` (needed to read `/home/*/CLAUDE.local.md`) -- Group: `data-ops` -- Timeout: 600 seconds (for API calls) -- Security hardening: `ProtectSystem=strict`, `PrivateTmp=true` - -### Configuration - -**Required GitHub Secret:** - -| Secret | Description | -|--------|-------------| -| `ANTHROPIC_API_KEY` | Claude API key for Haiku 4.5 extraction | - -The API key is deployed to `/opt/data-analyst/.env` via CI/CD and loaded by the collector service. - -**Model:** `claude-haiku-4-5-20251001` with Structured Outputs (`output_config.format.json_schema`) - -### Knowledge Categories - -| Category | Description | -|----------|-------------| -| `data_analysis` | DuckDB, Parquet, data processing techniques | -| `api_integration` | API usage, HTTP clients, authentication | -| `debugging` | Error diagnosis, troubleshooting techniques | -| `performance` | Optimization, caching, efficiency improvements | -| `workflow` | Best practices, processes, conventions | -| `infrastructure` | Server, deployment, configuration | -| `business_logic` | Domain knowledge, data relationships | - -### Extraction Process - -The collector uses **Claude Haiku 4.5** with **Structured Outputs** for guaranteed JSON schema compliance: - -1. **Catalog refresh prompt** sends all user files + existing catalog to Haiku -2. **JSON Schema** enforces output format including `existing_id` (string or null) for ID preservation -3. **Sensitivity check** verifies only NEW items are safe to share -4. **ID assignment**: Existing items keep their IDs; new items get `km_{uuid[:8]}` format - -**Filtering rules (in the prompt):** -- EXCLUDE: API keys, tokens, passwords, credentials -- EXCLUDE: Personal preferences, project-specific paths -- EXCLUDE: Basic knowledge any developer would know -- EXCLUDE: Incomplete or unclear notes -- EXCLUDE: Anything referencing specific people negatively - -### Manual Reset - -To recalculate the entire knowledge base from scratch (e.g., after fixing duplicates): +If the app crashes mid-write, DuckDB may hold a write lock: ```bash -# Reset: clears knowledge.json, votes.json, user_hashes.json, and stale .claude_rules -sudo /usr/local/bin/collect-knowledge --reset --verbose +docker compose down +# Wait a few seconds, then: +docker compose up -d ``` -The `--reset` flag: -1. Clears `knowledge.json`, `user_hashes.json`, and `votes.json` -2. Removes stale `.claude_rules/km_*.md` files from all user home directories -3. Runs a fresh collection from all `CLAUDE.local.md` files +DuckDB releases locks when the process exits cleanly. A forced restart resolves +most lock issues. -This is a manual operation, not part of the regular timer schedule. - -### Monitoring +### Sync failing ```bash -# Check timer status -sudo systemctl status corporate-memory.timer +# Check sync logs +docker compose logs app | grep -i "sync\|error\|exception" -# View last collection -sudo journalctl -u corporate-memory -n 50 --no-pager - -# Manual collection run -sudo systemctl start corporate-memory.service - -# Manual run with verbose output (shows API calls, items found) -sudo /usr/local/bin/collect-knowledge --verbose - -# View knowledge base -cat /data/corporate-memory/knowledge.json | python3 -m json.tool - -# Check item count -cat /data/corporate-memory/knowledge.json | python3 -c "import json,sys; d=json.load(sys.stdin); print(f'Items: {len(d.get(\"items\", {}))}')" - -# Check votes -cat /data/corporate-memory/votes.json | python3 -m json.tool - -# Check user hashes (change detection state) -cat /data/corporate-memory/user_hashes.json | python3 -m json.tool - -# View a user's synced rules -ls -la /home/john/.claude_rules/ +# Verify data source credentials in .env +docker compose exec app da admin tables list ``` -### Webapp Integration - -The Corporate Memory page at `/corporate-memory` provides: -- **Dashboard stats**: Total items, contributors, categories, last collection time -- **Knowledge cards**: Title, content, category badge, tags, contributor avatars (initials + tooltip) -- **Voting**: Upvote/downvote buttons per item (instantly updates score, regenerates user rules) -- **Filtering**: By category dropdown, text search (title + content + tags) -- **Sorting**: By score (default), by date, by number of contributors -- **"My Rules" toggle**: Shows only items the current user has upvoted -- **User stats**: Number of votes cast, number of active rules - -**API endpoints:** -- `GET /api/corporate-memory/knowledge` - List items (supports `category`, `search`, `sort`, `page`, `my_rules` params) -- `POST /api/corporate-memory/vote` - Cast vote `{item_id, vote: 1/-1/0}` -- `GET /api/corporate-memory/stats` - Dashboard statistics - -### Security - -- **Root access required**: Collector service runs as root to read `/home/*/CLAUDE.local.md` -- **Sudo helper for rules**: Webapp uses `install-user-rules` via sudo to write to user home dirs (same pattern as `sync_settings_service.py`). Each user's `.claude_rules/` is mode 700, files 600 - users cannot read each other's rules. -- **Sensitivity filtering**: Two-pass check (extraction prompt rules + dedicated sensitivity check on new items) -- **No credentials stored**: Knowledge items are filtered before storage -- **Source attribution**: Items track which users contributed (displayed as avatar initials) -- **Read-only for analysts**: `/data/corporate-memory/` is only writable by data-ops group -- **Atomic writes**: All JSON file updates use `tempfile.mkstemp()` + `os.replace()` to prevent corruption. **Critical:** always call `os.fchmod(fd, 0o660)` (or appropriate mode) immediately after `mkstemp()` — otherwise the default `0600` mode overrides the POSIX ACL mask to `---`, breaking group-based access for other services. See [#203](https://github.com/keboola/agnes-the-ai-analyst/issues/203). - -## Session Collector - -Collects Claude Code session transcripts from analyst home directories and stores them centrally. - -### Architecture - -``` -/home/*/user/sessions/ (per-user session transcripts) - │ - ▼ -session-collector.timer (every 6 hours) - │ - ▼ -/data/user_sessions/ (central storage, root:data-ops, mode 2770) -``` - -### Systemd Services - -| Unit | Type | Schedule | Description | -|------|------|----------|-------------| -| `session-collector.service` | oneshot | on-demand | Runs the session collector | -| `session-collector.timer` | timer | every 6 hours | Triggers the service | - -### Monitoring +### Out of disk space ```bash -sudo systemctl status session-collector.timer -sudo journalctl -u session-collector -n 50 --no-pager +df -h /data +du -sh /data/extracts/* + +# Remove old parquet partitions if needed (check with orchestrator first) +# Trigger a fresh snapshot before any manual cleanup ``` - -### Security - -- **Root access required**: Collector runs as root to read `/home/*/user/sessions/` -- **Central storage**: `/data/user_sessions/` is writable only by data-ops group - -## WebSocket Gateway - -Real-time WebSocket gateway for desktop app notifications and live updates. - -### Architecture - -``` -Desktop App (WebSocket client) - │ - ▼ -ws-gateway.service (deploy:data-ops) - │ - ▼ -/run/ws-gateway/ws.sock (unix socket, mode 0755) -``` - -### Systemd Service - -| Unit | Type | Description | -|------|------|-------------| -| `ws-gateway.service` | simple | WebSocket gateway for desktop clients | - -### Monitoring - -```bash -sudo systemctl status ws-gateway -sudo journalctl -u ws-gateway -n 50 --no-pager -``` - -### Security - -- **JWT authentication**: Desktop clients authenticate via JWT tokens (DESKTOP_JWT_SECRET) -- **Read-only home**: Service has `ProtectHome=read-only` -- **Strict protection**: `ProtectSystem=strict` limits filesystem access - -## Google Cloud Monitoring - -The server uses **Google Cloud Ops Agent** for centralized logging and metrics collection. All logs and metrics are sent to Google Cloud for analysis, alerting, and debugging. - -### What's Collected - -**Logs (Fluent Bit → Cloud Logging):** -- All syslog messages (`/var/log/syslog`, `/var/log/messages`) -- systemd journal logs (including service failures, crashes) -- Application logs (if written to syslog/journal) -- Retention: 30 days (default) - -**Metrics (OpenTelemetry → Cloud Monitoring):** -- CPU utilization (%) -- Memory usage (%) -- Disk usage (%) per device -- Network traffic (bytes sent/received) -- Load average -- Collection interval: 60 seconds -- Retention: 6 weeks (default) - -### Configured Alerts - -Alert notifications are sent to: -- **Admin 1** (admin1@your-domain.com) -- **Admin 2** (admin2@your-domain.com) -- **Admin 3** (admin3@your-domain.com) - -| Alert | Threshold | Duration | Action | -|-------|-----------|----------|--------| -| **High CPU Usage** | >80% | 5 minutes | Check: `ssh kids 'ps aux --sort=-%cpu \| head -20'` | -| **High Memory Usage** | >90% | 5 minutes | Check: `ssh kids 'free -h && ps aux --sort=-%mem \| head -20'` | -| **High Disk Usage** | >85% | 1 minute | Check: `ssh kids 'df -h && du -sh /data/* \| sort -h'` | -| **Health Endpoint Down** | Uptime check fails | 3 minutes | Check: `ssh kids 'systemctl status webapp'` | -| **Health Endpoint Degraded** | /health returns 503 | 2 minutes | Check: `curl https://your-instance.example.com/health` and review service status | -| **Systemd Service/Timer Failures** | Any failure | 1 minute | Check: `ssh kids 'systemctl --failed && journalctl -xe'` | - -### Log-Based Metrics - -Custom metrics derived from logs for trend analysis: - -| Metric | Description | Filter | -|--------|-------------|--------| -| `systemd_service_failures` | Count of systemd service/timer failures | `"Failed with result" OR "failed with result"` | -| `permission_denied_errors` | Count of Permission denied errors | `"Permission denied"` | -| `health_endpoint_degraded` | Count of /health returning 503 | `"/health" AND ("503" OR "degraded")` | - -### Dashboard - -**Server Overview Dashboard:** -- Real-time CPU, Memory, Disk, Network graphs -- Systemd service failures -- Health endpoint status -- URL: https://console.cloud.google.com/monitoring/dashboards/custom/09cdd94b-a0ed-4458-952f-3cca2bd5ba6e?project=your-gcp-project - -### Health Endpoint & Uptime Monitoring - -**Health Endpoint:** https://your-instance.example.com/health - -Returns detailed server status in JSON format: -- **Services**: webapp.service, telegram-bot.service -- **Timers**: jira-consistency.timer, corporate-memory.timer, jira-sla-poll.timer -- **Disk usage**: All partitions (/, /data, /home, /tmp) -- **System load**: 1min, 5min, 15min averages -- **Jira webhook**: Last webhook timestamp and age - -**Response format:** -```json -{ - "status": "healthy", // or "degraded" - "timestamp": "2026-02-13T18:50:33.825333Z", - "services": [{"name": "webapp.service", "status": "active", "healthy": true}], - "timers": [{"name": "jira-consistency.timer", "status": "active", "healthy": true}], - "disk": [ - {"partition": "/", "used_percent": 79.4, "free_gb": 1.98, "healthy": true}, - {"partition": "/data", "used_percent": 39.0, "free_gb": 17.92, "healthy": true} - ], - "load": {"load_1min": 0.58, "load_5min": 1.82, "load_15min": 1.85, "healthy": true}, - "jira_webhook": {"last_webhook_hours_ago": 0.0, "healthy": true} -} -``` - -**HTTP Status Codes:** -- `200 OK` = all checks healthy -- `503 Service Unavailable` = one or more checks failed (status: "degraded") - -**Uptime Check:** -- Monitors /health endpoint from 3 global locations (USA, Europe, Asia-Pacific) -- Check interval: 5 minutes -- Timeout: 10 seconds -- Validates response contains `"status": "healthy"` -- Alert triggered if check fails for 3+ minutes - -### Viewing Logs - -**Cloud Logging Console:** -https://console.cloud.google.com/logs?project=your-gcp-project - -**Useful log queries:** - -``` -# All logs from the server (last 1 hour) -resource.type="gce_instance" -resource.labels.instance_id="656c1763-11a1-49bb-bbc3-9782acf15aef" - -# systemd service failures -resource.type="gce_instance" -resource.labels.instance_id="656c1763-11a1-49bb-bbc3-9782acf15aef" -("Failed with result" OR "Main process exited") - -# Permission denied errors -resource.type="gce_instance" -resource.labels.instance_id="656c1763-11a1-49bb-bbc3-9782acf15aef" -"Permission denied" - -# Webapp errors -resource.type="gce_instance" -resource.labels.instance_id="656c1763-11a1-49bb-bbc3-9782acf15aef" -"gunicorn" AND ("ERROR" OR "WARNING") - -# Jira webhook processing -resource.type="gce_instance" -resource.labels.instance_id="656c1763-11a1-49bb-bbc3-9782acf15aef" -"Received webhook" -``` - -### Viewing Metrics - -**Cloud Monitoring Console:** -https://console.cloud.google.com/monitoring?project=your-gcp-project - -**Metrics Explorer** - Useful metric queries: -- CPU: `compute.googleapis.com/instance/cpu/utilization` -- Memory: `agent.googleapis.com/memory/percent_used` -- Disk: `agent.googleapis.com/disk/percent_used` -- Network: `agent.googleapis.com/network/bytes_sent` / `bytes_recv` - -### Cost - -Google Cloud Monitoring pricing (as of 2026): -- **Logs ingestion**: First 50 GB/month free, then $0.50/GB -- **Metrics ingestion**: First 150 MB/month free, then $0.2580/MB -- **Log storage**: $0.01/GB/month (30-day retention) -- **Typical monthly cost for this server**: ~$5-10 (well within free tier) - -Significantly cheaper than Datadog (~$15-31/host/month). - -### Managing Alerts - -**List alert policies:** -```bash -gcloud alpha monitoring policies list \ - --project=your-gcp-project \ - --format="table(displayName,enabled,conditions[0].conditionThreshold.thresholdValue)" -``` - -**Disable an alert:** -```bash -gcloud alpha monitoring policies update POLICY_ID \ - --project=your-gcp-project \ - --no-enabled -``` - -**Add notification channel:** -```bash -gcloud alpha monitoring channels create \ - --project=your-gcp-project \ - --display-name="New Person" \ - --type=email \ - --channel-labels=email_address=person@your-domain.com -``` - -### Debugging Server Crashes - -When investigating server issues (like the 2026-02-13 systemd-journald crash): - -1. **View logs around the crash time:** - - Go to Cloud Logging Console - - Filter: `resource.labels.instance_id="656c1763-11a1-49bb-bbc3-9782acf15aef"` - - Set time range to include the crash - - Look for ERROR/WARNING severity - -2. **Check metrics before the crash:** - - Go to Dashboard or Metrics Explorer - - View CPU/Memory/Disk graphs for the time period - - Look for spikes or anomalies - -3. **Correlate logs with metrics:** - - High CPU spike at 15:20? Check logs from that time - - Memory growth over time? Look for memory leaks in logs - -4. **Export for analysis:** - ```bash - # Export logs to file - gcloud logging read "resource.labels.instance_id=\"656c1763-11a1-49bb-bbc3-9782acf15aef\"" \ - --project=your-gcp-project \ - --limit=1000 \ - --format=json \ - --freshness=1d > server_logs.json - ``` - -### Best Practices - -1. **Structured logging**: Applications should log in JSON format for better searchability -2. **Log levels**: Use appropriate levels (ERROR for problems, INFO for events, DEBUG for details) -3. **Alert fatigue**: Only alert on actionable issues, not informational events -4. **Regular review**: Check dashboard weekly to spot trends before they become problems -5. **Cost monitoring**: If ingestion grows, consider log sampling or exclusion filters diff --git a/docs/CONFIGURATION.md b/docs/CONFIGURATION.md index 67cfe28..57a93a5 100644 --- a/docs/CONFIGURATION.md +++ b/docs/CONFIGURATION.md @@ -3,6 +3,7 @@ ## instance.yaml The main configuration file for your AI Data Analyst instance. Located at `config/instance.yaml`. +See `config/instance.yaml.example` for the full annotated template. ### Instance Branding @@ -17,10 +18,11 @@ instance: ```yaml auth: - allowed_domain: "acme.com" # Google OAuth domain restriction + allowed_domain: "acme.com" # Email domain restriction for login ``` -Only emails from this domain can log in via Google OAuth. External users can be added via password auth (requires SendGrid). +Only emails from this domain can log in via Google OAuth or email magic link. +Google OAuth is optional — if not configured, only email magic link auth is available. ### Email @@ -28,9 +30,15 @@ Only emails from this domain can log in via Google OAuth. External users can be email: from_address: "noreply@acme.com" from_name: "Acme Data Analyst" + smtp_host: "${SMTP_HOST}" + smtp_port: 587 + smtp_user: "${SMTP_USER}" + smtp_password: "${SMTP_PASSWORD}" ``` -Used for password auth setup and reset emails. Requires `SENDGRID_API_KEY` in `.env`. +Used for magic link authentication. Without SMTP configured, magic links are shown +directly in the browser (development mode). Compatible with any SMTP relay (Gmail, +Mailgun, SendGrid SMTP, etc.). ### Server @@ -45,6 +53,7 @@ server: ```yaml desktop: jwt_issuer: "acme-analyst" + jwt_secret: "${DESKTOP_JWT_SECRET}" url_scheme: "acme-analyst" ``` @@ -52,22 +61,18 @@ desktop: ```yaml data_source: - type: "keboola" # keboola, csv, bigquery + type: "keboola" # keboola, bigquery, local ``` ### Users ```yaml users: - john.doe: - name: "John Doe" - initials: "JD" - jane.smith: - name: "Jane Smith" - initials: "JS" + admin@acme.com: + display_name: "John Doe" + km_admin: true # Corporate Memory admin (optional) -username_mapping: - john.doe: john # Only if webapp and server names differ +username_mapping: {} # Map webapp email -> server username if different ``` ### Datasets @@ -102,11 +107,15 @@ catalog: ## Environment Variables (.env) +Copy `config/.env.template` to `.env` and fill in values. The template contains +the full variable list with comments. Never commit `.env`. + ### Required | Variable | Description | |----------|-------------| -| `WEBAPP_SECRET_KEY` | Flask session secret | +| `JWT_SECRET_KEY` | FastAPI JWT token secret (generate with `secrets.token_hex(32)`) | +| `SESSION_SECRET` | Session cookie secret (generate with `secrets.token_hex(32)`) | | `GOOGLE_CLIENT_ID` | Google OAuth client ID | | `GOOGLE_CLIENT_SECRET` | Google OAuth client secret | @@ -116,16 +125,29 @@ catalog: |----------|-------------| | `KEBOOLA_STORAGE_TOKEN` | Keboola Storage API token | | `KEBOOLA_STACK_URL` | Keboola stack URL | -| `KEBOOLA_PROJECT_ID` | Keboola project ID | -| `DATA_DIR` | Data directory path | +| `DATA_DIR` | Data directory path (default: `/data` in Docker, `./data` locally) | + +### Data Source (BigQuery) + +| Variable | Description | +|----------|-------------| +| `BIGQUERY_PROJECT` | GCP project for job execution/billing | +| `BIGQUERY_LOCATION` | BigQuery location (e.g., `US`, `us-central1`) | ### Optional | Variable | Description | |----------|-------------| -| `SENDGRID_API_KEY` | For password auth emails | +| `SMTP_HOST` | SMTP relay host for magic link emails | +| `SMTP_PORT` | SMTP port (587 for STARTTLS, 465 for SSL) | +| `SMTP_USER` | SMTP username | +| `SMTP_PASSWORD` | SMTP password | | `TELEGRAM_BOT_TOKEN` | For Telegram notifications | -| `ANTHROPIC_API_KEY` | For Corporate Memory AI | +| `ANTHROPIC_API_KEY` | For Corporate Memory AI (direct Anthropic) | | `LLM_API_KEY` | API key for LLM proxy (LiteLLM, OpenRouter, etc.) | -| `JIRA_WEBHOOK_SECRET` | For Jira integration | +| `JIRA_WEBHOOK_SECRET` | For Jira webhook integration | +| `JIRA_API_TOKEN` | For Jira REST API access | +| `DESKTOP_JWT_SECRET` | Separate secret for desktop app tokens | | `CONFIG_DIR` | Override config directory path | +| `LOG_LEVEL` | Logging level: `debug`, `info`, `warning`, `error` | +| `DOMAIN` | Public hostname for Caddy TLS (production profile) |