agnes-the-ai-analyst/dev_docs/server.md
Petr c56905d34f Initial commit: OSS data distribution platform
Open-source AI data analyst platform extracted from internal repo.
Includes data sync engine, Keboola adapter, Flask web portal,
server deployment scripts, and configuration templates.
2026-03-08 23:31:28 +01:00

88 KiB

Data Broker Server

Central server for distributing data to AI analytical systems.

Basic Information

Parameter Value
Name data-broker-for-claude
GCP Project kids-ai-data-analysis
Zone europe-north1-a
Type e2-medium
OS Debian 12 (bookworm)
External IP YOUR_SERVER_IP

Hardware

Resource Size
RAM 3.8 GB
Swap 2 GB (/mnt/swapfile)
System disk (sda) 10 GB - OS, packages, app (expendable)
Data disk (sdb) 30 GB - /data, pd-balanced (snapshotted)
Home disk (sdc) 30 GB - /home, pd-balanced (snapshotted)
Temp disk (sdd) 100 GB - /tmp, pd-standard (not snapshotted)

Access

SSH connection (admin)

ssh kids

Requires SSH config:

Host kids
  HostName YOUR_SERVER_IP
  User padak
  IdentityFile ~/.ssh/google_compute_engine

Or via gcloud:

gcloud compute ssh data-broker-for-claude --project=kids-ai-data-analysis --zone=europe-north1-a

Data Structure

/data/                      # Data disk (30 GB, pd-balanced)
├── lost+found/             # System directory
├── src_data/               # Source data (group: dataread, 750)
│   ├── raw/                # Raw data from Keboola (reserved for future use)
│   ├── parquet/            # Converted data (parquet format)
│   │   ├── sales/          # CRM data (in.c-crm bucket) - group: dataread
│   │   └── private/        # Private data - group: data-private
│   ├── metadata/           # Sync state, cache, profiles
│   │   ├── sync_state.json # Per-table sync stats (rows, columns, size)
│   │   └── profiles.json   # Data profiler output (mode 644, ~900 KB)
│   └── staging/            # Temporary processing (reserved for future use)
├── docs/                   # Documentation (deployed from repo)
│   └── schema.yml          # Auto-generated table schemas (from data sync)
├── scripts/                # Helper scripts (deployed from repo)
├── examples/               # Example notification scripts (padak:data-ops, 755)
│   └── notifications/      # Example notification scripts for analysts
├── notifications/          # Notification data (deploy:data-ops, 2770 setgid)
│   ├── telegram_users.json # username -> {chat_id, linked_at} mapping
│   ├── desktop_users.json  # username -> {linked_at} mapping (desktop app link state)
│   ├── pending_codes.json  # temporary verification codes
│   └── bot.log             # Bot service log
├── auth/                   # Password auth data (www-data:data-ops, 2770 setgid)
│   └── users.json          # Hashed passwords and metadata
├── corporate-memory/       # Knowledge base data (deploy:data-ops, 2770 setgid)
│   ├── knowledge.json      # Collected knowledge items from CLAUDE.local.md files
│   ├── votes.json          # User votes on knowledge items
│   └── user_hashes.json    # MD5 hashes for change detection
└── user_sessions/          # Session collector data (root:data-ops, 2770 setgid)
    └── *.jsonl             # User session logs collected every 6 hours

/run/notify-bot/                # Systemd RuntimeDirectory (mode 0755)
└── bot.sock                    # Unix socket for send API (mode 0666)

/tmp/keboola_load/              # Keboola staging directory (root:data-ops, 2770 setgid)
└── *.parquet                   # Temporary Parquet files during Keboola data load

Folder Mapping

Parquet subfolders are mapped from Keboola bucket names in docs/data_description.md:

folder_mapping:
  in.c-crm: sales        # CRM/Salesforce data
  in.c-private: private  # Private/sensitive data

This mapping is used by src/config.py to determine where to save Parquet files.

Access Control

Three-tier permission model:

Role Groups Access
Standard Analyst dataread Public data read-only
Privileged Analyst dataread + data-private Public + private data read-only
Admin sudo + google-sudoers + dataread + data-private + data-ops Full server access (NOPASSWD) + all data read/write + deployment
  • Standard Analyst - can read public data, sync via rsync, run scripts in their workspace
  • Privileged Analyst - same as standard + access to private/sensitive data (executives, management)
  • Admin - server administration, can add/remove users, has sudo privileges, full data access with write permissions, can deploy application updates

Data Directory Permissions

Data in /data/src_data/ uses ACL for granular access:

/data/src_data/          owner: padak, group: data-ops
├── raw/                 data-ops: rwx, dataread: r-x
├── parquet/             data-ops: rwx, dataread: r-x
│   └── private/         data-ops: rwx, data-private: r-x
└── staging/             data-ops: rwx, dataread: r-x
  • Admins (data-ops): Full read/write access to prepare data
  • Analysts (dataread): Read-only access to consume data
  • Private data (data-private): Additional group for sensitive data access

Atomic writes and ACL — required pattern:

Directories under /data/ use default ACLs (e.g., default:group:data-ops:rwx). Files created with open() inherit these correctly. However, tempfile.mkstemp() explicitly sets mode 0600, which overrides the ACL mask to --- and silently breaks group access for all other services.

Always use os.fchmod() immediately after mkstemp():

fd, tmp_path = tempfile.mkstemp(dir=str(target.parent), suffix=".tmp")
os.fchmod(fd, 0o660)  # REQUIRED: restore ACL mask for group access
try:
    with os.fdopen(fd, "w") as f:
        json.dump(data, f, indent=2)
    os.replace(tmp_path, str(target))
except Exception:
    os.unlink(tmp_path)
    raise

Use 0o660 for files accessed by services via data-ops group ACL, 0o644 for world-readable files (e.g., profiler output). See #203 for a production incident caused by missing fchmod.

Per-issue file locking for concurrent writers:

When multiple services write to the same JSON file (e.g., SLA poll and webhook handler both updating /data/src_data/raw/jira/issues/SUPPORT-1234.json), use advisory file locking to prevent races:

from src.jira_file_lock import issue_json_lock

with issue_json_lock(issues_dir, issue_key):
    # read JSON, modify, atomic write, transform to Parquet
    ...
  • Uses fcntl.flock() (POSIX advisory, blocking, exclusive)
  • Lock files stored in {issues_dir}/.locks/{issue_key}.lock
  • Different issue keys don't block each other (fine-grained locking)
  • The lock must cover the entire read-modify-write and the Parquet transform — otherwise another writer could overwrite the JSON between write and transform, causing the transform to read stale data

Currently used by:

  • scripts/jira_poll_sla.py — wraps SLA+status update + transform_single_issue()
  • webapp/jira_service.py — wraps save_issue() JSON write + trigger_incremental_transform(), and _handle_deletion() read-modify-write + transform

Attachment downloads in save_issue() intentionally run outside the lock (can take tens of seconds and don't modify JSON).

User Management

Each user has:

  • Own Linux account with home directory /home/username/
  • Server symlinks: /home/username/server/ (read-only links to /data/)
  • User workspace: /home/username/user/ (writable: duckdb, notifications, artifacts, scripts, parquet)
  • Notification state: /home/username/.notifications/{state,logs}
  • SSH key authentication

Management Commands

# Add standard analyst (public data only)
sudo add-analyst username "ssh-rsa AAAA... comment"

# Add privileged analyst (public + private data)
sudo add-analyst username "ssh-rsa AAAA... comment" --private

# Add server admin (sudo + all data)
sudo add-admin username "ssh-rsa AAAA... comment"

# List all analysts
list-analysts

# Remove user (interactive)
sudo remove-analyst username

# Remove user (non-interactive, e.g., via SSH)
sudo remove-analyst username --force

Examples

# Regular analyst
sudo add-analyst novak "ssh-rsa AAAAB3... jan.novak@example.com"

# Executive with private data access
sudo add-analyst ceo "ssh-rsa AAAAB3... ceo@example.com" --private

# Server administrator
sudo add-admin matejkys "ssh-rsa AAAAB3... matejkys@example.com"
sudo add-admin dasa "ssh-ed25519 AAAAC3... dasa@your-domain.com"

Output for admin:

Admin matejkys created successfully
  - Added to group: sudo (server administration)
  - Added to group: dataread (public data access)
  - Added to group: data-private (private data access)
  - Added to group: data-ops (application deployment)
  - Added to resource limits (unlimited)
  - Workspace: /home/matejkys/workspace
  - Data link: /home/matejkys/data -> /data/src_data

SSH Configuration

  • Passwords disabled (SSH keys only)
  • Root login disabled
  • MaxSessions: 20 (per user)
  • MaxStartups: 30:50:100 (rate limiting for DDoS protection)
  • ClientAliveInterval: 300s

Resource Limits

Protection against fork bombs and resource abuse. Configuration is version-controlled in server/limits-users.conf and deployed automatically by deploy.sh to /etc/security/limits.d/99-users.conf:

Resource Analysts Admins
Max processes (nproc) 100/150 unlimited
Virtual memory (as) 4 GB / 6 GB unlimited
File size (fsize) 2 GB / 4 GB unlimited
Open files (nofile) 1024/2048 65535
Core dumps disabled unlimited
  • Admins (data-ops group members) are explicitly listed in the limits file with unlimited access
  • New admins are automatically added to exceptions by add-admin script
  • All other users get restricted limits via wildcard rule (protection against fork bombs)

Data Sync Scripts

Server: update.sh

Syncs data from Keboola to Parquet files. Run via cron 3x daily (6:00, 14:00, 19:00 UTC).

cd /opt/data-analyst/repo && ./scripts/update.sh

What it does:

  1. Activates virtual environment (supports both local ./.venv and server /opt/data-analyst/.venv)
  2. Downloads data from Keboola Storage API, converts to Parquet format in DATA_DIR/parquet/{folder}/
  3. Generates data profiles (python -m src.profilerprofiles.json) — non-fatal if it fails

Cron setup:

sudo crontab -u deploy -e
# Add:
# MAILTO=admin@your-domain.com
# 0 6,14,19 * * * cd /opt/data-analyst/repo && ./scripts/update.sh > /var/log/update.log 2>&1 || cat /var/log/update.log

Client: sync_data.sh

Main sync script for analysts. Syncs docs, scripts, data, and regenerates CLAUDE.md:

bash server/scripts/sync_data.sh            # Full sync (pull server/ + push user/)
bash server/scripts/sync_data.sh --dry-run  # Preview only
bash server/scripts/sync_data.sh --push     # Only upload user/ to server

What it does:

  1. Syncs server/docs/, server/scripts/, server/examples/, server/metadata/ from server
  2. Regenerates CLAUDE.md from latest template (preserves username, never touches CLAUDE.local.md)
  3. Updates .claude/settings.json with project permissions from server
  4. Syncs parquet data files to server/parquet/ (incremental)
  5. Uploads user/ to server (backup + runtime for notifications)
  6. Downloads corporate memory rules from ~/.claude_rules/ to .claude/rules/
  7. Updates sync timestamp on server (touch ~/server/) - used by the webapp Account card "Last Sync" display. Each user's ~/server/ directory is per-user, so the timestamp is independent.
  8. Reinitializes DuckDB in user/duckdb/ (core tables via duckdb_manager.py, optional dataset views via sync_jira.sh --views-only etc.)

Note: Rsync uses --delete to remove obsolete files from client (e.g., old monthly partitions when switching to daily). Files are compared by mtime+size (no --checksum for better performance). If rsync is not available (Windows without WSL), scp is used as fallback with explicit dotfile handling.

CLAUDE.md update mechanism:

  • CLAUDE.md is regenerated from server/docs/setup/claude_md_template.txt on every sync
  • Template is maintained centrally and deployed to server via CI/CD
  • User's personal CLAUDE.local.md is never overwritten (higher priority in Claude Code)
  • New features added to template are automatically delivered to all analysts on next sync

Claude Code settings.json:

  • .claude/settings.json is copied from server/docs/setup/claude_settings.json on every sync
  • Contains project-wide permissions (allow/deny/ask rules for tools)
  • Protects server/ directory from accidental modifications by Claude
  • Centrally managed - analysts cannot override these permissions locally

Client: init.sh + setup_views.sh

First time setup (init.sh):

./scripts/init.sh

Creates virtual environment, installs dependencies, and creates data folders including duckdb/.

After rsync (setup_views.sh):

bash server/scripts/setup_views.sh

Initializes DuckDB views from synced Parquet files. DuckDB database is created at user/duckdb/analytics.duckdb.

Steps:

  1. Activates virtual environment
  2. Runs duckdb_manager.py --reinit for core Keboola tables (from data_description.md)
  3. Calls optional dataset scripts with --views-only flag:
    • If server/parquet/jira/ exists → sync_jira.sh --views-only (creates jira_issues, jira_comments, jira_attachments, jira_changelog views)
    • Future datasets follow the same pattern (e.g., sync_github.sh --views-only)

Convention: Each data source sync script (e.g., sync_jira.sh) manages its own DuckDB views. The --views-only flag creates/refreshes views without syncing data. This keeps duckdb_manager.py focused on core tables while optional datasets are self-contained.

Server Purpose

  1. Sync from Keboola - periodically pulls data from Keboola Storage
  2. Convert to Parquet - transforms data to efficient format
  3. Chunking - splits data by hour for incremental sync
  4. Distribution - clients pull data via rsync to local machines
  5. On-server analysis - analysts can run scripts directly on the server

Usage Guide

User Types

Type Groups Data Access Use Case
Standard Analyst dataread Public data Regular analysts, data scientists
Privileged Analyst dataread + data-private Public + private Executives, management
Admin sudo + data-ops + all data groups Everything + server + deployment DevOps, IT team
  • Standard analysts see all company data except sensitive information stored in private/
  • Privileged analysts have access to everything including executive reports and financial details
  • Admins can manage the server, add/remove users, and have full sudo access

What Each User Gets

Every analyst has their own Linux account with:

/home/username/
├── server/                         # Symlinks to shared read-only data on /data
│   ├── docs -> /data/docs
│   ├── scripts -> /data/scripts
│   ├── examples -> /data/examples
│   ├── parquet -> /data/src_data/parquet
│   └── metadata -> /data/src_data/metadata
├── user/                           # User's OWN writable directories
│   ├── duckdb/                     # Per-user DuckDB database
│   │   └── analytics.duckdb
│   ├── notifications/              # Notification scripts (*.py)
│   ├── artifacts/                  # Analysis outputs
│   ├── scripts/                    # Custom scripts
│   └── parquet/                    # Custom parquet files
├── .notifications/                 # Notification runner state
│   ├── state/                      # Cooldown tracking per script
│   └── logs/                       # Runner and cron logs
└── .ssh/authorized_keys            # SSH key for authentication
  • Home directory (/home/username/) - private space for each user
  • Server data (~/server/) - read-only symlinks to shared /data/ on disk
  • User workspace (~/user/) - writable directories for user's own files
  • DuckDB (~/user/duckdb/analytics.duckdb) - per-user database built from shared parquet

Typical Workflow

Option A: Local analysis with rsync (recommended)

  1. Analyst syncs data to their local machine:

    # Recommended: use the sync script
    bash server/scripts/sync_data.sh
    
    # Or manual rsync
    rsync -avz data-analyst:server/parquet/ ./server/parquet/
    
  2. Run analysis locally with Claude Code or other tools

  3. Data stays on analyst's machine - they can do whatever they want with it

Option B: Server-side analysis

  1. SSH into the server:

    ssh username@YOUR_SERVER_IP
    
  2. Work in personal workspace:

    cd ~/user
    # Run scripts, analyze data from ~/server/parquet/
    
  3. Copy results back to local machine if needed

Data Access Examples

Standard analyst (public data only):

$ ls ~/server/parquet/
sales/  products/  customers/  orders/  private/

$ ls ~/server/parquet/private/
ls: cannot open directory 'private/': Permission denied

Privileged analyst (public + private):

$ ls ~/server/parquet/
sales/  products/  customers/  orders/  private/

$ ls ~/server/parquet/private/
executive_reports/  financial_details/  board_materials/

Rsync Permissions

When syncing with rsync:

  • Standard analysts will get "Permission denied" errors for private/ folder (expected)
  • Use --exclude='private/' to skip it cleanly:
    rsync -avz --exclude='private/' data-analyst:server/parquet/ ./server/parquet/
    
  • Privileged analysts can sync everything including private data

Monitoring

Cloud Monitoring (GCP)

Ops Agent is installed and reports VM metrics to Cloud Monitoring, including disk space utilization.

Installation (already done):

curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --also-install

Check agent status:

sudo systemctl status google-cloud-ops-agent

Available metrics:

  • agent.googleapis.com/disk/percent_used - Disk utilization percentage
  • agent.googleapis.com/memory/percent_used - Memory utilization
  • agent.googleapis.com/cpu/utilization - CPU usage
  • agent.googleapis.com/network/traffic - Network I/O

View metrics in GCP Console:

  1. Go to Cloud Console > Monitoring > Metrics Explorer
  2. Select resource type: VM Instance
  3. Select metric: agent.googleapis.com/disk/percent_used
  4. Filter by device: /dev/sdb (data disk)

Alert Policy for Disk Space:

Alert triggers when /data partition exceeds 85% usage for 5 minutes.

To create the alert policy manually:

  1. Go to Cloud Console > Monitoring > Alerting
  2. Click Create Policy
  3. Click Add Condition:
    • Resource type: VM Instance
    • Metric: agent.googleapis.com/disk/percent_used
    • Filter: metadata.system_labels.device="/dev/sdb" AND metadata.system_labels.state="used"
    • Threshold: > 85
    • Duration: 5 minutes
  4. Click Next > Notifications (add email/Slack channel)
  5. Click Next > Documentation:
    Disk /data partition is above 85% full.
    
    Check /data/src_data/ for large files or run cleanup.
    
    Common causes:
    - Keboola data sync (check cron logs)
    - bot.log growth (check /data/notifications/bot.log)
    - Jira attachments (check /data/src_data/raw/jira/attachments/)
    
  6. Name: "Disk Space Alert - /data partition"
  7. Click Create Policy

Cost: Free tier (first 150 time series free, this VM uses ~25)

Dashboard: Available in GCP Console > Monitoring > Dashboards > "VM Instances"

Local Monitoring

# Server status
ssh kids "uptime && free -h && df -h / /data /home"

# Active users
ssh kids "who"

# Recent logins
ssh kids "last | head -20"

# Check disk space for all partitions
ssh kids "df -h"

# Check disk usage by directory
ssh kids "du -sh /data/*"

Backup & Disaster Recovery

Disk Layout

Disk Mount Size Purpose Backup
data-broker-for-claude (sda) / 10 GB OS, packages, app Expendable (rebuild from git)
data-disk (sdb) /data 30 GB Parquet data, docs, scripts Daily GCP snapshots
home-disk (sdc) /home 30 GB User homes, SSH keys, workspaces Daily GCP snapshots
tmp-disk (sdd) /tmp 100 GB Temporary files Expendable (not snapshotted)

Automatic Snapshots

Both data-disk and home-disk have daily GCP snapshot schedules with 14-day retention. Setup via server/setup-snapshot-schedule.sh.

# Check snapshot schedule status
gcloud compute resource-policies describe daily-backup \
  --project=kids-ai-data-analysis --region=europe-north1

# List existing snapshots
gcloud compute snapshots list --project=kids-ai-data-analysis

# Manual snapshot (if needed)
gcloud compute disks snapshot data-disk home-disk \
  --project=kids-ai-data-analysis \
  --zone=europe-north1-a \
  --snapshot-names=data-disk-$(date +%Y%m%d),home-disk-$(date +%Y%m%d)

Recovery

See disaster-recovery.md for detailed recovery procedures for each failure scenario.

Application Deployment

Directory Structure

/opt/data-analyst/          # Application directory (group: data-ops)
├── repo/                   # Git repository
│   ├── src/                # Python source code
│   ├── scripts/            # Data sync scripts
│   ├── server/             # Server management scripts
│   │   ├── bin/            # add-analyst, notify-runner, notify-scripts, etc.
│   │   └── telegram_bot/   # Telegram bot service
│   ├── webapp/             # Flask web application
│   └── examples/           # Example notification scripts
├── .venv/                  # Python virtual environment
├── .env                    # Webapp env (Google OAuth, secret key)
└── logs/                   # Application logs

CI/CD Pipeline

Application is automatically deployed via GitHub Actions when changes are pushed to main branch.

How it works:

  1. Push to main triggers GitHub Actions workflow
  2. Action connects to server via SSH as deploy user
  3. Runs /opt/data-analyst/repo/server/deploy.sh
  4. Deploy script:
    • Pulls latest code from origin/main
    • Updates server management scripts in /usr/local/bin/
    • Updates sudoers configurations (/etc/sudoers.d/)
    • Updates resource limits (/etc/security/limits.d/99-users.conf)
    • Deploys notify-runner and notify-scripts to /usr/local/bin/
    • Creates data directories:
      • /data/notifications/ (notification state)
      • /data/src_data/raw/jira/ (Jira webhook data)
      • /data/auth/ (password auth)
      • /data/corporate-memory/ (knowledge base)
      • /data/user_sessions/ (session logs)
      • /data/examples/ (example scripts)
      • /tmp/keboola_load/ (Keboola staging)
    • Deploys systemd units:
      • notify-bot.service (Telegram bot)
      • ws-gateway.service (WebSocket gateway)
      • corporate-memory.{service,timer} (knowledge collector)
      • jira-sla-poll.{service,timer} (SLA refresh)
      • jira-consistency.{service,timer,timer-deep} (data integrity monitoring)
      • session-collector.{service,timer} (session logs)
    • Sets ACLs for Jira attachments (dataread group)
    • Creates/updates Keboola .env file (if secrets provided)
    • Sets correct permissions on /opt/data-analyst/
    • Restarts webapp, notify-bot, ws-gateway services
    • Enables/starts timers (if credentials configured)

Deploy user permissions: The deploy user has limited sudo access defined in /etc/sudoers.d/deploy:

Core Operations:

  • Can copy scripts to /usr/local/bin/
  • Can update sudoers files in /etc/sudoers.d/
  • Can manage permissions on /opt/data-analyst/
  • Can update resource limits in /etc/security/limits.d/

Service Management:

  • Can restart/reload webapp, nginx services
  • Can manage notify-bot, ws-gateway services
  • Can manage corporate-memory timer
  • Can manage jira-sla-poll timer
  • Can manage jira-consistency timers (incremental + deep)
  • Can manage session-collector timer
  • Can run systemctl daemon-reload

Data Directories:

  • Can manage /data/scripts/ (helper scripts for analysts)
  • Can manage /data/docs/ (documentation)
  • Can manage /data/notifications/ (notification state)
  • Can manage /data/examples/ (example scripts)
  • Can manage /data/src_data/raw/jira/ (Jira webhook data)
  • Can manage /data/auth/ (password auth state)
  • Can manage /data/corporate-memory/ (knowledge base)
  • Can manage /data/user_sessions/ (session collector data)
  • Can manage /tmp/keboola_load/ (Keboola staging directory)

Special Permissions:

  • Can run notify-scripts as any user (list/run notification scripts)
  • Can set ACLs on Jira attachments (dataread group access)
  • Can create log files in /opt/data-analyst/logs/

Full sudoers reference: server/sudoers-deploy in repository

Note: On Debian 12, core utils are in /usr/bin/ (not /bin/). The sudoers file uses full paths like /usr/bin/cp, /usr/bin/chmod, etc.

Initial Setup (one-time)

1. Install prerequisites:

sudo apt-get update
sudo apt-get install -y git python3.11-venv python3-pip

2. Create deploy user and SSH key for GitHub:

# Create deploy user
sudo useradd -m -s /bin/bash deploy
sudo groupadd data-ops 2>/dev/null || true
sudo usermod -aG data-ops deploy

# Generate SSH key for GitHub
sudo -u deploy ssh-keygen -t ed25519 -f /home/deploy/.ssh/id_ed25519 -N '' -C 'deploy@data-broker'

# Configure SSH for GitHub
sudo -u deploy bash -c 'echo -e "Host github.com\n  IdentityFile ~/.ssh/id_ed25519\n  StrictHostKeyChecking accept-new" > /home/deploy/.ssh/config'
sudo chmod 600 /home/deploy/.ssh/config

# Show public key (add this to GitHub as Deploy Key)
sudo cat /home/deploy/.ssh/id_ed25519.pub

3. Add Deploy Key to GitHub:

4. Clone repository and run setup:

sudo mkdir -p /opt/data-analyst
sudo chown deploy:data-ops /opt/data-analyst
sudo -u deploy git clone git@github.com:keboola/internal_ai_data_analyst.git /opt/data-analyst/repo
sudo git config --global --add safe.directory /opt/data-analyst/repo
sudo -u deploy git config --global --add safe.directory /opt/data-analyst/repo
sudo /opt/data-analyst/repo/server/setup.sh

5. Add existing admins to data-ops group:

sudo usermod -aG data-ops padak
sudo usermod -aG data-ops matejkys
sudo usermod -aG data-ops dasa

GitHub Secrets Required

Set these in GitHub repository settings (Settings > Secrets > Actions):

Secret Value
SERVER_HOST YOUR_SERVER_IP
SERVER_USER deploy
SERVER_SSH_KEY Private SSH key (/home/deploy/.ssh/id_ed25519)
TELEGRAM_BOT_TOKEN Telegram Bot API token (from @BotFather)
SENDGRID_API_KEY SendGrid API key for password auth emails
ALLOWED_EMAILS Comma-separated whitelisted emails for password auth

Manual Deployment

Admins can trigger deployment manually:

# Via GitHub Actions UI (Actions > Deploy to Server > Run workflow)
# Or via SSH:
ssh kids "cd /opt/data-analyst/repo && ./server/deploy.sh"

Deployment Logs

# View deployment history
cat /opt/data-analyst/logs/deploy.log

# Follow live deployment
tail -f /opt/data-analyst/logs/deploy.log

Troubleshooting CI/CD

"sudo: a terminal is required to read the password"

  • Deploy user is missing NOPASSWD sudo permission for a specific command
  • Check /etc/sudoers.d/deploy exists and has correct permissions (440)
  • Verify the command path matches (Debian 12 uses /usr/bin/, not /bin/)
  • Fix: Add missing permission to server/sudoers-deploy and redeploy:
    # Edit server/sudoers-deploy in repo
    # Add the missing command with full path
    deploy ALL=(ALL) NOPASSWD: /usr/bin/command-name args
    
    # Commit and push
    git add server/sudoers-deploy
    git commit -m "Add missing sudo permission"
    git push origin main
    
    # Manually update on server (one-time)
    ssh kids "sudo cp /opt/data-analyst/repo/server/sudoers-deploy /etc/sudoers.d/deploy"
    ssh kids "sudo chmod 440 /etc/sudoers.d/deploy"
    

"Permission denied" on .env file

  • Deploy user cannot write directly to files owned by root
  • Solution: Use sudo /usr/bin/tee instead of direct file write

Deploy script changes not taking effect

  • The deploy script pulls new code AFTER it starts running
  • Changes to deploy.sh itself require manual pull first:
    ssh kids "sudo -u deploy bash -c 'cd /opt/data-analyst/repo && git pull'"
    

Verify sudoers configuration:

# Check if sudoers file exists and has correct permissions
ssh kids "ls -la /etc/sudoers.d/deploy"

# Validate syntax (exit code 0 = OK)
ssh kids "sudo visudo -cf /etc/sudoers.d/deploy && echo 'Syntax OK'"

# View current sudoers rules
ssh kids "sudo cat /etc/sudoers.d/deploy"

Test deploy locally as deploy user:

ssh kids "sudo -u deploy bash -c 'cd /opt/data-analyst/repo && ./server/deploy.sh'"

Web Application (Self-Service Portal)

A web application at https://your-instance.example.com allows team members to create their own analyst accounts via Google SSO.

Features

  • Google Sign-In (restricted to @your-domain.com emails only)
  • Email/password login for external users (whitelisted emails)
  • Self-service account creation for new users
  • Dashboard showing account info for existing users (2-column layout)
  • Dynamic data stats (tables, columns, rows, size) loaded from sync_state.json
  • Data catalog page with dynamic table listings from data_description.md + sync_state.json
  • Data profiler with per-column statistics, visualizations, and alerts (from profiles.json)
  • SSH connection instructions
  • Claude Code integration hints for AI-assisted setup
  • Telegram notification linking
  • macOS desktop app linking/unlinking with install instructions

User Flow

  1. User visits https://your-instance.example.com
  2. Signs in with Google (@your-domain.com account)
  3. Dashboard shows instructions and form for SSH key
  4. User can ask Claude Code to generate SSH key and guide them
  5. After pasting SSH key, account is created automatically
  6. User syncs data and starts analyzing with Claude Code

Dynamic Data Stats

Dashboard and catalog pages display live data statistics (table count, columns, rows, size). These are loaded dynamically from sync_state.json on every page request - no webapp restart needed.

Data flow:

Cron (update.sh) → data_sync.py → /data/src_data/metadata/sync_state.json
                                                    ↓
                              Flask reads on request → dashboard + catalog templates
  • sync_state.json is updated by the data sync process with per-table stats (rows, columns, file size)
  • Flask aggregates these into totals for display
  • If sync_state.json is missing or unreadable, hardcoded fallback values are used
  • Catalog page merges data_description.md (table names, descriptions, categories) with sync_state.json (row counts)

Architecture

Browser -> Nginx (HTTPS/Let's Encrypt) -> Gunicorn -> Flask App
                                                         |
                                                         v
                                              sudo add-analyst (via sudoers)

Setup

1. Run webapp setup script:

sudo /opt/data-analyst/repo/server/webapp-setup.sh

2. Configure Google OAuth:

  • Go to Google Cloud Console
  • Create OAuth 2.0 Client ID (Web application)
  • Authorized JavaScript origins: https://your-instance.example.com
  • Authorized redirect URIs: https://your-instance.example.com/authorize

3. Update environment file:

sudo nano /opt/data-analyst/.env

# Add:
WEBAPP_SECRET_KEY=<generate with: python -c "import secrets; print(secrets.token_hex(32))">
GOOGLE_CLIENT_ID=<from Google Console>
GOOGLE_CLIENT_SECRET=<from Google Console>

4. Start/restart webapp:

sudo systemctl restart webapp

Monitoring

# Service status
sudo systemctl status webapp
sudo systemctl status nginx

# Logs
tail -f /opt/data-analyst/logs/webapp-access.log
tail -f /opt/data-analyst/logs/webapp-error.log

# Test endpoint
curl -I https://your-instance.example.com/health

Security Notes

  • Only @your-domain.com emails can log in via Google OAuth
  • External users can log in via email/password if their email is whitelisted
  • Self-service creates standard analyst accounts only (no --private flag)
  • www-data is member of data-ops group (for access to /opt/data-analyst and static files)
  • www-data can only run add-analyst via sudoers (not add-admin) - configured in /etc/sudoers.d/webapp
  • HTTPS enforced with Let's Encrypt certificate
  • SSH keys are validated before passing to add-analyst script
  • Reserved system usernames (root, admin, deploy, etc.) are blocked from registration
  • Username collision with existing system accounts shows error and requires admin intervention
  • Password auth uses Argon2id hashing (state of the art) with rate limiting (5 attempts/minute)
  • Magic links for password setup expire in 24 hours, reset links in 1 hour

Technical Notes

Sudoers configuration:

The webapp needs sudo access to run add-analyst and notify-scripts. This is configured via server/sudoers-webapp file which is deployed to /etc/sudoers.d/webapp:

www-data ALL=(ALL) NOPASSWD: /usr/local/bin/add-analyst
www-data ALL=(ALL) NOPASSWD: /usr/local/bin/notify-scripts

Absolute paths requirement:

Gunicorn runs with a restricted PATH (only /opt/data-analyst/.venv/bin). Therefore, all system commands in Python code must use absolute paths:

  • /usr/bin/sudo (not just sudo)
  • /usr/local/bin/add-analyst
  • /usr/local/bin/notify-scripts

This is handled in webapp/user_service.py and server/telegram_bot/runner.py.

Username Generation

Username is generated from email address: the part before @ converted to lowercase.

Examples:

  • Petr.Simecek@your-domain.com -> petr.simecek
  • john@your-domain.com -> john

If a username conflicts with a reserved system name or existing non-analyst account, the user sees an error and must contact an admin to create the account manually with a different username.

Prerequisites

GCP Firewall:

# Allow HTTP/HTTPS traffic (required for Let's Encrypt and webapp)
gcloud compute firewall-rules create allow-http-data-broker \
  --project=kids-ai-data-analysis \
  --direction=INGRESS \
  --priority=1000 \
  --network=default \
  --action=ALLOW \
  --rules=tcp:80,tcp:443 \
  --source-ranges=0.0.0.0/0 \
  --target-tags=http-server,https-server

# Add tags to VM
gcloud compute instances add-tags data-broker-for-claude \
  --project=kids-ai-data-analysis \
  --zone=europe-north1-a \
  --tags=http-server,https-server

DNS:

  • A record: your-instance.example.com -> YOUR_SERVER_IP

Password Authentication for External Users

External users (investors, partners) who don't have @your-domain.com Google accounts can authenticate using email/password.

How It Works

  1. Admin adds email to whitelist (via GitHub Secrets):

    • Go to GitHub repo Settings > Secrets > Actions
    • Update ALLOWED_EMAILS secret (comma-separated list)
    • Push any change to trigger deploy, or manually restart webapp
  2. User visits login page and clicks "Sign in with Email"

  3. First-time setup (Sign Up tab):

    • User enters their whitelisted email
    • Clicks "Request Access"
    • Receives email with setup link (valid 24 hours)
    • Sets up password via the link
  4. Subsequent logins (Sign In tab):

    • User enters email + password
    • Same session/dashboard as Google OAuth users

Username Generation

Usernames are derived from email addresses differently for internal vs external users:

Email Username Type
john.doe@your-domain.com john.doe Internal (Google OAuth)
emily@investor.com emily_investor_com External (password auth)
partner@example.org partner_example_org External (password auth)

This prevents username collisions between internal and external users.

Configuration

GitHub Secrets (recommended):

Secret Description
ALLOWED_EMAILS Comma-separated list of whitelisted emails
SENDGRID_API_KEY SendGrid API key for sending emails
EMAIL_FROM_ADDRESS Sender email address (e.g., noreply@your-domain.com)
EMAIL_FROM_NAME Sender display name (e.g., Data Analyst Platform)

Data storage:

/data/auth/                         # Password auth data (www-data:data-ops, 2770)
└── password_users.json             # User records (hashes, tokens, metadata)

Security Features

  • Argon2id password hashing (most secure algorithm)
  • Rate limiting: 5 failed attempts per minute per email
  • Single-use tokens: Setup/reset links invalidate after use
  • Token expiry: Setup 24h, reset 1h
  • No email enumeration: Reset endpoint always shows same message
  • Password requirements: Min 8 chars, uppercase, lowercase, digit

Password Reset

Users can reset their password via "Forgot Password?" link on the Sign In tab. They receive an email with a reset link valid for 1 hour.

Telegram Notification Bot

A Telegram bot (@YourBot) allows analysts to receive alerts from their custom notification scripts.

Architecture

Telegram Bot Service (systemd: notify-bot)
├── Telegram polling (handles /start, /test commands)
└── HTTP server on unix socket (/run/notify-bot/bot.sock)
        ▲
        │ POST /send, POST /send_photo
        │
notify-runner (user crontab, /usr/local/bin/notify-runner)
└── Executes ~/user/notifications/*.py

The webapp reads/writes shared JSON files in /data/notifications/ for user-Telegram linking (verification codes, user mappings).

Services

Service User Description
notify-bot deploy:data-ops Telegram polling + send API on unix socket
webapp www-data:data-ops Dashboard with Telegram link/unlink UI

Bot Commands

Command Description
/start Link account (or show status if already linked)
/whoami Show username and email
/status List notification scripts with Run buttons
/test Send a demo graphical report
/help Show available commands

The /status command shows inline keyboard buttons to run scripts on demand. Scripts are executed as the owning user via sudo -u using the notify-scripts helper (see below).

Data Files

/data/notifications/            # deploy:data-ops, mode 2770 (setgid, no others)
├── telegram_users.json         # username -> {chat_id, linked_at}
├── desktop_users.json          # username -> {linked_at} (desktop app link state)
├── pending_codes.json          # code -> {chat_id, created_at}
└── bot.log                     # Bot service log

/run/notify-bot/                # systemd RuntimeDirectory (mode 0755)
└── bot.sock                    # Unix socket for send API (mode 0666)

The setgid bit (2770) ensures all files created in /data/notifications/ inherit the data-ops group, allowing both the bot service (deploy) and webapp (www-data) to read/write them. Analysts have no access to this directory.

The socket is in /run/notify-bot/, a systemd-managed directory with 0755 permissions, so any local user can connect to send notifications.

Notification Runner

Users create Python scripts in ~/user/notifications/ that output JSON to stdout. The notify-runner script (installed at /usr/local/bin/notify-runner) executes these scripts and sends results via the bot's unix socket.

Per-user state is stored in ~/.notifications/state/ (cooldown tracking) and logs in ~/.notifications/logs/.

Users configure their own crontab:

crontab -e
# Add:
*/5 * * * * ~/.venv/bin/python /usr/local/bin/notify-runner >> ~/.notifications/logs/cron.log 2>&1

Notify-Scripts Helper

The notify-scripts helper (/usr/local/bin/notify-scripts) provides a secure way for services (webapp, Telegram bot) to list and run user notification scripts without needing filesystem access to user home directories.

Why it exists: User home directories are set to 750 permissions. Services like www-data and deploy cannot traverse /home/{user}/ to read scripts or state files. The helper runs as the target user via sudo -u, so it has full access to ~/user/notifications/ and ~/.notifications/state/.

Usage:

# List scripts with last_run metadata (returns JSON array)
sudo -u <username> /usr/local/bin/notify-scripts list

# Run a script and return its JSON output
sudo -u <username> /usr/local/bin/notify-scripts run <script_name.py>

# Get last sync time (returns JSON with elapsed_seconds, elapsed_display)
sudo -u <username> /usr/local/bin/notify-scripts sync-status

The sync-status command reads the mtime of ~/server/ directory. This is updated by sync_data.sh via touch ~/server/ at the end of each sync. Each user has their own ~/server/ directory (containing symlinks to shared /data/), so timestamps are per-user.

Callers:

  • server/telegram_bot/status.py - /status command and script list API
  • server/telegram_bot/runner.py - on-demand script execution (Telegram "Run" button, webapp API)
  • webapp/account_service.py - Account card "Last Sync" display

Sudoers rules:

# /etc/sudoers.d/webapp
www-data ALL=(ALL) NOPASSWD: /usr/local/bin/notify-scripts

# /etc/sudoers.d/deploy
deploy ALL=(ALL) NOPASSWD: /usr/local/bin/notify-scripts

Monitoring

# Bot service
sudo systemctl status notify-bot
tail -f /data/notifications/bot.log

# Linked users
cat /data/notifications/telegram_users.json | python3 -m json.tool

# Runner logs (per user)
cat ~/.notifications/logs/runner.log

Security

  • Bot token is stored centrally in /opt/data-analyst/repo/.env (loaded via systemd EnvironmentFile)
  • Users never see the token - they communicate via unix socket only
  • Socket in /run/notify-bot/bot.sock (systemd RuntimeDirectory, mode 0755), socket itself 0666
  • /data/notifications/ is 2770 (only deploy + data-ops), no analyst access to logs or user mappings
  • Notification scripts run under the user's own account (no sudo) when triggered by crontab
  • On-demand runs (via /status button and webapp API) use sudo -u <user> /usr/local/bin/notify-scripts -- services never access user home directories directly
  • Scripts have a 60-second timeout (enforced by notify-scripts helper)
  • Verification codes expire after 10 minutes and are single-use

Known Issues

On-demand script execution security hardening (partially resolved): The notify-scripts helper replaced direct sudo -H -u ... /usr/bin/env ... calls with a single auditable entry point. Services no longer need filesystem access to user home directories (750 permissions are preserved). The bot still requires NoNewPrivileges=false and /tmp in ReadWritePaths for sudo execution. A queue-based approach (#51) could further improve this by having notify-runner pick up run requests from a queue instead of the bot calling sudo directly.

Data Sync Settings (Web Portal)

Users can configure which optional datasets to sync via the web portal at https://your-instance.example.com. Settings are stored server-side and downloaded by sync_data.sh before each sync.

Architecture

┌─────────────────────────────────────┐
│  Web Portal (Dashboard)             │
│  └── Data Settings widget           │
│      ├── Toggle: Jira (~50 MB)      │
│      └── Toggle: Jira Attachments   │
│                (~500 MB+)           │
└─────────────────────────────────────┘
              │ POST /api/sync-settings
              ▼
┌─────────────────────────────────────┐
│  Flask API                          │
│  ├── Save to sync_settings.json     │
│  └── Write ~/.sync_settings.yaml    │
│      (via sudo install)             │
└─────────────────────────────────────┘
              │
              ▼
/data/notifications/sync_settings.json  ← Central storage (all users)
/home/{user}/.sync_settings.yaml        ← Per-user config file
              │
              ▼ scp (analyst sync)
┌─────────────────────────────────────┐
│  sync_data.sh (client)              │
│  ├── Download ~/.sync_settings.yaml │
│  ├── Read dataset toggles           │
│  └── Conditionally run sync_jira.sh │
└─────────────────────────────────────┘

Data Files

File Location Purpose
sync_settings.json /data/notifications/ Central storage for all users' settings
.sync_settings.yaml /home/{user}/ Per-user config file (YAML format)

sync_settings.json format:

{
  "petr.simecek": {
    "datasets": {
      "jira": true,
      "jira_attachments": false
    },
    "updated_at": "2026-02-03T12:00:00Z"
  }
}

Per-user .sync_settings.yaml format:

# Data Analyst - Sync Configuration
# Managed by web portal - changes here may be overwritten

datasets:
  jira: true
  jira_attachments: false

Sudoers Configuration

The webapp needs sudo to write config files to user home directories. This is configured in /etc/sudoers.d/webapp-sync:

# Allow webapp to install sync settings to user home directories
www-data ALL=(ALL) NOPASSWD: /usr/bin/install -o * -g * -m 644 /tmp/*.yaml /home/*/.sync_settings.yaml

Why this approach:

  • Webapp runs as www-data which cannot write to /home/{user}/
  • Using install command allows setting ownership in one atomic operation
  • Tempfile must be in /tmp/ (Gunicorn has restricted PATH)
  • Target is restricted to .sync_settings.yaml only

Client Sync Flow

When sync_data.sh runs:

  1. Downloads config from server:

    scp -q data-analyst:~/.sync_settings.yaml /tmp/.sync_settings_$(id -u).yaml
    
  2. If no config exists on server, creates default (jira: false)

  3. Reads config and conditionally runs dataset sync scripts:

    if grep -qE '^\s*jira:\s*true' "$SYNC_CONFIG_LOCAL"; then
        bash sync_jira.sh
    fi
    
  4. sync_jira.sh syncs data AND creates DuckDB views automatically (no separate step needed)

  5. sync_jira.sh checks jira_attachments setting for attachment sync

Available Datasets

Dataset Size Description
jira ~50 MB Support tickets from SUPPORT project (issues, comments, changelog, attachment metadata)
jira_attachments ~500 MB+ Actual attachment files (images, logs, etc.). Requires jira to be enabled.

API Endpoints

Endpoint Method Description
/api/sync-settings GET Get current user's sync settings
/api/sync-settings POST Update settings and regenerate user config

Troubleshooting

Settings not being saved to user home:

  • Check /etc/sudoers.d/webapp-sync exists
  • Verify tempfile is created in /tmp/ (not other directory)
  • Check webapp logs: tail -f /opt/data-analyst/logs/webapp-error.log

Old scripts on client after sync:

  • sync_data.sh downloads scripts from /data/scripts/ on server
  • Ensure deploy.sh copies all scripts including sync_jira.sh
  • If scripts are missing from /data/scripts/, run manual deploy or CI/CD

Jira Webhook Integration

Receives webhooks from Atlassian Jira to maintain a real-time copy of issue data for analysis.

Architecture

Jira Cloud (your-org.atlassian.net)
        │
        │ POST /webhooks/jira (HTTPS)
        ▼
┌─────────────────────────────────────┐
│  Webapp (Flask)                     │
│  ├── Verify HMAC signature          │
│  ├── Fetch full issue via REST API  │
│  ├── Save JSON + download attachs   │
│  └── Trigger incremental transform  │
│            │                        │
│            ▼                        │
│  ┌─────────────────────────────┐    │
│  │ incremental_jira_transform  │    │
│  │ • Upsert to monthly Parquet │    │
│  │ • Copy to distribution dir  │    │
│  └─────────────────────────────┘    │
└─────────────────────────────────────┘
        │
        ▼ rsync (analyst sync)
┌─────────────────────────────────────┐
│  Analyst (local)                    │
│  • Only changed monthly files sync  │
│  • Data available within seconds    │
└─────────────────────────────────────┘

Data Structure

/data/src_data/
├── raw/jira/                  # Raw Jira data from webhooks
│   ├── issues/                # Individual issue JSON files
│   │   ├── SUPPORT-1234.json
│   │   └── SUPPORT-1235.json
│   ├── attachments/           # Downloaded attachment files
│   │   └── SUPPORT-1234/
│   │       └── 56340_image.png
│   └── webhook_events/        # Raw webhook payloads (audit)
│       └── 20260203_120000_jira_issue_created.json
│
└── parquet/jira/              # Transformed data (monthly partitioned)
    ├── issues/
    │   ├── 2024-01.parquet
    │   └── 2024-02.parquet
    ├── comments/
    ├── attachments/           # Metadata only (not binary)
    └── changelog/

~/server/parquet/jira/         # Distribution directory (symlink or copy)
                               # This is what analysts sync via rsync

Monthly partitioning: Each issue belongs to the month of its created_at date. When an issue is updated, only that month's Parquet file changes. Rsync detects changed files by checksum and only transfers those (~50-100KB per month).

Configuration

Add to /opt/data-analyst/.env:

# Jira Webhook Integration
JIRA_WEBHOOK_SECRET=<generate with: python -c "import secrets; print(secrets.token_hex(32))">
JIRA_DOMAIN=your-org.atlassian.net
JIRA_EMAIL=integration-user@your-domain.com
JIRA_API_TOKEN=<API token from Atlassian account>

# SLA polling (JSM service account for elapsed_millis refresh)
JIRA_SLA_EMAIL=<JSM service account email>
JIRA_SLA_API_TOKEN=<JSM service account API token>
JIRA_CLOUD_ID=f0f7a244-4fb4-41f9-b1f0-b79e24a20f11

Get Jira API token:

  1. Go to https://id.atlassian.com/manage-profile/security/api-tokens
  2. Create API token
  3. Store in .env as JIRA_API_TOKEN

Jira Webhook Setup

  1. Go to Jira Admin > System > WebHooks
  2. Create new webhook:
    • Name: Data Analyst Sync
    • URL: https://your-instance.example.com/webhooks/jira
    • Secret: Same value as JIRA_WEBHOOK_SECRET in .env
    • JQL Filter: project = "Your Project" (or your project)
    • Events:
      • Issue: created, updated, deleted
      • Comment: created, updated
      • Attachment: created
      • Issue link: created

Endpoints

Endpoint Method Description
/webhooks/jira POST Receive Jira webhooks
/webhooks/jira/health GET Health check (shows config status)
/webhooks/jira/test POST Manual issue fetch (debug mode only)

Monitoring

# Check webhook health
curl https://your-instance.example.com/webhooks/jira/health

# View recent webhook events
ls -la /data/src_data/raw/jira/webhook_events/ | tail -20

# Check saved issues
ls /data/src_data/raw/jira/issues/ | wc -l

# View webapp logs for webhook processing
tail -f /opt/data-analyst/logs/webapp-error.log | grep -i jira

SLA Polling

SLA elapsed values (first_response_elapsed_millis, time_to_resolution_elapsed_millis) only update when a webhook fires. For idle open tickets, these values go stale. The SLA polling timer refreshes them periodically and self-heals stale status data from missed webhooks.

Component Description
jira-sla-poll.service Oneshot service that polls open tickets for fresh SLA + status data
jira-sla-poll.timer Runs every 15 minutes (10min after boot, then every 15min)
scripts/jira_poll_sla.py Reads Parquet to find open issues, fetches SLA + status via cloud API
src/jira_file_lock.py Per-issue advisory file locking (shared with webhook handler)

How it works:

  1. Reads Parquet issues to find open tickets with SLA data (~49 tickets)
  2. For each: fetches fresh SLA and status fields via JSM service account (cloud API)
  3. Acquires per-issue advisory file lock (prevents concurrent webhook writes)
  4. Updates raw JSON atomically (tempfile + os.fchmod(0o660) + os.replace)
  5. If ticket is resolved in Jira but "open" locally: logs Self-healing: SUPPORT-XXXX is resolved in Jira
  6. Calls transform_single_issue() to update Parquet + distribution (inside lock)
  7. Releases lock

Monitoring:

# Check timer status
systemctl status jira-sla-poll.timer
systemctl list-timers | grep jira

# View last run logs
journalctl -u jira-sla-poll.service --since "1 hour ago"

# Manual dry run (count open issues)
cd /opt/data-analyst/repo
/opt/data-analyst/.venv/bin/python scripts/jira_poll_sla.py --dry-run

Requires: JIRA_SLA_EMAIL, JIRA_SLA_API_TOKEN, JIRA_CLOUD_ID in .env. Timer is auto-enabled by deploy.sh when JIRA_SLA_API_TOKEN is set.

Consistency Monitoring

Automated check every 30 minutes to detect missing Jira issues caused by webhook losses, disk failures, or processing errors. Validates data integrity by comparing three sources: Jira API (ground truth), raw JSON files, and Parquet data.

Component Description
jira-consistency.service Oneshot service that validates data consistency across all sources
jira-consistency.timer Runs every 30 minutes (10min after boot)
jira-consistency-deep.timer Weekly full history check (Sunday 3 AM)
scripts/jira_consistency_check.py Validation script with auto-backfill capability

How it works:

  1. Queries Jira API for all issue keys (last 30 days by default)
  2. Compares with raw JSON files in /data/src_data/raw/jira/issues/
  3. Compares with Parquet data in /data/src_data/parquet/jira/issues/
  4. Auto-backfills if 1-10 issues missing (downloads JSON + transforms to Parquet)
  5. Alerts (ERROR log) if 11+ issues missing (requires manual investigation)
  6. Re-transforms JSON to Parquet for issues with transform lag

Grace period: Ignores issues created in last 5 minutes to avoid false positives from webhook timing windows.

Alert levels:

  • INFO: 1-5 missing issues, auto-backfilled successfully
  • WARNING: 6-10 missing issues, auto-backfilled successfully
  • ERROR: 11+ missing issues, manual review required (no auto-fix)

Monitoring:

# Check timer status
systemctl status jira-consistency.timer
systemctl list-timers | grep jira

# View last run logs
journalctl -u jira-consistency.service --since "1 hour ago"

# Manual check (dry run)
cd /opt/data-analyst/repo
/opt/data-analyst/.venv/bin/python scripts/jira_consistency_check.py --dry-run --max-age-days 7

# Manual check with auto-fix
/opt/data-analyst/.venv/bin/python scripts/jira_consistency_check.py --auto-fix --max-age-days 30

# View consistency report
cat /data/src_data/raw/jira/_consistency_report.json | python3 -m json.tool

Manual recovery (if 11+ issues found):

# List missing issues from report
jq -r '.discrepancies.missing_in_json[]' /data/src_data/raw/jira/_consistency_report.json

# Backfill specific issues
cd /opt/data-analyst/repo
/opt/data-analyst/.venv/bin/python scripts/jira_backfill.py --issue-keys SUPPORT-15307,SUPPORT-15308

# Verify in Parquet
/opt/data-analyst/.venv/bin/python -c "
import duckdb
con = duckdb.connect()
result = con.execute('''
  SELECT issue_key, created_at, summary
  FROM read_parquet('/data/src_data/parquet/jira/issues/*.parquet')
  WHERE issue_key IN ('SUPPORT-15307', 'SUPPORT-15308')
''').fetchall()
for row in result:
    print(row)
"

Requires: JIRA_DOMAIN, JIRA_EMAIL, JIRA_API_TOKEN in .env. Timers are auto-enabled by deploy.sh when Jira credentials are configured.

Security

  • Webhooks are verified using HMAC-SHA256 signature
  • API token has read-only access to Jira (no write permissions needed)
  • Webhook events are logged for audit purposes
  • Multiple services write to /data/src_data/raw/jira/: webapp (www-data), SLA poll (root), consistency check (root), backfill scripts (admin users)
  • Concurrent writes to the same issue JSON are serialized via per-issue advisory file locking (src/jira_file_lock.py, fcntl.flock). Lock files in issues/.locks/. See #203.

Data Profiler

Generates YData-inspired statistical profiles for all tables in the data catalog, including Jira support tables. Profiles include per-column statistics, type-specific visualizations (histograms, top values, timelines), data quality alerts, and business context (relationships, metrics). Profiles are preserved across runs — if a table fails to profile, its previous valid data is retained.

Architecture

Cron (update.sh, 3x daily)
  Step 2: python -m src.data_sync     → parquet + sync_state.json + schema.yml
  Step 3: python -m src.profiler      → profiles.json
                │
                ▼
/data/src_data/metadata/profiles.json  (mode 644, padak:data-ops)
                │
                ▼
Webapp: GET /api/catalog/profile/<table_name>
                │
                ▼
Catalog page: profiler modal (Chart.js visualizations)

How It Works

  1. Profiler runs as Step 4 in scripts/update.sh after data sync and metadata generation
  2. Materializes Parquet into DuckDBCREATE TEMP TABLE loads each table once into DuckDB columnar storage (instead of re-reading Parquet files for every query)
  3. Batch statistics — base stats (COUNT, COUNT DISTINCT) for all columns in one query; type-specific aggregates (numeric, string, date, boolean) batched per category
  4. Large tables (>500K rows) are sampled: USING SAMPLE 500000 ROWS
  5. Merges metadata from data_description.md (descriptions, foreign keys), sync_state.json (row counts, file sizes), and docs/metrics/*.yml (business metric mappings)
  6. Writes profiles.json atomically (tempfile.mkstemp() + os.chmod(0o644) + os.replace())
  7. Preserves existing profiles on failure — if a table fails to profile, the previous valid profile is retained (marked _stale: true)
  8. Profiler failure is non-fatal — if the entire profiler fails, the update pipeline continues
  9. Jira table relationshipsissue_key foreign keys are defined between all Jira tables (comments, attachments, changelog, issuelinks, remote_links → jira_issues), visible in the Relationships tab

Output File

/data/src_data/metadata/profiles.json   # ~900 KB for ~29 tables

Permissions: File must be 644 (world-readable) so the webapp (www-data) can serve it. The profiler sets os.chmod(tmp, 0o644) before os.replace() because mkstemp() defaults to 600.

Per-Table Profile Structure

Each table profile contains:

Field Source Description
row_count, column_count DuckDB Table dimensions
file_size_mb sync_state.json Parquet file size on disk
description, primary_key data_description.md Business context
avg_completeness DuckDB Average non-null percentage across columns
missing_cells, missing_cells_pct DuckDB Total NULL cells count and percentage
duplicate_rows DuckDB COUNT(*) - COUNT(DISTINCT *)
date_range DuckDB Earliest/latest date from date columns
variable_types DuckDB Breakdown by type (STRING, NUMERIC, DATE, BOOLEAN)
alerts Computed Auto-detected data quality issues (see below)
related_tables data_description.md Foreign key relationships (outgoing + incoming)
used_by_metrics docs/metrics/*.yml Which business metrics use this table
sample_rows DuckDB First 5 rows for preview
columns DuckDB Per-column detailed statistics
_stale Profiler true if this profile is from a previous run (current profiling failed)

Alert System

Auto-detection of data quality issues, displayed as colored badges:

Alert Condition Severity
constant unique_count == 1 warning (yellow)
unique unique_pct == 100% info (red)
high_missing missing_pct > 30% error (red)
missing missing_pct > 5% warning (yellow)
imbalance top_value_pct > 60% (categorical) info (blue)
zeros zero_pct > 50% (numeric) info (blue)
high_cardinality unique_count > 50 (text) info (grey)

Type-Specific Column Statistics

Column Type Statistics Visualization
STRING (low cardinality ≤50) Top 10 values with counts/percentages Horizontal bar chart
STRING (high cardinality >50) min/max/avg length, sample values Sample list
NUMERIC (FLOAT64, INT64, DECIMAL) min, max, mean, median, p5/p25/p75/p95, stddev, zeros Histogram (10-20 buckets)
DATE/TIMESTAMP earliest, latest, span_days Timeline histogram (quarterly)
BOOLEAN true_count, false_count, true_pct True/false ratio bar

Webapp Integration

API endpoint: GET /api/catalog/profile/<table_name> (requires login)

  • Returns JSON profile for a single table from profiles.json
  • 404 if profiler hasn't run yet or table not found
  • 500 if file unreadable (check permissions)

Catalog page: Click any table row to open profiler modal with tabs:

  • Overview — dataset statistics + variable type breakdown
  • Variables — per-column cards with type-specific charts (Chart.js)
  • Alerts — all detected issues with colored severity badges
  • Missing Values — horizontal bar chart of completeness per column
  • Relationships — foreign key links (clickable to open related table's profile)
  • Sample — first 5 rows in table format

Performance

  • Runtime: ~1-2 minutes for ~29 tables (optimized from ~8min via TABLE materialization + batch queries)
  • Sampling: Tables >500K rows use USING SAMPLE 500000 ROWS for consistent performance
  • Memory: In-memory DuckDB with temporary tables (dropped after profiling)
  • Output size: ~900 KB JSON for ~29 tables (including 6 Jira tables)

Files

File Description
src/profiler.py Profiler engine (~1220 lines)
tests/test_profiler.py Unit + integration tests (24 tests)
scripts/update.sh Pipeline integration (Step 4)
webapp/app.py API route /api/catalog/profile/<table_name>
webapp/templates/catalog.html Profiler modal UI + Chart.js

Monitoring

# Manual profiler run
ssh kids "cd /opt/data-analyst/repo && source /opt/data-analyst/.venv/bin/activate && python -m src.profiler"

# Check output
ssh kids "ls -la /data/src_data/metadata/profiles.json"
ssh kids "python3 -c \"import json; d=json.load(open('/data/src_data/metadata/profiles.json')); print(f'Tables: {len(d[\\\"tables\\\"])}')\""

# Check update.sh logs (profiler runs as Step 4)
ssh kids "cat /var/log/update.log | grep -A5 'Generating data profiles'"

# Test API endpoint
curl -s https://your-instance.example.com/api/catalog/profile/company | python3 -m json.tool | head -20

Troubleshooting

"Profile data not available for this table"

  • Profiler hasn't been run yet, or table name doesn't match
  • Run manually: python -m src.profiler on server
  • Note: Since v1.1, profiler preserves old profiles on failure — this should only appear for truly new tables

HTTP 500 on /api/catalog/profile/*

  • Check file permissions: ls -la /data/src_data/metadata/profiles.json — must be 644
  • Fix: sudo chmod 644 /data/src_data/metadata/profiles.json
  • Root cause: mkstemp() creates files with 600; fixed in profiler.py with os.chmod(0o644)

Profiler takes too long

  • Normal runtime is ~1-2 minutes; if significantly longer, check which tables are large in profiler logs
  • Sampling threshold is 500K rows (configurable in src/profiler.py constant SAMPLE_THRESHOLD)
  • TABLE materialization + batch queries keep it fast; if DuckDB runs out of memory, check server RAM

Metrics not showing in profiler

  • Metrics are loaded from docs/metrics/ directory (split by category: docs/metrics/*/*.yml)
  • Legacy docs/metrics.yml path is still supported but the directory structure takes precedence
  • Check that metric files exist: ls docs/metrics/*/*.yml

Corporate Memory

A knowledge sharing system that extracts reusable insights from analysts' personal notes (CLAUDE.local.md), lets the team vote on them via a webapp, and syncs upvoted items back to each user's Claude Code rules.

Architecture

┌─────────────────────────────────────┐
│  Analyst Workstations               │
│  ├── CLAUDE.local.md                │  ← Personal notes (synced to server)
│  └── .claude/rules/*.md             │  ← Synced rules from upvoted items
└─────────────────────────────────────┘
         │ sync_data.sh                    ▲ sync_data.sh
         │ (upload CLAUDE.local.md)        │ (download .claude_rules/*)
         ▼                                 │
┌─────────────────────────────────────┐   │
│  Server: /home/{user}/              │   │
│  ├── CLAUDE.local.md                │   │
│  └── .claude_rules/*.md             │───┘
└─────────────────────────────────────┘
         │ corporate-memory.timer (every 30 min)
         ▼
┌─────────────────────────────────────┐
│  Knowledge Collector (full refresh) │
│  ├── MD5 hash change detection      │
│  ├── ALL files + existing catalog   │
│  │   → single Claude Haiku 4.5 call │
│  │     (Structured Outputs)         │
│  ├── Sensitivity check (new items)  │
│  └── Save to knowledge.json        │
└─────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────┐
│  /data/corporate-memory/            │
│  ├── knowledge.json                 │
│  ├── votes.json                     │
│  └── user_hashes.json               │
└─────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────┐
│  Webapp: /corporate-memory          │
│  ├── Browse, search, filter         │
│  ├── Upvote / downvote items        │
│  └── On vote → regenerate user rules│
└─────────────────────────────────────┘

How It Works

Collection (server-side, every 30 min)

  1. Analysts write notes in CLAUDE.local.md during their work with Claude Code
  2. sync_data.sh uploads CLAUDE.local.md to /home/{user}/CLAUDE.local.md on the server
  3. Collector checks for changes by comparing MD5 hashes of all users' files against user_hashes.json
  4. If any file changed, collector sends ALL users' files + the existing knowledge catalog to Claude Haiku 4.5 in a single API call (full refresh approach)
  5. Haiku maps knowledge to existing catalog items (preserving IDs for vote stability) or creates new items
  6. Sensitivity check runs only on newly created items (existing items were already checked)
  7. Knowledge base is updated atomically (tempfile + os.replace)

Voting and Rules Sync (webapp → analyst)

  1. Users browse knowledge at /corporate-memory (search, filter by category, sort by score)
  2. Upvoting an item records the vote in votes.json and immediately regenerates the user's rule files
  3. Rule files are installed to /home/{server_user}/.claude_rules/{item_id}.md via the install-user-rules sudo helper (see below)
  4. Next sync_data.sh run downloads .claude_rules/* to the analyst's .claude/rules/ directory
  5. Claude Code automatically reads files from .claude/rules/ as project context

There is no threshold - any personal upvote syncs the item to that user's rules.

Rules Installation (sudo helper)

The webapp runs as www-data which cannot write to /home/{user}/ directories (mode drwxr-x---). Rule files are installed using the established sudo install pattern (same approach as sync_settings_service.py for .sync_settings.yaml):

  1. Webapp writes rule .md files to a temp directory
  2. Calls sudo -n /usr/local/bin/install-user-rules {username} {tmp_dir}
  3. Helper script creates /home/{user}/.claude_rules/ (mode 700), removes old km_*.md files, installs new files with /usr/bin/install -o {user} -g {user} -m 600
  4. Webapp cleans up the temp directory

Files involved:

  • server/bin/install-user-rules → deployed to /usr/local/bin/install-user-rules
  • server/sudoers-webapp → entry: www-data ALL=(ALL) NOPASSWD: /usr/local/bin/install-user-rules
  • webapp/corporate_memory_service.py_regenerate_user_rules() calls the helper via subprocess.run()

Username Mapping

The webapp uses email-derived usernames (e.g., petr.simecek) while the server uses Linux home directory names (e.g., petr). Most users match, only Petr differs.

Mapping is in webapp/corporate_memory_service.py:

WEBAPP_TO_SERVER_USERNAME = {
    "petr.simecek": "petr",
}

Display names for avatars (initials + tooltip):

USER_DISPLAY_NAMES = {
    "petr": {"name": "Petr Simecek", "initials": "PS"},
    "dasa.damaskova": {"name": "Dasa Damaskova", "initials": "DD"},
    "martin.matejka": {"name": "Martin Matejka", "initials": "MM"},
    "jiri.manas": {"name": "Jiri Manas", "initials": "JM"},
    "pavel.dolezal": {"name": "Pavel Dolezal", "initials": "PD"},
}

Data Files

/data/corporate-memory/               # deploy:data-ops, mode 2770
├── knowledge.json                    # Extracted knowledge items + metadata
├── votes.json                        # Per-user votes {username: {item_id: 1/-1}}
├── user_hashes.json                  # MD5 hashes for change detection
└── collection.log                    # Collection run history

/home/{user}/
├── CLAUDE.local.md                   # User's personal notes (source)
└── .claude_rules/                    # Generated rule files (mode 700, owner-only)
    ├── km_abc123.md                  # mode 600, owned by user
    └── km_def456.md

knowledge.json structure:

{
  "items": {
    "km_abc123": {
      "id": "km_abc123",
      "title": "DuckDB Schema Reference Protocol",
      "content": "Always read schema before queries...",
      "category": "workflow",
      "tags": ["duckdb", "best-practices"],
      "source_users": ["petr"],
      "extracted_at": "2026-02-05T21:54:18Z",
      "updated_at": "2026-02-05T21:54:18Z"
    }
  },
  "metadata": {
    "last_collection": "2026-02-05T21:54:18Z",
    "total_users": 3
  }
}

votes.json structure:

{
  "petr": {
    "km_abc123": 1,
    "km_def456": -1
  }
}

Full Refresh Approach

The collector uses a full refresh strategy to avoid duplicates:

  1. Change detection: MD5 hash of each user's CLAUDE.local.md is compared against user_hashes.json
  2. If no changes: Skip the API call entirely (saves cost)
  3. If any file changed: Load ALL user files and the existing catalog
  4. Single Haiku call: The prompt includes the existing catalog with IDs, so Haiku can:
    • Map knowledge to existing items (preserving existing_id for vote stability)
    • Merge similar knowledge from different users into single items
    • Add genuinely new items (assigned new km_* IDs)
    • Preserve source_users from existing items even if a user removed their notes
  5. Sensitivity check: Only NEW items (without existing_id) are checked - existing items passed the check previously

This approach ensures:

  • No duplicates from non-deterministic AI output
  • Stable item IDs across runs (votes are preserved)
  • Cross-user knowledge merging in a single pass

Systemd Services

Service Type Schedule Description
corporate-memory.service oneshot on-demand Runs the knowledge collector
corporate-memory.timer timer every 30 min Triggers the service

Service configuration:

  • Runs as root (needed to read /home/*/CLAUDE.local.md)
  • Group: data-ops
  • Timeout: 600 seconds (for API calls)
  • Security hardening: ProtectSystem=strict, PrivateTmp=true

Configuration

Required GitHub Secret:

Secret Description
ANTHROPIC_API_KEY Claude API key for Haiku 4.5 extraction

The API key is deployed to /opt/data-analyst/.env via CI/CD and loaded by the collector service.

Model: claude-haiku-4-5-20251001 with Structured Outputs (output_config.format.json_schema)

Knowledge Categories

Category Description
data_analysis DuckDB, Parquet, data processing techniques
api_integration API usage, HTTP clients, authentication
debugging Error diagnosis, troubleshooting techniques
performance Optimization, caching, efficiency improvements
workflow Best practices, processes, conventions
infrastructure Server, deployment, configuration
business_logic Domain knowledge, data relationships

Extraction Process

The collector uses Claude Haiku 4.5 with Structured Outputs for guaranteed JSON schema compliance:

  1. Catalog refresh prompt sends all user files + existing catalog to Haiku
  2. JSON Schema enforces output format including existing_id (string or null) for ID preservation
  3. Sensitivity check verifies only NEW items are safe to share
  4. ID assignment: Existing items keep their IDs; new items get km_{uuid[:8]} format

Filtering rules (in the prompt):

  • EXCLUDE: API keys, tokens, passwords, credentials
  • EXCLUDE: Personal preferences, project-specific paths
  • EXCLUDE: Basic knowledge any developer would know
  • EXCLUDE: Incomplete or unclear notes
  • EXCLUDE: Anything referencing specific people negatively

Manual Reset

To recalculate the entire knowledge base from scratch (e.g., after fixing duplicates):

# Reset: clears knowledge.json, votes.json, user_hashes.json, and stale .claude_rules
sudo /usr/local/bin/collect-knowledge --reset --verbose

The --reset flag:

  1. Clears knowledge.json, user_hashes.json, and votes.json
  2. Removes stale .claude_rules/km_*.md files from all user home directories
  3. Runs a fresh collection from all CLAUDE.local.md files

This is a manual operation, not part of the regular timer schedule.

Monitoring

# Check timer status
sudo systemctl status corporate-memory.timer

# View last collection
sudo journalctl -u corporate-memory -n 50 --no-pager

# Manual collection run
sudo systemctl start corporate-memory.service

# Manual run with verbose output (shows API calls, items found)
sudo /usr/local/bin/collect-knowledge --verbose

# View knowledge base
cat /data/corporate-memory/knowledge.json | python3 -m json.tool

# Check item count
cat /data/corporate-memory/knowledge.json | python3 -c "import json,sys; d=json.load(sys.stdin); print(f'Items: {len(d.get(\"items\", {}))}')"

# Check votes
cat /data/corporate-memory/votes.json | python3 -m json.tool

# Check user hashes (change detection state)
cat /data/corporate-memory/user_hashes.json | python3 -m json.tool

# View a user's synced rules
ls -la /home/petr/.claude_rules/

Webapp Integration

The Corporate Memory page at /corporate-memory provides:

  • Dashboard stats: Total items, contributors, categories, last collection time
  • Knowledge cards: Title, content, category badge, tags, contributor avatars (initials + tooltip)
  • Voting: Upvote/downvote buttons per item (instantly updates score, regenerates user rules)
  • Filtering: By category dropdown, text search (title + content + tags)
  • Sorting: By score (default), by date, by number of contributors
  • "My Rules" toggle: Shows only items the current user has upvoted
  • User stats: Number of votes cast, number of active rules

API endpoints:

  • GET /api/corporate-memory/knowledge - List items (supports category, search, sort, page, my_rules params)
  • POST /api/corporate-memory/vote - Cast vote {item_id, vote: 1/-1/0}
  • GET /api/corporate-memory/stats - Dashboard statistics

Security

  • Root access required: Collector service runs as root to read /home/*/CLAUDE.local.md
  • Sudo helper for rules: Webapp uses install-user-rules via sudo to write to user home dirs (same pattern as sync_settings_service.py). Each user's .claude_rules/ is mode 700, files 600 - users cannot read each other's rules.
  • Sensitivity filtering: Two-pass check (extraction prompt rules + dedicated sensitivity check on new items)
  • No credentials stored: Knowledge items are filtered before storage
  • Source attribution: Items track which users contributed (displayed as avatar initials)
  • Read-only for analysts: /data/corporate-memory/ is only writable by data-ops group
  • Atomic writes: All JSON file updates use tempfile.mkstemp() + os.replace() to prevent corruption. Critical: always call os.fchmod(fd, 0o660) (or appropriate mode) immediately after mkstemp() — otherwise the default 0600 mode overrides the POSIX ACL mask to ---, breaking group-based access for other services. See #203.

Session Collector

Collects Claude Code session transcripts from analyst home directories and stores them centrally.

Architecture

/home/*/user/sessions/   (per-user session transcripts)
         │
         ▼
session-collector.timer  (every 6 hours)
         │
         ▼
/data/user_sessions/     (central storage, root:data-ops, mode 2770)

Systemd Services

Unit Type Schedule Description
session-collector.service oneshot on-demand Runs the session collector
session-collector.timer timer every 6 hours Triggers the service

Monitoring

sudo systemctl status session-collector.timer
sudo journalctl -u session-collector -n 50 --no-pager

Security

  • Root access required: Collector runs as root to read /home/*/user/sessions/
  • Central storage: /data/user_sessions/ is writable only by data-ops group

WebSocket Gateway

Real-time WebSocket gateway for desktop app notifications and live updates.

Architecture

Desktop App (WebSocket client)
         │
         ▼
ws-gateway.service  (deploy:data-ops)
         │
         ▼
/run/ws-gateway/ws.sock  (unix socket, mode 0755)

Systemd Service

Unit Type Description
ws-gateway.service simple WebSocket gateway for desktop clients

Monitoring

sudo systemctl status ws-gateway
sudo journalctl -u ws-gateway -n 50 --no-pager

Security

  • JWT authentication: Desktop clients authenticate via JWT tokens (DESKTOP_JWT_SECRET)
  • Read-only home: Service has ProtectHome=read-only
  • Strict protection: ProtectSystem=strict limits filesystem access

Google Cloud Monitoring

The server uses Google Cloud Ops Agent for centralized logging and metrics collection. All logs and metrics are sent to Google Cloud for analysis, alerting, and debugging.

What's Collected

Logs (Fluent Bit → Cloud Logging):

  • All syslog messages (/var/log/syslog, /var/log/messages)
  • systemd journal logs (including service failures, crashes)
  • Application logs (if written to syslog/journal)
  • Retention: 30 days (default)

Metrics (OpenTelemetry → Cloud Monitoring):

  • CPU utilization (%)
  • Memory usage (%)
  • Disk usage (%) per device
  • Network traffic (bytes sent/received)
  • Load average
  • Collection interval: 60 seconds
  • Retention: 6 weeks (default)

Configured Alerts

Alert notifications are sent to:

Alert Threshold Duration Action
High CPU Usage >80% 5 minutes Check: ssh kids 'ps aux --sort=-%cpu | head -20'
High Memory Usage >90% 5 minutes Check: ssh kids 'free -h && ps aux --sort=-%mem | head -20'
High Disk Usage >85% 1 minute Check: ssh kids 'df -h && du -sh /data/* | sort -h'
Health Endpoint Down Uptime check fails 3 minutes Check: ssh kids 'systemctl status webapp'
Health Endpoint Degraded /health returns 503 2 minutes Check: curl https://your-instance.example.com/health and review service status
Systemd Service/Timer Failures Any failure 1 minute Check: ssh kids 'systemctl --failed && journalctl -xe'

Log-Based Metrics

Custom metrics derived from logs for trend analysis:

Metric Description Filter
systemd_service_failures Count of systemd service/timer failures "Failed with result" OR "failed with result"
permission_denied_errors Count of Permission denied errors "Permission denied"
health_endpoint_degraded Count of /health returning 503 "/health" AND ("503" OR "degraded")

Dashboard

Server Overview Dashboard:

Health Endpoint & Uptime Monitoring

Health Endpoint: https://your-instance.example.com/health

Returns detailed server status in JSON format:

  • Services: webapp.service, telegram-bot.service
  • Timers: jira-consistency.timer, corporate-memory.timer, jira-sla-poll.timer
  • Disk usage: All partitions (/, /data, /home, /tmp)
  • System load: 1min, 5min, 15min averages
  • Jira webhook: Last webhook timestamp and age

Response format:

{
  "status": "healthy",  // or "degraded"
  "timestamp": "2026-02-13T18:50:33.825333Z",
  "services": [{"name": "webapp.service", "status": "active", "healthy": true}],
  "timers": [{"name": "jira-consistency.timer", "status": "active", "healthy": true}],
  "disk": [
    {"partition": "/", "used_percent": 79.4, "free_gb": 1.98, "healthy": true},
    {"partition": "/data", "used_percent": 39.0, "free_gb": 17.92, "healthy": true}
  ],
  "load": {"load_1min": 0.58, "load_5min": 1.82, "load_15min": 1.85, "healthy": true},
  "jira_webhook": {"last_webhook_hours_ago": 0.0, "healthy": true}
}

HTTP Status Codes:

  • 200 OK = all checks healthy
  • 503 Service Unavailable = one or more checks failed (status: "degraded")

Uptime Check:

  • Monitors /health endpoint from 3 global locations (USA, Europe, Asia-Pacific)
  • Check interval: 5 minutes
  • Timeout: 10 seconds
  • Validates response contains "status": "healthy"
  • Alert triggered if check fails for 3+ minutes

Viewing Logs

Cloud Logging Console: https://console.cloud.google.com/logs?project=kids-ai-data-analysis

Useful log queries:

# All logs from the server (last 1 hour)
resource.type="gce_instance"
resource.labels.instance_id="656c1763-11a1-49bb-bbc3-9782acf15aef"

# systemd service failures
resource.type="gce_instance"
resource.labels.instance_id="656c1763-11a1-49bb-bbc3-9782acf15aef"
("Failed with result" OR "Main process exited")

# Permission denied errors
resource.type="gce_instance"
resource.labels.instance_id="656c1763-11a1-49bb-bbc3-9782acf15aef"
"Permission denied"

# Webapp errors
resource.type="gce_instance"
resource.labels.instance_id="656c1763-11a1-49bb-bbc3-9782acf15aef"
"gunicorn" AND ("ERROR" OR "WARNING")

# Jira webhook processing
resource.type="gce_instance"
resource.labels.instance_id="656c1763-11a1-49bb-bbc3-9782acf15aef"
"Received webhook"

Viewing Metrics

Cloud Monitoring Console: https://console.cloud.google.com/monitoring?project=kids-ai-data-analysis

Metrics Explorer - Useful metric queries:

  • CPU: compute.googleapis.com/instance/cpu/utilization
  • Memory: agent.googleapis.com/memory/percent_used
  • Disk: agent.googleapis.com/disk/percent_used
  • Network: agent.googleapis.com/network/bytes_sent / bytes_recv

Cost

Google Cloud Monitoring pricing (as of 2026):

  • Logs ingestion: First 50 GB/month free, then $0.50/GB
  • Metrics ingestion: First 150 MB/month free, then $0.2580/MB
  • Log storage: $0.01/GB/month (30-day retention)
  • Typical monthly cost for this server: ~$5-10 (well within free tier)

Significantly cheaper than Datadog (~$15-31/host/month).

Managing Alerts

List alert policies:

gcloud alpha monitoring policies list \
  --project=kids-ai-data-analysis \
  --format="table(displayName,enabled,conditions[0].conditionThreshold.thresholdValue)"

Disable an alert:

gcloud alpha monitoring policies update POLICY_ID \
  --project=kids-ai-data-analysis \
  --no-enabled

Add notification channel:

gcloud alpha monitoring channels create \
  --project=kids-ai-data-analysis \
  --display-name="New Person" \
  --type=email \
  --channel-labels=email_address=person@your-domain.com

Debugging Server Crashes

When investigating server issues (like the 2026-02-13 systemd-journald crash):

  1. View logs around the crash time:

    • Go to Cloud Logging Console
    • Filter: resource.labels.instance_id="656c1763-11a1-49bb-bbc3-9782acf15aef"
    • Set time range to include the crash
    • Look for ERROR/WARNING severity
  2. Check metrics before the crash:

    • Go to Dashboard or Metrics Explorer
    • View CPU/Memory/Disk graphs for the time period
    • Look for spikes or anomalies
  3. Correlate logs with metrics:

    • High CPU spike at 15:20? Check logs from that time
    • Memory growth over time? Look for memory leaks in logs
  4. Export for analysis:

    # Export logs to file
    gcloud logging read "resource.labels.instance_id=\"656c1763-11a1-49bb-bbc3-9782acf15aef\"" \
      --project=kids-ai-data-analysis \
      --limit=1000 \
      --format=json \
      --freshness=1d > server_logs.json
    

Best Practices

  1. Structured logging: Applications should log in JSON format for better searchability
  2. Log levels: Use appropriate levels (ERROR for problems, INFO for events, DEBUG for details)
  3. Alert fatigue: Only alert on actionable issues, not informational events
  4. Regular review: Check dashboard weekly to spot trends before they become problems
  5. Cost monitoring: If ingestion grows, consider log sampling or exclusion filters