Open-source AI data analyst platform extracted from internal repo. Includes data sync engine, Keboola adapter, Flask web portal, server deployment scripts, and configuration templates.
88 KiB
Data Broker Server
Central server for distributing data to AI analytical systems.
Basic Information
| Parameter | Value |
|---|---|
| Name | data-broker-for-claude |
| GCP Project | kids-ai-data-analysis |
| Zone | europe-north1-a |
| Type | e2-medium |
| OS | Debian 12 (bookworm) |
| External IP | YOUR_SERVER_IP |
Hardware
| Resource | Size |
|---|---|
| RAM | 3.8 GB |
| Swap | 2 GB (/mnt/swapfile) |
| System disk (sda) | 10 GB - OS, packages, app (expendable) |
| Data disk (sdb) | 30 GB - /data, pd-balanced (snapshotted) |
| Home disk (sdc) | 30 GB - /home, pd-balanced (snapshotted) |
| Temp disk (sdd) | 100 GB - /tmp, pd-standard (not snapshotted) |
Access
SSH connection (admin)
ssh kids
Requires SSH config:
Host kids
HostName YOUR_SERVER_IP
User padak
IdentityFile ~/.ssh/google_compute_engine
Or via gcloud:
gcloud compute ssh data-broker-for-claude --project=kids-ai-data-analysis --zone=europe-north1-a
Data Structure
/data/ # Data disk (30 GB, pd-balanced)
├── lost+found/ # System directory
├── src_data/ # Source data (group: dataread, 750)
│ ├── raw/ # Raw data from Keboola (reserved for future use)
│ ├── parquet/ # Converted data (parquet format)
│ │ ├── sales/ # CRM data (in.c-crm bucket) - group: dataread
│ │ └── private/ # Private data - group: data-private
│ ├── metadata/ # Sync state, cache, profiles
│ │ ├── sync_state.json # Per-table sync stats (rows, columns, size)
│ │ └── profiles.json # Data profiler output (mode 644, ~900 KB)
│ └── staging/ # Temporary processing (reserved for future use)
├── docs/ # Documentation (deployed from repo)
│ └── schema.yml # Auto-generated table schemas (from data sync)
├── scripts/ # Helper scripts (deployed from repo)
├── examples/ # Example notification scripts (padak:data-ops, 755)
│ └── notifications/ # Example notification scripts for analysts
├── notifications/ # Notification data (deploy:data-ops, 2770 setgid)
│ ├── telegram_users.json # username -> {chat_id, linked_at} mapping
│ ├── desktop_users.json # username -> {linked_at} mapping (desktop app link state)
│ ├── pending_codes.json # temporary verification codes
│ └── bot.log # Bot service log
├── auth/ # Password auth data (www-data:data-ops, 2770 setgid)
│ └── users.json # Hashed passwords and metadata
├── corporate-memory/ # Knowledge base data (deploy:data-ops, 2770 setgid)
│ ├── knowledge.json # Collected knowledge items from CLAUDE.local.md files
│ ├── votes.json # User votes on knowledge items
│ └── user_hashes.json # MD5 hashes for change detection
└── user_sessions/ # Session collector data (root:data-ops, 2770 setgid)
└── *.jsonl # User session logs collected every 6 hours
/run/notify-bot/ # Systemd RuntimeDirectory (mode 0755)
└── bot.sock # Unix socket for send API (mode 0666)
/tmp/keboola_load/ # Keboola staging directory (root:data-ops, 2770 setgid)
└── *.parquet # Temporary Parquet files during Keboola data load
Folder Mapping
Parquet subfolders are mapped from Keboola bucket names in docs/data_description.md:
folder_mapping:
in.c-crm: sales # CRM/Salesforce data
in.c-private: private # Private/sensitive data
This mapping is used by src/config.py to determine where to save Parquet files.
Access Control
Three-tier permission model:
| Role | Groups | Access |
|---|---|---|
| Standard Analyst | dataread |
Public data read-only |
| Privileged Analyst | dataread + data-private |
Public + private data read-only |
| Admin | sudo + google-sudoers + dataread + data-private + data-ops |
Full server access (NOPASSWD) + all data read/write + deployment |
- Standard Analyst - can read public data, sync via rsync, run scripts in their workspace
- Privileged Analyst - same as standard + access to private/sensitive data (executives, management)
- Admin - server administration, can add/remove users, has sudo privileges, full data access with write permissions, can deploy application updates
Data Directory Permissions
Data in /data/src_data/ uses ACL for granular access:
/data/src_data/ owner: padak, group: data-ops
├── raw/ data-ops: rwx, dataread: r-x
├── parquet/ data-ops: rwx, dataread: r-x
│ └── private/ data-ops: rwx, data-private: r-x
└── staging/ data-ops: rwx, dataread: r-x
- Admins (data-ops): Full read/write access to prepare data
- Analysts (dataread): Read-only access to consume data
- Private data (data-private): Additional group for sensitive data access
Atomic writes and ACL — required pattern:
Directories under /data/ use default ACLs (e.g., default:group:data-ops:rwx). Files created with open() inherit these correctly. However, tempfile.mkstemp() explicitly sets mode 0600, which overrides the ACL mask to --- and silently breaks group access for all other services.
Always use os.fchmod() immediately after mkstemp():
fd, tmp_path = tempfile.mkstemp(dir=str(target.parent), suffix=".tmp")
os.fchmod(fd, 0o660) # REQUIRED: restore ACL mask for group access
try:
with os.fdopen(fd, "w") as f:
json.dump(data, f, indent=2)
os.replace(tmp_path, str(target))
except Exception:
os.unlink(tmp_path)
raise
Use 0o660 for files accessed by services via data-ops group ACL, 0o644 for world-readable files (e.g., profiler output). See #203 for a production incident caused by missing fchmod.
Per-issue file locking for concurrent writers:
When multiple services write to the same JSON file (e.g., SLA poll and webhook handler both updating /data/src_data/raw/jira/issues/SUPPORT-1234.json), use advisory file locking to prevent races:
from src.jira_file_lock import issue_json_lock
with issue_json_lock(issues_dir, issue_key):
# read JSON, modify, atomic write, transform to Parquet
...
- Uses
fcntl.flock()(POSIX advisory, blocking, exclusive) - Lock files stored in
{issues_dir}/.locks/{issue_key}.lock - Different issue keys don't block each other (fine-grained locking)
- The lock must cover the entire read-modify-write and the Parquet transform — otherwise another writer could overwrite the JSON between write and transform, causing the transform to read stale data
Currently used by:
scripts/jira_poll_sla.py— wraps SLA+status update +transform_single_issue()webapp/jira_service.py— wrapssave_issue()JSON write +trigger_incremental_transform(), and_handle_deletion()read-modify-write + transform
Attachment downloads in save_issue() intentionally run outside the lock (can take tens of seconds and don't modify JSON).
User Management
Each user has:
- Own Linux account with home directory
/home/username/ - Server symlinks:
/home/username/server/(read-only links to/data/) - User workspace:
/home/username/user/(writable: duckdb, notifications, artifacts, scripts, parquet) - Notification state:
/home/username/.notifications/{state,logs} - SSH key authentication
Management Commands
# Add standard analyst (public data only)
sudo add-analyst username "ssh-rsa AAAA... comment"
# Add privileged analyst (public + private data)
sudo add-analyst username "ssh-rsa AAAA... comment" --private
# Add server admin (sudo + all data)
sudo add-admin username "ssh-rsa AAAA... comment"
# List all analysts
list-analysts
# Remove user (interactive)
sudo remove-analyst username
# Remove user (non-interactive, e.g., via SSH)
sudo remove-analyst username --force
Examples
# Regular analyst
sudo add-analyst novak "ssh-rsa AAAAB3... jan.novak@example.com"
# Executive with private data access
sudo add-analyst ceo "ssh-rsa AAAAB3... ceo@example.com" --private
# Server administrator
sudo add-admin matejkys "ssh-rsa AAAAB3... matejkys@example.com"
sudo add-admin dasa "ssh-ed25519 AAAAC3... dasa@your-domain.com"
Output for admin:
Admin matejkys created successfully
- Added to group: sudo (server administration)
- Added to group: dataread (public data access)
- Added to group: data-private (private data access)
- Added to group: data-ops (application deployment)
- Added to resource limits (unlimited)
- Workspace: /home/matejkys/workspace
- Data link: /home/matejkys/data -> /data/src_data
SSH Configuration
- Passwords disabled (SSH keys only)
- Root login disabled
- MaxSessions: 20 (per user)
- MaxStartups: 30:50:100 (rate limiting for DDoS protection)
- ClientAliveInterval: 300s
Resource Limits
Protection against fork bombs and resource abuse. Configuration is version-controlled in server/limits-users.conf and deployed automatically by deploy.sh to /etc/security/limits.d/99-users.conf:
| Resource | Analysts | Admins |
|---|---|---|
| Max processes (nproc) | 100/150 | unlimited |
| Virtual memory (as) | 4 GB / 6 GB | unlimited |
| File size (fsize) | 2 GB / 4 GB | unlimited |
| Open files (nofile) | 1024/2048 | 65535 |
| Core dumps | disabled | unlimited |
- Admins (
data-opsgroup members) are explicitly listed in the limits file with unlimited access - New admins are automatically added to exceptions by
add-adminscript - All other users get restricted limits via wildcard rule (protection against fork bombs)
Data Sync Scripts
Server: update.sh
Syncs data from Keboola to Parquet files. Run via cron 3x daily (6:00, 14:00, 19:00 UTC).
cd /opt/data-analyst/repo && ./scripts/update.sh
What it does:
- Activates virtual environment (supports both local
./.venvand server/opt/data-analyst/.venv) - Downloads data from Keboola Storage API, converts to Parquet format in
DATA_DIR/parquet/{folder}/ - Generates data profiles (
python -m src.profiler→profiles.json) — non-fatal if it fails
Cron setup:
sudo crontab -u deploy -e
# Add:
# MAILTO=admin@your-domain.com
# 0 6,14,19 * * * cd /opt/data-analyst/repo && ./scripts/update.sh > /var/log/update.log 2>&1 || cat /var/log/update.log
Client: sync_data.sh
Main sync script for analysts. Syncs docs, scripts, data, and regenerates CLAUDE.md:
bash server/scripts/sync_data.sh # Full sync (pull server/ + push user/)
bash server/scripts/sync_data.sh --dry-run # Preview only
bash server/scripts/sync_data.sh --push # Only upload user/ to server
What it does:
- Syncs
server/docs/,server/scripts/,server/examples/,server/metadata/from server - Regenerates
CLAUDE.mdfrom latest template (preserves username, never touchesCLAUDE.local.md) - Updates
.claude/settings.jsonwith project permissions from server - Syncs parquet data files to
server/parquet/(incremental) - Uploads
user/to server (backup + runtime for notifications) - Downloads corporate memory rules from
~/.claude_rules/to.claude/rules/ - Updates sync timestamp on server (
touch ~/server/) - used by the webapp Account card "Last Sync" display. Each user's~/server/directory is per-user, so the timestamp is independent. - Reinitializes DuckDB in
user/duckdb/(core tables viaduckdb_manager.py, optional dataset views viasync_jira.sh --views-onlyetc.)
Note: Rsync uses --delete to remove obsolete files from client (e.g., old monthly partitions when switching to daily). Files are compared by mtime+size (no --checksum for better performance). If rsync is not available (Windows without WSL), scp is used as fallback with explicit dotfile handling.
CLAUDE.md update mechanism:
CLAUDE.mdis regenerated fromserver/docs/setup/claude_md_template.txton every sync- Template is maintained centrally and deployed to server via CI/CD
- User's personal
CLAUDE.local.mdis never overwritten (higher priority in Claude Code) - New features added to template are automatically delivered to all analysts on next sync
Claude Code settings.json:
.claude/settings.jsonis copied fromserver/docs/setup/claude_settings.jsonon every sync- Contains project-wide permissions (allow/deny/ask rules for tools)
- Protects
server/directory from accidental modifications by Claude - Centrally managed - analysts cannot override these permissions locally
Client: init.sh + setup_views.sh
First time setup (init.sh):
./scripts/init.sh
Creates virtual environment, installs dependencies, and creates data folders including duckdb/.
After rsync (setup_views.sh):
bash server/scripts/setup_views.sh
Initializes DuckDB views from synced Parquet files. DuckDB database is created at user/duckdb/analytics.duckdb.
Steps:
- Activates virtual environment
- Runs
duckdb_manager.py --reinitfor core Keboola tables (fromdata_description.md) - Calls optional dataset scripts with
--views-onlyflag:- If
server/parquet/jira/exists →sync_jira.sh --views-only(createsjira_issues,jira_comments,jira_attachments,jira_changelogviews) - Future datasets follow the same pattern (e.g.,
sync_github.sh --views-only)
- If
Convention: Each data source sync script (e.g., sync_jira.sh) manages its own DuckDB views. The --views-only flag creates/refreshes views without syncing data. This keeps duckdb_manager.py focused on core tables while optional datasets are self-contained.
Server Purpose
- Sync from Keboola - periodically pulls data from Keboola Storage
- Convert to Parquet - transforms data to efficient format
- Chunking - splits data by hour for incremental sync
- Distribution - clients pull data via rsync to local machines
- On-server analysis - analysts can run scripts directly on the server
Usage Guide
User Types
| Type | Groups | Data Access | Use Case |
|---|---|---|---|
| Standard Analyst | dataread |
Public data | Regular analysts, data scientists |
| Privileged Analyst | dataread + data-private |
Public + private | Executives, management |
| Admin | sudo + data-ops + all data groups |
Everything + server + deployment | DevOps, IT team |
- Standard analysts see all company data except sensitive information stored in
private/ - Privileged analysts have access to everything including executive reports and financial details
- Admins can manage the server, add/remove users, and have full sudo access
What Each User Gets
Every analyst has their own Linux account with:
/home/username/
├── server/ # Symlinks to shared read-only data on /data
│ ├── docs -> /data/docs
│ ├── scripts -> /data/scripts
│ ├── examples -> /data/examples
│ ├── parquet -> /data/src_data/parquet
│ └── metadata -> /data/src_data/metadata
├── user/ # User's OWN writable directories
│ ├── duckdb/ # Per-user DuckDB database
│ │ └── analytics.duckdb
│ ├── notifications/ # Notification scripts (*.py)
│ ├── artifacts/ # Analysis outputs
│ ├── scripts/ # Custom scripts
│ └── parquet/ # Custom parquet files
├── .notifications/ # Notification runner state
│ ├── state/ # Cooldown tracking per script
│ └── logs/ # Runner and cron logs
└── .ssh/authorized_keys # SSH key for authentication
- Home directory (
/home/username/) - private space for each user - Server data (
~/server/) - read-only symlinks to shared/data/on disk - User workspace (
~/user/) - writable directories for user's own files - DuckDB (
~/user/duckdb/analytics.duckdb) - per-user database built from shared parquet
Typical Workflow
Option A: Local analysis with rsync (recommended)
-
Analyst syncs data to their local machine:
# Recommended: use the sync script bash server/scripts/sync_data.sh # Or manual rsync rsync -avz data-analyst:server/parquet/ ./server/parquet/ -
Run analysis locally with Claude Code or other tools
-
Data stays on analyst's machine - they can do whatever they want with it
Option B: Server-side analysis
-
SSH into the server:
ssh username@YOUR_SERVER_IP -
Work in personal workspace:
cd ~/user # Run scripts, analyze data from ~/server/parquet/ -
Copy results back to local machine if needed
Data Access Examples
Standard analyst (public data only):
$ ls ~/server/parquet/
sales/ products/ customers/ orders/ private/
$ ls ~/server/parquet/private/
ls: cannot open directory 'private/': Permission denied
Privileged analyst (public + private):
$ ls ~/server/parquet/
sales/ products/ customers/ orders/ private/
$ ls ~/server/parquet/private/
executive_reports/ financial_details/ board_materials/
Rsync Permissions
When syncing with rsync:
- Standard analysts will get "Permission denied" errors for
private/folder (expected) - Use
--exclude='private/'to skip it cleanly:rsync -avz --exclude='private/' data-analyst:server/parquet/ ./server/parquet/ - Privileged analysts can sync everything including private data
Monitoring
Cloud Monitoring (GCP)
Ops Agent is installed and reports VM metrics to Cloud Monitoring, including disk space utilization.
Installation (already done):
curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --also-install
Check agent status:
sudo systemctl status google-cloud-ops-agent
Available metrics:
agent.googleapis.com/disk/percent_used- Disk utilization percentageagent.googleapis.com/memory/percent_used- Memory utilizationagent.googleapis.com/cpu/utilization- CPU usageagent.googleapis.com/network/traffic- Network I/O
View metrics in GCP Console:
- Go to Cloud Console > Monitoring > Metrics Explorer
- Select resource type:
VM Instance - Select metric:
agent.googleapis.com/disk/percent_used - Filter by device:
/dev/sdb(data disk)
Alert Policy for Disk Space:
Alert triggers when /data partition exceeds 85% usage for 5 minutes.
To create the alert policy manually:
- Go to Cloud Console > Monitoring > Alerting
- Click Create Policy
- Click Add Condition:
- Resource type: VM Instance
- Metric:
agent.googleapis.com/disk/percent_used - Filter:
metadata.system_labels.device="/dev/sdb"ANDmetadata.system_labels.state="used" - Threshold: > 85
- Duration: 5 minutes
- Click Next > Notifications (add email/Slack channel)
- Click Next > Documentation:
Disk /data partition is above 85% full. Check /data/src_data/ for large files or run cleanup. Common causes: - Keboola data sync (check cron logs) - bot.log growth (check /data/notifications/bot.log) - Jira attachments (check /data/src_data/raw/jira/attachments/) - Name: "Disk Space Alert - /data partition"
- Click Create Policy
Cost: Free tier (first 150 time series free, this VM uses ~25)
Dashboard: Available in GCP Console > Monitoring > Dashboards > "VM Instances"
Local Monitoring
# Server status
ssh kids "uptime && free -h && df -h / /data /home"
# Active users
ssh kids "who"
# Recent logins
ssh kids "last | head -20"
# Check disk space for all partitions
ssh kids "df -h"
# Check disk usage by directory
ssh kids "du -sh /data/*"
Backup & Disaster Recovery
Disk Layout
| Disk | Mount | Size | Purpose | Backup |
|---|---|---|---|---|
data-broker-for-claude (sda) |
/ |
10 GB | OS, packages, app | Expendable (rebuild from git) |
data-disk (sdb) |
/data |
30 GB | Parquet data, docs, scripts | Daily GCP snapshots |
home-disk (sdc) |
/home |
30 GB | User homes, SSH keys, workspaces | Daily GCP snapshots |
tmp-disk (sdd) |
/tmp |
100 GB | Temporary files | Expendable (not snapshotted) |
Automatic Snapshots
Both data-disk and home-disk have daily GCP snapshot schedules with 14-day retention. Setup via server/setup-snapshot-schedule.sh.
# Check snapshot schedule status
gcloud compute resource-policies describe daily-backup \
--project=kids-ai-data-analysis --region=europe-north1
# List existing snapshots
gcloud compute snapshots list --project=kids-ai-data-analysis
# Manual snapshot (if needed)
gcloud compute disks snapshot data-disk home-disk \
--project=kids-ai-data-analysis \
--zone=europe-north1-a \
--snapshot-names=data-disk-$(date +%Y%m%d),home-disk-$(date +%Y%m%d)
Recovery
See disaster-recovery.md for detailed recovery procedures for each failure scenario.
Application Deployment
Directory Structure
/opt/data-analyst/ # Application directory (group: data-ops)
├── repo/ # Git repository
│ ├── src/ # Python source code
│ ├── scripts/ # Data sync scripts
│ ├── server/ # Server management scripts
│ │ ├── bin/ # add-analyst, notify-runner, notify-scripts, etc.
│ │ └── telegram_bot/ # Telegram bot service
│ ├── webapp/ # Flask web application
│ └── examples/ # Example notification scripts
├── .venv/ # Python virtual environment
├── .env # Webapp env (Google OAuth, secret key)
└── logs/ # Application logs
CI/CD Pipeline
Application is automatically deployed via GitHub Actions when changes are pushed to main branch.
How it works:
- Push to
maintriggers GitHub Actions workflow - Action connects to server via SSH as
deployuser - Runs
/opt/data-analyst/repo/server/deploy.sh - Deploy script:
- Pulls latest code from
origin/main - Updates server management scripts in
/usr/local/bin/ - Updates sudoers configurations (
/etc/sudoers.d/) - Updates resource limits (
/etc/security/limits.d/99-users.conf) - Deploys
notify-runnerandnotify-scriptsto/usr/local/bin/ - Creates data directories:
/data/notifications/(notification state)/data/src_data/raw/jira/(Jira webhook data)/data/auth/(password auth)/data/corporate-memory/(knowledge base)/data/user_sessions/(session logs)/data/examples/(example scripts)/tmp/keboola_load/(Keboola staging)
- Deploys systemd units:
notify-bot.service(Telegram bot)ws-gateway.service(WebSocket gateway)corporate-memory.{service,timer}(knowledge collector)jira-sla-poll.{service,timer}(SLA refresh)jira-consistency.{service,timer,timer-deep}(data integrity monitoring)session-collector.{service,timer}(session logs)
- Sets ACLs for Jira attachments (dataread group)
- Creates/updates Keboola
.envfile (if secrets provided) - Sets correct permissions on
/opt/data-analyst/ - Restarts webapp, notify-bot, ws-gateway services
- Enables/starts timers (if credentials configured)
- Pulls latest code from
Deploy user permissions:
The deploy user has limited sudo access defined in /etc/sudoers.d/deploy:
Core Operations:
- Can copy scripts to
/usr/local/bin/ - Can update sudoers files in
/etc/sudoers.d/ - Can manage permissions on
/opt/data-analyst/ - Can update resource limits in
/etc/security/limits.d/
Service Management:
- Can restart/reload webapp, nginx services
- Can manage notify-bot, ws-gateway services
- Can manage corporate-memory timer
- Can manage jira-sla-poll timer
- Can manage jira-consistency timers (incremental + deep)
- Can manage session-collector timer
- Can run
systemctl daemon-reload
Data Directories:
- Can manage
/data/scripts/(helper scripts for analysts) - Can manage
/data/docs/(documentation) - Can manage
/data/notifications/(notification state) - Can manage
/data/examples/(example scripts) - Can manage
/data/src_data/raw/jira/(Jira webhook data) - Can manage
/data/auth/(password auth state) - Can manage
/data/corporate-memory/(knowledge base) - Can manage
/data/user_sessions/(session collector data) - Can manage
/tmp/keboola_load/(Keboola staging directory)
Special Permissions:
- Can run
notify-scriptsas any user (list/run notification scripts) - Can set ACLs on Jira attachments (dataread group access)
- Can create log files in
/opt/data-analyst/logs/
Full sudoers reference: server/sudoers-deploy in repository
Note: On Debian 12, core utils are in /usr/bin/ (not /bin/). The sudoers file uses full paths like /usr/bin/cp, /usr/bin/chmod, etc.
Initial Setup (one-time)
1. Install prerequisites:
sudo apt-get update
sudo apt-get install -y git python3.11-venv python3-pip
2. Create deploy user and SSH key for GitHub:
# Create deploy user
sudo useradd -m -s /bin/bash deploy
sudo groupadd data-ops 2>/dev/null || true
sudo usermod -aG data-ops deploy
# Generate SSH key for GitHub
sudo -u deploy ssh-keygen -t ed25519 -f /home/deploy/.ssh/id_ed25519 -N '' -C 'deploy@data-broker'
# Configure SSH for GitHub
sudo -u deploy bash -c 'echo -e "Host github.com\n IdentityFile ~/.ssh/id_ed25519\n StrictHostKeyChecking accept-new" > /home/deploy/.ssh/config'
sudo chmod 600 /home/deploy/.ssh/config
# Show public key (add this to GitHub as Deploy Key)
sudo cat /home/deploy/.ssh/id_ed25519.pub
3. Add Deploy Key to GitHub:
- Go to: https://github.com/keboola/internal_ai_data_analyst/settings/keys
- Click "Add deploy key"
- Title:
data-broker-server - Key: (paste public key from previous step)
- Allow write access: NO
4. Clone repository and run setup:
sudo mkdir -p /opt/data-analyst
sudo chown deploy:data-ops /opt/data-analyst
sudo -u deploy git clone git@github.com:keboola/internal_ai_data_analyst.git /opt/data-analyst/repo
sudo git config --global --add safe.directory /opt/data-analyst/repo
sudo -u deploy git config --global --add safe.directory /opt/data-analyst/repo
sudo /opt/data-analyst/repo/server/setup.sh
5. Add existing admins to data-ops group:
sudo usermod -aG data-ops padak
sudo usermod -aG data-ops matejkys
sudo usermod -aG data-ops dasa
GitHub Secrets Required
Set these in GitHub repository settings (Settings > Secrets > Actions):
| Secret | Value |
|---|---|
SERVER_HOST |
YOUR_SERVER_IP |
SERVER_USER |
deploy |
SERVER_SSH_KEY |
Private SSH key (/home/deploy/.ssh/id_ed25519) |
TELEGRAM_BOT_TOKEN |
Telegram Bot API token (from @BotFather) |
SENDGRID_API_KEY |
SendGrid API key for password auth emails |
ALLOWED_EMAILS |
Comma-separated whitelisted emails for password auth |
Manual Deployment
Admins can trigger deployment manually:
# Via GitHub Actions UI (Actions > Deploy to Server > Run workflow)
# Or via SSH:
ssh kids "cd /opt/data-analyst/repo && ./server/deploy.sh"
Deployment Logs
# View deployment history
cat /opt/data-analyst/logs/deploy.log
# Follow live deployment
tail -f /opt/data-analyst/logs/deploy.log
Troubleshooting CI/CD
"sudo: a terminal is required to read the password"
- Deploy user is missing NOPASSWD sudo permission for a specific command
- Check
/etc/sudoers.d/deployexists and has correct permissions (440) - Verify the command path matches (Debian 12 uses
/usr/bin/, not/bin/) - Fix: Add missing permission to
server/sudoers-deployand redeploy:# Edit server/sudoers-deploy in repo # Add the missing command with full path deploy ALL=(ALL) NOPASSWD: /usr/bin/command-name args # Commit and push git add server/sudoers-deploy git commit -m "Add missing sudo permission" git push origin main # Manually update on server (one-time) ssh kids "sudo cp /opt/data-analyst/repo/server/sudoers-deploy /etc/sudoers.d/deploy" ssh kids "sudo chmod 440 /etc/sudoers.d/deploy"
"Permission denied" on .env file
- Deploy user cannot write directly to files owned by root
- Solution: Use
sudo /usr/bin/teeinstead of direct file write
Deploy script changes not taking effect
- The deploy script pulls new code AFTER it starts running
- Changes to
deploy.shitself require manual pull first:ssh kids "sudo -u deploy bash -c 'cd /opt/data-analyst/repo && git pull'"
Verify sudoers configuration:
# Check if sudoers file exists and has correct permissions
ssh kids "ls -la /etc/sudoers.d/deploy"
# Validate syntax (exit code 0 = OK)
ssh kids "sudo visudo -cf /etc/sudoers.d/deploy && echo 'Syntax OK'"
# View current sudoers rules
ssh kids "sudo cat /etc/sudoers.d/deploy"
Test deploy locally as deploy user:
ssh kids "sudo -u deploy bash -c 'cd /opt/data-analyst/repo && ./server/deploy.sh'"
Web Application (Self-Service Portal)
A web application at https://your-instance.example.com allows team members to create their own analyst accounts via Google SSO.
Features
- Google Sign-In (restricted to
@your-domain.comemails only) - Email/password login for external users (whitelisted emails)
- Self-service account creation for new users
- Dashboard showing account info for existing users (2-column layout)
- Dynamic data stats (tables, columns, rows, size) loaded from
sync_state.json - Data catalog page with dynamic table listings from
data_description.md+sync_state.json - Data profiler with per-column statistics, visualizations, and alerts (from
profiles.json) - SSH connection instructions
- Claude Code integration hints for AI-assisted setup
- Telegram notification linking
- macOS desktop app linking/unlinking with install instructions
User Flow
- User visits
https://your-instance.example.com - Signs in with Google (@your-domain.com account)
- Dashboard shows instructions and form for SSH key
- User can ask Claude Code to generate SSH key and guide them
- After pasting SSH key, account is created automatically
- User syncs data and starts analyzing with Claude Code
Dynamic Data Stats
Dashboard and catalog pages display live data statistics (table count, columns, rows, size). These are loaded dynamically from sync_state.json on every page request - no webapp restart needed.
Data flow:
Cron (update.sh) → data_sync.py → /data/src_data/metadata/sync_state.json
↓
Flask reads on request → dashboard + catalog templates
sync_state.jsonis updated by the data sync process with per-table stats (rows, columns, file size)- Flask aggregates these into totals for display
- If
sync_state.jsonis missing or unreadable, hardcoded fallback values are used - Catalog page merges
data_description.md(table names, descriptions, categories) withsync_state.json(row counts)
Architecture
Browser -> Nginx (HTTPS/Let's Encrypt) -> Gunicorn -> Flask App
|
v
sudo add-analyst (via sudoers)
Setup
1. Run webapp setup script:
sudo /opt/data-analyst/repo/server/webapp-setup.sh
2. Configure Google OAuth:
- Go to Google Cloud Console
- Create OAuth 2.0 Client ID (Web application)
- Authorized JavaScript origins:
https://your-instance.example.com - Authorized redirect URIs:
https://your-instance.example.com/authorize
3. Update environment file:
sudo nano /opt/data-analyst/.env
# Add:
WEBAPP_SECRET_KEY=<generate with: python -c "import secrets; print(secrets.token_hex(32))">
GOOGLE_CLIENT_ID=<from Google Console>
GOOGLE_CLIENT_SECRET=<from Google Console>
4. Start/restart webapp:
sudo systemctl restart webapp
Monitoring
# Service status
sudo systemctl status webapp
sudo systemctl status nginx
# Logs
tail -f /opt/data-analyst/logs/webapp-access.log
tail -f /opt/data-analyst/logs/webapp-error.log
# Test endpoint
curl -I https://your-instance.example.com/health
Security Notes
- Only
@your-domain.comemails can log in via Google OAuth - External users can log in via email/password if their email is whitelisted
- Self-service creates standard analyst accounts only (no --private flag)
- www-data is member of
data-opsgroup (for access to /opt/data-analyst and static files) - www-data can only run
add-analystvia sudoers (not add-admin) - configured in/etc/sudoers.d/webapp - HTTPS enforced with Let's Encrypt certificate
- SSH keys are validated before passing to add-analyst script
- Reserved system usernames (root, admin, deploy, etc.) are blocked from registration
- Username collision with existing system accounts shows error and requires admin intervention
- Password auth uses Argon2id hashing (state of the art) with rate limiting (5 attempts/minute)
- Magic links for password setup expire in 24 hours, reset links in 1 hour
Technical Notes
Sudoers configuration:
The webapp needs sudo access to run add-analyst and notify-scripts. This is configured via server/sudoers-webapp file which is deployed to /etc/sudoers.d/webapp:
www-data ALL=(ALL) NOPASSWD: /usr/local/bin/add-analyst
www-data ALL=(ALL) NOPASSWD: /usr/local/bin/notify-scripts
Absolute paths requirement:
Gunicorn runs with a restricted PATH (only /opt/data-analyst/.venv/bin). Therefore, all system commands in Python code must use absolute paths:
/usr/bin/sudo(not justsudo)/usr/local/bin/add-analyst/usr/local/bin/notify-scripts
This is handled in webapp/user_service.py and server/telegram_bot/runner.py.
Username Generation
Username is generated from email address: the part before @ converted to lowercase.
Examples:
Petr.Simecek@your-domain.com->petr.simecekjohn@your-domain.com->john
If a username conflicts with a reserved system name or existing non-analyst account, the user sees an error and must contact an admin to create the account manually with a different username.
Prerequisites
GCP Firewall:
# Allow HTTP/HTTPS traffic (required for Let's Encrypt and webapp)
gcloud compute firewall-rules create allow-http-data-broker \
--project=kids-ai-data-analysis \
--direction=INGRESS \
--priority=1000 \
--network=default \
--action=ALLOW \
--rules=tcp:80,tcp:443 \
--source-ranges=0.0.0.0/0 \
--target-tags=http-server,https-server
# Add tags to VM
gcloud compute instances add-tags data-broker-for-claude \
--project=kids-ai-data-analysis \
--zone=europe-north1-a \
--tags=http-server,https-server
DNS:
- A record:
your-instance.example.com->YOUR_SERVER_IP
Password Authentication for External Users
External users (investors, partners) who don't have @your-domain.com Google accounts can authenticate using email/password.
How It Works
-
Admin adds email to whitelist (via GitHub Secrets):
- Go to GitHub repo Settings > Secrets > Actions
- Update
ALLOWED_EMAILSsecret (comma-separated list) - Push any change to trigger deploy, or manually restart webapp
-
User visits login page and clicks "Sign in with Email"
-
First-time setup (Sign Up tab):
- User enters their whitelisted email
- Clicks "Request Access"
- Receives email with setup link (valid 24 hours)
- Sets up password via the link
-
Subsequent logins (Sign In tab):
- User enters email + password
- Same session/dashboard as Google OAuth users
Username Generation
Usernames are derived from email addresses differently for internal vs external users:
| Username | Type | |
|---|---|---|
john.doe@your-domain.com |
john.doe |
Internal (Google OAuth) |
emily@investor.com |
emily_investor_com |
External (password auth) |
partner@example.org |
partner_example_org |
External (password auth) |
This prevents username collisions between internal and external users.
Configuration
GitHub Secrets (recommended):
| Secret | Description |
|---|---|
ALLOWED_EMAILS |
Comma-separated list of whitelisted emails |
SENDGRID_API_KEY |
SendGrid API key for sending emails |
EMAIL_FROM_ADDRESS |
Sender email address (e.g., noreply@your-domain.com) |
EMAIL_FROM_NAME |
Sender display name (e.g., Data Analyst Platform) |
Data storage:
/data/auth/ # Password auth data (www-data:data-ops, 2770)
└── password_users.json # User records (hashes, tokens, metadata)
Security Features
- Argon2id password hashing (most secure algorithm)
- Rate limiting: 5 failed attempts per minute per email
- Single-use tokens: Setup/reset links invalidate after use
- Token expiry: Setup 24h, reset 1h
- No email enumeration: Reset endpoint always shows same message
- Password requirements: Min 8 chars, uppercase, lowercase, digit
Password Reset
Users can reset their password via "Forgot Password?" link on the Sign In tab. They receive an email with a reset link valid for 1 hour.
Telegram Notification Bot
A Telegram bot (@YourBot) allows analysts to receive alerts from their custom notification scripts.
Architecture
Telegram Bot Service (systemd: notify-bot)
├── Telegram polling (handles /start, /test commands)
└── HTTP server on unix socket (/run/notify-bot/bot.sock)
▲
│ POST /send, POST /send_photo
│
notify-runner (user crontab, /usr/local/bin/notify-runner)
└── Executes ~/user/notifications/*.py
The webapp reads/writes shared JSON files in /data/notifications/ for user-Telegram linking (verification codes, user mappings).
Services
| Service | User | Description |
|---|---|---|
notify-bot |
deploy:data-ops | Telegram polling + send API on unix socket |
webapp |
www-data:data-ops | Dashboard with Telegram link/unlink UI |
Bot Commands
| Command | Description |
|---|---|
/start |
Link account (or show status if already linked) |
/whoami |
Show username and email |
/status |
List notification scripts with Run buttons |
/test |
Send a demo graphical report |
/help |
Show available commands |
The /status command shows inline keyboard buttons to run scripts on demand. Scripts are executed as the owning user via sudo -u using the notify-scripts helper (see below).
Data Files
/data/notifications/ # deploy:data-ops, mode 2770 (setgid, no others)
├── telegram_users.json # username -> {chat_id, linked_at}
├── desktop_users.json # username -> {linked_at} (desktop app link state)
├── pending_codes.json # code -> {chat_id, created_at}
└── bot.log # Bot service log
/run/notify-bot/ # systemd RuntimeDirectory (mode 0755)
└── bot.sock # Unix socket for send API (mode 0666)
The setgid bit (2770) ensures all files created in /data/notifications/ inherit the data-ops group, allowing both the bot service (deploy) and webapp (www-data) to read/write them. Analysts have no access to this directory.
The socket is in /run/notify-bot/, a systemd-managed directory with 0755 permissions, so any local user can connect to send notifications.
Notification Runner
Users create Python scripts in ~/user/notifications/ that output JSON to stdout. The notify-runner script (installed at /usr/local/bin/notify-runner) executes these scripts and sends results via the bot's unix socket.
Per-user state is stored in ~/.notifications/state/ (cooldown tracking) and logs in ~/.notifications/logs/.
Users configure their own crontab:
crontab -e
# Add:
*/5 * * * * ~/.venv/bin/python /usr/local/bin/notify-runner >> ~/.notifications/logs/cron.log 2>&1
Notify-Scripts Helper
The notify-scripts helper (/usr/local/bin/notify-scripts) provides a secure way for services (webapp, Telegram bot) to list and run user notification scripts without needing filesystem access to user home directories.
Why it exists: User home directories are set to 750 permissions. Services like www-data and deploy cannot traverse /home/{user}/ to read scripts or state files. The helper runs as the target user via sudo -u, so it has full access to ~/user/notifications/ and ~/.notifications/state/.
Usage:
# List scripts with last_run metadata (returns JSON array)
sudo -u <username> /usr/local/bin/notify-scripts list
# Run a script and return its JSON output
sudo -u <username> /usr/local/bin/notify-scripts run <script_name.py>
# Get last sync time (returns JSON with elapsed_seconds, elapsed_display)
sudo -u <username> /usr/local/bin/notify-scripts sync-status
The sync-status command reads the mtime of ~/server/ directory. This is updated by sync_data.sh via touch ~/server/ at the end of each sync. Each user has their own ~/server/ directory (containing symlinks to shared /data/), so timestamps are per-user.
Callers:
server/telegram_bot/status.py-/statuscommand and script list APIserver/telegram_bot/runner.py- on-demand script execution (Telegram "Run" button, webapp API)webapp/account_service.py- Account card "Last Sync" display
Sudoers rules:
# /etc/sudoers.d/webapp
www-data ALL=(ALL) NOPASSWD: /usr/local/bin/notify-scripts
# /etc/sudoers.d/deploy
deploy ALL=(ALL) NOPASSWD: /usr/local/bin/notify-scripts
Monitoring
# Bot service
sudo systemctl status notify-bot
tail -f /data/notifications/bot.log
# Linked users
cat /data/notifications/telegram_users.json | python3 -m json.tool
# Runner logs (per user)
cat ~/.notifications/logs/runner.log
Security
- Bot token is stored centrally in
/opt/data-analyst/repo/.env(loaded via systemd EnvironmentFile) - Users never see the token - they communicate via unix socket only
- Socket in
/run/notify-bot/bot.sock(systemd RuntimeDirectory, mode0755), socket itself0666 /data/notifications/is2770(only deploy + data-ops), no analyst access to logs or user mappings- Notification scripts run under the user's own account (no sudo) when triggered by crontab
- On-demand runs (via /status button and webapp API) use
sudo -u <user> /usr/local/bin/notify-scripts-- services never access user home directories directly - Scripts have a 60-second timeout (enforced by
notify-scriptshelper) - Verification codes expire after 10 minutes and are single-use
Known Issues
On-demand script execution security hardening (partially resolved):
The notify-scripts helper replaced direct sudo -H -u ... /usr/bin/env ... calls with a single auditable entry point. Services no longer need filesystem access to user home directories (750 permissions are preserved). The bot still requires NoNewPrivileges=false and /tmp in ReadWritePaths for sudo execution. A queue-based approach (#51) could further improve this by having notify-runner pick up run requests from a queue instead of the bot calling sudo directly.
Data Sync Settings (Web Portal)
Users can configure which optional datasets to sync via the web portal at https://your-instance.example.com. Settings are stored server-side and downloaded by sync_data.sh before each sync.
Architecture
┌─────────────────────────────────────┐
│ Web Portal (Dashboard) │
│ └── Data Settings widget │
│ ├── Toggle: Jira (~50 MB) │
│ └── Toggle: Jira Attachments │
│ (~500 MB+) │
└─────────────────────────────────────┘
│ POST /api/sync-settings
▼
┌─────────────────────────────────────┐
│ Flask API │
│ ├── Save to sync_settings.json │
│ └── Write ~/.sync_settings.yaml │
│ (via sudo install) │
└─────────────────────────────────────┘
│
▼
/data/notifications/sync_settings.json ← Central storage (all users)
/home/{user}/.sync_settings.yaml ← Per-user config file
│
▼ scp (analyst sync)
┌─────────────────────────────────────┐
│ sync_data.sh (client) │
│ ├── Download ~/.sync_settings.yaml │
│ ├── Read dataset toggles │
│ └── Conditionally run sync_jira.sh │
└─────────────────────────────────────┘
Data Files
| File | Location | Purpose |
|---|---|---|
sync_settings.json |
/data/notifications/ |
Central storage for all users' settings |
.sync_settings.yaml |
/home/{user}/ |
Per-user config file (YAML format) |
sync_settings.json format:
{
"petr.simecek": {
"datasets": {
"jira": true,
"jira_attachments": false
},
"updated_at": "2026-02-03T12:00:00Z"
}
}
Per-user .sync_settings.yaml format:
# Data Analyst - Sync Configuration
# Managed by web portal - changes here may be overwritten
datasets:
jira: true
jira_attachments: false
Sudoers Configuration
The webapp needs sudo to write config files to user home directories. This is configured in /etc/sudoers.d/webapp-sync:
# Allow webapp to install sync settings to user home directories
www-data ALL=(ALL) NOPASSWD: /usr/bin/install -o * -g * -m 644 /tmp/*.yaml /home/*/.sync_settings.yaml
Why this approach:
- Webapp runs as
www-datawhich cannot write to/home/{user}/ - Using
installcommand allows setting ownership in one atomic operation - Tempfile must be in
/tmp/(Gunicorn has restricted PATH) - Target is restricted to
.sync_settings.yamlonly
Client Sync Flow
When sync_data.sh runs:
-
Downloads config from server:
scp -q data-analyst:~/.sync_settings.yaml /tmp/.sync_settings_$(id -u).yaml -
If no config exists on server, creates default (jira: false)
-
Reads config and conditionally runs dataset sync scripts:
if grep -qE '^\s*jira:\s*true' "$SYNC_CONFIG_LOCAL"; then bash sync_jira.sh fi -
sync_jira.shsyncs data AND creates DuckDB views automatically (no separate step needed) -
sync_jira.shchecksjira_attachmentssetting for attachment sync
Available Datasets
| Dataset | Size | Description |
|---|---|---|
jira |
~50 MB | Support tickets from SUPPORT project (issues, comments, changelog, attachment metadata) |
jira_attachments |
~500 MB+ | Actual attachment files (images, logs, etc.). Requires jira to be enabled. |
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/sync-settings |
GET | Get current user's sync settings |
/api/sync-settings |
POST | Update settings and regenerate user config |
Troubleshooting
Settings not being saved to user home:
- Check
/etc/sudoers.d/webapp-syncexists - Verify tempfile is created in
/tmp/(not other directory) - Check webapp logs:
tail -f /opt/data-analyst/logs/webapp-error.log
Old scripts on client after sync:
sync_data.shdownloads scripts from/data/scripts/on server- Ensure
deploy.shcopies all scripts includingsync_jira.sh - If scripts are missing from
/data/scripts/, run manual deploy or CI/CD
Jira Webhook Integration
Receives webhooks from Atlassian Jira to maintain a real-time copy of issue data for analysis.
Architecture
Jira Cloud (your-org.atlassian.net)
│
│ POST /webhooks/jira (HTTPS)
▼
┌─────────────────────────────────────┐
│ Webapp (Flask) │
│ ├── Verify HMAC signature │
│ ├── Fetch full issue via REST API │
│ ├── Save JSON + download attachs │
│ └── Trigger incremental transform │
│ │ │
│ ▼ │
│ ┌─────────────────────────────┐ │
│ │ incremental_jira_transform │ │
│ │ • Upsert to monthly Parquet │ │
│ │ • Copy to distribution dir │ │
│ └─────────────────────────────┘ │
└─────────────────────────────────────┘
│
▼ rsync (analyst sync)
┌─────────────────────────────────────┐
│ Analyst (local) │
│ • Only changed monthly files sync │
│ • Data available within seconds │
└─────────────────────────────────────┘
Data Structure
/data/src_data/
├── raw/jira/ # Raw Jira data from webhooks
│ ├── issues/ # Individual issue JSON files
│ │ ├── SUPPORT-1234.json
│ │ └── SUPPORT-1235.json
│ ├── attachments/ # Downloaded attachment files
│ │ └── SUPPORT-1234/
│ │ └── 56340_image.png
│ └── webhook_events/ # Raw webhook payloads (audit)
│ └── 20260203_120000_jira_issue_created.json
│
└── parquet/jira/ # Transformed data (monthly partitioned)
├── issues/
│ ├── 2024-01.parquet
│ └── 2024-02.parquet
├── comments/
├── attachments/ # Metadata only (not binary)
└── changelog/
~/server/parquet/jira/ # Distribution directory (symlink or copy)
# This is what analysts sync via rsync
Monthly partitioning: Each issue belongs to the month of its created_at date. When an issue is updated, only that month's Parquet file changes. Rsync detects changed files by checksum and only transfers those (~50-100KB per month).
Configuration
Add to /opt/data-analyst/.env:
# Jira Webhook Integration
JIRA_WEBHOOK_SECRET=<generate with: python -c "import secrets; print(secrets.token_hex(32))">
JIRA_DOMAIN=your-org.atlassian.net
JIRA_EMAIL=integration-user@your-domain.com
JIRA_API_TOKEN=<API token from Atlassian account>
# SLA polling (JSM service account for elapsed_millis refresh)
JIRA_SLA_EMAIL=<JSM service account email>
JIRA_SLA_API_TOKEN=<JSM service account API token>
JIRA_CLOUD_ID=f0f7a244-4fb4-41f9-b1f0-b79e24a20f11
Get Jira API token:
- Go to https://id.atlassian.com/manage-profile/security/api-tokens
- Create API token
- Store in
.envasJIRA_API_TOKEN
Jira Webhook Setup
- Go to Jira Admin > System > WebHooks
- Create new webhook:
- Name:
Data Analyst Sync - URL:
https://your-instance.example.com/webhooks/jira - Secret: Same value as
JIRA_WEBHOOK_SECRETin.env - JQL Filter:
project = "Your Project"(or your project) - Events:
- Issue: created, updated, deleted
- Comment: created, updated
- Attachment: created
- Issue link: created
- Name:
Endpoints
| Endpoint | Method | Description |
|---|---|---|
/webhooks/jira |
POST | Receive Jira webhooks |
/webhooks/jira/health |
GET | Health check (shows config status) |
/webhooks/jira/test |
POST | Manual issue fetch (debug mode only) |
Monitoring
# Check webhook health
curl https://your-instance.example.com/webhooks/jira/health
# View recent webhook events
ls -la /data/src_data/raw/jira/webhook_events/ | tail -20
# Check saved issues
ls /data/src_data/raw/jira/issues/ | wc -l
# View webapp logs for webhook processing
tail -f /opt/data-analyst/logs/webapp-error.log | grep -i jira
SLA Polling
SLA elapsed values (first_response_elapsed_millis, time_to_resolution_elapsed_millis) only update when a webhook fires. For idle open tickets, these values go stale. The SLA polling timer refreshes them periodically and self-heals stale status data from missed webhooks.
| Component | Description |
|---|---|
jira-sla-poll.service |
Oneshot service that polls open tickets for fresh SLA + status data |
jira-sla-poll.timer |
Runs every 15 minutes (10min after boot, then every 15min) |
scripts/jira_poll_sla.py |
Reads Parquet to find open issues, fetches SLA + status via cloud API |
src/jira_file_lock.py |
Per-issue advisory file locking (shared with webhook handler) |
How it works:
- Reads Parquet issues to find open tickets with SLA data (~49 tickets)
- For each: fetches fresh SLA and status fields via JSM service account (cloud API)
- Acquires per-issue advisory file lock (prevents concurrent webhook writes)
- Updates raw JSON atomically (tempfile +
os.fchmod(0o660)+ os.replace) - If ticket is resolved in Jira but "open" locally: logs
Self-healing: SUPPORT-XXXX is resolved in Jira - Calls
transform_single_issue()to update Parquet + distribution (inside lock) - Releases lock
Monitoring:
# Check timer status
systemctl status jira-sla-poll.timer
systemctl list-timers | grep jira
# View last run logs
journalctl -u jira-sla-poll.service --since "1 hour ago"
# Manual dry run (count open issues)
cd /opt/data-analyst/repo
/opt/data-analyst/.venv/bin/python scripts/jira_poll_sla.py --dry-run
Requires: JIRA_SLA_EMAIL, JIRA_SLA_API_TOKEN, JIRA_CLOUD_ID in .env. Timer is auto-enabled by deploy.sh when JIRA_SLA_API_TOKEN is set.
Consistency Monitoring
Automated check every 30 minutes to detect missing Jira issues caused by webhook losses, disk failures, or processing errors. Validates data integrity by comparing three sources: Jira API (ground truth), raw JSON files, and Parquet data.
| Component | Description |
|---|---|
jira-consistency.service |
Oneshot service that validates data consistency across all sources |
jira-consistency.timer |
Runs every 30 minutes (10min after boot) |
jira-consistency-deep.timer |
Weekly full history check (Sunday 3 AM) |
scripts/jira_consistency_check.py |
Validation script with auto-backfill capability |
How it works:
- Queries Jira API for all issue keys (last 30 days by default)
- Compares with raw JSON files in
/data/src_data/raw/jira/issues/ - Compares with Parquet data in
/data/src_data/parquet/jira/issues/ - Auto-backfills if 1-10 issues missing (downloads JSON + transforms to Parquet)
- Alerts (ERROR log) if 11+ issues missing (requires manual investigation)
- Re-transforms JSON to Parquet for issues with transform lag
Grace period: Ignores issues created in last 5 minutes to avoid false positives from webhook timing windows.
Alert levels:
- INFO: 1-5 missing issues, auto-backfilled successfully
- WARNING: 6-10 missing issues, auto-backfilled successfully
- ERROR: 11+ missing issues, manual review required (no auto-fix)
Monitoring:
# Check timer status
systemctl status jira-consistency.timer
systemctl list-timers | grep jira
# View last run logs
journalctl -u jira-consistency.service --since "1 hour ago"
# Manual check (dry run)
cd /opt/data-analyst/repo
/opt/data-analyst/.venv/bin/python scripts/jira_consistency_check.py --dry-run --max-age-days 7
# Manual check with auto-fix
/opt/data-analyst/.venv/bin/python scripts/jira_consistency_check.py --auto-fix --max-age-days 30
# View consistency report
cat /data/src_data/raw/jira/_consistency_report.json | python3 -m json.tool
Manual recovery (if 11+ issues found):
# List missing issues from report
jq -r '.discrepancies.missing_in_json[]' /data/src_data/raw/jira/_consistency_report.json
# Backfill specific issues
cd /opt/data-analyst/repo
/opt/data-analyst/.venv/bin/python scripts/jira_backfill.py --issue-keys SUPPORT-15307,SUPPORT-15308
# Verify in Parquet
/opt/data-analyst/.venv/bin/python -c "
import duckdb
con = duckdb.connect()
result = con.execute('''
SELECT issue_key, created_at, summary
FROM read_parquet('/data/src_data/parquet/jira/issues/*.parquet')
WHERE issue_key IN ('SUPPORT-15307', 'SUPPORT-15308')
''').fetchall()
for row in result:
print(row)
"
Requires: JIRA_DOMAIN, JIRA_EMAIL, JIRA_API_TOKEN in .env. Timers are auto-enabled by deploy.sh when Jira credentials are configured.
Security
- Webhooks are verified using HMAC-SHA256 signature
- API token has read-only access to Jira (no write permissions needed)
- Webhook events are logged for audit purposes
- Multiple services write to
/data/src_data/raw/jira/: webapp (www-data), SLA poll (root), consistency check (root), backfill scripts (admin users) - Concurrent writes to the same issue JSON are serialized via per-issue advisory file locking (
src/jira_file_lock.py,fcntl.flock). Lock files inissues/.locks/. See #203.
Data Profiler
Generates YData-inspired statistical profiles for all tables in the data catalog, including Jira support tables. Profiles include per-column statistics, type-specific visualizations (histograms, top values, timelines), data quality alerts, and business context (relationships, metrics). Profiles are preserved across runs — if a table fails to profile, its previous valid data is retained.
Architecture
Cron (update.sh, 3x daily)
Step 2: python -m src.data_sync → parquet + sync_state.json + schema.yml
Step 3: python -m src.profiler → profiles.json
│
▼
/data/src_data/metadata/profiles.json (mode 644, padak:data-ops)
│
▼
Webapp: GET /api/catalog/profile/<table_name>
│
▼
Catalog page: profiler modal (Chart.js visualizations)
How It Works
- Profiler runs as Step 4 in
scripts/update.shafter data sync and metadata generation - Materializes Parquet into DuckDB —
CREATE TEMP TABLEloads each table once into DuckDB columnar storage (instead of re-reading Parquet files for every query) - Batch statistics — base stats (COUNT, COUNT DISTINCT) for all columns in one query; type-specific aggregates (numeric, string, date, boolean) batched per category
- Large tables (>500K rows) are sampled:
USING SAMPLE 500000 ROWS - Merges metadata from
data_description.md(descriptions, foreign keys),sync_state.json(row counts, file sizes), anddocs/metrics/*.yml(business metric mappings) - Writes
profiles.jsonatomically (tempfile.mkstemp()+os.chmod(0o644)+os.replace()) - Preserves existing profiles on failure — if a table fails to profile, the previous valid profile is retained (marked
_stale: true) - Profiler failure is non-fatal — if the entire profiler fails, the update pipeline continues
- Jira table relationships —
issue_keyforeign keys are defined between all Jira tables (comments, attachments, changelog, issuelinks, remote_links → jira_issues), visible in the Relationships tab
Output File
/data/src_data/metadata/profiles.json # ~900 KB for ~29 tables
Permissions: File must be 644 (world-readable) so the webapp (www-data) can serve it. The profiler sets os.chmod(tmp, 0o644) before os.replace() because mkstemp() defaults to 600.
Per-Table Profile Structure
Each table profile contains:
| Field | Source | Description |
|---|---|---|
row_count, column_count |
DuckDB | Table dimensions |
file_size_mb |
sync_state.json | Parquet file size on disk |
description, primary_key |
data_description.md | Business context |
avg_completeness |
DuckDB | Average non-null percentage across columns |
missing_cells, missing_cells_pct |
DuckDB | Total NULL cells count and percentage |
duplicate_rows |
DuckDB | COUNT(*) - COUNT(DISTINCT *) |
date_range |
DuckDB | Earliest/latest date from date columns |
variable_types |
DuckDB | Breakdown by type (STRING, NUMERIC, DATE, BOOLEAN) |
alerts |
Computed | Auto-detected data quality issues (see below) |
related_tables |
data_description.md | Foreign key relationships (outgoing + incoming) |
used_by_metrics |
docs/metrics/*.yml | Which business metrics use this table |
sample_rows |
DuckDB | First 5 rows for preview |
columns |
DuckDB | Per-column detailed statistics |
_stale |
Profiler | true if this profile is from a previous run (current profiling failed) |
Alert System
Auto-detection of data quality issues, displayed as colored badges:
| Alert | Condition | Severity |
|---|---|---|
constant |
unique_count == 1 |
warning (yellow) |
unique |
unique_pct == 100% |
info (red) |
high_missing |
missing_pct > 30% |
error (red) |
missing |
missing_pct > 5% |
warning (yellow) |
imbalance |
top_value_pct > 60% (categorical) |
info (blue) |
zeros |
zero_pct > 50% (numeric) |
info (blue) |
high_cardinality |
unique_count > 50 (text) |
info (grey) |
Type-Specific Column Statistics
| Column Type | Statistics | Visualization |
|---|---|---|
| STRING (low cardinality ≤50) | Top 10 values with counts/percentages | Horizontal bar chart |
| STRING (high cardinality >50) | min/max/avg length, sample values | Sample list |
| NUMERIC (FLOAT64, INT64, DECIMAL) | min, max, mean, median, p5/p25/p75/p95, stddev, zeros | Histogram (10-20 buckets) |
| DATE/TIMESTAMP | earliest, latest, span_days | Timeline histogram (quarterly) |
| BOOLEAN | true_count, false_count, true_pct | True/false ratio bar |
Webapp Integration
API endpoint: GET /api/catalog/profile/<table_name> (requires login)
- Returns JSON profile for a single table from
profiles.json - 404 if profiler hasn't run yet or table not found
- 500 if file unreadable (check permissions)
Catalog page: Click any table row to open profiler modal with tabs:
- Overview — dataset statistics + variable type breakdown
- Variables — per-column cards with type-specific charts (Chart.js)
- Alerts — all detected issues with colored severity badges
- Missing Values — horizontal bar chart of completeness per column
- Relationships — foreign key links (clickable to open related table's profile)
- Sample — first 5 rows in table format
Performance
- Runtime: ~1-2 minutes for ~29 tables (optimized from ~8min via TABLE materialization + batch queries)
- Sampling: Tables >500K rows use
USING SAMPLE 500000 ROWSfor consistent performance - Memory: In-memory DuckDB with temporary tables (dropped after profiling)
- Output size: ~900 KB JSON for ~29 tables (including 6 Jira tables)
Files
| File | Description |
|---|---|
src/profiler.py |
Profiler engine (~1220 lines) |
tests/test_profiler.py |
Unit + integration tests (24 tests) |
scripts/update.sh |
Pipeline integration (Step 4) |
webapp/app.py |
API route /api/catalog/profile/<table_name> |
webapp/templates/catalog.html |
Profiler modal UI + Chart.js |
Monitoring
# Manual profiler run
ssh kids "cd /opt/data-analyst/repo && source /opt/data-analyst/.venv/bin/activate && python -m src.profiler"
# Check output
ssh kids "ls -la /data/src_data/metadata/profiles.json"
ssh kids "python3 -c \"import json; d=json.load(open('/data/src_data/metadata/profiles.json')); print(f'Tables: {len(d[\\\"tables\\\"])}')\""
# Check update.sh logs (profiler runs as Step 4)
ssh kids "cat /var/log/update.log | grep -A5 'Generating data profiles'"
# Test API endpoint
curl -s https://your-instance.example.com/api/catalog/profile/company | python3 -m json.tool | head -20
Troubleshooting
"Profile data not available for this table"
- Profiler hasn't been run yet, or table name doesn't match
- Run manually:
python -m src.profileron server - Note: Since v1.1, profiler preserves old profiles on failure — this should only appear for truly new tables
HTTP 500 on /api/catalog/profile/*
- Check file permissions:
ls -la /data/src_data/metadata/profiles.json— must be644 - Fix:
sudo chmod 644 /data/src_data/metadata/profiles.json - Root cause:
mkstemp()creates files with600; fixed in profiler.py withos.chmod(0o644)
Profiler takes too long
- Normal runtime is ~1-2 minutes; if significantly longer, check which tables are large in profiler logs
- Sampling threshold is 500K rows (configurable in
src/profiler.pyconstantSAMPLE_THRESHOLD) - TABLE materialization + batch queries keep it fast; if DuckDB runs out of memory, check server RAM
Metrics not showing in profiler
- Metrics are loaded from
docs/metrics/directory (split by category:docs/metrics/*/*.yml) - Legacy
docs/metrics.ymlpath is still supported but the directory structure takes precedence - Check that metric files exist:
ls docs/metrics/*/*.yml
Corporate Memory
A knowledge sharing system that extracts reusable insights from analysts' personal notes (CLAUDE.local.md), lets the team vote on them via a webapp, and syncs upvoted items back to each user's Claude Code rules.
Architecture
┌─────────────────────────────────────┐
│ Analyst Workstations │
│ ├── CLAUDE.local.md │ ← Personal notes (synced to server)
│ └── .claude/rules/*.md │ ← Synced rules from upvoted items
└─────────────────────────────────────┘
│ sync_data.sh ▲ sync_data.sh
│ (upload CLAUDE.local.md) │ (download .claude_rules/*)
▼ │
┌─────────────────────────────────────┐ │
│ Server: /home/{user}/ │ │
│ ├── CLAUDE.local.md │ │
│ └── .claude_rules/*.md │───┘
└─────────────────────────────────────┘
│ corporate-memory.timer (every 30 min)
▼
┌─────────────────────────────────────┐
│ Knowledge Collector (full refresh) │
│ ├── MD5 hash change detection │
│ ├── ALL files + existing catalog │
│ │ → single Claude Haiku 4.5 call │
│ │ (Structured Outputs) │
│ ├── Sensitivity check (new items) │
│ └── Save to knowledge.json │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ /data/corporate-memory/ │
│ ├── knowledge.json │
│ ├── votes.json │
│ └── user_hashes.json │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Webapp: /corporate-memory │
│ ├── Browse, search, filter │
│ ├── Upvote / downvote items │
│ └── On vote → regenerate user rules│
└─────────────────────────────────────┘
How It Works
Collection (server-side, every 30 min)
- Analysts write notes in
CLAUDE.local.mdduring their work with Claude Code sync_data.shuploadsCLAUDE.local.mdto/home/{user}/CLAUDE.local.mdon the server- Collector checks for changes by comparing MD5 hashes of all users' files against
user_hashes.json - If any file changed, collector sends ALL users' files + the existing knowledge catalog to Claude Haiku 4.5 in a single API call (full refresh approach)
- Haiku maps knowledge to existing catalog items (preserving IDs for vote stability) or creates new items
- Sensitivity check runs only on newly created items (existing items were already checked)
- Knowledge base is updated atomically (
tempfile+os.replace)
Voting and Rules Sync (webapp → analyst)
- Users browse knowledge at
/corporate-memory(search, filter by category, sort by score) - Upvoting an item records the vote in
votes.jsonand immediately regenerates the user's rule files - Rule files are installed to
/home/{server_user}/.claude_rules/{item_id}.mdvia theinstall-user-rulessudo helper (see below) - Next
sync_data.shrun downloads.claude_rules/*to the analyst's.claude/rules/directory - Claude Code automatically reads files from
.claude/rules/as project context
There is no threshold - any personal upvote syncs the item to that user's rules.
Rules Installation (sudo helper)
The webapp runs as www-data which cannot write to /home/{user}/ directories (mode drwxr-x---). Rule files are installed using the established sudo install pattern (same approach as sync_settings_service.py for .sync_settings.yaml):
- Webapp writes rule
.mdfiles to a temp directory - Calls
sudo -n /usr/local/bin/install-user-rules {username} {tmp_dir} - Helper script creates
/home/{user}/.claude_rules/(mode 700), removes oldkm_*.mdfiles, installs new files with/usr/bin/install -o {user} -g {user} -m 600 - Webapp cleans up the temp directory
Files involved:
server/bin/install-user-rules→ deployed to/usr/local/bin/install-user-rulesserver/sudoers-webapp→ entry:www-data ALL=(ALL) NOPASSWD: /usr/local/bin/install-user-ruleswebapp/corporate_memory_service.py→_regenerate_user_rules()calls the helper viasubprocess.run()
Username Mapping
The webapp uses email-derived usernames (e.g., petr.simecek) while the server uses Linux home directory names (e.g., petr). Most users match, only Petr differs.
Mapping is in webapp/corporate_memory_service.py:
WEBAPP_TO_SERVER_USERNAME = {
"petr.simecek": "petr",
}
Display names for avatars (initials + tooltip):
USER_DISPLAY_NAMES = {
"petr": {"name": "Petr Simecek", "initials": "PS"},
"dasa.damaskova": {"name": "Dasa Damaskova", "initials": "DD"},
"martin.matejka": {"name": "Martin Matejka", "initials": "MM"},
"jiri.manas": {"name": "Jiri Manas", "initials": "JM"},
"pavel.dolezal": {"name": "Pavel Dolezal", "initials": "PD"},
}
Data Files
/data/corporate-memory/ # deploy:data-ops, mode 2770
├── knowledge.json # Extracted knowledge items + metadata
├── votes.json # Per-user votes {username: {item_id: 1/-1}}
├── user_hashes.json # MD5 hashes for change detection
└── collection.log # Collection run history
/home/{user}/
├── CLAUDE.local.md # User's personal notes (source)
└── .claude_rules/ # Generated rule files (mode 700, owner-only)
├── km_abc123.md # mode 600, owned by user
└── km_def456.md
knowledge.json structure:
{
"items": {
"km_abc123": {
"id": "km_abc123",
"title": "DuckDB Schema Reference Protocol",
"content": "Always read schema before queries...",
"category": "workflow",
"tags": ["duckdb", "best-practices"],
"source_users": ["petr"],
"extracted_at": "2026-02-05T21:54:18Z",
"updated_at": "2026-02-05T21:54:18Z"
}
},
"metadata": {
"last_collection": "2026-02-05T21:54:18Z",
"total_users": 3
}
}
votes.json structure:
{
"petr": {
"km_abc123": 1,
"km_def456": -1
}
}
Full Refresh Approach
The collector uses a full refresh strategy to avoid duplicates:
- Change detection: MD5 hash of each user's
CLAUDE.local.mdis compared againstuser_hashes.json - If no changes: Skip the API call entirely (saves cost)
- If any file changed: Load ALL user files and the existing catalog
- Single Haiku call: The prompt includes the existing catalog with IDs, so Haiku can:
- Map knowledge to existing items (preserving
existing_idfor vote stability) - Merge similar knowledge from different users into single items
- Add genuinely new items (assigned new
km_*IDs) - Preserve
source_usersfrom existing items even if a user removed their notes
- Map knowledge to existing items (preserving
- Sensitivity check: Only NEW items (without
existing_id) are checked - existing items passed the check previously
This approach ensures:
- No duplicates from non-deterministic AI output
- Stable item IDs across runs (votes are preserved)
- Cross-user knowledge merging in a single pass
Systemd Services
| Service | Type | Schedule | Description |
|---|---|---|---|
corporate-memory.service |
oneshot | on-demand | Runs the knowledge collector |
corporate-memory.timer |
timer | every 30 min | Triggers the service |
Service configuration:
- Runs as
root(needed to read/home/*/CLAUDE.local.md) - Group:
data-ops - Timeout: 600 seconds (for API calls)
- Security hardening:
ProtectSystem=strict,PrivateTmp=true
Configuration
Required GitHub Secret:
| Secret | Description |
|---|---|
ANTHROPIC_API_KEY |
Claude API key for Haiku 4.5 extraction |
The API key is deployed to /opt/data-analyst/.env via CI/CD and loaded by the collector service.
Model: claude-haiku-4-5-20251001 with Structured Outputs (output_config.format.json_schema)
Knowledge Categories
| Category | Description |
|---|---|
data_analysis |
DuckDB, Parquet, data processing techniques |
api_integration |
API usage, HTTP clients, authentication |
debugging |
Error diagnosis, troubleshooting techniques |
performance |
Optimization, caching, efficiency improvements |
workflow |
Best practices, processes, conventions |
infrastructure |
Server, deployment, configuration |
business_logic |
Domain knowledge, data relationships |
Extraction Process
The collector uses Claude Haiku 4.5 with Structured Outputs for guaranteed JSON schema compliance:
- Catalog refresh prompt sends all user files + existing catalog to Haiku
- JSON Schema enforces output format including
existing_id(string or null) for ID preservation - Sensitivity check verifies only NEW items are safe to share
- ID assignment: Existing items keep their IDs; new items get
km_{uuid[:8]}format
Filtering rules (in the prompt):
- EXCLUDE: API keys, tokens, passwords, credentials
- EXCLUDE: Personal preferences, project-specific paths
- EXCLUDE: Basic knowledge any developer would know
- EXCLUDE: Incomplete or unclear notes
- EXCLUDE: Anything referencing specific people negatively
Manual Reset
To recalculate the entire knowledge base from scratch (e.g., after fixing duplicates):
# Reset: clears knowledge.json, votes.json, user_hashes.json, and stale .claude_rules
sudo /usr/local/bin/collect-knowledge --reset --verbose
The --reset flag:
- Clears
knowledge.json,user_hashes.json, andvotes.json - Removes stale
.claude_rules/km_*.mdfiles from all user home directories - Runs a fresh collection from all
CLAUDE.local.mdfiles
This is a manual operation, not part of the regular timer schedule.
Monitoring
# Check timer status
sudo systemctl status corporate-memory.timer
# View last collection
sudo journalctl -u corporate-memory -n 50 --no-pager
# Manual collection run
sudo systemctl start corporate-memory.service
# Manual run with verbose output (shows API calls, items found)
sudo /usr/local/bin/collect-knowledge --verbose
# View knowledge base
cat /data/corporate-memory/knowledge.json | python3 -m json.tool
# Check item count
cat /data/corporate-memory/knowledge.json | python3 -c "import json,sys; d=json.load(sys.stdin); print(f'Items: {len(d.get(\"items\", {}))}')"
# Check votes
cat /data/corporate-memory/votes.json | python3 -m json.tool
# Check user hashes (change detection state)
cat /data/corporate-memory/user_hashes.json | python3 -m json.tool
# View a user's synced rules
ls -la /home/petr/.claude_rules/
Webapp Integration
The Corporate Memory page at /corporate-memory provides:
- Dashboard stats: Total items, contributors, categories, last collection time
- Knowledge cards: Title, content, category badge, tags, contributor avatars (initials + tooltip)
- Voting: Upvote/downvote buttons per item (instantly updates score, regenerates user rules)
- Filtering: By category dropdown, text search (title + content + tags)
- Sorting: By score (default), by date, by number of contributors
- "My Rules" toggle: Shows only items the current user has upvoted
- User stats: Number of votes cast, number of active rules
API endpoints:
GET /api/corporate-memory/knowledge- List items (supportscategory,search,sort,page,my_rulesparams)POST /api/corporate-memory/vote- Cast vote{item_id, vote: 1/-1/0}GET /api/corporate-memory/stats- Dashboard statistics
Security
- Root access required: Collector service runs as root to read
/home/*/CLAUDE.local.md - Sudo helper for rules: Webapp uses
install-user-rulesvia sudo to write to user home dirs (same pattern assync_settings_service.py). Each user's.claude_rules/is mode 700, files 600 - users cannot read each other's rules. - Sensitivity filtering: Two-pass check (extraction prompt rules + dedicated sensitivity check on new items)
- No credentials stored: Knowledge items are filtered before storage
- Source attribution: Items track which users contributed (displayed as avatar initials)
- Read-only for analysts:
/data/corporate-memory/is only writable by data-ops group - Atomic writes: All JSON file updates use
tempfile.mkstemp()+os.replace()to prevent corruption. Critical: always callos.fchmod(fd, 0o660)(or appropriate mode) immediately aftermkstemp()— otherwise the default0600mode overrides the POSIX ACL mask to---, breaking group-based access for other services. See #203.
Session Collector
Collects Claude Code session transcripts from analyst home directories and stores them centrally.
Architecture
/home/*/user/sessions/ (per-user session transcripts)
│
▼
session-collector.timer (every 6 hours)
│
▼
/data/user_sessions/ (central storage, root:data-ops, mode 2770)
Systemd Services
| Unit | Type | Schedule | Description |
|---|---|---|---|
session-collector.service |
oneshot | on-demand | Runs the session collector |
session-collector.timer |
timer | every 6 hours | Triggers the service |
Monitoring
sudo systemctl status session-collector.timer
sudo journalctl -u session-collector -n 50 --no-pager
Security
- Root access required: Collector runs as root to read
/home/*/user/sessions/ - Central storage:
/data/user_sessions/is writable only by data-ops group
WebSocket Gateway
Real-time WebSocket gateway for desktop app notifications and live updates.
Architecture
Desktop App (WebSocket client)
│
▼
ws-gateway.service (deploy:data-ops)
│
▼
/run/ws-gateway/ws.sock (unix socket, mode 0755)
Systemd Service
| Unit | Type | Description |
|---|---|---|
ws-gateway.service |
simple | WebSocket gateway for desktop clients |
Monitoring
sudo systemctl status ws-gateway
sudo journalctl -u ws-gateway -n 50 --no-pager
Security
- JWT authentication: Desktop clients authenticate via JWT tokens (DESKTOP_JWT_SECRET)
- Read-only home: Service has
ProtectHome=read-only - Strict protection:
ProtectSystem=strictlimits filesystem access
Google Cloud Monitoring
The server uses Google Cloud Ops Agent for centralized logging and metrics collection. All logs and metrics are sent to Google Cloud for analysis, alerting, and debugging.
What's Collected
Logs (Fluent Bit → Cloud Logging):
- All syslog messages (
/var/log/syslog,/var/log/messages) - systemd journal logs (including service failures, crashes)
- Application logs (if written to syslog/journal)
- Retention: 30 days (default)
Metrics (OpenTelemetry → Cloud Monitoring):
- CPU utilization (%)
- Memory usage (%)
- Disk usage (%) per device
- Network traffic (bytes sent/received)
- Load average
- Collection interval: 60 seconds
- Retention: 6 weeks (default)
Configured Alerts
Alert notifications are sent to:
- Admin 1 (admin1@your-domain.com)
- Admin 2 (admin2@your-domain.com)
- Admin 3 (admin3@your-domain.com)
| Alert | Threshold | Duration | Action |
|---|---|---|---|
| High CPU Usage | >80% | 5 minutes | Check: ssh kids 'ps aux --sort=-%cpu | head -20' |
| High Memory Usage | >90% | 5 minutes | Check: ssh kids 'free -h && ps aux --sort=-%mem | head -20' |
| High Disk Usage | >85% | 1 minute | Check: ssh kids 'df -h && du -sh /data/* | sort -h' |
| Health Endpoint Down | Uptime check fails | 3 minutes | Check: ssh kids 'systemctl status webapp' |
| Health Endpoint Degraded | /health returns 503 | 2 minutes | Check: curl https://your-instance.example.com/health and review service status |
| Systemd Service/Timer Failures | Any failure | 1 minute | Check: ssh kids 'systemctl --failed && journalctl -xe' |
Log-Based Metrics
Custom metrics derived from logs for trend analysis:
| Metric | Description | Filter |
|---|---|---|
systemd_service_failures |
Count of systemd service/timer failures | "Failed with result" OR "failed with result" |
permission_denied_errors |
Count of Permission denied errors | "Permission denied" |
health_endpoint_degraded |
Count of /health returning 503 | "/health" AND ("503" OR "degraded") |
Dashboard
Server Overview Dashboard:
- Real-time CPU, Memory, Disk, Network graphs
- Systemd service failures
- Health endpoint status
- URL: https://console.cloud.google.com/monitoring/dashboards/custom/09cdd94b-a0ed-4458-952f-3cca2bd5ba6e?project=kids-ai-data-analysis
Health Endpoint & Uptime Monitoring
Health Endpoint: https://your-instance.example.com/health
Returns detailed server status in JSON format:
- Services: webapp.service, telegram-bot.service
- Timers: jira-consistency.timer, corporate-memory.timer, jira-sla-poll.timer
- Disk usage: All partitions (/, /data, /home, /tmp)
- System load: 1min, 5min, 15min averages
- Jira webhook: Last webhook timestamp and age
Response format:
{
"status": "healthy", // or "degraded"
"timestamp": "2026-02-13T18:50:33.825333Z",
"services": [{"name": "webapp.service", "status": "active", "healthy": true}],
"timers": [{"name": "jira-consistency.timer", "status": "active", "healthy": true}],
"disk": [
{"partition": "/", "used_percent": 79.4, "free_gb": 1.98, "healthy": true},
{"partition": "/data", "used_percent": 39.0, "free_gb": 17.92, "healthy": true}
],
"load": {"load_1min": 0.58, "load_5min": 1.82, "load_15min": 1.85, "healthy": true},
"jira_webhook": {"last_webhook_hours_ago": 0.0, "healthy": true}
}
HTTP Status Codes:
200 OK= all checks healthy503 Service Unavailable= one or more checks failed (status: "degraded")
Uptime Check:
- Monitors /health endpoint from 3 global locations (USA, Europe, Asia-Pacific)
- Check interval: 5 minutes
- Timeout: 10 seconds
- Validates response contains
"status": "healthy" - Alert triggered if check fails for 3+ minutes
Viewing Logs
Cloud Logging Console: https://console.cloud.google.com/logs?project=kids-ai-data-analysis
Useful log queries:
# All logs from the server (last 1 hour)
resource.type="gce_instance"
resource.labels.instance_id="656c1763-11a1-49bb-bbc3-9782acf15aef"
# systemd service failures
resource.type="gce_instance"
resource.labels.instance_id="656c1763-11a1-49bb-bbc3-9782acf15aef"
("Failed with result" OR "Main process exited")
# Permission denied errors
resource.type="gce_instance"
resource.labels.instance_id="656c1763-11a1-49bb-bbc3-9782acf15aef"
"Permission denied"
# Webapp errors
resource.type="gce_instance"
resource.labels.instance_id="656c1763-11a1-49bb-bbc3-9782acf15aef"
"gunicorn" AND ("ERROR" OR "WARNING")
# Jira webhook processing
resource.type="gce_instance"
resource.labels.instance_id="656c1763-11a1-49bb-bbc3-9782acf15aef"
"Received webhook"
Viewing Metrics
Cloud Monitoring Console: https://console.cloud.google.com/monitoring?project=kids-ai-data-analysis
Metrics Explorer - Useful metric queries:
- CPU:
compute.googleapis.com/instance/cpu/utilization - Memory:
agent.googleapis.com/memory/percent_used - Disk:
agent.googleapis.com/disk/percent_used - Network:
agent.googleapis.com/network/bytes_sent/bytes_recv
Cost
Google Cloud Monitoring pricing (as of 2026):
- Logs ingestion: First 50 GB/month free, then $0.50/GB
- Metrics ingestion: First 150 MB/month free, then $0.2580/MB
- Log storage: $0.01/GB/month (30-day retention)
- Typical monthly cost for this server: ~$5-10 (well within free tier)
Significantly cheaper than Datadog (~$15-31/host/month).
Managing Alerts
List alert policies:
gcloud alpha monitoring policies list \
--project=kids-ai-data-analysis \
--format="table(displayName,enabled,conditions[0].conditionThreshold.thresholdValue)"
Disable an alert:
gcloud alpha monitoring policies update POLICY_ID \
--project=kids-ai-data-analysis \
--no-enabled
Add notification channel:
gcloud alpha monitoring channels create \
--project=kids-ai-data-analysis \
--display-name="New Person" \
--type=email \
--channel-labels=email_address=person@your-domain.com
Debugging Server Crashes
When investigating server issues (like the 2026-02-13 systemd-journald crash):
-
View logs around the crash time:
- Go to Cloud Logging Console
- Filter:
resource.labels.instance_id="656c1763-11a1-49bb-bbc3-9782acf15aef" - Set time range to include the crash
- Look for ERROR/WARNING severity
-
Check metrics before the crash:
- Go to Dashboard or Metrics Explorer
- View CPU/Memory/Disk graphs for the time period
- Look for spikes or anomalies
-
Correlate logs with metrics:
- High CPU spike at 15:20? Check logs from that time
- Memory growth over time? Look for memory leaks in logs
-
Export for analysis:
# Export logs to file gcloud logging read "resource.labels.instance_id=\"656c1763-11a1-49bb-bbc3-9782acf15aef\"" \ --project=kids-ai-data-analysis \ --limit=1000 \ --format=json \ --freshness=1d > server_logs.json
Best Practices
- Structured logging: Applications should log in JSON format for better searchability
- Log levels: Use appropriate levels (ERROR for problems, INFO for events, DEBUG for details)
- Alert fatigue: Only alert on actionable issues, not informational events
- Regular review: Check dashboard weekly to spot trends before they become problems
- Cost monitoring: If ingestion grows, consider log sampling or exclusion filters