H1 - Sanitize dev_docs/ for public release:
- Replace all real employee names with generic placeholders
(padak->admin1, matejkys->admin2, dasa->admin3, petr->john, etc.)
- Replace GCP project ID (kids-ai-data-analysis -> your-gcp-project)
- Replace server hostname (data-broker-for-claude -> your-server)
- Replace real IP address (34.88.8.46 -> YOUR_SERVER_IP)
- Replace internal FQDN with placeholder
- Covers: security.md, server.md, disaster-recovery.md, desktop-app.md,
session_explore.md, plan-rsync-fix.md, draft/*.md
H3 - webapp-setup.sh: validate sudoers syntax BEFORE copying to /etc/sudoers.d
- Prevents broken sudo if syntax is invalid
- Uses install -m 440 for atomic copy with correct permissions
M1 - setup.sh: deploy user created with /usr/sbin/nologin instead of /bin/bash
- CI/CD service account does not need interactive shell
M2 - config/loader.py: warn on missing env vars, validate webapp_secret_key
- _resolve_env_refs now logs warnings for unset ${ENV_VAR} references
- _validate_config checks auth.webapp_secret_key is non-empty
- Prevents Flask signing sessions with empty secret key
All 118 tests pass.
6.2 KiB
Plan: Fix rsync synchronization hang (#197)
Problem
Rsync from GCP server (YOUR_SERVER_IP) hangs after 1-5 minutes. Process exists but has 0% CPU and no network activity. 100% reproducible with ~7000 parquet files.
Root Cause Analysis
The rsync commands in scripts/sync_data.sh use plain rsync -avz without:
- SSH keepalive - No
-e "ssh -o ServerAliveInterval=60"to keep the TCP connection alive through firewall/NAT - I/O timeout - No
--timeout=Nto detect stalled transfers - Retry logic - A single connection drop kills the entire sync
- Partial resume - No
--partial-dirso retries restart file transfers from scratch - Compression skip -
-zcompresses parquet files that are already snappy-compressed (CPU waste)
A stateful firewall or NAT device (GCP Cloud NAT, VPC firewall, or analyst's home router) drops idle TCP connections. Server-side ClientAliveInterval=300s sends keepalive, but the client side sends nothing. The firewall sees idle traffic and silently drops the connection. Neither rsync nor SSH client detects the dead connection.
Note: GCP VPC firewall default idle timeout is 600s (10 min). The 1-5 minute hang suggests a more aggressive timeout on Cloud NAT or analyst's local network.
Implementation Plan
Step 1: Create diagnostic tests (tests/test_sync_data.py)
Real end-to-end tests that download data to gitignored data/ folder with detailed logging to pinpoint exactly where and why the hang occurs.
Live diagnostic tests (require server access, marked @pytest.mark.live):
| Test | Purpose |
|---|---|
test_ssh_connectivity |
Basic SSH connection test with timeout |
test_rsync_small_directory |
Sync docs (~few files) to verify baseline works |
test_rsync_with_keepalive |
Sync parquet with SSH keepalive options |
test_rsync_with_timeout |
Sync parquet with --timeout=60 to detect hang quickly |
test_rsync_per_subdirectory |
Sync each parquet subdir separately to isolate which triggers the hang |
test_rsync_without_compression |
Sync parquet without -z to test if compression causes stall |
Each test logs to data/sync_diagnostics/ with timestamps, exit codes, bytes transferred, duration.
Static regression tests (run in CI without server access):
| Test | Purpose |
|---|---|
test_rsync_commands_use_ssh_keepalive |
Every rsync in scripts has ServerAliveInterval |
test_rsync_commands_have_timeout |
Every rsync has --timeout=N |
test_parquet_rsync_does_not_compress |
Parquet rsync uses -av not -avz |
test_sync_scripts_have_retry_logic |
Scripts define retry wrapper |
Step 2: Run diagnostic tests and analyze results
Before implementing any fix, run the live diagnostic tests to confirm the root cause:
pytest tests/test_sync_data.py -v -k "live" --tb=short 2>&1 | tee data/sync_diagnostics/test_run.log
Expected outcomes to validate:
test_ssh_connectivitypasses → SSH connection itself is finetest_rsync_small_directorypasses → small syncs work, problem is scale/duration-relatedtest_rsync_with_timeoutfails with timeout → confirms connection drop (rsync exits instead of hanging forever)test_rsync_with_keepalivepasses → confirms keepalive fixes the issuetest_rsync_per_subdirectory→ identifies if specific directory triggers the hang or if it's purely time-basedtest_rsync_without_compression→ reveals if-zcontributes to the stall
Decision gate: Based on results, confirm which combination of fixes is needed before proceeding to Step 3. If keepalive alone fixes it, the other changes are still applied as defense-in-depth.
Step 3: Fix scripts/sync_data.sh
Add reliability wrapper after argument parsing:
# --- Rsync reliability settings (Issue #197) ---
RSYNC_SSH_OPTS='ssh -o ServerAliveInterval=60 -o ServerAliveCountMax=3 -o ConnectTimeout=30'
RSYNC_TIMEOUT=300
RSYNC_MAX_RETRIES=3
RSYNC_RETRY_DELAY=5
rsync_reliable() {
local attempt=1
local delay=$RSYNC_RETRY_DELAY
while [[ $attempt -le $RSYNC_MAX_RETRIES ]]; do
if rsync -e "$RSYNC_SSH_OPTS" --timeout="$RSYNC_TIMEOUT" \
--partial-dir=.rsync-partial "$@"; then
return 0
fi
local exit_code=$?
if [[ $attempt -lt $RSYNC_MAX_RETRIES ]]; then
echo " Rsync failed (exit $exit_code), retrying in ${delay}s ($attempt/$RSYNC_MAX_RETRIES)..."
sleep "$delay"
delay=$((delay * 2))
fi
attempt=$((attempt + 1))
done
echo " ERROR: Rsync failed after $RSYNC_MAX_RETRIES attempts"
return 1
}
Replace all rsync invocations to use rsync_reliable() wrapper:
- Drop
-zfor parquet transfers (already snappy-compressed) - Keep
-zfor text transfers (docs, scripts, metadata) --partial-dir=.rsync-partialenables resume on retry (Gemini review recommendation)
Step 4: Fix scripts/sync_jira.sh
Same rsync_reliable() wrapper and fixes.
Step 5: Update CI
Add tests/test_sync_data.py to .github/workflows/deploy-guard.yml (static regression tests only, live diagnostics excluded via marker).
Step 6: Verify fix end-to-end
Run full sync and confirm no hanging:
bash scripts/sync_data.sh --dry-run # script syntax OK
bash scripts/sync_data.sh # full sync completes
Files to Modify
| File | Action | Description |
|---|---|---|
tests/test_sync_data.py |
CREATE | Diagnostic + regression tests |
scripts/sync_data.sh |
EDIT | Add reliability wrapper, fix all rsync calls |
scripts/sync_jira.sh |
EDIT | Same fixes |
.github/workflows/deploy-guard.yml |
EDIT | Add static tests to CI |
Review Notes
Plan reviewed by Google Gemini (2026-02-17). Key feedback incorporated:
--partial-dir=.rsync-partialadded to wrapper for efficient retry resume- Root cause confirmed as most likely firewall/NAT idle timeout (textbook signature)
ServerAliveInterval=60confirmed as correct value (low overhead, safe interval)--timeout=300confirmed as reasonable (long enough to avoid false positives)- Drop
-zconfirmed correct (compressing snappy data wastes CPU) - Future optimization:
--files-fromapproach for large file sets if needed