- google_compute_resource_policy.daily_backup: daily snapshot at 02:00,
30-day retention, labels (app=agnes, customer=<name>)
- google_compute_disk_resource_policy_attachment.data_backup: attach policy
to each data disk (prod + dev)
- google_monitoring_uptime_check_config.health: per-VM /api/health uptime
check every 60s, 10s timeout
- google_monitoring_alert_policy.health_failure: alert when uptime check
fails for > 5 min
New opt-out: enable_monitoring = false (default true)
New opt-in: notification_channel_ids = [...] to wire alerts to email/Slack
Module API unchanged; existing customers pick up backups + monitoring on
next module upgrade. TF provider requirement unchanged.
The CI smoke test failed because docker-compose.prod.yml forced a bind mount
to /data on the host — which doesn't exist on GitHub runners.
Split the bind mount into docker-compose.host-mount.yml, which is only
composed by the VM startup script (/data exists there, mounted from the
persistent disk). CI continues to use the default named volume.
Module startup script + auto-upgrade cron now compose all three:
-f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.host-mount.yml
Watchtower container has Docker API mismatch (client 1.25 vs daemon 1.54+)
that can't be worked around without upstream fix. Simple cron job does the
same thing more reliably:
- Every 5 min: docker compose pull + detect digest change + up -d if changed
- Logs to /var/log/agnes-auto-upgrade.log
This removes the watchtower container and a Docker daemon dependency.
- Replace padak/tmp_oss → keboola/agnes-the-ai-analyst in all docs, infra, CLI
- Replace your-org/ai-data-analyst → keboola/agnes-the-ai-analyst in README, Jira docs
- Remove real GCP project ID from terraform.tfvars.example
- Delete internal draft documents (dev_docs/draft/)
- Update infra/main.tf to clone from main branch