chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88 , wave 1) (#94 )

* chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88)

Vendor-neutralization step before public release. The directory mixed
two concerns: (1) generic ops scripts referenced from mainline OSS
infrastructure (TLS rotation, auto-upgrade cron) and (2) one operator's
hackathon manual-deploy helper with hardcoded GCP project IDs, VM names,
and admin emails. Splitting them per concern.

Moved (still in OSS, just under a vendor-neutral name):
- scripts/grpn/agnes-tls-rotate.sh   → scripts/ops/agnes-tls-rotate.sh
- scripts/grpn/agnes-auto-upgrade.sh → scripts/ops/agnes-auto-upgrade.sh

Removed (belongs in private consumer infra repos, not upstream OSS):
- scripts/grpn/Makefile (hardcoded prj-grp-foundryai-dev-7c37, foundryai-development VM name, e_zsrotyr@groupon.com bootstrap email)
- scripts/grpn/README.md (GRPN hackathon deploy walkthrough)
- docs/superpowers/plans/2026-04-22-grpn-deploy-learnings.md (org-specific deploy log)

Cross-refs updated in README.md, CLAUDE.md, docs/DEPLOYMENT.md,
docker-compose.yml. CHANGELOG entry flags BREAKING (ops) for any
consumer infra repo that installs these scripts via path-based systemd
timers.

This is the first wave of #88 — the remaining leaks (test data with
prj-grp-dataview-prod-1ff9, AIAgent.FoundryAI tags in OpenMetadata test
fixtures, docstrings in connectors/openmetadata/enricher.py) will be a
separate, smaller PR.

Refs #88.

* chore(oss): comprehensive vendor-neutralization (#88 wave 2 + review fixes)

PR #94 review found that the original wave-1 grep was scoped wrong and
many leaks survived. This commit closes wave 1 properly AND folds in all
wave-2 anonymization in a single pass — easier to review than two PRs.

Wave-1 review-fix corrections:
- Caddyfile: scripts/grpn/agnes-tls-rotate.sh → scripts/ops/ (the original
  wave-1 grep filter excluded extensionless files like Caddyfile).
- CHANGELOG bullet rewritten — original wording implied an in-repo migration
  for infra/modules/customer-instance/, which is wrong (the TF module embeds
  the script inline via heredoc, never sourced from scripts/grpn/). Now
  flags downstream consumer infra repos only.
- infra/modules/customer-instance/variables.tf: Czech docstring with `grpn`
  example → English description with `acme, example` placeholders.

Wave-2 anonymization:
- Code docstrings (connectors/openmetadata/{client,transformer,enricher}.py,
  src/catalog_export.py, scripts/duckdb_manager.py): prj-grp-… →
  my-bq-project / prj-example-1234, AIAgent.FoundryAI → AIAgent.MyAgent,
  FoundryAIDataModel → AnalyticsDataModel.
- Test fixtures (4 files): same set of replacements — 157 tests still pass.
- .github/workflows/keboola-deploy.yml: "Groupon-side dev VMs" comment →
  generic "per-developer dev VMs".
- docs/auth-groups.md + scripts/debug/probe_google_groups.py:
  kids-ai-data-analysis project name → acme-internal-prod placeholder.
- 5 planning/spec docs under docs/superpowers/{plans,specs}/2026-04-21-*:
  hardcoded IPs (34.77.94.14, 34.77.102.61) → <dev-vm-ip>/<prod-vm-ip>;
  GRPN/Groupon → Acme/another-customer; prj-grp-… → prj-example-….
- scripts/switch-dev-vm.sh deleted — hackathon-era helper hardcoded to a
  specific shared dev VM. Per-developer dev VMs are the supported pattern.

Final grep `groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.(94|102)\.…|kids-ai-data`
returns zero hits (excluding CHANGELOG.md historical entries).

CHANGELOG entry expanded to document both waves under one bullet, with
the BREAKING (ops) clarification about the TF module being unaffected.

Refs review of #94, closes #88.

* fix(oss): close remaining #94 review-2 findings (Czech, padak refs, CHANGELOG)

Reviewer of PR #94 round 2 caught 4 remaining items the wave-2 pass missed:

1. infra/modules/customer-instance/variables.tf had Czech descriptions on
   8 more variables. Previous review only flagged line 19; this round
   audited the rest. Translated lines 2, 28, 42-46 (heredoc), 60, 65, 71,
   78, 84 to English. Same review concern: a Terraform module that is
   the customer-facing API surface in Czech is unfit for OSS distribution.

2. infra/modules/customer-instance/outputs.tf had Czech descriptions on
   four outputs. Same fix.

3. docs/padak-security.md referenced a private repo (padak/keboola_agent_cli#206)
   in two places. Replaced with generic 'tracked upstream in the auth-CLI repo'
   per CLAUDE.md vendor-agnostic rule (no cross-refs to private repos).

4. scripts/fetch-env-from-secrets.sh:41 had a Czech comment.
   Translated.

5. CHANGELOG cosmetic: bullet said 'AIAgent.FoundryAI -> AIAgent.MyAgent'
   but the actual code uses both MyAgent (in docstrings) and Example
   (in test fixtures). Reworded to mention both targets.

Final grep across all shipping file types (.md, .py, .yml, .yaml, .sh,
Makefile, .json, .tf, .tpl, Caddyfile, .toml) for groupon|grpn|foundryai|
prj-grp|groupondev|34.77.94.14|34.77.102.61|kids-ai-data|padak/keboola_agent_cli
returns ZERO hits (excluding CHANGELOG.md). Czech-diacritic grep across
.tf/.toml/Caddyfile/Makefile/.yml returns ZERO hits.

157/157 OpenMetadata + DuckDB tests still pass.

* fix(oss): close #94 round-3 leaks (env.template, instance.yaml.example, padak typo)

Round-3 reviewer caught two MUST-FIX leaks the round-2 grep missed
(grep was scoped to extensions that did not include .template / .example
suffixes — the audit was right, the previous grep was not paranoid enough):

1. config/instance.yaml.example:114 — '(optional - Groupon-specific)' brand
   leak in a shipping config example. Replaced with '(optional)'.

2. config/.env.template:68 — stale path 'scripts/grpn/agnes-tls-rotate.sh'
   in operator-facing env-template comment. The script lives at
   scripts/ops/ now (commit 16a85cc); this comment had been pointing
   operators at a non-existent path.

3. docs/padak-security.md:188 — phrase duplication 'tracked in tracked
   upstream' from a sloppy substitution in round-2. Trivial wording fix.

Final paranoid grep across .md/.py/.yml/.yaml/.sh/Makefile/.json/.tf/.tpl/
Caddyfile/.toml/.template/.example/.env* with the full token set
(groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.94\.14|34\.77\.102\.61|
kids-ai-data|padak/keboola_agent_cli) returns ZERO hits, excluding
CHANGELOG.md historical entries.

* fix(oss): #94 round-4 — QUICKSTART.md + rename padak-security.md

Devin Review caught two findings on the latest round-3 commit:

1. docs/QUICKSTART.md:67 still pointed users at the deleted
   scripts/switch-dev-vm.sh. A Quickstart user following step-by-step
   would hit a missing-file error at the final step. Replaced with the
   inline gcloud-ssh equivalent that the Removed bullet documents.

2. docs/padak-security.md filename retains the personal identifier
   'padak'. The PR fixed the body content (replaced
   padak/keboola_agent_cli#206 references with generic wording) but
   missed the filename. Renamed to docs/security-audit-2026-04.md
   (date-anchored, vendor-neutral). Updated the historical CHANGELOG
   link to point at the new path with an inline note about the rename.

* fix(oss): redact remaining hardcoded IPs from planning docs + remove default email

Devin Review caught two more leaks:
1. scripts/fetch-env-from-secrets.sh line 16 had a hardcoded
   personal-email default (zdenek.srotyr@keboola.com). Replaced with
   ':?' bash error so SEED_ADMIN_EMAIL must be explicitly set —
   safer than carrying any specific identity.
2. Planning docs still had 35.195.96.98 and 34.62.223.189 (legacy
   prod/dev IPs) that the round-1 IP-replace pattern missed (it only
   targeted 34.77.x.x). Generic regex redaction across all five
   planning docs replaces every public IP with <redacted-ip>,
   preserving private/loopback/IAP ranges.

2026-04-27 20:24:34 +02:00

16 KiB

Raw Blame History

Security audit — Agnes AI Data Analyst

Date: 2026-04-22 Branch audited: main at commit cbb7733 Method: parallel review passes over four scope areas — (1) secrets/SQLi/authz/SSRF, (2) auth flows & route wiring, (3) templates & UI wiring/XSS, (4) data layer & config & dead code. Findings deduped across the passes, severities adjusted to real-world exploitability.

Known issues already in flight are marked with their tracking links so we do not re-open them.

Top 5 — fix first

1. `[CRITICAL]` Script-API sandbox escape → RCE for the `analyst` role

File: app/api/scripts.py:116–180
Required role: Role.ANALYST (not admin)

Trigger: POST /api/scripts/run with body:

().__class__.__base__.__subclasses__()[N].__init__.__globals__["system"]("id")

Why existing guards miss it: the AST walker and the string allowlist block direct import os / exec, but neither stops attribute traversal through Python's class hierarchy (__class__ → __base__ → __subclasses__() → __globals__).
Impact: arbitrary OS commands under the FastAPI process uid. Gives access to DATA_DIR (DuckDB files, cached parquet), .jwt_secret, env vars, and any credentials mounted into the container.
Fix (minimum): add to the string-pattern blocklist: __subclasses__, __globals__, __mro__, __bases__, __class__, __dict__, __code__. In the AST walker, reject any ast.Attribute whose attr starts and ends with __.
Fix (correct): do not run untrusted Python in-process. Either drop server-side script execution entirely, or run the sandbox in nsjail/gVisor/Pyodide in isolation, or gate the endpoint behind Role.ADMIN if it must stay.
Confidence: broken (verified analytically).

2. `[HIGH]` `/auth/password/reset` endpoint missing — "Forgot Password?" returns 404

Template reference: app/web/templates/login_email.html:47 — <form method="POST" action="/auth/password/reset">
URL map: app/web/router.py:119 — "password_auth.reset_request": "/auth/password/reset"
Backend: app/auth/providers/password.py only registers /login, /login/web, /setup. No /reset handler is wired.
Related dead code: templates password_reset.html and password_setup.html exist but no route renders them — indicates an abandoned reset flow.
Tracking: tracked upstream in the auth-CLI repo
Confidence: broken.

3. `[HIGH]` No rate limiting on any auth endpoint

Files: app/auth/providers/password.py:36, app/auth/providers/email.py:53, app/auth/router.py:58, app/main.py (middleware).
Evidence: grep -rn "slowapi\|limiter\|throttle\|ratelimit" — zero hits in app/.
Impact:
- Password brute-force against POST /auth/password/login and POST /auth/token.
- Email bombing on POST /auth/email/send-link — attacker floods SMTP/SendGrid quota by looping with random recipients.
- Enumeration of bootstrap state via POST /auth/bootstrap.
Fix: add slowapi with @limiter.limit("10/minute") on the four endpoints above, get_remote_address as key with proxy-aware client-IP extraction (project already has _client_ip helper in app/auth/dependencies.py).
Confidence: broken.

4. `[HIGH]` Open-redirect bypass via backslash in `safe_next_path`

Files: app/auth/_common.py:10-24, app/auth/providers/password.py:95-96, app/web/router.py:218-219.
Trigger: https://agnes/login?next=/\evil.com
Why the current check fails: Python sees /\ (not //) and the guard startswith("//") does not fire. Every major browser normalizes \ to / in URL paths, so Location: /\evil.com resolves as //evil.com — a cross-origin redirect.
Impact: phishing — attacker crafts a link on the victim's Agnes URL, lands them on a lookalike after login.
Fix:
```
if len(candidate) > 1 and candidate[1] in ("/", "\\"):
    return default
```
Existing tests (tests/test_web_ui.py:270-296) cover //evil.example/ but not /\evil.com — add the case.
Confidence: broken.

5. `[HIGH]` SSRF in `/api/admin/configure` — `keboola_url` accepted as-is

File: app/api/admin.py:163–282; the URL is passed straight to KeboolaClient.test_connection() which issues a GET request.
Trigger: a compromised (or insider-threat) admin sends {"keboola_url": "http://169.254.169.254/latest/meta-data/"} or http://localhost:5432/.
Impact: server as SSRF proxy to the private network — GCP/AWS instance metadata service (IAM tokens), internal databases, LAN services.
Why a domain allowlist is wrong: keboola_url is the URL of the Keboola stack the Agnes instance connects to, not the Agnes host. Valid values include connection.keboola.com, connection.eu-central-1.keboola.com, connection.europe-west3.gcp.keboola.com, plus potentially self-hosted private Keboola stacks. A .keboola.com suffix check would break legitimate deployments.

Correct fix: enforce https:// scheme, then resolve the hostname and reject any result in a private / loopback / link-local / reserved block. Still allows any public HTTPS host.

from urllib.parse import urlparse
import ipaddress, socket

def _validate_stack_url(url: str) -> str | None:
    try:
        p = urlparse(url)
    except Exception:
        return "not a valid URL"
    if p.scheme != "https":
        return "must use https"
    if not p.hostname:
        return "missing hostname"
    try:
        infos = socket.getaddrinfo(p.hostname, None)
    except socket.gaierror:
        return f"cannot resolve {p.hostname}"
    for fam, _, _, _, sa in infos:
        try:
            ip = ipaddress.ip_address(sa[0])
        except ValueError:
            continue
        if ip.is_private or ip.is_loopback or ip.is_link_local or ip.is_reserved:
            return f"{p.hostname} resolves to a private/loopback address ({ip})"
    return None

Residual risk: DNS rebinding (hostname resolves to a public IP at validation time, then to loopback at request time). Out of scope for this fix — would need an outbound egress proxy with an IP-level ACL.
Confidence: broken.

Second tier (HIGH)

#	Category	File:line	Summary	Confidence
6	race	`app/auth/providers/email.py:106-128`	`_consume_token` read-validate-clear is not atomic. Two concurrent clicks on the same magic link can both issue JWTs. Fix: `UPDATE users SET reset_token=NULL WHERE id=? AND reset_token=?` and check `rowcount == 1` before issuing the JWT.	broken
7	bootstrap	`app/auth/router.py:103-158`	Check is "any user with `password_hash`". If a seed admin exists without a password (e.g. created by `LOCAL_DEV_MODE=1` then redeployed without it, or a `SEED_ADMIN_EMAIL` without `SEED_ADMIN_PASSWORD`), `/auth/bootstrap` stays open and any caller can register a new admin account. Fix: check "any user at all" or require an explicit admin token for bootstrap.	broken
8	cookie	`app/main.py:61`	`SessionMiddleware(secret_key=...)` without `max_age` → OAuth session cookie expires only when the browser closes. Also `https_only` is not forced in production. Fix: `max_age=3600` and `https_only=bool(os.environ.get("DOMAIN"))`.	broken
9	sqli-adjacent	`connectors/keboola/extractor.py:104-106, 128`	`CREATE OR REPLACE VIEW "{table_name}" AS SELECT * FROM kbc."{bucket}"."{source_table}"` — f-string DDL with identifier quotes. Inputs come from `table_registry` (admin-controlled), but relying on quote-escaping for DDL is fragile. Fix: validate each identifier against `^[a-zA-Z0-9_]{1,63}$` before interpolation.	suspicious
10	datetime	`connectors/keboola/client.py:183`	Same offset-naive vs. offset-aware bug as the one we just fixed in `app/auth/providers/email.py`. `datetime.now() - cached_time` crashes or treats every cache entry as stale, causing needless API calls. Fix: use `datetime.now(timezone.utc)` on both sides.	broken

Third tier (MEDIUM)

#	Category	File:line	Summary
11	timing	`email.py:113`, `password.py:117`, `dependencies.py:148`	Token / `token_hash` compared with `!=` instead of `hmac.compare_digest`. 32-byte entropy mitigates the brute side, but constant-time comparison is a zero-cost hardening.
12	xss-latent	`app/web/templates/_theme.html:12`	`{{ var }}: {{ val }};` inside a `<style>` block. Jinja auto-escape does not escape for CSS context. Today the values come from `instance.yaml` (FS-write), so risk is low — but it escalates to HIGH if a `/api/admin/theme` endpoint is ever added. Validate against `^[a-zA-Z0-9#()., %-]{1,50}$`.
13	config-drift	`app/instance_config.py:21-23`	`_instance_config` is cached in a module-level global with no invalidation — changes via `/api/admin/configure` or direct file edits do not take effect until the process restarts. Fix: file-mtime watch, or explicit `reload=True` kwarg on `load_instance_config`.
14	rbac-inconsistency	`app/auth/dependencies.py:203-210`	`require_admin` does an exact-match `role == "admin"` check, while `require_role(Role.ADMIN)` respects the hierarchy. Risk: if a higher role is ever added, `require_admin` rejects it. Fix: replace all call sites with `require_role(Role.ADMIN)`, delete `require_admin`.
15	sqli-defense	`src/repositories/knowledge.py:49-58`	`update(fields)` builds `SET` clause from `fields.keys()` without an allowlist (unlike `users.py` and `metrics.py`). Today the API layer filters to a safe set of keys, but one future caller with user-controlled `fields` turns this into SQL column injection. Add an `ALLOWED_KNOWLEDGE_FIELDS` set in the repository layer.
16	silent-fail	`services/scheduler/__main__.py:36-48`	If `SCHEDULER_API_TOKEN` is missing and auto-fetch fails, the scheduler proceeds with an empty token. API calls then 401 silently and no sync runs. Fix: `logger.error()` + non-zero exit.
17	dos-input	`src/repositories/audit.py:38-49`	`limit` is a user-supplied int passed straight into SQL `LIMIT`. `limit=1_000_000` → OOM on large audit tables. Fix: `limit = min(max(limit, 1), 1000)`.
18	pat-audit-only	`app/auth/dependencies.py:151-173`	New-IP use of a PAT logs to audit but neither blocks the request nor notifies the owner. A leaked PAT keeps working until someone reviews the audit log. Consider auto-revoke on new IP for tokens older than N days, or at least a notification hook.
19	cookie-inconsistency	`google.py:115` vs `password.py:93`, `email.py:156`	Three different rules for `secure=`: Google uses `TESTING != 1`, the others use `DOMAIN != ""`. Unify on one helper (`_is_secure_context()` keyed on `DOMAIN`).

Fourth tier (LOW) — debt, not incidents

#	Category	File:line	Summary
20	path-traversal	`app/api/upload.py:77`	Target path is `md_dir / f"{user_email}.md"`. `user_email` comes from the JWT (trusted after authn) but is not sanitized — theoretical exploit requires being able to register an account with `../` in the email, which depends on the OAuth provider / domain filter. `upload_session` and `upload_artifact` correctly use `Path(raw).name`. Fix: use `user["id"]` as the filename, or `re.sub(r"[^a-zA-Z0-9@._-]", "_", email)`.
21	info-disclosure	`app/api/users.py:209`	`POST /api/users/{id}/reset-password` returns `{"reset_token": "..."}` in plaintext. Admin-only, but if the response is logged (Nginx access log, load balancer, audit proxy) the token is captured. Return `"email_sent": true` and email the token, or truncate the token in the log path.
22	dead-code	`app/web/templates/password_reset.html`, `password_setup.html`, `login_magic_link_sent.html`	Templates exist, no Python route renders them. Delete or finish the flow they belong to (password reset, magic-link "check your email" confirmation).
23	empty-schedule	`src/scheduler.py:92-99`	Empty `schedule` string silently returns "not due" — tables with misconfigured schedules never sync, no warning. Log at `WARNING` when a schedule doesn't match any known pattern.
24	silent-excepts	`src/db.py:254, 286, 309, 325, 368, 382, 489, 546`, `src/orchestrator.py:121`	`except Exception: pass`. Some are legitimate (optional `CHECKPOINT` on read-only DBs), others mask migration failures. Triage each; log at least at `DEBUG`.

Verified non-issues (do not re-open)

Several patterns looked scary at first glance but are correctly defended:

UserRepository.update() SQL injection — one audit pass flagged this as CRITICAL, but the allowed = {...} allowlist at src/repositories/users.py:50-56 drops any column not in the set before the SET clause is built. Safe.
Orchestrator DETACH / ATTACH — identifiers are validated through _SAFE_IDENTIFIER = re.compile(r"^[a-zA-Z_][a-zA-Z0-9_]{0,63}$") in src/orchestrator.py before interpolation.
CORS wildcard + credentials — default CORS_ORIGINS is localhost:3000,localhost:8000, not *. Starlette's CORSMiddleware also refuses to combine * with allow_credentials=True.
yaml.load unsafe variant — not present anywhere. Every YAML parser uses yaml.safe_load.
Hardcoded secrets in the repo — no real secrets in .py/.yaml/.md. config/deploy.yml references env vars (${KAMAL_REGISTRY_PASSWORD}). Git history has no .env leak.
LOCAL_DEV_MODE reachable via HTTP — the flag is read from os.environ only; no request header or query parameter toggles it.
IDOR on /auth/tokens/{token_id} — handler checks row["user_id"] != user["id"] before returning or modifying the token. Correct.
Jira webhook signature — connectors/jira/webhook.py uses hmac.compare_digest. Correct.

Coverage gaps — what this audit did not check

Area	Why skipped
`connectors/jira/transform.py`, `incremental_transform.py`	Scope + complex transformation logic
`services/corporate_memory/`, `services/telegram_bot/`, `services/ws_gateway/*`	Not read in depth
Frontend JS in `app/web/static/*.js`	Audit #3 focused on HTML templates
`async` / `await` correctness (blocking calls inside async handlers)	Not traced
Automated scanners (`bandit`, `semgrep`, `trivy`, `pip-audit`)	Not installed in the audit environment
Dependency CVE audit	73 packages in `uv.lock`, not walked manually
`pytest` run	Test suite not executed — some findings may already be covered by existing tests
Docker hardening (`Dockerfile` — user namespaces, capabilities)	Out of scope
`infra/modules/customer-instance/` (Terraform / startup scripts)	Out of scope

Honest estimate: roughly 65–70 % of the real attack surface is covered. To close the gap I would run bandit -r app/ src/ connectors/, pip-audit, execute the pytest suite, and review outbound traffic logs from staging.

Proposed action plan

This week:

#1 script sandbox escape — one issue, RCE severity, own ticket
#3 rate limiting on auth — slowapi introduction, one PR
#4 backslash open-redirect — two-line fix plus test

Next sprint:

#5 SSRF validator in /api/admin/configure (with the IP-range variant above, not a .keboola.com suffix)
#6 atomic magic-link token consumption
#7 bootstrap hardening
#8 SessionMiddleware max_age + https_only

Backlog (single tracking issue with checkboxes):

#11–24 — timing comparisons, theme XSS gating, config reload, RBAC unification, schedule validation, silent-except triage, dead templates cleanup.

The missing /auth/password/reset endpoint (#2) is already tracked upstream in the auth-CLI repo.

16 KiB Raw Blame History Unescape Escape