agnes-the-ai-analyst/docs/archive/security-audit-2026-04.md
ZdenekSrotyr a48524509a
docs: consolidate and de-clutter the documentation tree (#306)
CLAUDE.md rewritten (708 -> ~320 lines): four overlapping release
sections collapsed to one, stale v1->v35 schema history dropped (it
lives in CHANGELOG), marketplace endpoint internals and verbose
process sections moved out or tightened.

New focused docs:
- docs/RELEASING.md - release process, deploy workflows, CI quirks
  (RELEASE_TEMPLATE.md folded in as an appendix)
- docs/marketplace.md - marketplace ingestion + re-serving internals
- docs/README.md - documentation index by audience, linked from
  README.md and CLAUDE.md

Archived under docs/archive/: docs/superpowers/ (52 historical
planning artifacts), HACKATHON.md, pd-ps-comments.md,
security-audit-2026-04.md, future/NOTIFICATIONS.md.

Removed the docs/auto-install.md stub. Fixed dangling links in
connectors/jira/README.md and dev_docs/README.md, repointed
code/doc references to archived paths.
2026-05-14 18:54:22 +00:00

193 lines
16 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Security audit — Agnes AI Data Analyst
> **Archived.** This is a point-in-time audit snapshot. Findings have been
> triaged and addressed in subsequent releases — check `CHANGELOG.md` (search
> for the relevant file/endpoint) for the fixes that landed since 2026-04-22.
> Kept for historical reference; not a live tracker.
**Date:** 2026-04-22
**Branch audited:** `main` at commit `cbb7733`
**Method:** parallel review passes over four scope areas — (1) secrets/SQLi/authz/SSRF, (2) auth flows & route wiring, (3) templates & UI wiring/XSS, (4) data layer & config & dead code. Findings deduped across the passes, severities adjusted to real-world exploitability.
Known issues already in flight are marked with their tracking links so we do not re-open them.
---
## Top 5 — fix first
### 1. `[CRITICAL]` Script-API sandbox escape → RCE for the `analyst` role
- **File:** `app/api/scripts.py:116180`
- **Required role:** `Role.ANALYST` (not admin)
- **Trigger:** `POST /api/scripts/run` with body:
```python
().__class__.__base__.__subclasses__()[N].__init__.__globals__["system"]("id")
```
- **Why existing guards miss it:** the AST walker and the string allowlist block direct `import os` / `exec`, but neither stops attribute traversal through Python's class hierarchy (`__class__ → __base____subclasses__() → __globals__`).
- **Impact:** arbitrary OS commands under the FastAPI process uid. Gives access to `DATA_DIR` (DuckDB files, cached parquet), `.jwt_secret`, env vars, and any credentials mounted into the container.
- **Fix (minimum):** add to the string-pattern blocklist: `__subclasses__`, `__globals__`, `__mro__`, `__bases__`, `__class__`, `__dict__`, `__code__`. In the AST walker, reject any `ast.Attribute` whose `attr` starts and ends with `__`.
- **Fix (correct):** do not run untrusted Python in-process. Either drop server-side script execution entirely, or run the sandbox in `nsjail`/gVisor/Pyodide in isolation, or gate the endpoint behind `Role.ADMIN` if it must stay.
- **Confidence:** broken (verified analytically).
### 2. `[HIGH]` `/auth/password/reset` endpoint missing — "Forgot Password?" returns 404
- **Template reference:** `app/web/templates/login_email.html:47``<form method="POST" action="/auth/password/reset">`
- **URL map:** `app/web/router.py:119``"password_auth.reset_request": "/auth/password/reset"`
- **Backend:** `app/auth/providers/password.py` only registers `/login`, `/login/web`, `/setup`. No `/reset` handler is wired.
- **Related dead code:** templates `password_reset.html` and `password_setup.html` exist but no route renders them — indicates an abandoned reset flow.
- **Tracking:** tracked upstream in the auth-CLI repo
- **Confidence:** broken.
### 3. `[HIGH]` No rate limiting on any auth endpoint
- **Files:** `app/auth/providers/password.py:36`, `app/auth/providers/email.py:53`, `app/auth/router.py:58`, `app/main.py` (middleware).
- **Evidence:** `grep -rn "slowapi\|limiter\|throttle\|ratelimit"` — zero hits in `app/`.
- **Impact:**
- Password brute-force against `POST /auth/password/login` and `POST /auth/token`.
- Email bombing on `POST /auth/email/send-link` — attacker floods SMTP/SendGrid quota by looping with random recipients.
- Enumeration of bootstrap state via `POST /auth/bootstrap`.
- **Fix:** add `slowapi` with `@limiter.limit("10/minute")` on the four endpoints above, `get_remote_address` as key with proxy-aware client-IP extraction (project already has `_client_ip` helper in `app/auth/dependencies.py`).
- **Confidence:** broken.
### 4. `[HIGH]` Open-redirect bypass via backslash in `safe_next_path`
- **Files:** `app/auth/_common.py:10-24`, `app/auth/providers/password.py:95-96`, `app/web/router.py:218-219`.
- **Trigger:** `https://agnes/login?next=/\evil.com`
- **Why the current check fails:** Python sees `/\` (not `//`) and the guard `startswith("//")` does not fire. Every major browser normalizes `\` to `/` in URL paths, so `Location: /\evil.com` resolves as `//evil.com` — a cross-origin redirect.
- **Impact:** phishing — attacker crafts a link on the victim's Agnes URL, lands them on a lookalike after login.
- **Fix:**
```python
if len(candidate) > 1 and candidate[1] in ("/", "\\"):
return default
```
Existing tests (`tests/test_web_ui.py:270-296`) cover `//evil.example/` but not `/\evil.com` — add the case.
- **Confidence:** broken.
### 5. `[HIGH]` SSRF in `/api/admin/configure` — `keboola_url` accepted as-is
- **File:** `app/api/admin.py:163282`; the URL is passed straight to `KeboolaClient.test_connection()` which issues a GET request.
- **Trigger:** a compromised (or insider-threat) admin sends `{"keboola_url": "http://169.254.169.254/latest/meta-data/"}` or `http://localhost:5432/`.
- **Impact:** server as SSRF proxy to the private network — GCP/AWS instance metadata service (IAM tokens), internal databases, LAN services.
- **Why a domain allowlist is wrong:** `keboola_url` is the URL of the Keboola stack the Agnes instance connects to, **not** the Agnes host. Valid values include `connection.keboola.com`, `connection.eu-central-1.keboola.com`, `connection.europe-west3.gcp.keboola.com`, plus potentially self-hosted private Keboola stacks. A `.keboola.com` suffix check would break legitimate deployments.
- **Correct fix:** enforce `https://` scheme, then resolve the hostname and reject any result in a private / loopback / link-local / reserved block. Still allows any public HTTPS host.
```python
from urllib.parse import urlparse
import ipaddress, socket
def _validate_stack_url(url: str) -> str | None:
try:
p = urlparse(url)
except Exception:
return "not a valid URL"
if p.scheme != "https":
return "must use https"
if not p.hostname:
return "missing hostname"
try:
infos = socket.getaddrinfo(p.hostname, None)
except socket.gaierror:
return f"cannot resolve {p.hostname}"
for fam, _, _, _, sa in infos:
try:
ip = ipaddress.ip_address(sa[0])
except ValueError:
continue
if ip.is_private or ip.is_loopback or ip.is_link_local or ip.is_reserved:
return f"{p.hostname} resolves to a private/loopback address ({ip})"
return None
```
- **Residual risk:** DNS rebinding (hostname resolves to a public IP at validation time, then to loopback at request time). Out of scope for this fix — would need an outbound egress proxy with an IP-level ACL.
- **Confidence:** broken.
---
## Second tier (HIGH)
| # | Category | File:line | Summary | Confidence |
|---|---|---|---|---|
| 6 | race | `app/auth/providers/email.py:106-128` | `_consume_token` read-validate-clear is not atomic. Two concurrent clicks on the same magic link can both issue JWTs. Fix: `UPDATE users SET reset_token=NULL WHERE id=? AND reset_token=?` and check `rowcount == 1` before issuing the JWT. | broken |
| 7 | bootstrap | `app/auth/router.py:103-158` | Check is "any user with `password_hash`". If a seed admin exists without a password (e.g. created by `LOCAL_DEV_MODE=1` then redeployed without it, or a `SEED_ADMIN_EMAIL` without `SEED_ADMIN_PASSWORD`), `/auth/bootstrap` stays open and any caller can register a new admin account. Fix: check "any user at all" or require an explicit admin token for bootstrap. | broken |
| 8 | cookie | `app/main.py:61` | `SessionMiddleware(secret_key=...)` without `max_age` → OAuth session cookie expires only when the browser closes. Also `https_only` is not forced in production. Fix: `max_age=3600` and `https_only=bool(os.environ.get("DOMAIN"))`. | broken |
| 9 | sqli-adjacent | `connectors/keboola/extractor.py:104-106, 128` | `CREATE OR REPLACE VIEW "{table_name}" AS SELECT * FROM kbc."{bucket}"."{source_table}"` — f-string DDL with identifier quotes. Inputs come from `table_registry` (admin-controlled), but relying on quote-escaping for DDL is fragile. Fix: validate each identifier against `^[a-zA-Z0-9_]{1,63}$` before interpolation. | suspicious |
| 10 | datetime | `connectors/keboola/client.py:183` | Same offset-naive vs. offset-aware bug as the one we just fixed in `app/auth/providers/email.py`. `datetime.now() - cached_time` crashes or treats every cache entry as stale, causing needless API calls. Fix: use `datetime.now(timezone.utc)` on both sides. | broken |
---
## Third tier (MEDIUM)
| # | Category | File:line | Summary |
|---|---|---|---|
| 11 | timing | `email.py:113`, `password.py:117`, `dependencies.py:148` | Token / `token_hash` compared with `!=` instead of `hmac.compare_digest`. 32-byte entropy mitigates the brute side, but constant-time comparison is a zero-cost hardening. |
| 12 | xss-latent | `app/web/templates/_theme.html:12` | `{{ var }}: {{ val }};` inside a `<style>` block. Jinja auto-escape does not escape for CSS context. Today the values come from `instance.yaml` (FS-write), so risk is low — but it escalates to HIGH if a `/api/admin/theme` endpoint is ever added. Validate against `^[a-zA-Z0-9#()., %-]{1,50}$`. |
| 13 | config-drift | `app/instance_config.py:21-23` | `_instance_config` is cached in a module-level global with no invalidation — changes via `/api/admin/configure` or direct file edits do not take effect until the process restarts. Fix: file-mtime watch, or explicit `reload=True` kwarg on `load_instance_config`. |
| 14 | rbac-inconsistency | `app/auth/dependencies.py:203-210` | `require_admin` does an exact-match `role == "admin"` check, while `require_role(Role.ADMIN)` respects the hierarchy. Risk: if a higher role is ever added, `require_admin` rejects it. Fix: replace all call sites with `require_role(Role.ADMIN)`, delete `require_admin`. |
| 15 | sqli-defense | `src/repositories/knowledge.py:49-58` | `update(**fields)` builds `SET` clause from `fields.keys()` without an allowlist (unlike `users.py` and `metrics.py`). Today the API layer filters to a safe set of keys, but one future caller with user-controlled `**fields` turns this into SQL column injection. Add an `ALLOWED_KNOWLEDGE_FIELDS` set in the repository layer. |
| 16 | silent-fail | `services/scheduler/__main__.py:36-48` | If `SCHEDULER_API_TOKEN` is missing and auto-fetch fails, the scheduler proceeds with an empty token. API calls then 401 silently and no sync runs. Fix: `logger.error()` + non-zero exit. |
| 17 | dos-input | `src/repositories/audit.py:38-49` | `limit` is a user-supplied int passed straight into SQL `LIMIT`. `limit=1_000_000` → OOM on large audit tables. Fix: `limit = min(max(limit, 1), 1000)`. |
| 18 | pat-audit-only | `app/auth/dependencies.py:151-173` | New-IP use of a PAT logs to audit but neither blocks the request nor notifies the owner. A leaked PAT keeps working until someone reviews the audit log. Consider auto-revoke on new IP for tokens older than N days, or at least a notification hook. |
| 19 | cookie-inconsistency | `google.py:115` vs `password.py:93`, `email.py:156` | Three different rules for `secure=`: Google uses `TESTING != 1`, the others use `DOMAIN != ""`. Unify on one helper (`_is_secure_context()` keyed on `DOMAIN`). |
---
## Fourth tier (LOW) — debt, not incidents
| # | Category | File:line | Summary |
|---|---|---|---|
| 20 | path-traversal | `app/api/upload.py:77` | Target path is `md_dir / f"{user_email}.md"`. `user_email` comes from the JWT (trusted after authn) but is not sanitized — theoretical exploit requires being able to register an account with `../` in the email, which depends on the OAuth provider / domain filter. `upload_session` and `upload_artifact` correctly use `Path(raw).name`. Fix: use `user["id"]` as the filename, or `re.sub(r"[^a-zA-Z0-9@._-]", "_", email)`. |
| 21 | info-disclosure | `app/api/users.py:209` | `POST /api/users/{id}/reset-password` returns `{"reset_token": "..."}` in plaintext. Admin-only, but if the response is logged (Nginx access log, load balancer, audit proxy) the token is captured. Return `"email_sent": true` and email the token, or truncate the token in the log path. |
| 22 | dead-code | `app/web/templates/password_reset.html`, `password_setup.html`, `login_magic_link_sent.html` | Templates exist, no Python route renders them. Delete or finish the flow they belong to (password reset, magic-link "check your email" confirmation). |
| 23 | empty-schedule | `src/scheduler.py:92-99` | Empty `schedule` string silently returns "not due" — tables with misconfigured schedules never sync, no warning. Log at `WARNING` when a schedule doesn't match any known pattern. |
| 24 | silent-excepts | `src/db.py:254, 286, 309, 325, 368, 382, 489, 546`, `src/orchestrator.py:121` | `except Exception: pass`. Some are legitimate (optional `CHECKPOINT` on read-only DBs), others mask migration failures. Triage each; log at least at `DEBUG`. |
---
## Verified non-issues (do not re-open)
Several patterns looked scary at first glance but are correctly defended:
- **`UserRepository.update()` SQL injection** — one audit pass flagged this as CRITICAL, but the `allowed = {...}` allowlist at `src/repositories/users.py:50-56` drops any column not in the set before the `SET` clause is built. Safe.
- **Orchestrator `DETACH` / `ATTACH`** — identifiers are validated through `_SAFE_IDENTIFIER = re.compile(r"^[a-zA-Z_][a-zA-Z0-9_]{0,63}$")` in `src/orchestrator.py` before interpolation.
- **CORS wildcard + credentials** — default `CORS_ORIGINS` is `localhost:3000,localhost:8000`, not `*`. Starlette's `CORSMiddleware` also refuses to combine `*` with `allow_credentials=True`.
- **`yaml.load` unsafe variant** — not present anywhere. Every YAML parser uses `yaml.safe_load`.
- **Hardcoded secrets in the repo** — no real secrets in `.py`/`.yaml`/`.md`. `config/deploy.yml` references env vars (`${KAMAL_REGISTRY_PASSWORD}`). Git history has no `.env` leak.
- **`LOCAL_DEV_MODE` reachable via HTTP** — the flag is read from `os.environ` only; no request header or query parameter toggles it.
- **IDOR on `/auth/tokens/{token_id}`** — handler checks `row["user_id"] != user["id"]` before returning or modifying the token. Correct.
- **Jira webhook signature** — `connectors/jira/webhook.py` uses `hmac.compare_digest`. Correct.
---
## Coverage gaps — what this audit did **not** check
| Area | Why skipped |
|---|---|
| `connectors/jira/transform.py`, `incremental_transform.py` | Scope + complex transformation logic |
| `services/corporate_memory/*`, `services/telegram_bot/*`, `services/ws_gateway/*` | Not read in depth |
| Frontend JS in `app/web/static/*.js` | Audit #3 focused on HTML templates |
| `async` / `await` correctness (blocking calls inside async handlers) | Not traced |
| Automated scanners (`bandit`, `semgrep`, `trivy`, `pip-audit`) | Not installed in the audit environment |
| Dependency CVE audit | 73 packages in `uv.lock`, not walked manually |
| `pytest` run | Test suite not executed — some findings may already be covered by existing tests |
| Docker hardening (`Dockerfile` — user namespaces, capabilities) | Out of scope |
| `infra/modules/customer-instance/` (Terraform / startup scripts) | Out of scope |
**Honest estimate:** roughly **6570 %** of the real attack surface is covered. To close the gap I would run `bandit -r app/ src/ connectors/`, `pip-audit`, execute the `pytest` suite, and review outbound traffic logs from staging.
---
## Proposed action plan
**This week:**
- #1 script sandbox escape — one issue, RCE severity, own ticket
- #3 rate limiting on auth — `slowapi` introduction, one PR
- #4 backslash open-redirect — two-line fix plus test
**Next sprint:**
- #5 SSRF validator in `/api/admin/configure` (with the IP-range variant above, **not** a `.keboola.com` suffix)
- #6 atomic magic-link token consumption
- #7 bootstrap hardening
- #8 `SessionMiddleware` `max_age` + `https_only`
**Backlog (single tracking issue with checkboxes):**
- #1124 — timing comparisons, theme XSS gating, config reload, RBAC unification, schedule validation, silent-except triage, dead templates cleanup.
The missing `/auth/password/reset` endpoint (#2) is already tracked upstream in the auth-CLI repo.