docs: consolidate and de-clutter the documentation tree (#306 )

CLAUDE.md rewritten (708 -> ~320 lines): four overlapping release
sections collapsed to one, stale v1->v35 schema history dropped (it
lives in CHANGELOG), marketplace endpoint internals and verbose
process sections moved out or tightened.

New focused docs:
- docs/RELEASING.md - release process, deploy workflows, CI quirks
  (RELEASE_TEMPLATE.md folded in as an appendix)
- docs/marketplace.md - marketplace ingestion + re-serving internals
- docs/README.md - documentation index by audience, linked from
  README.md and CLAUDE.md

Archived under docs/archive/: docs/superpowers/ (52 historical
planning artifacts), HACKATHON.md, pd-ps-comments.md,
security-audit-2026-04.md, future/NOTIFICATIONS.md.

Removed the docs/auto-install.md stub. Fixed dangling links in
connectors/jira/README.md and dev_docs/README.md, repointed
code/doc references to archived paths.

2026-05-14 18:54:22 +00:00

16 KiB

Raw Permalink Blame History

Security audit — Agnes AI Data Analyst

Archived. This is a point-in-time audit snapshot. Findings have been triaged and addressed in subsequent releases — check CHANGELOG.md (search for the relevant file/endpoint) for the fixes that landed since 2026-04-22. Kept for historical reference; not a live tracker.

Date: 2026-04-22 Branch audited: main at commit cbb7733 Method: parallel review passes over four scope areas — (1) secrets/SQLi/authz/SSRF, (2) auth flows & route wiring, (3) templates & UI wiring/XSS, (4) data layer & config & dead code. Findings deduped across the passes, severities adjusted to real-world exploitability.

Known issues already in flight are marked with their tracking links so we do not re-open them.

Top 5 — fix first

1. `[CRITICAL]` Script-API sandbox escape → RCE for the `analyst` role

File: app/api/scripts.py:116–180
Required role: Role.ANALYST (not admin)

Trigger: POST /api/scripts/run with body:

().__class__.__base__.__subclasses__()[N].__init__.__globals__["system"]("id")

Why existing guards miss it: the AST walker and the string allowlist block direct import os / exec, but neither stops attribute traversal through Python's class hierarchy (__class__ → __base__ → __subclasses__() → __globals__).
Impact: arbitrary OS commands under the FastAPI process uid. Gives access to DATA_DIR (DuckDB files, cached parquet), .jwt_secret, env vars, and any credentials mounted into the container.
Fix (minimum): add to the string-pattern blocklist: __subclasses__, __globals__, __mro__, __bases__, __class__, __dict__, __code__. In the AST walker, reject any ast.Attribute whose attr starts and ends with __.
Fix (correct): do not run untrusted Python in-process. Either drop server-side script execution entirely, or run the sandbox in nsjail/gVisor/Pyodide in isolation, or gate the endpoint behind Role.ADMIN if it must stay.
Confidence: broken (verified analytically).

2. `[HIGH]` `/auth/password/reset` endpoint missing — "Forgot Password?" returns 404

Template reference: app/web/templates/login_email.html:47 — <form method="POST" action="/auth/password/reset">
URL map: app/web/router.py:119 — "password_auth.reset_request": "/auth/password/reset"
Backend: app/auth/providers/password.py only registers /login, /login/web, /setup. No /reset handler is wired.
Related dead code: templates password_reset.html and password_setup.html exist but no route renders them — indicates an abandoned reset flow.
Tracking: tracked upstream in the auth-CLI repo
Confidence: broken.

3. `[HIGH]` No rate limiting on any auth endpoint

Files: app/auth/providers/password.py:36, app/auth/providers/email.py:53, app/auth/router.py:58, app/main.py (middleware).
Evidence: grep -rn "slowapi\|limiter\|throttle\|ratelimit" — zero hits in app/.
Impact:
- Password brute-force against POST /auth/password/login and POST /auth/token.
- Email bombing on POST /auth/email/send-link — attacker floods SMTP/SendGrid quota by looping with random recipients.
- Enumeration of bootstrap state via POST /auth/bootstrap.
Fix: add slowapi with @limiter.limit("10/minute") on the four endpoints above, get_remote_address as key with proxy-aware client-IP extraction (project already has _client_ip helper in app/auth/dependencies.py).
Confidence: broken.

4. `[HIGH]` Open-redirect bypass via backslash in `safe_next_path`

Files: app/auth/_common.py:10-24, app/auth/providers/password.py:95-96, app/web/router.py:218-219.
Trigger: https://agnes/login?next=/\evil.com
Why the current check fails: Python sees /\ (not //) and the guard startswith("//") does not fire. Every major browser normalizes \ to / in URL paths, so Location: /\evil.com resolves as //evil.com — a cross-origin redirect.
Impact: phishing — attacker crafts a link on the victim's Agnes URL, lands them on a lookalike after login.
Fix:
```
if len(candidate) > 1 and candidate[1] in ("/", "\\"):
    return default
```
Existing tests (tests/test_web_ui.py:270-296) cover //evil.example/ but not /\evil.com — add the case.
Confidence: broken.

5. `[HIGH]` SSRF in `/api/admin/configure` — `keboola_url` accepted as-is

File: app/api/admin.py:163–282; the URL is passed straight to KeboolaClient.test_connection() which issues a GET request.
Trigger: a compromised (or insider-threat) admin sends {"keboola_url": "http://169.254.169.254/latest/meta-data/"} or http://localhost:5432/.
Impact: server as SSRF proxy to the private network — GCP/AWS instance metadata service (IAM tokens), internal databases, LAN services.
Why a domain allowlist is wrong: keboola_url is the URL of the Keboola stack the Agnes instance connects to, not the Agnes host. Valid values include connection.keboola.com, connection.eu-central-1.keboola.com, connection.europe-west3.gcp.keboola.com, plus potentially self-hosted private Keboola stacks. A .keboola.com suffix check would break legitimate deployments.

Correct fix: enforce https:// scheme, then resolve the hostname and reject any result in a private / loopback / link-local / reserved block. Still allows any public HTTPS host.

from urllib.parse import urlparse
import ipaddress, socket

def _validate_stack_url(url: str) -> str | None:
    try:
        p = urlparse(url)
    except Exception:
        return "not a valid URL"
    if p.scheme != "https":
        return "must use https"
    if not p.hostname:
        return "missing hostname"
    try:
        infos = socket.getaddrinfo(p.hostname, None)
    except socket.gaierror:
        return f"cannot resolve {p.hostname}"
    for fam, _, _, _, sa in infos:
        try:
            ip = ipaddress.ip_address(sa[0])
        except ValueError:
            continue
        if ip.is_private or ip.is_loopback or ip.is_link_local or ip.is_reserved:
            return f"{p.hostname} resolves to a private/loopback address ({ip})"
    return None

Residual risk: DNS rebinding (hostname resolves to a public IP at validation time, then to loopback at request time). Out of scope for this fix — would need an outbound egress proxy with an IP-level ACL.
Confidence: broken.

Second tier (HIGH)

#	Category	File:line	Summary	Confidence
6	race	`app/auth/providers/email.py:106-128`	`_consume_token` read-validate-clear is not atomic. Two concurrent clicks on the same magic link can both issue JWTs. Fix: `UPDATE users SET reset_token=NULL WHERE id=? AND reset_token=?` and check `rowcount == 1` before issuing the JWT.	broken
7	bootstrap	`app/auth/router.py:103-158`	Check is "any user with `password_hash`". If a seed admin exists without a password (e.g. created by `LOCAL_DEV_MODE=1` then redeployed without it, or a `SEED_ADMIN_EMAIL` without `SEED_ADMIN_PASSWORD`), `/auth/bootstrap` stays open and any caller can register a new admin account. Fix: check "any user at all" or require an explicit admin token for bootstrap.	broken
8	cookie	`app/main.py:61`	`SessionMiddleware(secret_key=...)` without `max_age` → OAuth session cookie expires only when the browser closes. Also `https_only` is not forced in production. Fix: `max_age=3600` and `https_only=bool(os.environ.get("DOMAIN"))`.	broken
9	sqli-adjacent	`connectors/keboola/extractor.py:104-106, 128`	`CREATE OR REPLACE VIEW "{table_name}" AS SELECT * FROM kbc."{bucket}"."{source_table}"` — f-string DDL with identifier quotes. Inputs come from `table_registry` (admin-controlled), but relying on quote-escaping for DDL is fragile. Fix: validate each identifier against `^[a-zA-Z0-9_]{1,63}$` before interpolation.	suspicious
10	datetime	`connectors/keboola/client.py:183`	Same offset-naive vs. offset-aware bug as the one we just fixed in `app/auth/providers/email.py`. `datetime.now() - cached_time` crashes or treats every cache entry as stale, causing needless API calls. Fix: use `datetime.now(timezone.utc)` on both sides.	broken

Third tier (MEDIUM)

#	Category	File:line	Summary
11	timing	`email.py:113`, `password.py:117`, `dependencies.py:148`	Token / `token_hash` compared with `!=` instead of `hmac.compare_digest`. 32-byte entropy mitigates the brute side, but constant-time comparison is a zero-cost hardening.
12	xss-latent	`app/web/templates/_theme.html:12`	`{{ var }}: {{ val }};` inside a `<style>` block. Jinja auto-escape does not escape for CSS context. Today the values come from `instance.yaml` (FS-write), so risk is low — but it escalates to HIGH if a `/api/admin/theme` endpoint is ever added. Validate against `^[a-zA-Z0-9#()., %-]{1,50}$`.
13	config-drift	`app/instance_config.py:21-23`	`_instance_config` is cached in a module-level global with no invalidation — changes via `/api/admin/configure` or direct file edits do not take effect until the process restarts. Fix: file-mtime watch, or explicit `reload=True` kwarg on `load_instance_config`.
14	rbac-inconsistency	`app/auth/dependencies.py:203-210`	`require_admin` does an exact-match `role == "admin"` check, while `require_role(Role.ADMIN)` respects the hierarchy. Risk: if a higher role is ever added, `require_admin` rejects it. Fix: replace all call sites with `require_role(Role.ADMIN)`, delete `require_admin`.
15	sqli-defense	`src/repositories/knowledge.py:49-58`	`update(fields)` builds `SET` clause from `fields.keys()` without an allowlist (unlike `users.py` and `metrics.py`). Today the API layer filters to a safe set of keys, but one future caller with user-controlled `fields` turns this into SQL column injection. Add an `ALLOWED_KNOWLEDGE_FIELDS` set in the repository layer.
16	silent-fail	`services/scheduler/__main__.py:36-48`	If `SCHEDULER_API_TOKEN` is missing and auto-fetch fails, the scheduler proceeds with an empty token. API calls then 401 silently and no sync runs. Fix: `logger.error()` + non-zero exit.
17	dos-input	`src/repositories/audit.py:38-49`	`limit` is a user-supplied int passed straight into SQL `LIMIT`. `limit=1_000_000` → OOM on large audit tables. Fix: `limit = min(max(limit, 1), 1000)`.
18	pat-audit-only	`app/auth/dependencies.py:151-173`	New-IP use of a PAT logs to audit but neither blocks the request nor notifies the owner. A leaked PAT keeps working until someone reviews the audit log. Consider auto-revoke on new IP for tokens older than N days, or at least a notification hook.
19	cookie-inconsistency	`google.py:115` vs `password.py:93`, `email.py:156`	Three different rules for `secure=`: Google uses `TESTING != 1`, the others use `DOMAIN != ""`. Unify on one helper (`_is_secure_context()` keyed on `DOMAIN`).

Fourth tier (LOW) — debt, not incidents

#	Category	File:line	Summary
20	path-traversal	`app/api/upload.py:77`	Target path is `md_dir / f"{user_email}.md"`. `user_email` comes from the JWT (trusted after authn) but is not sanitized — theoretical exploit requires being able to register an account with `../` in the email, which depends on the OAuth provider / domain filter. `upload_session` and `upload_artifact` correctly use `Path(raw).name`. Fix: use `user["id"]` as the filename, or `re.sub(r"[^a-zA-Z0-9@._-]", "_", email)`.
21	info-disclosure	`app/api/users.py:209`	`POST /api/users/{id}/reset-password` returns `{"reset_token": "..."}` in plaintext. Admin-only, but if the response is logged (Nginx access log, load balancer, audit proxy) the token is captured. Return `"email_sent": true` and email the token, or truncate the token in the log path.
22	dead-code	`app/web/templates/password_reset.html`, `password_setup.html`, `login_magic_link_sent.html`	Templates exist, no Python route renders them. Delete or finish the flow they belong to (password reset, magic-link "check your email" confirmation).
23	empty-schedule	`src/scheduler.py:92-99`	Empty `schedule` string silently returns "not due" — tables with misconfigured schedules never sync, no warning. Log at `WARNING` when a schedule doesn't match any known pattern.
24	silent-excepts	`src/db.py:254, 286, 309, 325, 368, 382, 489, 546`, `src/orchestrator.py:121`	`except Exception: pass`. Some are legitimate (optional `CHECKPOINT` on read-only DBs), others mask migration failures. Triage each; log at least at `DEBUG`.

Verified non-issues (do not re-open)

Several patterns looked scary at first glance but are correctly defended:

UserRepository.update() SQL injection — one audit pass flagged this as CRITICAL, but the allowed = {...} allowlist at src/repositories/users.py:50-56 drops any column not in the set before the SET clause is built. Safe.
Orchestrator DETACH / ATTACH — identifiers are validated through _SAFE_IDENTIFIER = re.compile(r"^[a-zA-Z_][a-zA-Z0-9_]{0,63}$") in src/orchestrator.py before interpolation.
CORS wildcard + credentials — default CORS_ORIGINS is localhost:3000,localhost:8000, not *. Starlette's CORSMiddleware also refuses to combine * with allow_credentials=True.
yaml.load unsafe variant — not present anywhere. Every YAML parser uses yaml.safe_load.
Hardcoded secrets in the repo — no real secrets in .py/.yaml/.md. config/deploy.yml references env vars (${KAMAL_REGISTRY_PASSWORD}). Git history has no .env leak.
LOCAL_DEV_MODE reachable via HTTP — the flag is read from os.environ only; no request header or query parameter toggles it.
IDOR on /auth/tokens/{token_id} — handler checks row["user_id"] != user["id"] before returning or modifying the token. Correct.
Jira webhook signature — connectors/jira/webhook.py uses hmac.compare_digest. Correct.

Coverage gaps — what this audit did not check

Area	Why skipped
`connectors/jira/transform.py`, `incremental_transform.py`	Scope + complex transformation logic
`services/corporate_memory/`, `services/telegram_bot/`, `services/ws_gateway/*`	Not read in depth
Frontend JS in `app/web/static/*.js`	Audit #3 focused on HTML templates
`async` / `await` correctness (blocking calls inside async handlers)	Not traced
Automated scanners (`bandit`, `semgrep`, `trivy`, `pip-audit`)	Not installed in the audit environment
Dependency CVE audit	73 packages in `uv.lock`, not walked manually
`pytest` run	Test suite not executed — some findings may already be covered by existing tests
Docker hardening (`Dockerfile` — user namespaces, capabilities)	Out of scope
`infra/modules/customer-instance/` (Terraform / startup scripts)	Out of scope

Honest estimate: roughly 65–70 % of the real attack surface is covered. To close the gap I would run bandit -r app/ src/ connectors/, pip-audit, execute the pytest suite, and review outbound traffic logs from staging.

Proposed action plan

This week:

#1 script sandbox escape — one issue, RCE severity, own ticket
#3 rate limiting on auth — slowapi introduction, one PR
#4 backslash open-redirect — two-line fix plus test

Next sprint:

#5 SSRF validator in /api/admin/configure (with the IP-range variant above, not a .keboola.com suffix)
#6 atomic magic-link token consumption
#7 bootstrap hardening
#8 SessionMiddleware max_age + https_only

Backlog (single tracking issue with checkboxes):

#11–24 — timing comparisons, theme XSS gating, config reload, RBAC unification, schedule validation, silent-except triage, dead templates cleanup.

The missing /auth/password/reset endpoint (#2) is already tracked upstream in the auth-CLI repo.

16 KiB Raw Permalink Blame History Unescape Escape