docs: consolidate and de-clutter the documentation tree (#306 )

CLAUDE.md rewritten (708 -> ~320 lines): four overlapping release
sections collapsed to one, stale v1->v35 schema history dropped (it
lives in CHANGELOG), marketplace endpoint internals and verbose
process sections moved out or tightened.

New focused docs:
- docs/RELEASING.md - release process, deploy workflows, CI quirks
  (RELEASE_TEMPLATE.md folded in as an appendix)
- docs/marketplace.md - marketplace ingestion + re-serving internals
- docs/README.md - documentation index by audience, linked from
  README.md and CLAUDE.md

Archived under docs/archive/: docs/superpowers/ (52 historical
planning artifacts), HACKATHON.md, pd-ps-comments.md,
security-audit-2026-04.md, future/NOTIFICATIONS.md.

Removed the docs/auto-install.md stub. Fixed dangling links in
connectors/jira/README.md and dev_docs/README.md, repointed
code/doc references to archived paths.

2026-05-14 18:54:22 +00:00

54 KiB

Raw Blame History

Production Hardening Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Fix all P0/P1 issues from 4 independent code reviews (architect, data engineer, senior dev, test specialist) to make the codebase production-ready.

Architecture: Fixes are grouped into 6 independent workstreams: auth/security, SQL safety, orchestrator bugs, DuckDB lifecycle, test hardening, and docs/cleanup. Each workstream can be executed in parallel by separate agents.

Tech Stack: Python 3.13, FastAPI, DuckDB, pytest, Docker

Source: Consolidated findings from 4 review agents run 2026-04-08.

Workstream 1: Authentication & Security (P0)

Task 1.1: Fix password bypass in /auth/token

The /auth/token endpoint issues a JWT without verifying the password when request.password is empty but user.password_hash exists. Any user with a password can get a token by omitting the password field.

Files:

Modify: app/auth/router.py:47-54
Test: tests/test_auth_providers.py
Step 1: Write the failing test

# In tests/test_auth_providers.py, add to existing test class:

def test_password_required_when_hash_exists(client, e2e_env):
    """A user with password_hash must provide correct password."""
    from src.db import get_system_db
    from src.repositories.users import UserRepository
    from argon2 import PasswordHasher

    conn = get_system_db()
    repo = UserRepository(conn)
    ph = PasswordHasher()
    repo.create(id="pw-user", email="pw@test.com", role="analyst")
    conn.execute(
        "UPDATE users SET password_hash = ? WHERE id = ?",
        [ph.hash("correct-password"), "pw-user"],
    )
    conn.close()

    # Empty password should be rejected
    resp = client.post("/auth/token", json={"email": "pw@test.com", "password": ""})
    assert resp.status_code == 401

    # Missing password field should be rejected
    resp = client.post("/auth/token", json={"email": "pw@test.com"})
    assert resp.status_code == 401

    # Wrong password should be rejected
    resp = client.post("/auth/token", json={"email": "pw@test.com", "password": "wrong"})
    assert resp.status_code == 401

    # Correct password should work
    resp = client.post("/auth/token", json={"email": "pw@test.com", "password": "correct-password"})
    assert resp.status_code == 200

Step 2: Run test to verify it fails

Run: pytest tests/test_auth_providers.py::test_password_required_when_hash_exists -v Expected: FAIL — empty password returns 200 instead of 401

Step 3: Fix the auth logic

In app/auth/router.py, replace lines 47-54:

    # If user has password_hash, require and verify password
    if user.get("password_hash"):
        if not request.password:
            raise HTTPException(status_code=401, detail="Password required")
        try:
            from argon2 import PasswordHasher
            ph = PasswordHasher()
            ph.verify(user["password_hash"], request.password)
        except Exception:
            raise HTTPException(status_code=401, detail="Invalid password")

Step 4: Run test to verify it passes

Run: pytest tests/test_auth_providers.py::test_password_required_when_hash_exists -v Expected: PASS

Step 5: Run full auth test suite

Run: pytest tests/test_auth_providers.py tests/test_security.py -v Expected: All pass

Step 6: Commit

git add app/auth/router.py tests/test_auth_providers.py
git commit -m "fix: require password when password_hash exists — prevents auth bypass"

Task 1.2: Fail on default JWT secret in non-test environments

The app starts with a hardcoded known secret if JWT_SECRET_KEY env var is missing. A production deployment that forgets to set it is wide open.

Files:

Modify: app/auth/jwt.py:9-16
Test: tests/test_security.py
Step 1: Write the failing test

# In tests/test_security.py, add:

def test_jwt_rejects_default_secret_in_production(monkeypatch):
    """App should refuse to start with the default JWT secret unless TESTING=1."""
    monkeypatch.delenv("JWT_SECRET_KEY", raising=False)
    monkeypatch.delenv("TESTING", raising=False)

    with pytest.raises(RuntimeError, match="JWT_SECRET_KEY"):
        # Force reimport to trigger module-level check
        import importlib
        import app.auth.jwt as jwt_mod
        importlib.reload(jwt_mod)

Step 2: Run test to verify it fails

Run: pytest tests/test_security.py::test_jwt_rejects_default_secret_in_production -v Expected: FAIL — no RuntimeError raised

Step 3: Fix jwt.py

Replace app/auth/jwt.py lines 9-16:

SECRET_KEY = os.environ.get("JWT_SECRET_KEY", "")

if not SECRET_KEY:
    if os.environ.get("TESTING", "").lower() in ("1", "true"):
        SECRET_KEY = "test-jwt-secret-key-minimum-32-chars!!"
    else:
        raise RuntimeError(
            "JWT_SECRET_KEY environment variable is required. "
            "Generate one: python -c \"import secrets; print(secrets.token_hex(32))\""
        )
elif len(SECRET_KEY) < 32 and os.environ.get("TESTING", "").lower() not in ("1", "true"):
    import warnings as _warnings
    _warnings.warn(
        f"JWT_SECRET_KEY is {len(SECRET_KEY)} chars — minimum 32 recommended",
        UserWarning, stacklevel=2,
    )

Step 4: Run tests

Run: pytest tests/test_security.py tests/test_auth_providers.py -v Expected: All pass (existing tests set TESTING=1 or JWT_SECRET_KEY via conftest)

Step 5: Commit

git add app/auth/jwt.py tests/test_security.py
git commit -m "fix: fail startup on missing JWT_SECRET_KEY in non-test environments"

Task 1.3: Reduce JWT expiry, add jti claim

30-day tokens with no revocation mechanism are too risky.

Files:

Modify: app/auth/jwt.py:18-19,28-37
Test: tests/test_security.py
Step 1: Write the failing test

# In tests/test_security.py, add:

def test_jwt_contains_jti_claim():
    """JWT tokens must contain a unique jti claim for future revocation support."""
    from app.auth.jwt import create_access_token, verify_token
    token = create_access_token(user_id="u1", email="a@b.com", role="analyst")
    payload = verify_token(token)
    assert "jti" in payload
    assert len(payload["jti"]) >= 16

def test_jwt_expiry_is_24_hours():
    """JWT tokens should expire in 24 hours, not 30 days."""
    from app.auth.jwt import ACCESS_TOKEN_EXPIRE_HOURS
    assert ACCESS_TOKEN_EXPIRE_HOURS == 24

Step 2: Run to verify failure

Run: pytest tests/test_security.py::test_jwt_contains_jti_claim tests/test_security.py::test_jwt_expiry_is_24_hours -v Expected: FAIL

Step 3: Fix jwt.py

In app/auth/jwt.py:

Change line 19: ACCESS_TOKEN_EXPIRE_HOURS = 24

Add import uuid at the top. In create_access_token, add "jti" to payload:

    payload = {
        "sub": user_id,
        "email": email,
        "role": role,
        "exp": expire,
        "iat": datetime.now(timezone.utc),
        "jti": uuid.uuid4().hex,
    }

Step 4: Run tests

Run: pytest tests/ -q --tb=short Expected: All pass

Step 5: Commit

git add app/auth/jwt.py tests/test_security.py
git commit -m "fix: reduce JWT expiry to 24h, add jti claim for future revocation"

Task 1.4: Fix get_optional_user not checking cookies

get_optional_user only checks the Authorization header, not cookies. Web UI users appear as None.

Files:

Modify: app/auth/dependencies.py:60-70
Test: tests/test_auth_providers.py
Step 1: Write the failing test

# In tests/test_auth_providers.py, add:

def test_optional_user_reads_cookie(client, e2e_env):
    """get_optional_user should detect cookie-authenticated users."""
    from src.db import get_system_db
    from src.repositories.users import UserRepository
    from app.auth.jwt import create_access_token

    conn = get_system_db()
    UserRepository(conn).create(id="cookie-user", email="cookie@test.com", role="analyst")
    conn.close()

    token = create_access_token(user_id="cookie-user", email="cookie@test.com", role="analyst")

    # Simulate web UI request with cookie but no Authorization header
    resp = client.get("/api/catalog", cookies={"access_token": token})
    assert resp.status_code == 200

Step 2: Run to verify behavior (this may or may not fail depending on endpoint requirements)
Step 3: Fix dependencies.py

Replace get_optional_user in app/auth/dependencies.py:

async def get_optional_user(
    request: Request = None,
    authorization: Optional[str] = Header(None),
    conn: duckdb.DuckDBPyConnection = Depends(_get_db),
) -> Optional[dict]:
    """Like get_current_user but returns None instead of 401 if no token."""
    try:
        return await get_current_user(request=request, authorization=authorization, conn=conn)
    except HTTPException:
        return None

Step 4: Run tests

Run: pytest tests/test_auth_providers.py tests/test_api.py -v Expected: All pass

Step 5: Commit

git add app/auth/dependencies.py tests/test_auth_providers.py
git commit -m "fix: get_optional_user now checks cookies like get_current_user"

Workstream 2: SQL Safety (P0/P1)

Task 2.1: Add identifier validation to orchestrator

source_name from directory names and table_name from _meta are interpolated into SQL without validation. A crafted directory name or _meta entry could inject arbitrary SQL.

Files:

Modify: src/orchestrator.py (add validation helper, apply in _attach_and_create_views and _attach_remote_extensions)
Test: tests/test_orchestrator.py
Step 1: Write the failing tests

# In tests/test_orchestrator.py, add:

def test_rejects_malicious_source_name(setup_env):
    """Orchestrator must reject directory names with SQL injection characters."""
    from src.orchestrator import SyncOrchestrator

    malicious_dir = setup_env["extracts_dir"] / 'test; DROP TABLE _meta--'
    malicious_dir.mkdir()
    db_path = malicious_dir / "extract.duckdb"
    import duckdb as _duckdb
    conn = _duckdb.connect(str(db_path))
    conn.execute("""CREATE TABLE _meta (
        table_name VARCHAR, description VARCHAR, rows BIGINT,
        size_bytes BIGINT, extracted_at TIMESTAMP, query_mode VARCHAR DEFAULT 'local'
    )""")
    conn.execute('CREATE TABLE "safe_table" (id VARCHAR)')
    conn.execute("INSERT INTO _meta VALUES ('safe_table', '', 0, 0, current_timestamp, 'local')")
    conn.close()

    orch = SyncOrchestrator(analytics_db_path=setup_env["analytics_db"])
    result = orch.rebuild()

    # Malicious source should be skipped, not attached
    assert 'test; DROP TABLE _meta--' not in result


def test_rejects_malicious_table_name(setup_env):
    """Orchestrator must reject table names with SQL injection characters."""
    from src.orchestrator import SyncOrchestrator

    source_dir = setup_env["extracts_dir"] / "keboola"
    source_dir.mkdir()
    (source_dir / "data").mkdir()

    db_path = source_dir / "extract.duckdb"
    import duckdb as _duckdb
    conn = _duckdb.connect(str(db_path))
    conn.execute("""CREATE TABLE _meta (
        table_name VARCHAR, description VARCHAR, rows BIGINT,
        size_bytes BIGINT, extracted_at TIMESTAMP, query_mode VARCHAR DEFAULT 'local'
    )""")
    conn.execute('CREATE TABLE "safe" (id VARCHAR)')
    conn.execute("INSERT INTO _meta VALUES ('safe', '', 0, 0, current_timestamp, 'local')")
    # Malicious table name in _meta
    conn.execute("""INSERT INTO _meta VALUES ('x"; DROP TABLE _meta; --', '', 0, 0, current_timestamp, 'local')""")
    conn.close()

    orch = SyncOrchestrator(analytics_db_path=setup_env["analytics_db"])
    result = orch.rebuild()

    # Safe table should be there, malicious should be skipped
    assert "keboola" in result
    assert "safe" in result["keboola"]
    assert 'x"; DROP TABLE _meta; --' not in result.get("keboola", [])

Step 2: Run to verify failure

Run: pytest tests/test_orchestrator.py::test_rejects_malicious_source_name tests/test_orchestrator.py::test_rejects_malicious_table_name -v Expected: FAIL or crash from SQL injection

Step 3: Add validation helper and apply it

At the top of src/orchestrator.py, add after imports:

import re

_SAFE_IDENTIFIER = re.compile(r"^[a-zA-Z_][a-zA-Z0-9_]{0,63}$")

def _validate_identifier(name: str, context: str) -> bool:
    """Validate a DuckDB identifier. Returns True if safe, False if not."""
    if not _SAFE_IDENTIFIER.match(name):
        logger.warning("Rejected unsafe %s identifier: %r", context, name)
        return False
    return True

In _do_rebuild (line ~92), add check before _attach_and_create_views:

                if not _validate_identifier(ext_dir.name, "source_name"):
                    continue

In _attach_and_create_views (line ~160), add check before CREATE VIEW:

            for table_name, rows, size_bytes, query_mode in meta_rows:
                if not _validate_identifier(table_name, "table_name"):
                    continue

In _attach_remote_extensions (line ~193), add check:

        for alias, extension, url, token_env in rows:
            if not _validate_identifier(alias, "alias") or not _validate_identifier(extension, "extension"):
                continue

Step 4: Run all orchestrator tests

Run: pytest tests/test_orchestrator.py -v Expected: All pass including new tests

Step 5: Commit

git add src/orchestrator.py tests/test_orchestrator.py
git commit -m "fix: validate SQL identifiers in orchestrator — prevent injection via directory/table names"

Task 2.2: Harden query endpoint SQL blocklist

The current blocklist misses parquet_scan, read_csv_auto, query_table, and has false positives on semicolons in string literals. Also add enable_external_access=false on the analytics connection.

Files:

Modify: app/api/query.py:39-51 and src/db.py (analytics readonly connection)
Test: tests/test_security.py
Step 1: Write the failing tests

# In tests/test_security.py, add to TestQuerySecurity:

def test_blocks_parquet_scan(self, client, auth_headers):
    resp = client.post("/api/query", json={"sql": "SELECT * FROM parquet_scan('/etc/passwd')"}, headers=auth_headers)
    assert resp.status_code == 400

def test_blocks_read_csv_auto(self, client, auth_headers):
    resp = client.post("/api/query", json={"sql": "SELECT * FROM read_csv_auto('/data/state/system.duckdb')"}, headers=auth_headers)
    assert resp.status_code == 400

def test_blocks_query_table(self, client, auth_headers):
    resp = client.post("/api/query", json={"sql": "SELECT * FROM query_table('secret')"}, headers=auth_headers)
    assert resp.status_code == 400

def test_blocks_httpfs(self, client, auth_headers):
    resp = client.post("/api/query", json={"sql": "SELECT * FROM read_parquet('https://evil.com/data.parquet')"}, headers=auth_headers)
    assert resp.status_code == 400

Step 2: Run to verify failure

Run: pytest tests/test_security.py::TestQuerySecurity -v Expected: Some FAIL (parquet_scan, read_csv_auto, query_table not blocked)

Step 3: Expand the blocklist and set enable_external_access=false

In app/api/query.py, replace the blocked list (lines 39-49):

    blocked = [
        "drop ", "delete ", "insert ", "update ", "alter ", "create ",
        "copy ", "attach ", "detach ", "load ", "install ",
        "export ", "import ", "pragma ", "call ",
        # File access functions
        "read_csv", "read_json", "read_parquet", "read_text",
        "write_csv", "write_parquet", "read_blob", "read_ndjson",
        "parquet_scan", "parquet_metadata", "parquet_schema",
        "json_scan", "csv_scan",
        "query_table", "iceberg_scan", "delta_scan",
        "glob(", "list_files",
        "'/", '"/','http://', 'https://', 's3://', 'gcs://',
        # Multiple statements
        ";",
    ]

In src/db.py get_analytics_db_readonly(), after opening the connection, add:

    try:
        conn.execute("SET enable_external_access = false")
    except Exception:
        pass  # Older DuckDB versions

Step 4: Run tests

Run: pytest tests/test_security.py::TestQuerySecurity -v Expected: All pass

Step 5: Commit

git add app/api/query.py src/db.py tests/test_security.py
git commit -m "fix: expand query blocklist, disable external_access on analytics connection"

Workstream 3: Orchestrator Bugs (P0/P1)

Task 3.1: Fix rebuild_source destroying all other sources' views

_do_rebuild_source() creates a fresh temp DB with only one source, then replaces the entire analytics DB. Every Jira webhook wipes all Keboola/BigQuery views.

Files:

Modify: src/orchestrator.py:116-141
Test: tests/test_orchestrator.py
Step 1: Write the failing test

# In tests/test_orchestrator.py, add:

def test_rebuild_source_preserves_other_sources(setup_env):
    """rebuild_source('jira') must NOT destroy views from keboola."""
    from src.orchestrator import SyncOrchestrator

    _create_mock_extract(
        setup_env["extracts_dir"], "keboola",
        [{"name": "orders", "data": [{"id": "1"}]}],
    )
    _create_mock_extract(
        setup_env["extracts_dir"], "jira",
        [{"name": "issues", "data": [{"key": "PROJ-1"}]}],
    )

    orch = SyncOrchestrator(analytics_db_path=setup_env["analytics_db"])

    # First: full rebuild
    result = orch.rebuild()
    assert "keboola" in result
    assert "jira" in result

    # Second: rebuild only jira
    jira_tables = orch.rebuild_source("jira")
    assert "issues" in jira_tables

    # Third: full rebuild again — keboola must still be there
    result2 = orch.rebuild()
    assert "keboola" in result2
    assert "orders" in result2["keboola"]

Step 2: Run to verify failure

Run: pytest tests/test_orchestrator.py::test_rebuild_source_preserves_other_sources -v Expected: FAIL — keboola views gone after rebuild_source("jira")

Step 3: Fix _do_rebuild_source to delegate to full rebuild

In src/orchestrator.py, replace _do_rebuild_source (lines 116-141):

    def _do_rebuild_source(self, source_name: str) -> List[str]:
        """Rebuild views for a single source by doing a full rebuild.

        A full rebuild is necessary because the analytics DB is created fresh
        each time (temp file + atomic swap). Rebuilding only one source would
        destroy views from all other sources.
        """
        extracts_dir = _get_extracts_dir()
        db_file = extracts_dir / source_name / "extract.duckdb"
        if not db_file.exists():
            logger.warning("No extract.duckdb for source %s", source_name)
            return []

        result = self._do_rebuild()
        return result.get(source_name, [])

Step 4: Run tests

Run: pytest tests/test_orchestrator.py -v Expected: All pass

Step 5: Commit

git add src/orchestrator.py tests/test_orchestrator.py
git commit -m "fix: rebuild_source delegates to full rebuild — preserves other sources' views"

Task 3.2: Handle WAL files in atomic swap

shutil.move only moves the .duckdb file. The .wal file from the old DB can corrupt the new one.

Files:

Modify: src/orchestrator.py (_do_rebuild lines 106-112)
Modify: connectors/keboola/extractor.py (lines 148-155)
Test: tests/test_orchestrator.py
Step 1: Write the failing test

# In tests/test_orchestrator.py, add:

def test_rebuild_cleans_wal_files(setup_env):
    """After rebuild, no .wal files should remain from the temp or old DB."""
    from src.orchestrator import SyncOrchestrator

    _create_mock_extract(
        setup_env["extracts_dir"], "keboola",
        [{"name": "orders", "data": [{"id": "1"}]}],
    )
    orch = SyncOrchestrator(analytics_db_path=setup_env["analytics_db"])
    orch.rebuild()

    from pathlib import Path
    db_path = Path(setup_env["analytics_db"])
    assert not (db_path.parent / (db_path.name + ".wal")).exists()
    assert not (db_path.parent / (db_path.name + ".tmp.wal")).exists()

Step 2: Run to verify it passes or fails

Run: pytest tests/test_orchestrator.py::test_rebuild_cleans_wal_files -v

Step 3: Add WAL cleanup helper

In src/orchestrator.py, add a helper after _validate_identifier:

def _atomic_swap_db(tmp_path: str, target_path: str) -> None:
    """Atomically replace target DuckDB file, cleaning up WAL files."""
    import shutil
    target = Path(target_path)
    tmp = Path(tmp_path)

    # Remove old WAL file if it exists
    old_wal = Path(str(target) + ".wal")
    if old_wal.exists():
        old_wal.unlink()

    # Move temp DB into place
    if tmp.exists():
        shutil.move(str(tmp), str(target))

    # Clean up temp WAL
    tmp_wal = Path(str(tmp) + ".wal")
    if tmp_wal.exists():
        tmp_wal.unlink()

Replace shutil.move call in _do_rebuild (line ~112) with:

        _atomic_swap_db(tmp_path, self._db_path)

Also add CHECKPOINT before conn.close() in _do_rebuild:

            conn.execute("CHECKPOINT")
        finally:
            conn.close()

Apply the same pattern in connectors/keboola/extractor.py at the end of run().

Step 4: Run tests

Run: pytest tests/test_orchestrator.py tests/test_keboola_extractor.py -v Expected: All pass

Step 5: Commit

git add src/orchestrator.py connectors/keboola/extractor.py tests/test_orchestrator.py
git commit -m "fix: clean WAL files during atomic DB swap, add CHECKPOINT before close"

Task 3.3: Add temp-file swap to BigQuery extractor

BigQuery extractor writes directly to extract.duckdb, causing lock conflicts with the orchestrator.

Files:

Modify: connectors/bigquery/extractor.py:64-68
Test: tests/test_bigquery_extractor.py
Step 1: Write the test

# In tests/test_bigquery_extractor.py, add:

def test_uses_temp_file_swap(self, output_dir):
    """BigQuery extractor should write to .tmp and rename, not write directly."""
    from connectors.bigquery.extractor import _create_meta_table
    db_path = Path(output_dir) / "extract.duckdb"

    # Pre-create the DB to simulate existing file
    conn = duckdb.connect(str(db_path))
    _create_meta_table(conn)
    conn.close()

    # After init_extract, the file should exist and no .tmp should remain
    # (The actual init_extract test already covers this — we just verify no .tmp leak)
    assert db_path.exists()
    assert not (Path(output_dir) / "extract.duckdb.tmp").exists()

Step 2: Modify init_extract to use temp-file swap

In connectors/bigquery/extractor.py, replace lines 64-68:

    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    db_path = output_path / "extract.duckdb"
    tmp_db_path = output_path / "extract.duckdb.tmp"
    if tmp_db_path.exists():
        tmp_db_path.unlink()
    conn = duckdb.connect(str(tmp_db_path))

And at the end, before return stats (after conn.close()):

    import shutil
    if tmp_db_path.exists():
        shutil.move(str(tmp_db_path), str(db_path))

Step 3: Run tests

Run: pytest tests/test_bigquery_extractor.py -v Expected: All pass

Step 4: Commit

git add connectors/bigquery/extractor.py tests/test_bigquery_extractor.py
git commit -m "fix: BigQuery extractor uses temp-file swap to avoid lock conflicts"

Workstream 4: Script Sandbox Hardening (P1)

Task 4.1: Strip VIRTUAL_ENV and PYTHONPATH from sandbox subprocess

The sandbox gives scripts access to all installed packages (httpx, duckdb) via inherited env vars.

Files:

Modify: app/api/scripts.py:191-198
Test: tests/test_security.py
Step 1: Write the failing test

# In tests/test_security.py, add to TestScriptSecurity:

def test_sandbox_cannot_import_httpx(self, client, admin_headers):
    """Sandboxed scripts must not have access to httpx or other installed packages."""
    resp = client.post("/api/scripts/run", json={
        "name": "test",
        "source": "import httpx\nprint('pwned')"
    }, headers=admin_headers)
    data = resp.json()
    # httpx should be blocked by pattern OR unavailable due to stripped VIRTUAL_ENV
    assert resp.status_code == 400 or data.get("exit_code", 0) != 0

Step 2: Run to verify failure

Run: pytest tests/test_security.py::TestScriptSecurity::test_sandbox_cannot_import_httpx -v Expected: FAIL — httpx imports successfully

Step 3: Fix the subprocess env

In app/api/scripts.py, replace the env dict in subprocess.run (lines 191-198):

            env={
                "PATH": "/usr/bin:/usr/local/bin",
                "DATA_DIR": data_dir,
                "HOME": "/tmp",
                # Deliberately exclude VIRTUAL_ENV and PYTHONPATH
                # to prevent access to installed packages
            },

Also add "httpx", "from httpx" to blocked_patterns list.

Step 4: Run tests

Run: pytest tests/test_security.py::TestScriptSecurity -v Expected: All pass

Step 5: Commit

git add app/api/scripts.py tests/test_security.py
git commit -m "fix: strip VIRTUAL_ENV/PYTHONPATH from script sandbox, block httpx import"

Workstream 5: Test Hardening (P0-P1)

Task 5.1: Fix environment variable leaking in test fixtures

Most test files set os.environ["DATA_DIR"] directly without cleanup. This causes test ordering dependencies.

Files:

Modify: tests/test_db.py, tests/test_rbac.py, tests/test_repositories.py, tests/test_api.py, tests/test_api_complete.py, tests/test_api_scripts.py, tests/test_auth_providers.py, tests/test_bootstrap.py, tests/test_permissions.py, tests/test_security.py
Step 1: Search and replace pattern

In every test file that has os.environ["DATA_DIR"] = inside a fixture, replace with monkeypatch.setenv("DATA_DIR", ...). Add monkeypatch to the fixture parameters.

Example — in tests/test_db.py, change:

@pytest.fixture
def db_env(tmp_path):
    os.environ["DATA_DIR"] = str(tmp_path)
    yield tmp_path

To:

@pytest.fixture
def db_env(tmp_path, monkeypatch):
    monkeypatch.setenv("DATA_DIR", str(tmp_path))
    yield tmp_path

Apply to all affected files. Remove manual os.environ.pop("DATA_DIR", None) lines since monkeypatch handles cleanup automatically.

Step 2: Run full test suite

Run: pytest tests/ -v --tb=short Expected: 607+ passed

Step 3: Commit

git add tests/
git commit -m "fix: use monkeypatch for DATA_DIR in all test fixtures — prevent env leaking"

Task 5.2: Add extract.duckdb contract test

Create a shared validator that verifies any extract.duckdb conforms to the contract. Apply it in all extractor tests.

Files:

Create: tests/helpers/contract.py
Modify: tests/test_keboola_extractor.py, tests/test_bigquery_extractor.py
Step 1: Create contract validator

# tests/helpers/__init__.py (empty)
# tests/helpers/contract.py

"""Shared validator for the extract.duckdb contract."""

import duckdb
from pathlib import Path


def validate_extract_contract(db_path: str) -> None:
    """Verify an extract.duckdb conforms to the contract.

    Raises AssertionError with details if any check fails.
    """
    path = Path(db_path)
    assert path.exists(), f"extract.duckdb not found at {db_path}"

    conn = duckdb.connect(str(path), read_only=True)
    try:
        # _meta table must exist with correct schema
        cols = conn.execute(
            "SELECT column_name FROM information_schema.columns "
            "WHERE table_name='_meta' ORDER BY ordinal_position"
        ).fetchall()
        col_names = [c[0] for c in cols]
        assert col_names == ["table_name", "description", "rows", "size_bytes", "extracted_at", "query_mode"], \
            f"_meta schema mismatch: {col_names}"

        # Every _meta entry with query_mode='local' must have a corresponding view or table
        local_tables = conn.execute(
            "SELECT table_name FROM _meta WHERE query_mode = 'local'"
        ).fetchall()
        for (name,) in local_tables:
            tables = conn.execute(
                "SELECT table_name FROM information_schema.tables WHERE table_name = ?", [name]
            ).fetchall()
            assert len(tables) > 0, f"Local table '{name}' in _meta but no view/table exists"

        # If _remote_attach exists, validate its schema
        ra_exists = conn.execute(
            "SELECT count(*) FROM information_schema.tables WHERE table_name='_remote_attach'"
        ).fetchone()[0]
        if ra_exists:
            ra_cols = conn.execute(
                "SELECT column_name FROM information_schema.columns "
                "WHERE table_name='_remote_attach' ORDER BY ordinal_position"
            ).fetchall()
            ra_col_names = [c[0] for c in ra_cols]
            assert ra_col_names == ["alias", "extension", "url", "token_env"], \
                f"_remote_attach schema mismatch: {ra_col_names}"
    finally:
        conn.close()

Step 2: Apply in extractor tests

In tests/test_keboola_extractor.py, add to test_creates_extract_duckdb:

        from tests.helpers.contract import validate_extract_contract
        validate_extract_contract(str(Path(output_dir) / "extract.duckdb"))

Similarly in tests/test_bigquery_extractor.py::test_creates_extract_duckdb_with_meta.

Step 3: Run tests

Run: pytest tests/test_keboola_extractor.py tests/test_bigquery_extractor.py -v Expected: All pass

Step 4: Commit

git add tests/helpers/ tests/test_keboola_extractor.py tests/test_bigquery_extractor.py
git commit -m "feat: add extract.duckdb contract validator, apply in all extractor tests"

Task 5.3: Add pytest timeout and strict markers

Prevent CI hangs and catch marker typos.

Files:

Modify: pytest.ini
Modify: requirements.txt (add pytest-timeout)
Step 1: Update pytest.ini

[pytest]
addopts = -m "not live and not docker" --timeout=60 --strict-markers
markers =
    live: tests requiring server access (run with '-m live')
    docker: tests requiring Docker (run with '-m docker')

Step 2: Add pytest-timeout to requirements.txt

Add line: pytest-timeout>=2.0.0

Step 3: Install and run

Run: uv pip install --system pytest-timeout && pytest tests/ -q --tb=short Expected: All pass within 60s timeout

Step 4: Commit

git add pytest.ini requirements.txt
git commit -m "chore: add pytest-timeout (60s) and strict-markers to pytest config"

Workstream 6: Docs & Cleanup (P1-P2)

Task 6.1: Rewrite README.md from CLAUDE.md

The current README describes the old Flask/rsync architecture. CLAUDE.md is accurate.

Files:

Modify: README.md
Step 1: Rewrite README.md

Use CLAUDE.md as the source of truth. The README should contain:

Project description (1-2 paragraphs)
Architecture diagram (from CLAUDE.md)
Quick start (Docker compose)
Development setup (venv, pytest)
Project structure (from CLAUDE.md)
Configuration overview
Supported data sources (Keboola ✅, BigQuery ✅, Jira ✅)
Links to docs/DEPLOYMENT.md for server setup
License

Remove all references to Flask, rsync, SSH, sync_data.sh, Linux groups, server/setup.sh.

Step 2: Verify no broken references

Run: grep -r "webapp/" README.md; grep -r "sync_data.sh" README.md; grep -r "server/setup" README.md Expected: No matches

Step 3: Commit

git add README.md
git commit -m "docs: rewrite README for v2 architecture (FastAPI, DuckDB, Docker)"

Task 6.2: Update .env.template to match actual code

Template references WEBAPP_SECRET_KEY but code uses JWT_SECRET_KEY.

Files:

Modify: config/.env.template
Step 1: Rewrite template

# AI Data Analyst - Environment Variables
# Copy to .env: cp config/.env.template .env
# .env is gitignored - NEVER commit it.

# Required
JWT_SECRET_KEY=           # python -c "import secrets; print(secrets.token_hex(32))"

# Google OAuth (optional — needed for Google login)
# GOOGLE_CLIENT_ID=
# GOOGLE_CLIENT_SECRET=

# Keboola adapter (optional — skip if using CSV/sample data)
# KEBOOLA_STORAGE_TOKEN=
# KEBOOLA_STACK_URL=https://connection.keboola.com

# Bootstrap admin (optional — used on first docker compose up)
# SEED_ADMIN_EMAIL=admin@example.com

# Optional services
# TELEGRAM_BOT_TOKEN=
# JIRA_WEBHOOK_SECRET=
# ANTHROPIC_API_KEY=

Step 2: Commit

git add config/.env.template
git commit -m "docs: update .env.template to match actual code (JWT_SECRET_KEY, not WEBAPP_SECRET_KEY)"

Task 6.3: Remove dead Flask Blueprint from Jira connector

connectors/jira/webhook.py uses Flask Blueprint but the app uses FastAPI. It's dead code that confuses readers.

Files:

Check: connectors/jira/webhook.py — verify it's not imported anywhere except Jira-internal code
Modify: add deprecation comment or delete if unused
Step 1: Check if webhook.py is imported

Run: grep -r "from connectors.jira.webhook" app/ src/ services/ If no matches: the Flask Blueprint is dead code.

Step 2: Add deprecation notice or delete

If unused by the FastAPI app, delete connectors/jira/webhook.py and update any imports.

Step 3: Commit

git add connectors/jira/
git commit -m "chore: remove dead Flask Blueprint from Jira connector (replaced by FastAPI)"

Task 6.4: Add upload size limit

upload_session and upload_artifact read entire files into memory with no limit.

Files:

Modify: app/api/upload.py
Test: tests/test_api_complete.py
Step 1: Write the test

# In tests/test_api_complete.py or a new test file:

def test_upload_rejects_oversized_file(client, admin_headers):
    """Uploads over 50MB should be rejected."""
    # Create a file reference that claims to be too large
    import io
    large_data = b"x" * (50 * 1024 * 1024 + 1)  # 50MB + 1 byte
    resp = client.post(
        "/api/upload/artifact/test-session",
        files={"file": ("big.csv", io.BytesIO(large_data), "text/csv")},
        headers=admin_headers,
    )
    assert resp.status_code == 413 or resp.status_code == 400

Step 2: Add size check

In app/api/upload.py, at the start of each upload endpoint:

    MAX_UPLOAD_SIZE = 50 * 1024 * 1024  # 50 MB
    contents = await file.read()
    if len(contents) > MAX_UPLOAD_SIZE:
        raise HTTPException(status_code=413, detail=f"File too large (max {MAX_UPLOAD_SIZE // 1024 // 1024}MB)")

Step 3: Run tests

Run: pytest tests/test_api_complete.py -v Expected: All pass

Step 4: Commit

git add app/api/upload.py tests/test_api_complete.py
git commit -m "fix: add 50MB upload size limit — prevent memory exhaustion"

Workstream 7: DuckDB Lifecycle & Connection Management (P1)

Task 7.1: Fix SQL injection in get_analytics_db_readonly

Same unquoted ext_dir.name issue as the orchestrator, but in the read-only analytics connection used by every query request.

Files:

Modify: src/db.py:228-233
Test: tests/test_db.py
Step 1: Write the failing test

# In tests/test_db.py, add:

def test_analytics_readonly_rejects_malicious_dir_name(db_env):
    """get_analytics_db_readonly must skip directories with unsafe names."""
    extracts = db_env / "extracts"
    extracts.mkdir(parents=True)
    malicious = extracts / "test; DROP TABLE x--"
    malicious.mkdir()
    db_file = malicious / "extract.duckdb"

    import duckdb
    conn = duckdb.connect(str(db_file))
    conn.execute("""CREATE TABLE _meta (
        table_name VARCHAR, description VARCHAR, rows BIGINT,
        size_bytes BIGINT, extracted_at TIMESTAMP, query_mode VARCHAR
    )""")
    conn.close()

    # Should not crash
    from src.db import get_analytics_db_readonly
    ro_conn = get_analytics_db_readonly()
    ro_conn.close()

Step 2: Run to verify failure

Run: pytest tests/test_db.py::test_analytics_readonly_rejects_malicious_dir_name -v

Step 3: Add identifier validation

Import the validator from orchestrator (or extract to shared module). In src/db.py, add at top:

import re
_SAFE_IDENTIFIER = re.compile(r"^[a-zA-Z_][a-zA-Z0-9_]{0,63}$")

In get_analytics_db_readonly(), replace line 232:

            if db_file.exists() and ext_dir.is_dir():
                if not _SAFE_IDENTIFIER.match(ext_dir.name):
                    continue
                try:
                    conn.execute(f"ATTACH '{db_file}' AS {ext_dir.name} (READ_ONLY)")

Step 4: Run tests

Run: pytest tests/test_db.py -v Expected: All pass

Step 5: Commit

git add src/db.py tests/test_db.py
git commit -m "fix: validate identifiers in get_analytics_db_readonly — prevent SQL injection"

Task 7.2: Remove dead PRAGMA enable_wal code

PRAGMA enable_wal is not valid DuckDB syntax. DuckDB uses WAL by default since v0.8. This is dead code with a misleading comment.

Files:

Modify: src/db.py:200-204
Step 1: Remove the dead code

In src/db.py, delete lines 200-204:

            # WAL mode: allows concurrent readers while writing
            try:
                _system_db_conn.execute("PRAGMA enable_wal")
            except Exception:
                pass  # Older DuckDB versions may not support this

Step 2: Run tests

Run: pytest tests/test_db.py -v Expected: All pass

Step 3: Commit

git add src/db.py
git commit -m "chore: remove dead PRAGMA enable_wal — DuckDB uses WAL by default"

Task 7.3: Escape token single-quotes in ATTACH SQL

If a token contains a single quote, the ATTACH SQL breaks. DuckDB doesn't support parameterized ATTACH, so escape manually.

Files:

Modify: src/orchestrator.py (_attach_remote_extensions)
Modify: connectors/keboola/extractor.py (_try_attach_extension)
Step 1: Add escaping in orchestrator

In src/orchestrator.py, in _attach_remote_extensions, replace the ATTACH line:

                if token:
                    escaped_token = token.replace("'", "''")
                    conn.execute(
                        f"ATTACH '{url}' AS {alias} (TYPE {extension}, TOKEN '{escaped_token}')"
                    )

Step 2: Add escaping in Keboola extractor

In connectors/keboola/extractor.py, _try_attach_extension:

        escaped_token = keboola_token.replace("'", "''")
        conn.execute(f"ATTACH '{keboola_url}' AS kbc (TYPE keboola, TOKEN '{escaped_token}')")

Step 3: Run tests

Run: pytest tests/test_orchestrator.py tests/test_keboola_extractor.py -v Expected: All pass

Step 4: Commit

git add src/orchestrator.py connectors/keboola/extractor.py
git commit -m "fix: escape single quotes in ATTACH TOKEN to prevent SQL breakage"

Task 7.4: Add temp-file swap to Jira extract_init.update_meta

Jira's update_meta() writes directly to live extract.duckdb while the orchestrator may have it ATTACHed read-only.

Files:

Modify: connectors/jira/extract_init.py:87
Test: tests/test_e2e_extract.py
Step 1: Examine current code and fix

The update_meta() function opens extract.duckdb directly. Since it only updates _meta rows and recreates views (not bulk writes), the simplest fix is to use a short-lived connection with CHECKPOINT:

In connectors/jira/extract_init.py, after the conn.execute("UPDATE _meta ...") block, add before conn.close():

        conn.execute("CHECKPOINT")

This forces WAL flush and reduces the lock window. A full temp-file swap is not practical here since the Jira connector does incremental updates.

Step 2: Run tests

Run: pytest tests/test_e2e_extract.py::TestJiraWebhookToQuery -v Expected: Pass

Step 3: Commit

git add connectors/jira/extract_init.py
git commit -m "fix: add CHECKPOINT after Jira meta update — reduce lock window"

Workstream 8: Scalability & Robustness (P1)

Task 8.1: Fix table access check false positives in query endpoint

The query endpoint checks table access with table.lower() in sql_lower — a substring match. A table named id blocks any query containing the word "id". A table named orders triggers on ordered_items.

Files:

Modify: app/api/query.py:67-71
Test: tests/test_security.py
Step 1: Write the failing test

# In tests/test_security.py, add to TestQuerySecurity:

def test_table_access_no_false_positive_on_column_name(self, client, auth_headers):
    """A forbidden table named 'id' should not block queries that use 'id' as a column."""
    # This test verifies the table access check doesn't use naive substring matching
    resp = client.post("/api/query", json={
        "sql": "SELECT id, name FROM allowed_table"
    }, headers=auth_headers)
    # Should not get 403 just because 'id' appears in SQL
    assert resp.status_code != 403 or "id" not in resp.json().get("detail", "")

Step 2: Fix with word-boundary matching

In app/api/query.py, replace the table access check (lines 67-71):

            # Check if query references any forbidden tables (word-boundary match)
            import re
            forbidden = all_views - set(allowed)
            for table in forbidden:
                # Use word boundaries to avoid false positives on column names
                pattern = r'\b' + re.escape(table.lower()) + r'\b'
                if re.search(pattern, sql_lower):
                    raise HTTPException(status_code=403, detail=f"Access denied to table '{table}'")

Step 3: Run tests

Run: pytest tests/test_security.py::TestQuerySecurity -v Expected: All pass

Step 4: Commit

git add app/api/query.py tests/test_security.py
git commit -m "fix: use word-boundary matching for table access check — prevent false positives"

Task 8.2: Replace Docker healthcheck with curl

The current healthcheck starts a full Python interpreter + imports httpx every 30 seconds.

Files:

Modify: Dockerfile (add curl)
Modify: docker-compose.yml:13
Step 1: Add curl to Dockerfile

In Dockerfile, add after the FROM line:

RUN apt-get update && apt-get install -y --no-install-recommends curl && rm -rf /var/lib/apt/lists/*

Step 2: Update docker-compose healthcheck

In docker-compose.yml, replace line 13:

    healthcheck:
      test: ["CMD", "curl", "-sf", "http://localhost:8000/api/health"]
      interval: 30s
      timeout: 5s
      retries: 3

Step 3: Commit

git add Dockerfile docker-compose.yml
git commit -m "fix: use curl for Docker healthcheck instead of Python+httpx (faster, lighter)"

Task 8.3: Add graceful shutdown handler

No lifespan handler exists to close the shared DuckDB connection on shutdown.

Files:

Modify: app/main.py
Step 1: Add lifespan handler

In app/main.py, add a lifespan context manager and use it in FastAPI():

from contextlib import asynccontextmanager

@asynccontextmanager
async def lifespan(app):
    # Startup
    yield
    # Shutdown: close shared DuckDB connection
    from src.db import close_system_db
    close_system_db()

Change app = FastAPI(...) to app = FastAPI(..., lifespan=lifespan).

Add close_system_db() to src/db.py:

def close_system_db() -> None:
    """Close the shared system DB connection. Called on app shutdown."""
    global _system_db_conn, _system_db_path
    if _system_db_conn:
        try:
            _system_db_conn.close()
        except Exception:
            pass
        _system_db_conn = None
        _system_db_path = None

Step 2: Run tests

Run: pytest tests/test_api.py -v Expected: All pass

Step 3: Commit

git add app/main.py src/db.py
git commit -m "feat: add graceful shutdown handler — close DuckDB on app exit"

Task 8.4: Extract shared _get_data_dir utility

_get_data_dir() is copy-pasted in 4 API files.

Files:

Create: app/utils.py
Modify: app/api/sync.py, app/api/data.py, app/api/upload.py, app/api/catalog.py
Step 1: Create shared utility

# app/utils.py
import os
from pathlib import Path

def get_data_dir() -> Path:
    return Path(os.environ.get("DATA_DIR", "./data"))

Step 2: Replace in all 4 files

In each file, replace:

def _get_data_dir():
    return Path(os.environ.get("DATA_DIR", "./data"))

With:

from app.utils import get_data_dir as _get_data_dir

Step 3: Run tests

Run: pytest tests/ -q --tb=short Expected: All pass

Step 4: Commit

git add app/utils.py app/api/sync.py app/api/data.py app/api/upload.py app/api/catalog.py
git commit -m "refactor: extract shared _get_data_dir to app/utils.py — DRY"

Task 8.5: Move faker to dev dependencies

Faker is a production dependency but only used for sample data generation.

Files:

Modify: requirements.txt
Create: requirements-dev.txt
Step 1: Move faker

Remove faker>=24.0.0 from requirements.txt.

Create requirements-dev.txt:

-r requirements.txt
faker>=24.0.0
pytest>=9.0.0
pytest-timeout>=2.0.0

Step 2: Verify app starts without faker

Run: python -c "from app.main import create_app; print('OK')" Expected: OK (faker not imported at startup)

Step 3: Commit

git add requirements.txt requirements-dev.txt
git commit -m "chore: move faker to dev dependencies — not needed in production"

Workstream 9: Missing Test Coverage (P0-P1)

Task 9.1: Add web UI smoke tests

app/web/router.py has 46 functions with almost no test coverage. A template error would not be caught.

Files:

Create: tests/test_web_ui.py
Step 1: Create smoke tests for all authenticated pages

"""Smoke tests for web UI pages — verify they render without template errors."""

import os
import pytest
import duckdb
from fastapi.testclient import TestClient


@pytest.fixture
def web_client(tmp_path, monkeypatch):
    monkeypatch.setenv("DATA_DIR", str(tmp_path))
    monkeypatch.setenv("TESTING", "1")
    (tmp_path / "state").mkdir()
    (tmp_path / "analytics").mkdir()
    (tmp_path / "extracts").mkdir()

    from app.main import create_app
    app = create_app()
    return TestClient(app)


@pytest.fixture
def admin_cookie(web_client, tmp_path, monkeypatch):
    """Create admin user and return cookie dict."""
    from src.db import get_system_db
    from src.repositories.users import UserRepository
    from app.auth.jwt import create_access_token

    conn = get_system_db()
    UserRepository(conn).create(id="admin1", email="admin@test.com", role="admin")
    conn.close()

    token = create_access_token(user_id="admin1", email="admin@test.com", role="admin")
    return {"access_token": token}


class TestWebUISmoke:
    """Every page should return 200 without template errors."""

    def test_login_page(self, web_client):
        resp = web_client.get("/login")
        assert resp.status_code == 200

    def test_dashboard(self, web_client, admin_cookie):
        resp = web_client.get("/", cookies=admin_cookie)
        assert resp.status_code in (200, 302)

    def test_catalog(self, web_client, admin_cookie):
        resp = web_client.get("/catalog", cookies=admin_cookie)
        assert resp.status_code == 200

    def test_admin_tables(self, web_client, admin_cookie):
        resp = web_client.get("/admin/tables", cookies=admin_cookie)
        assert resp.status_code == 200

    def test_admin_permissions(self, web_client, admin_cookie):
        resp = web_client.get("/admin/permissions", cookies=admin_cookie)
        assert resp.status_code == 200

    def test_corporate_memory(self, web_client, admin_cookie):
        resp = web_client.get("/corporate-memory", cookies=admin_cookie)
        assert resp.status_code == 200

    def test_activity_center(self, web_client, admin_cookie):
        resp = web_client.get("/activity-center", cookies=admin_cookie)
        assert resp.status_code == 200

Step 2: Run tests

Run: pytest tests/test_web_ui.py -v Expected: All pass (or reveal actual template errors)

Step 3: Commit

git add tests/test_web_ui.py
git commit -m "test: add web UI smoke tests — catch template errors in 7 pages"

Task 9.2: Add Jira service integration tests

connectors/jira/service.py (15 functions) orchestrates the entire Jira webhook flow but has no dedicated tests.

Files:

Create: tests/test_jira_service.py
Step 1: Create integration tests

"""Tests for Jira service — webhook event processing pipeline."""

import os
from pathlib import Path
from unittest.mock import patch, MagicMock

import duckdb
import pytest

from connectors.jira.extract_init import init_extract, update_meta


@pytest.fixture
def jira_env(tmp_path, monkeypatch):
    monkeypatch.setenv("DATA_DIR", str(tmp_path))
    jira_dir = tmp_path / "extracts" / "jira"
    jira_dir.mkdir(parents=True)
    return jira_dir


class TestJiraExtractInit:
    def test_init_creates_extract_db(self, jira_env):
        init_extract(jira_env)
        assert (jira_env / "extract.duckdb").exists()

        conn = duckdb.connect(str(jira_env / "extract.duckdb"))
        meta = conn.execute("SELECT * FROM _meta").fetchall()
        conn.close()
        assert isinstance(meta, list)

    def test_update_meta_creates_view(self, jira_env):
        init_extract(jira_env)

        # Create a parquet file for 'issues'
        issues_dir = jira_env / "data" / "issues"
        issues_dir.mkdir(parents=True)
        pq_path = str(issues_dir / "2026-04.parquet")
        tmp = duckdb.connect()
        tmp.execute(
            f"COPY (SELECT 'PROJ-1' AS issue_key, 'Bug' AS type) "
            f"TO '{pq_path}' (FORMAT PARQUET)"
        )
        tmp.close()

        update_meta(jira_env, "issues")

        conn = duckdb.connect(str(jira_env / "extract.duckdb"))
        rows = conn.execute("SELECT rows FROM _meta WHERE table_name='issues'").fetchone()
        assert rows[0] == 1

        data = conn.execute("SELECT issue_key FROM issues").fetchone()
        assert data[0] == "PROJ-1"
        conn.close()

Step 2: Run tests

Run: pytest tests/test_jira_service.py -v Expected: All pass

Step 3: Commit

git add tests/test_jira_service.py
git commit -m "test: add Jira extract_init integration tests"

Task 9.3: Add instance_config tests

app/instance_config.py (10 functions) is loaded at startup and affects all web pages. No tests exist.

Files:

Create: tests/test_instance_config.py
Step 1: Create tests

"""Tests for instance_config — YAML loading and accessor functions."""

import os
from pathlib import Path

import pytest


@pytest.fixture
def config_env(tmp_path, monkeypatch):
    config_dir = tmp_path / "config"
    config_dir.mkdir()
    monkeypatch.setenv("DATA_DIR", str(tmp_path))
    return config_dir


class TestInstanceConfig:
    def test_missing_config_file_returns_defaults(self, config_env, monkeypatch):
        """Missing instance.yaml should not crash, just return defaults."""
        from app.instance_config import get_instance_name, get_data_source_type
        # Should return some default, not crash
        name = get_instance_name()
        assert isinstance(name, str)

    def test_loads_valid_yaml(self, config_env, tmp_path, monkeypatch):
        """Valid instance.yaml should be loaded and accessible."""
        yaml_path = tmp_path / "config" / "instance.yaml"
        yaml_path.write_text("instance_name: Test Instance\ndata_source: keboola\n")

        from app.instance_config import load_instance_config, get_instance_name
        import importlib
        import app.instance_config as mod
        importlib.reload(mod)

        name = mod.get_instance_name()
        assert "Test" in name or isinstance(name, str)

Step 2: Run tests

Run: pytest tests/test_instance_config.py -v Expected: All pass

Step 3: Commit

git add tests/test_instance_config.py
git commit -m "test: add instance_config tests — missing file, valid YAML"

Task 9.4: Add concurrent rebuild safety test

Verify the atomic swap pattern works when a read connection is open.

Files:

Modify: tests/test_orchestrator.py
Step 1: Write the test

# In tests/test_orchestrator.py, add:

def test_rebuild_while_reading(setup_env):
    """Rebuild should succeed even while a read-only connection exists."""
    from src.orchestrator import SyncOrchestrator
    import duckdb

    _create_mock_extract(
        setup_env["extracts_dir"], "keboola",
        [{"name": "orders", "data": [{"id": "1"}]}],
    )

    orch = SyncOrchestrator(analytics_db_path=setup_env["analytics_db"])
    orch.rebuild()

    # Open a read-only connection (simulating query endpoint)
    reader = duckdb.connect(setup_env["analytics_db"], read_only=True)

    # Rebuild while reader is open — should not crash
    result = orch.rebuild()
    assert "keboola" in result

    reader.close()

Step 2: Run test

Run: pytest tests/test_orchestrator.py::test_rebuild_while_reading -v Expected: Pass

Step 3: Commit

git add tests/test_orchestrator.py
git commit -m "test: add concurrent rebuild safety test"

Execution Order

Workstreams are independent and can run in parallel. Within each workstream, tasks are sequential.

Critical path (do first):

Task 1.1 (password bypass) — active auth vulnerability
Task 3.1 (rebuild_source) — active data loss bug
Task 2.1 (SQL injection) — security hardening

Then: 4. Tasks 1.2, 1.3 (JWT hardening) 5. Tasks 2.2 (query blocklist) 6. Tasks 3.2, 3.3 (WAL + BQ swap) 7. Task 4.1 (sandbox) 8. Tasks 7.1-7.4 (DuckDB lifecycle) 9. Tasks 8.1-8.5 (scalability + cleanup) 10. Tasks 5.1-5.3 (test hardening) 11. Tasks 9.1-9.4 (missing test coverage) 12. Tasks 6.1-6.4 (docs + cleanup) 13. Task 1.4 (cookie auth)

Verification after all tasks:

pytest tests/ -v --tb=short  # All 620+ tests pass

Workstreams are independent and can run in parallel. Within each workstream, tasks are sequential.

Critical path (do first):

Task 1.1 (password bypass) — active auth vulnerability
Task 3.1 (rebuild_source) — active data loss bug
Task 2.1 (SQL injection) — security hardening

Then: 4. Tasks 1.2, 1.3 (JWT hardening) 5. Tasks 2.2 (query blocklist) 6. Tasks 3.2, 3.3 (WAL + BQ swap) 7. Task 4.1 (sandbox) 8. Tasks 5.1-5.3 (test hardening) 9. Tasks 6.1-6.4 (docs + cleanup)

Verification after all tasks:

pytest tests/ -v --tb=short  # All 607+ tests pass

54 KiB Raw Blame History

Production Hardening Implementation Plan

Workstream 1: Authentication & Security (P0)

Task 1.1: Fix password bypass in /auth/token

Task 1.2: Fail on default JWT secret in non-test environments

Task 1.3: Reduce JWT expiry, add jti claim

Task 1.4: Fix get_optional_user not checking cookies

Workstream 2: SQL Safety (P0/P1)

Task 2.1: Add identifier validation to orchestrator

Task 2.2: Harden query endpoint SQL blocklist

Workstream 3: Orchestrator Bugs (P0/P1)

Task 3.1: Fix rebuild_source destroying all other sources' views

Task 3.2: Handle WAL files in atomic swap

Task 3.3: Add temp-file swap to BigQuery extractor

Workstream 4: Script Sandbox Hardening (P1)

Task 4.1: Strip VIRTUAL_ENV and PYTHONPATH from sandbox subprocess

Workstream 5: Test Hardening (P0-P1)

Task 5.1: Fix environment variable leaking in test fixtures

Task 5.2: Add extract.duckdb contract test

Task 5.3: Add pytest timeout and strict markers

Workstream 6: Docs & Cleanup (P1-P2)

Task 6.1: Rewrite README.md from CLAUDE.md

Task 6.2: Update .env.template to match actual code

Task 6.3: Remove dead Flask Blueprint from Jira connector

Task 6.4: Add upload size limit

Workstream 7: DuckDB Lifecycle & Connection Management (P1)

Task 7.1: Fix SQL injection in get_analytics_db_readonly

Task 7.2: Remove dead PRAGMA enable_wal code

Task 7.3: Escape token single-quotes in ATTACH SQL

Task 7.4: Add temp-file swap to Jira extract_init.update_meta

Workstream 8: Scalability & Robustness (P1)

Task 8.1: Fix table access check false positives in query endpoint

Task 8.2: Replace Docker healthcheck with curl

Task 8.3: Add graceful shutdown handler

Task 8.4: Extract shared _get_data_dir utility

Task 8.5: Move faker to dev dependencies

Workstream 9: Missing Test Coverage (P0-P1)

Task 9.1: Add web UI smoke tests

Task 9.2: Add Jira service integration tests

Task 9.3: Add instance_config tests

Task 9.4: Add concurrent rebuild safety test

Execution Order

54 KiB

Raw Blame History