CLAUDE.md rewritten (708 -> ~320 lines): four overlapping release sections collapsed to one, stale v1->v35 schema history dropped (it lives in CHANGELOG), marketplace endpoint internals and verbose process sections moved out or tightened. New focused docs: - docs/RELEASING.md - release process, deploy workflows, CI quirks (RELEASE_TEMPLATE.md folded in as an appendix) - docs/marketplace.md - marketplace ingestion + re-serving internals - docs/README.md - documentation index by audience, linked from README.md and CLAUDE.md Archived under docs/archive/: docs/superpowers/ (52 historical planning artifacts), HACKATHON.md, pd-ps-comments.md, security-audit-2026-04.md, future/NOTIFICATIONS.md. Removed the docs/auto-install.md stub. Fixed dangling links in connectors/jira/README.md and dev_docs/README.md, repointed code/doc references to archived paths.
54 KiB
Production Hardening Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Fix all P0/P1 issues from 4 independent code reviews (architect, data engineer, senior dev, test specialist) to make the codebase production-ready.
Architecture: Fixes are grouped into 6 independent workstreams: auth/security, SQL safety, orchestrator bugs, DuckDB lifecycle, test hardening, and docs/cleanup. Each workstream can be executed in parallel by separate agents.
Tech Stack: Python 3.13, FastAPI, DuckDB, pytest, Docker
Source: Consolidated findings from 4 review agents run 2026-04-08.
Workstream 1: Authentication & Security (P0)
Task 1.1: Fix password bypass in /auth/token
The /auth/token endpoint issues a JWT without verifying the password when request.password is empty but user.password_hash exists. Any user with a password can get a token by omitting the password field.
Files:
-
Modify:
app/auth/router.py:47-54 -
Test:
tests/test_auth_providers.py -
Step 1: Write the failing test
# In tests/test_auth_providers.py, add to existing test class:
def test_password_required_when_hash_exists(client, e2e_env):
"""A user with password_hash must provide correct password."""
from src.db import get_system_db
from src.repositories.users import UserRepository
from argon2 import PasswordHasher
conn = get_system_db()
repo = UserRepository(conn)
ph = PasswordHasher()
repo.create(id="pw-user", email="pw@test.com", role="analyst")
conn.execute(
"UPDATE users SET password_hash = ? WHERE id = ?",
[ph.hash("correct-password"), "pw-user"],
)
conn.close()
# Empty password should be rejected
resp = client.post("/auth/token", json={"email": "pw@test.com", "password": ""})
assert resp.status_code == 401
# Missing password field should be rejected
resp = client.post("/auth/token", json={"email": "pw@test.com"})
assert resp.status_code == 401
# Wrong password should be rejected
resp = client.post("/auth/token", json={"email": "pw@test.com", "password": "wrong"})
assert resp.status_code == 401
# Correct password should work
resp = client.post("/auth/token", json={"email": "pw@test.com", "password": "correct-password"})
assert resp.status_code == 200
- Step 2: Run test to verify it fails
Run: pytest tests/test_auth_providers.py::test_password_required_when_hash_exists -v
Expected: FAIL — empty password returns 200 instead of 401
- Step 3: Fix the auth logic
In app/auth/router.py, replace lines 47-54:
# If user has password_hash, require and verify password
if user.get("password_hash"):
if not request.password:
raise HTTPException(status_code=401, detail="Password required")
try:
from argon2 import PasswordHasher
ph = PasswordHasher()
ph.verify(user["password_hash"], request.password)
except Exception:
raise HTTPException(status_code=401, detail="Invalid password")
- Step 4: Run test to verify it passes
Run: pytest tests/test_auth_providers.py::test_password_required_when_hash_exists -v
Expected: PASS
- Step 5: Run full auth test suite
Run: pytest tests/test_auth_providers.py tests/test_security.py -v
Expected: All pass
- Step 6: Commit
git add app/auth/router.py tests/test_auth_providers.py
git commit -m "fix: require password when password_hash exists — prevents auth bypass"
Task 1.2: Fail on default JWT secret in non-test environments
The app starts with a hardcoded known secret if JWT_SECRET_KEY env var is missing. A production deployment that forgets to set it is wide open.
Files:
-
Modify:
app/auth/jwt.py:9-16 -
Test:
tests/test_security.py -
Step 1: Write the failing test
# In tests/test_security.py, add:
def test_jwt_rejects_default_secret_in_production(monkeypatch):
"""App should refuse to start with the default JWT secret unless TESTING=1."""
monkeypatch.delenv("JWT_SECRET_KEY", raising=False)
monkeypatch.delenv("TESTING", raising=False)
with pytest.raises(RuntimeError, match="JWT_SECRET_KEY"):
# Force reimport to trigger module-level check
import importlib
import app.auth.jwt as jwt_mod
importlib.reload(jwt_mod)
- Step 2: Run test to verify it fails
Run: pytest tests/test_security.py::test_jwt_rejects_default_secret_in_production -v
Expected: FAIL — no RuntimeError raised
- Step 3: Fix jwt.py
Replace app/auth/jwt.py lines 9-16:
SECRET_KEY = os.environ.get("JWT_SECRET_KEY", "")
if not SECRET_KEY:
if os.environ.get("TESTING", "").lower() in ("1", "true"):
SECRET_KEY = "test-jwt-secret-key-minimum-32-chars!!"
else:
raise RuntimeError(
"JWT_SECRET_KEY environment variable is required. "
"Generate one: python -c \"import secrets; print(secrets.token_hex(32))\""
)
elif len(SECRET_KEY) < 32 and os.environ.get("TESTING", "").lower() not in ("1", "true"):
import warnings as _warnings
_warnings.warn(
f"JWT_SECRET_KEY is {len(SECRET_KEY)} chars — minimum 32 recommended",
UserWarning, stacklevel=2,
)
- Step 4: Run tests
Run: pytest tests/test_security.py tests/test_auth_providers.py -v
Expected: All pass (existing tests set TESTING=1 or JWT_SECRET_KEY via conftest)
- Step 5: Commit
git add app/auth/jwt.py tests/test_security.py
git commit -m "fix: fail startup on missing JWT_SECRET_KEY in non-test environments"
Task 1.3: Reduce JWT expiry, add jti claim
30-day tokens with no revocation mechanism are too risky.
Files:
-
Modify:
app/auth/jwt.py:18-19,28-37 -
Test:
tests/test_security.py -
Step 1: Write the failing test
# In tests/test_security.py, add:
def test_jwt_contains_jti_claim():
"""JWT tokens must contain a unique jti claim for future revocation support."""
from app.auth.jwt import create_access_token, verify_token
token = create_access_token(user_id="u1", email="a@b.com", role="analyst")
payload = verify_token(token)
assert "jti" in payload
assert len(payload["jti"]) >= 16
def test_jwt_expiry_is_24_hours():
"""JWT tokens should expire in 24 hours, not 30 days."""
from app.auth.jwt import ACCESS_TOKEN_EXPIRE_HOURS
assert ACCESS_TOKEN_EXPIRE_HOURS == 24
- Step 2: Run to verify failure
Run: pytest tests/test_security.py::test_jwt_contains_jti_claim tests/test_security.py::test_jwt_expiry_is_24_hours -v
Expected: FAIL
- Step 3: Fix jwt.py
In app/auth/jwt.py:
Change line 19: ACCESS_TOKEN_EXPIRE_HOURS = 24
Add import uuid at the top. In create_access_token, add "jti" to payload:
payload = {
"sub": user_id,
"email": email,
"role": role,
"exp": expire,
"iat": datetime.now(timezone.utc),
"jti": uuid.uuid4().hex,
}
- Step 4: Run tests
Run: pytest tests/ -q --tb=short
Expected: All pass
- Step 5: Commit
git add app/auth/jwt.py tests/test_security.py
git commit -m "fix: reduce JWT expiry to 24h, add jti claim for future revocation"
Task 1.4: Fix get_optional_user not checking cookies
get_optional_user only checks the Authorization header, not cookies. Web UI users appear as None.
Files:
-
Modify:
app/auth/dependencies.py:60-70 -
Test:
tests/test_auth_providers.py -
Step 1: Write the failing test
# In tests/test_auth_providers.py, add:
def test_optional_user_reads_cookie(client, e2e_env):
"""get_optional_user should detect cookie-authenticated users."""
from src.db import get_system_db
from src.repositories.users import UserRepository
from app.auth.jwt import create_access_token
conn = get_system_db()
UserRepository(conn).create(id="cookie-user", email="cookie@test.com", role="analyst")
conn.close()
token = create_access_token(user_id="cookie-user", email="cookie@test.com", role="analyst")
# Simulate web UI request with cookie but no Authorization header
resp = client.get("/api/catalog", cookies={"access_token": token})
assert resp.status_code == 200
-
Step 2: Run to verify behavior (this may or may not fail depending on endpoint requirements)
-
Step 3: Fix dependencies.py
Replace get_optional_user in app/auth/dependencies.py:
async def get_optional_user(
request: Request = None,
authorization: Optional[str] = Header(None),
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
) -> Optional[dict]:
"""Like get_current_user but returns None instead of 401 if no token."""
try:
return await get_current_user(request=request, authorization=authorization, conn=conn)
except HTTPException:
return None
- Step 4: Run tests
Run: pytest tests/test_auth_providers.py tests/test_api.py -v
Expected: All pass
- Step 5: Commit
git add app/auth/dependencies.py tests/test_auth_providers.py
git commit -m "fix: get_optional_user now checks cookies like get_current_user"
Workstream 2: SQL Safety (P0/P1)
Task 2.1: Add identifier validation to orchestrator
source_name from directory names and table_name from _meta are interpolated into SQL without validation. A crafted directory name or _meta entry could inject arbitrary SQL.
Files:
-
Modify:
src/orchestrator.py(add validation helper, apply in_attach_and_create_viewsand_attach_remote_extensions) -
Test:
tests/test_orchestrator.py -
Step 1: Write the failing tests
# In tests/test_orchestrator.py, add:
def test_rejects_malicious_source_name(setup_env):
"""Orchestrator must reject directory names with SQL injection characters."""
from src.orchestrator import SyncOrchestrator
malicious_dir = setup_env["extracts_dir"] / 'test; DROP TABLE _meta--'
malicious_dir.mkdir()
db_path = malicious_dir / "extract.duckdb"
import duckdb as _duckdb
conn = _duckdb.connect(str(db_path))
conn.execute("""CREATE TABLE _meta (
table_name VARCHAR, description VARCHAR, rows BIGINT,
size_bytes BIGINT, extracted_at TIMESTAMP, query_mode VARCHAR DEFAULT 'local'
)""")
conn.execute('CREATE TABLE "safe_table" (id VARCHAR)')
conn.execute("INSERT INTO _meta VALUES ('safe_table', '', 0, 0, current_timestamp, 'local')")
conn.close()
orch = SyncOrchestrator(analytics_db_path=setup_env["analytics_db"])
result = orch.rebuild()
# Malicious source should be skipped, not attached
assert 'test; DROP TABLE _meta--' not in result
def test_rejects_malicious_table_name(setup_env):
"""Orchestrator must reject table names with SQL injection characters."""
from src.orchestrator import SyncOrchestrator
source_dir = setup_env["extracts_dir"] / "keboola"
source_dir.mkdir()
(source_dir / "data").mkdir()
db_path = source_dir / "extract.duckdb"
import duckdb as _duckdb
conn = _duckdb.connect(str(db_path))
conn.execute("""CREATE TABLE _meta (
table_name VARCHAR, description VARCHAR, rows BIGINT,
size_bytes BIGINT, extracted_at TIMESTAMP, query_mode VARCHAR DEFAULT 'local'
)""")
conn.execute('CREATE TABLE "safe" (id VARCHAR)')
conn.execute("INSERT INTO _meta VALUES ('safe', '', 0, 0, current_timestamp, 'local')")
# Malicious table name in _meta
conn.execute("""INSERT INTO _meta VALUES ('x"; DROP TABLE _meta; --', '', 0, 0, current_timestamp, 'local')""")
conn.close()
orch = SyncOrchestrator(analytics_db_path=setup_env["analytics_db"])
result = orch.rebuild()
# Safe table should be there, malicious should be skipped
assert "keboola" in result
assert "safe" in result["keboola"]
assert 'x"; DROP TABLE _meta; --' not in result.get("keboola", [])
- Step 2: Run to verify failure
Run: pytest tests/test_orchestrator.py::test_rejects_malicious_source_name tests/test_orchestrator.py::test_rejects_malicious_table_name -v
Expected: FAIL or crash from SQL injection
- Step 3: Add validation helper and apply it
At the top of src/orchestrator.py, add after imports:
import re
_SAFE_IDENTIFIER = re.compile(r"^[a-zA-Z_][a-zA-Z0-9_]{0,63}$")
def _validate_identifier(name: str, context: str) -> bool:
"""Validate a DuckDB identifier. Returns True if safe, False if not."""
if not _SAFE_IDENTIFIER.match(name):
logger.warning("Rejected unsafe %s identifier: %r", context, name)
return False
return True
In _do_rebuild (line ~92), add check before _attach_and_create_views:
if not _validate_identifier(ext_dir.name, "source_name"):
continue
In _attach_and_create_views (line ~160), add check before CREATE VIEW:
for table_name, rows, size_bytes, query_mode in meta_rows:
if not _validate_identifier(table_name, "table_name"):
continue
In _attach_remote_extensions (line ~193), add check:
for alias, extension, url, token_env in rows:
if not _validate_identifier(alias, "alias") or not _validate_identifier(extension, "extension"):
continue
- Step 4: Run all orchestrator tests
Run: pytest tests/test_orchestrator.py -v
Expected: All pass including new tests
- Step 5: Commit
git add src/orchestrator.py tests/test_orchestrator.py
git commit -m "fix: validate SQL identifiers in orchestrator — prevent injection via directory/table names"
Task 2.2: Harden query endpoint SQL blocklist
The current blocklist misses parquet_scan, read_csv_auto, query_table, and has false positives on semicolons in string literals. Also add enable_external_access=false on the analytics connection.
Files:
-
Modify:
app/api/query.py:39-51andsrc/db.py(analytics readonly connection) -
Test:
tests/test_security.py -
Step 1: Write the failing tests
# In tests/test_security.py, add to TestQuerySecurity:
def test_blocks_parquet_scan(self, client, auth_headers):
resp = client.post("/api/query", json={"sql": "SELECT * FROM parquet_scan('/etc/passwd')"}, headers=auth_headers)
assert resp.status_code == 400
def test_blocks_read_csv_auto(self, client, auth_headers):
resp = client.post("/api/query", json={"sql": "SELECT * FROM read_csv_auto('/data/state/system.duckdb')"}, headers=auth_headers)
assert resp.status_code == 400
def test_blocks_query_table(self, client, auth_headers):
resp = client.post("/api/query", json={"sql": "SELECT * FROM query_table('secret')"}, headers=auth_headers)
assert resp.status_code == 400
def test_blocks_httpfs(self, client, auth_headers):
resp = client.post("/api/query", json={"sql": "SELECT * FROM read_parquet('https://evil.com/data.parquet')"}, headers=auth_headers)
assert resp.status_code == 400
- Step 2: Run to verify failure
Run: pytest tests/test_security.py::TestQuerySecurity -v
Expected: Some FAIL (parquet_scan, read_csv_auto, query_table not blocked)
- Step 3: Expand the blocklist and set enable_external_access=false
In app/api/query.py, replace the blocked list (lines 39-49):
blocked = [
"drop ", "delete ", "insert ", "update ", "alter ", "create ",
"copy ", "attach ", "detach ", "load ", "install ",
"export ", "import ", "pragma ", "call ",
# File access functions
"read_csv", "read_json", "read_parquet", "read_text",
"write_csv", "write_parquet", "read_blob", "read_ndjson",
"parquet_scan", "parquet_metadata", "parquet_schema",
"json_scan", "csv_scan",
"query_table", "iceberg_scan", "delta_scan",
"glob(", "list_files",
"'/", '"/','http://', 'https://', 's3://', 'gcs://',
# Multiple statements
";",
]
In src/db.py get_analytics_db_readonly(), after opening the connection, add:
try:
conn.execute("SET enable_external_access = false")
except Exception:
pass # Older DuckDB versions
- Step 4: Run tests
Run: pytest tests/test_security.py::TestQuerySecurity -v
Expected: All pass
- Step 5: Commit
git add app/api/query.py src/db.py tests/test_security.py
git commit -m "fix: expand query blocklist, disable external_access on analytics connection"
Workstream 3: Orchestrator Bugs (P0/P1)
Task 3.1: Fix rebuild_source destroying all other sources' views
_do_rebuild_source() creates a fresh temp DB with only one source, then replaces the entire analytics DB. Every Jira webhook wipes all Keboola/BigQuery views.
Files:
-
Modify:
src/orchestrator.py:116-141 -
Test:
tests/test_orchestrator.py -
Step 1: Write the failing test
# In tests/test_orchestrator.py, add:
def test_rebuild_source_preserves_other_sources(setup_env):
"""rebuild_source('jira') must NOT destroy views from keboola."""
from src.orchestrator import SyncOrchestrator
_create_mock_extract(
setup_env["extracts_dir"], "keboola",
[{"name": "orders", "data": [{"id": "1"}]}],
)
_create_mock_extract(
setup_env["extracts_dir"], "jira",
[{"name": "issues", "data": [{"key": "PROJ-1"}]}],
)
orch = SyncOrchestrator(analytics_db_path=setup_env["analytics_db"])
# First: full rebuild
result = orch.rebuild()
assert "keboola" in result
assert "jira" in result
# Second: rebuild only jira
jira_tables = orch.rebuild_source("jira")
assert "issues" in jira_tables
# Third: full rebuild again — keboola must still be there
result2 = orch.rebuild()
assert "keboola" in result2
assert "orders" in result2["keboola"]
- Step 2: Run to verify failure
Run: pytest tests/test_orchestrator.py::test_rebuild_source_preserves_other_sources -v
Expected: FAIL — keboola views gone after rebuild_source("jira")
- Step 3: Fix _do_rebuild_source to delegate to full rebuild
In src/orchestrator.py, replace _do_rebuild_source (lines 116-141):
def _do_rebuild_source(self, source_name: str) -> List[str]:
"""Rebuild views for a single source by doing a full rebuild.
A full rebuild is necessary because the analytics DB is created fresh
each time (temp file + atomic swap). Rebuilding only one source would
destroy views from all other sources.
"""
extracts_dir = _get_extracts_dir()
db_file = extracts_dir / source_name / "extract.duckdb"
if not db_file.exists():
logger.warning("No extract.duckdb for source %s", source_name)
return []
result = self._do_rebuild()
return result.get(source_name, [])
- Step 4: Run tests
Run: pytest tests/test_orchestrator.py -v
Expected: All pass
- Step 5: Commit
git add src/orchestrator.py tests/test_orchestrator.py
git commit -m "fix: rebuild_source delegates to full rebuild — preserves other sources' views"
Task 3.2: Handle WAL files in atomic swap
shutil.move only moves the .duckdb file. The .wal file from the old DB can corrupt the new one.
Files:
-
Modify:
src/orchestrator.py(_do_rebuild lines 106-112) -
Modify:
connectors/keboola/extractor.py(lines 148-155) -
Test:
tests/test_orchestrator.py -
Step 1: Write the failing test
# In tests/test_orchestrator.py, add:
def test_rebuild_cleans_wal_files(setup_env):
"""After rebuild, no .wal files should remain from the temp or old DB."""
from src.orchestrator import SyncOrchestrator
_create_mock_extract(
setup_env["extracts_dir"], "keboola",
[{"name": "orders", "data": [{"id": "1"}]}],
)
orch = SyncOrchestrator(analytics_db_path=setup_env["analytics_db"])
orch.rebuild()
from pathlib import Path
db_path = Path(setup_env["analytics_db"])
assert not (db_path.parent / (db_path.name + ".wal")).exists()
assert not (db_path.parent / (db_path.name + ".tmp.wal")).exists()
- Step 2: Run to verify it passes or fails
Run: pytest tests/test_orchestrator.py::test_rebuild_cleans_wal_files -v
- Step 3: Add WAL cleanup helper
In src/orchestrator.py, add a helper after _validate_identifier:
def _atomic_swap_db(tmp_path: str, target_path: str) -> None:
"""Atomically replace target DuckDB file, cleaning up WAL files."""
import shutil
target = Path(target_path)
tmp = Path(tmp_path)
# Remove old WAL file if it exists
old_wal = Path(str(target) + ".wal")
if old_wal.exists():
old_wal.unlink()
# Move temp DB into place
if tmp.exists():
shutil.move(str(tmp), str(target))
# Clean up temp WAL
tmp_wal = Path(str(tmp) + ".wal")
if tmp_wal.exists():
tmp_wal.unlink()
Replace shutil.move call in _do_rebuild (line ~112) with:
_atomic_swap_db(tmp_path, self._db_path)
Also add CHECKPOINT before conn.close() in _do_rebuild:
conn.execute("CHECKPOINT")
finally:
conn.close()
Apply the same pattern in connectors/keboola/extractor.py at the end of run().
- Step 4: Run tests
Run: pytest tests/test_orchestrator.py tests/test_keboola_extractor.py -v
Expected: All pass
- Step 5: Commit
git add src/orchestrator.py connectors/keboola/extractor.py tests/test_orchestrator.py
git commit -m "fix: clean WAL files during atomic DB swap, add CHECKPOINT before close"
Task 3.3: Add temp-file swap to BigQuery extractor
BigQuery extractor writes directly to extract.duckdb, causing lock conflicts with the orchestrator.
Files:
-
Modify:
connectors/bigquery/extractor.py:64-68 -
Test:
tests/test_bigquery_extractor.py -
Step 1: Write the test
# In tests/test_bigquery_extractor.py, add:
def test_uses_temp_file_swap(self, output_dir):
"""BigQuery extractor should write to .tmp and rename, not write directly."""
from connectors.bigquery.extractor import _create_meta_table
db_path = Path(output_dir) / "extract.duckdb"
# Pre-create the DB to simulate existing file
conn = duckdb.connect(str(db_path))
_create_meta_table(conn)
conn.close()
# After init_extract, the file should exist and no .tmp should remain
# (The actual init_extract test already covers this — we just verify no .tmp leak)
assert db_path.exists()
assert not (Path(output_dir) / "extract.duckdb.tmp").exists()
- Step 2: Modify init_extract to use temp-file swap
In connectors/bigquery/extractor.py, replace lines 64-68:
output_path = Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)
db_path = output_path / "extract.duckdb"
tmp_db_path = output_path / "extract.duckdb.tmp"
if tmp_db_path.exists():
tmp_db_path.unlink()
conn = duckdb.connect(str(tmp_db_path))
And at the end, before return stats (after conn.close()):
import shutil
if tmp_db_path.exists():
shutil.move(str(tmp_db_path), str(db_path))
- Step 3: Run tests
Run: pytest tests/test_bigquery_extractor.py -v
Expected: All pass
- Step 4: Commit
git add connectors/bigquery/extractor.py tests/test_bigquery_extractor.py
git commit -m "fix: BigQuery extractor uses temp-file swap to avoid lock conflicts"
Workstream 4: Script Sandbox Hardening (P1)
Task 4.1: Strip VIRTUAL_ENV and PYTHONPATH from sandbox subprocess
The sandbox gives scripts access to all installed packages (httpx, duckdb) via inherited env vars.
Files:
-
Modify:
app/api/scripts.py:191-198 -
Test:
tests/test_security.py -
Step 1: Write the failing test
# In tests/test_security.py, add to TestScriptSecurity:
def test_sandbox_cannot_import_httpx(self, client, admin_headers):
"""Sandboxed scripts must not have access to httpx or other installed packages."""
resp = client.post("/api/scripts/run", json={
"name": "test",
"source": "import httpx\nprint('pwned')"
}, headers=admin_headers)
data = resp.json()
# httpx should be blocked by pattern OR unavailable due to stripped VIRTUAL_ENV
assert resp.status_code == 400 or data.get("exit_code", 0) != 0
- Step 2: Run to verify failure
Run: pytest tests/test_security.py::TestScriptSecurity::test_sandbox_cannot_import_httpx -v
Expected: FAIL — httpx imports successfully
- Step 3: Fix the subprocess env
In app/api/scripts.py, replace the env dict in subprocess.run (lines 191-198):
env={
"PATH": "/usr/bin:/usr/local/bin",
"DATA_DIR": data_dir,
"HOME": "/tmp",
# Deliberately exclude VIRTUAL_ENV and PYTHONPATH
# to prevent access to installed packages
},
Also add "httpx", "from httpx" to blocked_patterns list.
- Step 4: Run tests
Run: pytest tests/test_security.py::TestScriptSecurity -v
Expected: All pass
- Step 5: Commit
git add app/api/scripts.py tests/test_security.py
git commit -m "fix: strip VIRTUAL_ENV/PYTHONPATH from script sandbox, block httpx import"
Workstream 5: Test Hardening (P0-P1)
Task 5.1: Fix environment variable leaking in test fixtures
Most test files set os.environ["DATA_DIR"] directly without cleanup. This causes test ordering dependencies.
Files:
-
Modify:
tests/test_db.py,tests/test_rbac.py,tests/test_repositories.py,tests/test_api.py,tests/test_api_complete.py,tests/test_api_scripts.py,tests/test_auth_providers.py,tests/test_bootstrap.py,tests/test_permissions.py,tests/test_security.py -
Step 1: Search and replace pattern
In every test file that has os.environ["DATA_DIR"] = inside a fixture, replace with monkeypatch.setenv("DATA_DIR", ...). Add monkeypatch to the fixture parameters.
Example — in tests/test_db.py, change:
@pytest.fixture
def db_env(tmp_path):
os.environ["DATA_DIR"] = str(tmp_path)
yield tmp_path
To:
@pytest.fixture
def db_env(tmp_path, monkeypatch):
monkeypatch.setenv("DATA_DIR", str(tmp_path))
yield tmp_path
Apply to all affected files. Remove manual os.environ.pop("DATA_DIR", None) lines since monkeypatch handles cleanup automatically.
- Step 2: Run full test suite
Run: pytest tests/ -v --tb=short
Expected: 607+ passed
- Step 3: Commit
git add tests/
git commit -m "fix: use monkeypatch for DATA_DIR in all test fixtures — prevent env leaking"
Task 5.2: Add extract.duckdb contract test
Create a shared validator that verifies any extract.duckdb conforms to the contract. Apply it in all extractor tests.
Files:
-
Create:
tests/helpers/contract.py -
Modify:
tests/test_keboola_extractor.py,tests/test_bigquery_extractor.py -
Step 1: Create contract validator
# tests/helpers/__init__.py (empty)
# tests/helpers/contract.py
"""Shared validator for the extract.duckdb contract."""
import duckdb
from pathlib import Path
def validate_extract_contract(db_path: str) -> None:
"""Verify an extract.duckdb conforms to the contract.
Raises AssertionError with details if any check fails.
"""
path = Path(db_path)
assert path.exists(), f"extract.duckdb not found at {db_path}"
conn = duckdb.connect(str(path), read_only=True)
try:
# _meta table must exist with correct schema
cols = conn.execute(
"SELECT column_name FROM information_schema.columns "
"WHERE table_name='_meta' ORDER BY ordinal_position"
).fetchall()
col_names = [c[0] for c in cols]
assert col_names == ["table_name", "description", "rows", "size_bytes", "extracted_at", "query_mode"], \
f"_meta schema mismatch: {col_names}"
# Every _meta entry with query_mode='local' must have a corresponding view or table
local_tables = conn.execute(
"SELECT table_name FROM _meta WHERE query_mode = 'local'"
).fetchall()
for (name,) in local_tables:
tables = conn.execute(
"SELECT table_name FROM information_schema.tables WHERE table_name = ?", [name]
).fetchall()
assert len(tables) > 0, f"Local table '{name}' in _meta but no view/table exists"
# If _remote_attach exists, validate its schema
ra_exists = conn.execute(
"SELECT count(*) FROM information_schema.tables WHERE table_name='_remote_attach'"
).fetchone()[0]
if ra_exists:
ra_cols = conn.execute(
"SELECT column_name FROM information_schema.columns "
"WHERE table_name='_remote_attach' ORDER BY ordinal_position"
).fetchall()
ra_col_names = [c[0] for c in ra_cols]
assert ra_col_names == ["alias", "extension", "url", "token_env"], \
f"_remote_attach schema mismatch: {ra_col_names}"
finally:
conn.close()
- Step 2: Apply in extractor tests
In tests/test_keboola_extractor.py, add to test_creates_extract_duckdb:
from tests.helpers.contract import validate_extract_contract
validate_extract_contract(str(Path(output_dir) / "extract.duckdb"))
Similarly in tests/test_bigquery_extractor.py::test_creates_extract_duckdb_with_meta.
- Step 3: Run tests
Run: pytest tests/test_keboola_extractor.py tests/test_bigquery_extractor.py -v
Expected: All pass
- Step 4: Commit
git add tests/helpers/ tests/test_keboola_extractor.py tests/test_bigquery_extractor.py
git commit -m "feat: add extract.duckdb contract validator, apply in all extractor tests"
Task 5.3: Add pytest timeout and strict markers
Prevent CI hangs and catch marker typos.
Files:
-
Modify:
pytest.ini -
Modify:
requirements.txt(add pytest-timeout) -
Step 1: Update pytest.ini
[pytest]
addopts = -m "not live and not docker" --timeout=60 --strict-markers
markers =
live: tests requiring server access (run with '-m live')
docker: tests requiring Docker (run with '-m docker')
- Step 2: Add pytest-timeout to requirements.txt
Add line: pytest-timeout>=2.0.0
- Step 3: Install and run
Run: uv pip install --system pytest-timeout && pytest tests/ -q --tb=short
Expected: All pass within 60s timeout
- Step 4: Commit
git add pytest.ini requirements.txt
git commit -m "chore: add pytest-timeout (60s) and strict-markers to pytest config"
Workstream 6: Docs & Cleanup (P1-P2)
Task 6.1: Rewrite README.md from CLAUDE.md
The current README describes the old Flask/rsync architecture. CLAUDE.md is accurate.
Files:
-
Modify:
README.md -
Step 1: Rewrite README.md
Use CLAUDE.md as the source of truth. The README should contain:
- Project description (1-2 paragraphs)
- Architecture diagram (from CLAUDE.md)
- Quick start (Docker compose)
- Development setup (venv, pytest)
- Project structure (from CLAUDE.md)
- Configuration overview
- Supported data sources (Keboola ✅, BigQuery ✅, Jira ✅)
- Links to docs/DEPLOYMENT.md for server setup
- License
Remove all references to Flask, rsync, SSH, sync_data.sh, Linux groups, server/setup.sh.
- Step 2: Verify no broken references
Run: grep -r "webapp/" README.md; grep -r "sync_data.sh" README.md; grep -r "server/setup" README.md
Expected: No matches
- Step 3: Commit
git add README.md
git commit -m "docs: rewrite README for v2 architecture (FastAPI, DuckDB, Docker)"
Task 6.2: Update .env.template to match actual code
Template references WEBAPP_SECRET_KEY but code uses JWT_SECRET_KEY.
Files:
-
Modify:
config/.env.template -
Step 1: Rewrite template
# AI Data Analyst - Environment Variables
# Copy to .env: cp config/.env.template .env
# .env is gitignored - NEVER commit it.
# Required
JWT_SECRET_KEY= # python -c "import secrets; print(secrets.token_hex(32))"
# Google OAuth (optional — needed for Google login)
# GOOGLE_CLIENT_ID=
# GOOGLE_CLIENT_SECRET=
# Keboola adapter (optional — skip if using CSV/sample data)
# KEBOOLA_STORAGE_TOKEN=
# KEBOOLA_STACK_URL=https://connection.keboola.com
# Bootstrap admin (optional — used on first docker compose up)
# SEED_ADMIN_EMAIL=admin@example.com
# Optional services
# TELEGRAM_BOT_TOKEN=
# JIRA_WEBHOOK_SECRET=
# ANTHROPIC_API_KEY=
- Step 2: Commit
git add config/.env.template
git commit -m "docs: update .env.template to match actual code (JWT_SECRET_KEY, not WEBAPP_SECRET_KEY)"
Task 6.3: Remove dead Flask Blueprint from Jira connector
connectors/jira/webhook.py uses Flask Blueprint but the app uses FastAPI. It's dead code that confuses readers.
Files:
-
Check:
connectors/jira/webhook.py— verify it's not imported anywhere except Jira-internal code -
Modify: add deprecation comment or delete if unused
-
Step 1: Check if webhook.py is imported
Run: grep -r "from connectors.jira.webhook" app/ src/ services/
If no matches: the Flask Blueprint is dead code.
- Step 2: Add deprecation notice or delete
If unused by the FastAPI app, delete connectors/jira/webhook.py and update any imports.
- Step 3: Commit
git add connectors/jira/
git commit -m "chore: remove dead Flask Blueprint from Jira connector (replaced by FastAPI)"
Task 6.4: Add upload size limit
upload_session and upload_artifact read entire files into memory with no limit.
Files:
-
Modify:
app/api/upload.py -
Test:
tests/test_api_complete.py -
Step 1: Write the test
# In tests/test_api_complete.py or a new test file:
def test_upload_rejects_oversized_file(client, admin_headers):
"""Uploads over 50MB should be rejected."""
# Create a file reference that claims to be too large
import io
large_data = b"x" * (50 * 1024 * 1024 + 1) # 50MB + 1 byte
resp = client.post(
"/api/upload/artifact/test-session",
files={"file": ("big.csv", io.BytesIO(large_data), "text/csv")},
headers=admin_headers,
)
assert resp.status_code == 413 or resp.status_code == 400
- Step 2: Add size check
In app/api/upload.py, at the start of each upload endpoint:
MAX_UPLOAD_SIZE = 50 * 1024 * 1024 # 50 MB
contents = await file.read()
if len(contents) > MAX_UPLOAD_SIZE:
raise HTTPException(status_code=413, detail=f"File too large (max {MAX_UPLOAD_SIZE // 1024 // 1024}MB)")
- Step 3: Run tests
Run: pytest tests/test_api_complete.py -v
Expected: All pass
- Step 4: Commit
git add app/api/upload.py tests/test_api_complete.py
git commit -m "fix: add 50MB upload size limit — prevent memory exhaustion"
Workstream 7: DuckDB Lifecycle & Connection Management (P1)
Task 7.1: Fix SQL injection in get_analytics_db_readonly
Same unquoted ext_dir.name issue as the orchestrator, but in the read-only analytics connection used by every query request.
Files:
-
Modify:
src/db.py:228-233 -
Test:
tests/test_db.py -
Step 1: Write the failing test
# In tests/test_db.py, add:
def test_analytics_readonly_rejects_malicious_dir_name(db_env):
"""get_analytics_db_readonly must skip directories with unsafe names."""
extracts = db_env / "extracts"
extracts.mkdir(parents=True)
malicious = extracts / "test; DROP TABLE x--"
malicious.mkdir()
db_file = malicious / "extract.duckdb"
import duckdb
conn = duckdb.connect(str(db_file))
conn.execute("""CREATE TABLE _meta (
table_name VARCHAR, description VARCHAR, rows BIGINT,
size_bytes BIGINT, extracted_at TIMESTAMP, query_mode VARCHAR
)""")
conn.close()
# Should not crash
from src.db import get_analytics_db_readonly
ro_conn = get_analytics_db_readonly()
ro_conn.close()
- Step 2: Run to verify failure
Run: pytest tests/test_db.py::test_analytics_readonly_rejects_malicious_dir_name -v
- Step 3: Add identifier validation
Import the validator from orchestrator (or extract to shared module). In src/db.py, add at top:
import re
_SAFE_IDENTIFIER = re.compile(r"^[a-zA-Z_][a-zA-Z0-9_]{0,63}$")
In get_analytics_db_readonly(), replace line 232:
if db_file.exists() and ext_dir.is_dir():
if not _SAFE_IDENTIFIER.match(ext_dir.name):
continue
try:
conn.execute(f"ATTACH '{db_file}' AS {ext_dir.name} (READ_ONLY)")
- Step 4: Run tests
Run: pytest tests/test_db.py -v
Expected: All pass
- Step 5: Commit
git add src/db.py tests/test_db.py
git commit -m "fix: validate identifiers in get_analytics_db_readonly — prevent SQL injection"
Task 7.2: Remove dead PRAGMA enable_wal code
PRAGMA enable_wal is not valid DuckDB syntax. DuckDB uses WAL by default since v0.8. This is dead code with a misleading comment.
Files:
-
Modify:
src/db.py:200-204 -
Step 1: Remove the dead code
In src/db.py, delete lines 200-204:
# WAL mode: allows concurrent readers while writing
try:
_system_db_conn.execute("PRAGMA enable_wal")
except Exception:
pass # Older DuckDB versions may not support this
- Step 2: Run tests
Run: pytest tests/test_db.py -v
Expected: All pass
- Step 3: Commit
git add src/db.py
git commit -m "chore: remove dead PRAGMA enable_wal — DuckDB uses WAL by default"
Task 7.3: Escape token single-quotes in ATTACH SQL
If a token contains a single quote, the ATTACH SQL breaks. DuckDB doesn't support parameterized ATTACH, so escape manually.
Files:
-
Modify:
src/orchestrator.py(_attach_remote_extensions) -
Modify:
connectors/keboola/extractor.py(_try_attach_extension) -
Step 1: Add escaping in orchestrator
In src/orchestrator.py, in _attach_remote_extensions, replace the ATTACH line:
if token:
escaped_token = token.replace("'", "''")
conn.execute(
f"ATTACH '{url}' AS {alias} (TYPE {extension}, TOKEN '{escaped_token}')"
)
- Step 2: Add escaping in Keboola extractor
In connectors/keboola/extractor.py, _try_attach_extension:
escaped_token = keboola_token.replace("'", "''")
conn.execute(f"ATTACH '{keboola_url}' AS kbc (TYPE keboola, TOKEN '{escaped_token}')")
- Step 3: Run tests
Run: pytest tests/test_orchestrator.py tests/test_keboola_extractor.py -v
Expected: All pass
- Step 4: Commit
git add src/orchestrator.py connectors/keboola/extractor.py
git commit -m "fix: escape single quotes in ATTACH TOKEN to prevent SQL breakage"
Task 7.4: Add temp-file swap to Jira extract_init.update_meta
Jira's update_meta() writes directly to live extract.duckdb while the orchestrator may have it ATTACHed read-only.
Files:
-
Modify:
connectors/jira/extract_init.py:87 -
Test:
tests/test_e2e_extract.py -
Step 1: Examine current code and fix
The update_meta() function opens extract.duckdb directly. Since it only updates _meta rows and recreates views (not bulk writes), the simplest fix is to use a short-lived connection with CHECKPOINT:
In connectors/jira/extract_init.py, after the conn.execute("UPDATE _meta ...") block, add before conn.close():
conn.execute("CHECKPOINT")
This forces WAL flush and reduces the lock window. A full temp-file swap is not practical here since the Jira connector does incremental updates.
- Step 2: Run tests
Run: pytest tests/test_e2e_extract.py::TestJiraWebhookToQuery -v
Expected: Pass
- Step 3: Commit
git add connectors/jira/extract_init.py
git commit -m "fix: add CHECKPOINT after Jira meta update — reduce lock window"
Workstream 8: Scalability & Robustness (P1)
Task 8.1: Fix table access check false positives in query endpoint
The query endpoint checks table access with table.lower() in sql_lower — a substring match. A table named id blocks any query containing the word "id". A table named orders triggers on ordered_items.
Files:
-
Modify:
app/api/query.py:67-71 -
Test:
tests/test_security.py -
Step 1: Write the failing test
# In tests/test_security.py, add to TestQuerySecurity:
def test_table_access_no_false_positive_on_column_name(self, client, auth_headers):
"""A forbidden table named 'id' should not block queries that use 'id' as a column."""
# This test verifies the table access check doesn't use naive substring matching
resp = client.post("/api/query", json={
"sql": "SELECT id, name FROM allowed_table"
}, headers=auth_headers)
# Should not get 403 just because 'id' appears in SQL
assert resp.status_code != 403 or "id" not in resp.json().get("detail", "")
- Step 2: Fix with word-boundary matching
In app/api/query.py, replace the table access check (lines 67-71):
# Check if query references any forbidden tables (word-boundary match)
import re
forbidden = all_views - set(allowed)
for table in forbidden:
# Use word boundaries to avoid false positives on column names
pattern = r'\b' + re.escape(table.lower()) + r'\b'
if re.search(pattern, sql_lower):
raise HTTPException(status_code=403, detail=f"Access denied to table '{table}'")
- Step 3: Run tests
Run: pytest tests/test_security.py::TestQuerySecurity -v
Expected: All pass
- Step 4: Commit
git add app/api/query.py tests/test_security.py
git commit -m "fix: use word-boundary matching for table access check — prevent false positives"
Task 8.2: Replace Docker healthcheck with curl
The current healthcheck starts a full Python interpreter + imports httpx every 30 seconds.
Files:
-
Modify:
Dockerfile(add curl) -
Modify:
docker-compose.yml:13 -
Step 1: Add curl to Dockerfile
In Dockerfile, add after the FROM line:
RUN apt-get update && apt-get install -y --no-install-recommends curl && rm -rf /var/lib/apt/lists/*
- Step 2: Update docker-compose healthcheck
In docker-compose.yml, replace line 13:
healthcheck:
test: ["CMD", "curl", "-sf", "http://localhost:8000/api/health"]
interval: 30s
timeout: 5s
retries: 3
- Step 3: Commit
git add Dockerfile docker-compose.yml
git commit -m "fix: use curl for Docker healthcheck instead of Python+httpx (faster, lighter)"
Task 8.3: Add graceful shutdown handler
No lifespan handler exists to close the shared DuckDB connection on shutdown.
Files:
-
Modify:
app/main.py -
Step 1: Add lifespan handler
In app/main.py, add a lifespan context manager and use it in FastAPI():
from contextlib import asynccontextmanager
@asynccontextmanager
async def lifespan(app):
# Startup
yield
# Shutdown: close shared DuckDB connection
from src.db import close_system_db
close_system_db()
Change app = FastAPI(...) to app = FastAPI(..., lifespan=lifespan).
Add close_system_db() to src/db.py:
def close_system_db() -> None:
"""Close the shared system DB connection. Called on app shutdown."""
global _system_db_conn, _system_db_path
if _system_db_conn:
try:
_system_db_conn.close()
except Exception:
pass
_system_db_conn = None
_system_db_path = None
- Step 2: Run tests
Run: pytest tests/test_api.py -v
Expected: All pass
- Step 3: Commit
git add app/main.py src/db.py
git commit -m "feat: add graceful shutdown handler — close DuckDB on app exit"
Task 8.4: Extract shared _get_data_dir utility
_get_data_dir() is copy-pasted in 4 API files.
Files:
-
Create:
app/utils.py -
Modify:
app/api/sync.py,app/api/data.py,app/api/upload.py,app/api/catalog.py -
Step 1: Create shared utility
# app/utils.py
import os
from pathlib import Path
def get_data_dir() -> Path:
return Path(os.environ.get("DATA_DIR", "./data"))
- Step 2: Replace in all 4 files
In each file, replace:
def _get_data_dir():
return Path(os.environ.get("DATA_DIR", "./data"))
With:
from app.utils import get_data_dir as _get_data_dir
- Step 3: Run tests
Run: pytest tests/ -q --tb=short
Expected: All pass
- Step 4: Commit
git add app/utils.py app/api/sync.py app/api/data.py app/api/upload.py app/api/catalog.py
git commit -m "refactor: extract shared _get_data_dir to app/utils.py — DRY"
Task 8.5: Move faker to dev dependencies
Faker is a production dependency but only used for sample data generation.
Files:
-
Modify:
requirements.txt -
Create:
requirements-dev.txt -
Step 1: Move faker
Remove faker>=24.0.0 from requirements.txt.
Create requirements-dev.txt:
-r requirements.txt
faker>=24.0.0
pytest>=9.0.0
pytest-timeout>=2.0.0
- Step 2: Verify app starts without faker
Run: python -c "from app.main import create_app; print('OK')"
Expected: OK (faker not imported at startup)
- Step 3: Commit
git add requirements.txt requirements-dev.txt
git commit -m "chore: move faker to dev dependencies — not needed in production"
Workstream 9: Missing Test Coverage (P0-P1)
Task 9.1: Add web UI smoke tests
app/web/router.py has 46 functions with almost no test coverage. A template error would not be caught.
Files:
-
Create:
tests/test_web_ui.py -
Step 1: Create smoke tests for all authenticated pages
"""Smoke tests for web UI pages — verify they render without template errors."""
import os
import pytest
import duckdb
from fastapi.testclient import TestClient
@pytest.fixture
def web_client(tmp_path, monkeypatch):
monkeypatch.setenv("DATA_DIR", str(tmp_path))
monkeypatch.setenv("TESTING", "1")
(tmp_path / "state").mkdir()
(tmp_path / "analytics").mkdir()
(tmp_path / "extracts").mkdir()
from app.main import create_app
app = create_app()
return TestClient(app)
@pytest.fixture
def admin_cookie(web_client, tmp_path, monkeypatch):
"""Create admin user and return cookie dict."""
from src.db import get_system_db
from src.repositories.users import UserRepository
from app.auth.jwt import create_access_token
conn = get_system_db()
UserRepository(conn).create(id="admin1", email="admin@test.com", role="admin")
conn.close()
token = create_access_token(user_id="admin1", email="admin@test.com", role="admin")
return {"access_token": token}
class TestWebUISmoke:
"""Every page should return 200 without template errors."""
def test_login_page(self, web_client):
resp = web_client.get("/login")
assert resp.status_code == 200
def test_dashboard(self, web_client, admin_cookie):
resp = web_client.get("/", cookies=admin_cookie)
assert resp.status_code in (200, 302)
def test_catalog(self, web_client, admin_cookie):
resp = web_client.get("/catalog", cookies=admin_cookie)
assert resp.status_code == 200
def test_admin_tables(self, web_client, admin_cookie):
resp = web_client.get("/admin/tables", cookies=admin_cookie)
assert resp.status_code == 200
def test_admin_permissions(self, web_client, admin_cookie):
resp = web_client.get("/admin/permissions", cookies=admin_cookie)
assert resp.status_code == 200
def test_corporate_memory(self, web_client, admin_cookie):
resp = web_client.get("/corporate-memory", cookies=admin_cookie)
assert resp.status_code == 200
def test_activity_center(self, web_client, admin_cookie):
resp = web_client.get("/activity-center", cookies=admin_cookie)
assert resp.status_code == 200
- Step 2: Run tests
Run: pytest tests/test_web_ui.py -v
Expected: All pass (or reveal actual template errors)
- Step 3: Commit
git add tests/test_web_ui.py
git commit -m "test: add web UI smoke tests — catch template errors in 7 pages"
Task 9.2: Add Jira service integration tests
connectors/jira/service.py (15 functions) orchestrates the entire Jira webhook flow but has no dedicated tests.
Files:
-
Create:
tests/test_jira_service.py -
Step 1: Create integration tests
"""Tests for Jira service — webhook event processing pipeline."""
import os
from pathlib import Path
from unittest.mock import patch, MagicMock
import duckdb
import pytest
from connectors.jira.extract_init import init_extract, update_meta
@pytest.fixture
def jira_env(tmp_path, monkeypatch):
monkeypatch.setenv("DATA_DIR", str(tmp_path))
jira_dir = tmp_path / "extracts" / "jira"
jira_dir.mkdir(parents=True)
return jira_dir
class TestJiraExtractInit:
def test_init_creates_extract_db(self, jira_env):
init_extract(jira_env)
assert (jira_env / "extract.duckdb").exists()
conn = duckdb.connect(str(jira_env / "extract.duckdb"))
meta = conn.execute("SELECT * FROM _meta").fetchall()
conn.close()
assert isinstance(meta, list)
def test_update_meta_creates_view(self, jira_env):
init_extract(jira_env)
# Create a parquet file for 'issues'
issues_dir = jira_env / "data" / "issues"
issues_dir.mkdir(parents=True)
pq_path = str(issues_dir / "2026-04.parquet")
tmp = duckdb.connect()
tmp.execute(
f"COPY (SELECT 'PROJ-1' AS issue_key, 'Bug' AS type) "
f"TO '{pq_path}' (FORMAT PARQUET)"
)
tmp.close()
update_meta(jira_env, "issues")
conn = duckdb.connect(str(jira_env / "extract.duckdb"))
rows = conn.execute("SELECT rows FROM _meta WHERE table_name='issues'").fetchone()
assert rows[0] == 1
data = conn.execute("SELECT issue_key FROM issues").fetchone()
assert data[0] == "PROJ-1"
conn.close()
- Step 2: Run tests
Run: pytest tests/test_jira_service.py -v
Expected: All pass
- Step 3: Commit
git add tests/test_jira_service.py
git commit -m "test: add Jira extract_init integration tests"
Task 9.3: Add instance_config tests
app/instance_config.py (10 functions) is loaded at startup and affects all web pages. No tests exist.
Files:
-
Create:
tests/test_instance_config.py -
Step 1: Create tests
"""Tests for instance_config — YAML loading and accessor functions."""
import os
from pathlib import Path
import pytest
@pytest.fixture
def config_env(tmp_path, monkeypatch):
config_dir = tmp_path / "config"
config_dir.mkdir()
monkeypatch.setenv("DATA_DIR", str(tmp_path))
return config_dir
class TestInstanceConfig:
def test_missing_config_file_returns_defaults(self, config_env, monkeypatch):
"""Missing instance.yaml should not crash, just return defaults."""
from app.instance_config import get_instance_name, get_data_source_type
# Should return some default, not crash
name = get_instance_name()
assert isinstance(name, str)
def test_loads_valid_yaml(self, config_env, tmp_path, monkeypatch):
"""Valid instance.yaml should be loaded and accessible."""
yaml_path = tmp_path / "config" / "instance.yaml"
yaml_path.write_text("instance_name: Test Instance\ndata_source: keboola\n")
from app.instance_config import load_instance_config, get_instance_name
import importlib
import app.instance_config as mod
importlib.reload(mod)
name = mod.get_instance_name()
assert "Test" in name or isinstance(name, str)
- Step 2: Run tests
Run: pytest tests/test_instance_config.py -v
Expected: All pass
- Step 3: Commit
git add tests/test_instance_config.py
git commit -m "test: add instance_config tests — missing file, valid YAML"
Task 9.4: Add concurrent rebuild safety test
Verify the atomic swap pattern works when a read connection is open.
Files:
-
Modify:
tests/test_orchestrator.py -
Step 1: Write the test
# In tests/test_orchestrator.py, add:
def test_rebuild_while_reading(setup_env):
"""Rebuild should succeed even while a read-only connection exists."""
from src.orchestrator import SyncOrchestrator
import duckdb
_create_mock_extract(
setup_env["extracts_dir"], "keboola",
[{"name": "orders", "data": [{"id": "1"}]}],
)
orch = SyncOrchestrator(analytics_db_path=setup_env["analytics_db"])
orch.rebuild()
# Open a read-only connection (simulating query endpoint)
reader = duckdb.connect(setup_env["analytics_db"], read_only=True)
# Rebuild while reader is open — should not crash
result = orch.rebuild()
assert "keboola" in result
reader.close()
- Step 2: Run test
Run: pytest tests/test_orchestrator.py::test_rebuild_while_reading -v
Expected: Pass
- Step 3: Commit
git add tests/test_orchestrator.py
git commit -m "test: add concurrent rebuild safety test"
Execution Order
Workstreams are independent and can run in parallel. Within each workstream, tasks are sequential.
Critical path (do first):
- Task 1.1 (password bypass) — active auth vulnerability
- Task 3.1 (rebuild_source) — active data loss bug
- Task 2.1 (SQL injection) — security hardening
Then: 4. Tasks 1.2, 1.3 (JWT hardening) 5. Tasks 2.2 (query blocklist) 6. Tasks 3.2, 3.3 (WAL + BQ swap) 7. Task 4.1 (sandbox) 8. Tasks 7.1-7.4 (DuckDB lifecycle) 9. Tasks 8.1-8.5 (scalability + cleanup) 10. Tasks 5.1-5.3 (test hardening) 11. Tasks 9.1-9.4 (missing test coverage) 12. Tasks 6.1-6.4 (docs + cleanup) 13. Task 1.4 (cookie auth)
Verification after all tasks:
pytest tests/ -v --tb=short # All 620+ tests pass
Workstreams are independent and can run in parallel. Within each workstream, tasks are sequential.
Critical path (do first):
- Task 1.1 (password bypass) — active auth vulnerability
- Task 3.1 (rebuild_source) — active data loss bug
- Task 2.1 (SQL injection) — security hardening
Then: 4. Tasks 1.2, 1.3 (JWT hardening) 5. Tasks 2.2 (query blocklist) 6. Tasks 3.2, 3.3 (WAL + BQ swap) 7. Task 4.1 (sandbox) 8. Tasks 5.1-5.3 (test hardening) 9. Tasks 6.1-6.4 (docs + cleanup)
Verification after all tasks:
pytest tests/ -v --tb=short # All 607+ tests pass