Commit graph

315 commits

Author SHA1 Message Date
ZdenekSrotyr
809448e02b fix: move kbcstorage to optional dep — unblocks urllib3 security updates
kbcstorage pins urllib3<2.0.0 which blocks Dependabot security patches.
Moved to [project.optional-dependencies] keboola-legacy since the primary
extraction path uses the DuckDB Keboola extension, not kbcstorage.
Legacy fallback uses lazy import — app works without it installed.
2026-04-09 14:46:50 +02:00
ZdenekSrotyr
22b4d830e5 chore: upgrade docker actions to Node.js 24 (login-action@v4, build-push-action@v7) 2026-04-09 14:22:11 +02:00
ZdenekSrotyr
6ebfc15010 fix: setup-uv@v7 (v8 major tag doesn't exist yet) 2026-04-09 14:19:32 +02:00
ZdenekSrotyr
1ebf50bd78 fix: upgrade setup-uv@v5 → v8 (Node.js 24 native), add uv.lock
- setup-uv@v8 runs on Node.js 24 natively — no more deprecation warnings
- Removed FORCE_JAVASCRIPT_ACTIONS_TO_NODE24 workaround (no longer needed)
- Added uv.lock for reproducible dependency resolution
2026-04-09 14:16:55 +02:00
ZdenekSrotyr
554ba0d9f2 fix: remove Kamal deploy job (no server configured), force Node.js 24 in CI
- Removed deploy-production job — Kamal config has placeholder IPs, no secrets
- Renamed workflow to "Build & Push" — test + Docker image to GHCR
- Added FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true to suppress Node.js 20 warnings
2026-04-09 14:10:37 +02:00
ZdenekSrotyr
0279cc06fa refactor: consolidate deps into pyproject.toml, remove requirements.txt
- All dependencies now in pyproject.toml [project.dependencies]
- Dev/test deps in [project.optional-dependencies] dev and [tool.uv]
- Dockerfile uses uv pip install . from pyproject.toml
- CI uses uv pip install ".[dev]"
- Deleted requirements.txt and requirements-dev.txt
- Updated README, CLAUDE.md install instructions
- Enhanced .dockerignore (exclude tests, docs, infra from image)
2026-04-09 13:17:59 +02:00
ZdenekSrotyr
fa3aef652f chore: update GitHub Actions to Node.js 24 compatible versions (checkout@v5, setup-python@v6, setup-uv@v5) 2026-04-09 12:48:14 +02:00
ZdenekSrotyr
eb7c548025 fix: add anthropic + openai to dev deps (needed by test_llm_connector) 2026-04-09 09:13:38 +02:00
ZdenekSrotyr
f9fae6e895 fix: CI installs requirements-dev.txt (faker needed for tests), set TESTING=1 2026-04-09 09:10:29 +02:00
ZdenekSrotyr
53f39bb38d chore: clean stale docs — rewrite architecture.md, remove old plans
- architecture.md rewritten for v2 (FastAPI, DuckDB, Docker) — removed
  all Flask/rsync/SSH/systemd references
- Deleted PLAN.md and REFACTORING_PLAN.md (completed, superseded)
- auto-install.md replaced with redirect to DEPLOYMENT.md
- Fixed absolute paths in superpowers plan doc
2026-04-09 09:06:13 +02:00
ZdenekSrotyr
afa84f6585 fix: web UI smoke tests — reset DuckDB singleton, get token via API 2026-04-09 07:18:17 +02:00
ZdenekSrotyr
5131816a5b test: add missing coverage for web UI, Jira extract, instance config, and concurrent rebuild
- tests/test_web_ui.py: smoke tests for all authenticated web pages (login, dashboard, catalog, corporate-memory, activity-center, admin/tables, admin/permissions)
- tests/test_jira_service.py: unit tests for extract_init and update_meta in the Jira connector
- tests/test_instance_config.py: verifies get_instance_name() returns a string when config file is absent
- tests/test_orchestrator.py: concurrent rebuild test asserting rebuild succeeds while a read-only connection holds the analytics DB
2026-04-09 07:15:14 +02:00
ZdenekSrotyr
8df8183a9f feat: add 50 MB upload size limit to session and artifact endpoints
Rejects files exceeding MAX_UPLOAD_SIZE with HTTP 413 before writing to disk.
2026-04-09 07:14:16 +02:00
ZdenekSrotyr
c20da6d744 Remove dead Flask Blueprint from Jira connector
The Flask-based webhook endpoint at connectors/jira/webhook.py is no longer used.
FastAPI handles webhooks via app/api/jira_webhooks.py which is already integrated
into the application. This removes the redundant Flask code.
2026-04-09 07:13:20 +02:00
ZdenekSrotyr
fbbb866879 Move faker to dev dependencies 2026-04-09 07:13:20 +02:00
ZdenekSrotyr
cb557baf36 chore: update .env.template to match actual code 2026-04-09 07:13:01 +02:00
ZdenekSrotyr
8e9a0c367a fix: replace os.environ direct assignment with monkeypatch.setenv in test fixtures
Prevents environment variable leaking between tests. All DATA_DIR,
JWT_SECRET_KEY, and SCRIPT_TIMEOUT assignments in fixtures now use
monkeypatch.setenv() which auto-reverts after each test. Removes
manual os.environ.pop() cleanup lines.
2026-04-09 07:11:36 +02:00
ZdenekSrotyr
c56511f69f refactor: replace local _get_data_dir() with shared app.utils.get_data_dir()
Replace copy-pasted _get_data_dir() functions in catalog.py and upload.py
with import from app.utils.get_data_dir(). sync.py and data.py already use
the shared utility.
2026-04-09 07:05:50 +02:00
ZdenekSrotyr
53a9e838f9 feat: add graceful shutdown handler
- Add close_system_db() function in src/db.py to cleanly close shared DB connection
- Add lifespan context manager in app/main.py to trigger shutdown on app exit
- Integrate lifespan into FastAPI app initialization
- All API tests pass (77/77)
2026-04-09 07:03:45 +02:00
ZdenekSrotyr
d6458a5b53 feat: add pytest timeout and strict markers
- Add --timeout=60 to pytest.ini to catch hanging tests
- Add --strict-markers to enforce declared markers only
- Install pytest-timeout>=2.0.0 dependency
- All 627 tests pass within 60s timeout
2026-04-09 07:03:44 +02:00
ZdenekSrotyr
9e5066cf1d feat: replace Docker healthcheck with curl 2026-04-09 07:03:12 +02:00
ZdenekSrotyr
1b3acce7e9 fix: replace substring table access check with word-boundary regex
Replace substring matching with word-boundary regex in query endpoint's
table access validation. Prevents false positives where short table names
like 'id' would block any query containing the word. Uses re.escape() to
safely handle special characters in table names.

- Import re module at top
- Use regex pattern with word boundaries (\b) for matching
- Add tests to verify no false positives and proper blocking
2026-04-09 07:00:48 +02:00
ZdenekSrotyr
1488e01bf9 feat: add temp-file swap to BigQuery extractor
Write to extract.duckdb.tmp, then atomically swap into place with WAL cleanup.
Prevents lock conflicts with orchestrator holding read lock on existing database.
2026-04-09 07:00:19 +02:00
ZdenekSrotyr
f25393871d fix: escape single quotes in ATTACH TOKEN parameters
- In src/orchestrator.py _attach_remote_extensions: escape token with '' before passing to ATTACH
- In connectors/keboola/extractor.py _try_attach_extension: escape token with '' before passing to ATTACH

Prevents SQL injection if token contains single quotes.
2026-04-09 07:00:13 +02:00
ZdenekSrotyr
1b219cabe9 fix: remove dead PRAGMA enable_wal code
DuckDB has used WAL by default since v0.8, so this pragma is not
valid DuckDB syntax. Removed obsolete try-except block that attempted
to enable WAL on system database initialization.
2026-04-09 06:59:57 +02:00
ZdenekSrotyr
535b5fb1bf security: strip VIRTUAL_ENV/PYTHONPATH from script sandbox and block httpx
Replace inherited env vars with a minimal env dict (PATH, DATA_DIR, HOME only),
omitting VIRTUAL_ENV and PYTHONPATH to prevent subprocess access to installed
packages. Switch subprocess invocation to sys.executable so the correct
interpreter is used with the restricted PATH. Add httpx to blocked_patterns
and BLOCKED_MODULES. Add test_sandbox_cannot_import_httpx to test_security.py.
2026-04-09 06:58:26 +02:00
ZdenekSrotyr
3e9c347cf1 fix: validate extract dir name in get_analytics_db_readonly to prevent SQL injection
Adds _SAFE_IDENTIFIER regex guard before ATTACHing extract.duckdb files in the
read-only analytics connection, matching the same fix already applied in the
orchestrator. Adds test coverage for malicious directory names.
2026-04-09 06:57:31 +02:00
ZdenekSrotyr
e425d4baa5 fix: handle WAL files in atomic swap to prevent DB corruption
Add _atomic_swap_db helper that removes stale WAL files before and after
moving the temp DuckDB into place. Apply CHECKPOINT before close in both
orchestrator and Keboola extractor so DuckDB flushes WAL before the swap.
2026-04-09 06:57:29 +02:00
ZdenekSrotyr
3321d2e266 security: reduce JWT expiry to 24h and add jti claim
Tokens previously lasted 30 days with no revocation path. Expiry is now
24 hours and every token carries a unique jti (UUID hex) to support future
revocation checks.
2026-04-09 06:57:23 +02:00
ZdenekSrotyr
23ae6a602c security: harden query endpoint SQL blocklist and disable external access
Expand blocked keywords to cover parquet_scan, read_csv_auto, query_table,
iceberg_scan, delta_scan, call, URL schemes (http/https/s3/gcs), and
additional file-scan functions. Set enable_external_access=false on the
non-read-only analytics connection path. Add three new tests covering
parquet_scan, read_csv_auto, and query_table blocking.
2026-04-09 06:54:58 +02:00
ZdenekSrotyr
4aa97c23d2 fix: raise RuntimeError on missing JWT_SECRET_KEY in non-test environments
Prevents production deployments from silently using a hardcoded default
secret. TESTING=1 still resolves to a built-in test key so the existing
test suite is unaffected. Adds a test that verifies the RuntimeError is
raised when neither JWT_SECRET_KEY nor TESTING is set.
2026-04-09 06:54:29 +02:00
ZdenekSrotyr
0d3ab5060c fix: reject unsafe SQL identifiers in orchestrator
Adds _validate_identifier() with ^[a-zA-Z_][a-zA-Z0-9_]{0,63}$ regex and
applies it to source_name (directory names), table_name (_meta rows), and
alias/extension (_remote_attach rows) before any SQL interpolation.
Adds two tests covering SQL-injection directory names and malicious _meta entries.
2026-04-09 06:51:07 +02:00
ZdenekSrotyr
cb9c566d07 fix: rebuild_source delegates to full rebuild to preserve all source views
_do_rebuild_source was creating a fresh temp DB with only one source,
then atomically replacing analytics.duckdb — wiping views from every
other source. Now it delegates to _do_rebuild so all extract dirs are
re-attached in a single pass.

Adds test_rebuild_source_preserves_other_sources to guard the regression.
2026-04-09 06:48:25 +02:00
ZdenekSrotyr
94c6b0f839 fix: require password verification when user has password_hash in /auth/token
Previously the password check was gated on both user.password_hash and
request.password being truthy, so an attacker could omit the password
field (which defaults to "") and receive a valid JWT. Now any user with a
stored hash must supply a non-empty password that passes argon2 verification.

Adds six TestTokenEndpoint tests covering empty, missing, wrong, and correct
password, plus no-hash user and unknown user cases.
2026-04-09 06:44:31 +02:00
ZdenekSrotyr
89154d043b chore: clean repo for public release — fix references, remove drafts
- Replace padak/tmp_oss → keboola/agnes-the-ai-analyst in all docs, infra, CLI
- Replace your-org/ai-data-analyst → keboola/agnes-the-ai-analyst in README, Jira docs
- Remove real GCP project ID from terraform.tfvars.example
- Delete internal draft documents (dev_docs/draft/)
- Update infra/main.tf to clone from main branch
2026-04-08 19:27:25 +02:00
ZdenekSrotyr
79443e0df4 fix: CSV all_varchar in legacy extractor, rewrite DEPLOYMENT.md from real deploy
- Legacy extractor now uses read_csv(all_varchar=true) to avoid type
  inference errors (e.g. seniority column typed as DOUBLE with string values)
- DEPLOYMENT.md rewritten based on actual dev VM deployment experience:
  deploy key setup, DuckDB write locking, env reload gotchas, bootstrap flow
2026-04-08 19:09:55 +02:00
ZdenekSrotyr
2635f77974 ci: add CI test suite + deploy pipeline
- ci.yml: runs 607 tests + Docker build on push/PR
- deploy.yml: tests → build → GHCR push → Kamal deploy on main
2026-04-08 18:24:05 +02:00
ZdenekSrotyr
cfa08c4b4c chore: remove obsolete CI workflows (deploy-guard, deploy.yml.example)
deploy-guard.yml referenced deleted tests and sudoers files.
deploy.yml.example used legacy SSH-based deployment.
Updated ci.yml and deploy.yml are in .gitignore (need workflow scope to push).
2026-04-08 18:16:48 +02:00
ZdenekSrotyr
3ba207a7f8 feat: add _remote_attach to BigQuery extractor, support token-less ATTACH in orchestrator
BigQuery extension handles auth via GOOGLE_APPLICATION_CREDENTIALS env var,
so _remote_attach uses empty token_env. Orchestrator now supports both
token-based (Keboola) and env-based (BigQuery) authentication modes.
2026-04-08 18:13:31 +02:00
ZdenekSrotyr
06e1cf0a8d feat: generic _remote_attach contract for remote DuckDB extension views
Extractors with remote tables now write a _remote_attach table into
extract.duckdb so the orchestrator can re-ATTACH external extensions
at query time. The mechanism is source-agnostic — any connector can use it.

- Keboola extractor writes _remote_attach + creates views on kbc.*
- Orchestrator reads _remote_attach, installs extension, reads token from env
- Graceful degradation: missing token → warning, local tables still work
2026-04-08 18:10:12 +02:00
ZdenekSrotyr
ee7d5630ef fix: keep external_access enabled — views need read_parquet on local files
File access attacks blocked by SQL blocklist instead of DuckDB pragma
(pragma also blocks legitimate view resolution via read_parquet).
2026-04-08 12:33:05 +02:00
ZdenekSrotyr
f2f9a62803 fix: set enable_external_access=false AFTER ATTACHing extracts 2026-04-08 12:29:27 +02:00
ZdenekSrotyr
6efdf4ca64 fix: read-only analytics DB ATTACHes extract.duckdb files for view resolution 2026-04-08 12:27:12 +02:00
ZdenekSrotyr
a0f7e98f11 chore: add auth/ to gitignore (legacy dir, not tracked) 2026-04-08 12:12:37 +02:00
ZdenekSrotyr
92fbb88c15 chore: Docker prod config (Python 3.13, no reload), fix utcnow deprecation, update docs 2026-04-08 12:10:47 +02:00
ZdenekSrotyr
05a1b452e9 security: harden query (read-only DB), uploads (path sanitization), scripts (AST validation) 2026-04-08 12:09:19 +02:00
ZdenekSrotyr
224635b88d security: fix auth (argon2, cookie, JWT), CORS, session middleware, pyproject.toml 2026-04-08 12:08:52 +02:00
ZdenekSrotyr
d5659d7091 fix: login page uses login_buttons format expected by template 2026-04-08 07:11:03 +02:00
ZdenekSrotyr
67a1e0bb45 feat: Jira webhook FastAPI adapter — replaces Flask Blueprint 2026-04-08 07:04:50 +02:00
ZdenekSrotyr
3e3f84a00e feat: dynamic login providers + profiler auto-trigger + refresh endpoint 2026-04-08 07:04:40 +02:00