* fix: redirect unauthenticated HTML routes to /login (#10) * docs(plan): user mgmt + PAT + CLI distribution implementation plan (#9 #10 #11 #12) * build(docker): produce wheel artifact for /cli/download (#9) * feat(db): schema v5 — users.active + deactivated_at/by (#11) * feat(api): /cli/download wheel + /cli/install.sh with baked server URL (#9) * feat(users): repository supports active flag + count_admins (#11) * feat(ui): /install page with per-deployment install instructions (#9) * feat(api): user PATCH/reset-password/set-password/activate/deactivate (#11) * fix(cli): da login prompts for password and sends it in body (#9) * test(api): safeguard tests for self-deactivate and last admin (#11) * feat(auth): reject requests from deactivated users (#11) * fixup(#10): propagate next through /login buttons + lock down sanitizer tests * feat(cli): da admin set-role/activate/deactivate/reset-password/set-password (#11) * feat(ui): /admin/users management page (#11) * feat(db): schema v6 — personal_access_tokens (#12) * feat(users): access_tokens repository (#12) * feat(auth): JWT carries typ (session|pat) and explicit jti (#12) * feat(auth): reject revoked/expired PATs; update last_used_at (#12) * feat(api): /auth/tokens CRUD + admin revoke; session-only guard (#12) * feat(cli): da auth token create/list/revoke (#12) * feat(ui): /profile page with PAT create/list/revoke (#12) * docs: PAT usage and session/PAT TTL clarification (#12) * feat(auth): PAT first-use-from-new-IP audit + last_used_ip (schema v7) (#12) Closes remaining acceptance gap from issue #12: audit_log entry on first use of a PAT from an IP that differs from the recorded last_used_ip. - schema v7: personal_access_tokens.last_used_ip column - AccessTokenRepository.mark_used now stores the client IP - get_current_user extracts client IP (X-Forwarded-For first hop, fallback to request.client.host) and emits a token.first_use_new_ip audit when the IP changes on a subsequent use (not the very first use) - tests: new-ip audit, same-ip no-op, first-ever-use no-op, schema v7 column * fix: address Devin review findings on PR #28 - app/main.py: exclude /auth/* from HTML redirect handler so JSON endpoints under /auth/ (PAT CRUD used by `da auth token` CLI) keep their 401 JSON contract (Devin #1, bug) - app/api/tokens.py: reject expires_in_days <= 0 explicitly; use `is not None` so 0 no longer silently creates a non-expiring token (Devin #2) - app/api/users.py: validate role against Role enum in create_user to match update_user and prevent 500 on role-protected requests later (Devin #3) - app/web/templates/admin_users.html: escape user-supplied strings before innerHTML; move onclick handlers to addEventListener via data attributes so emails with quotes / HTML no longer break the UI or enable stored XSS (Devin #4) - app/auth/router.py, app/auth/providers/{password,google}.py: reject deactivated users at login instead of issuing a JWT that would then fail on the next request — removes the confusing redirect loop (Devin #5) - CLAUDE.md: document schema v7 instead of stale v4 (Devin #6) - tests/test_web_ui.py: regression test for the /auth/* JSON 401 * feat(web): add /profile and /admin/users links to dashboard nav * feat(web): point setup banner at /install page * chore(web): drop unused setup_instructions context * fix: address Devin review round 2 on PR #28 - app/api/tokens.py: when expires_in_days is None (the "never" option), use a ~100-year JWT expiry so the token doesn't silently die in 24h via the session-default fallback in create_access_token. The real expiry enforcement stays in verify_token's DB-level check (Devin 🔴) - app/web/templates/profile.html: escape t.name and other user-supplied strings via esc() helper before innerHTML, same pattern as admin_users.html. Move revoke onclick to data-attribute + addEventListener (Devin 🟡) - app/api/cli_artifacts.py: use `mktemp -d` with X's at end of template for GNU/BSD portability, place wheel inside the temp dir and clean up with rm -rf (Devin 🚩) * feat(web): redesign /install page; make curl one-liner primary, collapse manual Rebuild the public /install page using the dashboard visual language (shared header, card layout, gradient hero, design tokens from style-custom.css). The page is now anchored on the one-liner install path: curl -fsSL <server>/cli/install.sh | bash is rendered as the primary, prominent step 1, while the old manual wheel-download flow is tucked behind a closed-by-default <details> block for users in restricted/offline environments. Information architecture: hero (server URL + version) -> step 1: quick install (one-liner, big Copy button) -> step 2: create PAT on /profile + export DA_TOKEN / da auth whoami -> step 3: Claude Code / MCP via ~/.config/da/token.json -> collapsed "Manual install" details for download-wheel flow -> footer link to docs/HEADLESS_USAGE.md Every shell snippet has a vanilla-JS "Copy" button that confirms visually ("Copied!" for 1.5s) and falls back to textarea+execCommand on non-secure contexts. No new dependencies, no bundler. The route now also pulls an optional user so the header shows the same nav (Dashboard / Profile / Logout) as dashboard.html when a session exists, while staying fully public when signed out. * fix(cli): use real wheel filename in install.sh (broken pip/uv install) The installer wrote the downloaded wheel as agnes_cli.whl, which lacks a PEP-427 version component — both pip and uv tool install reject it and abort the one-liner. Use curl -OJ so Content-Disposition determines the on-disk filename, then resolve it via glob. Install an EXIT trap to remove the tmpdir even when install fails. * fix(web): correct manual install wheel glob and add PEP 668 / PATH hints - Wheel glob is agnes_the_ai_analyst-*.whl (not agnes-*.whl) — the old pattern never matched the real artefact name from the build. - Add — or — separator between uv tool install and pip install. - Warn that pip install --user is blocked on macOS Homebrew / modern Debian (PEP 668) and recommend uv tool install as the default path. - Both flows now show the ~/.local/bin PATH hint so a fresh shell can find the da binary after install. * fix(web): consistent session.user reference in install header The avatar-letter fallback inside {% if session.user %} was reading user.name / user.email directly, but the route dependency can pass user=None — those references resolved to an empty FlexDict and produced an empty avatar circle. Read everything through session.user to match the guard and the dashboard pattern. * fix(web): point headless usage link at GitHub source /docs/HEADLESS_USAGE.md 404s — no static route serves repo docs. Point the footer link at the rendered markdown on GitHub instead of adding a dedicated docs serving route just for one file. * feat(web): /install hero size, anon sign-in banner, step 2 copy polish - Bump hero h1 from 26px to 30px to match dashboard primary scale. - Anonymous visitors see a small sign-in banner above Step 2 (creating a token requires auth; without the banner the flow appears stuck). - Add an 'After generating your token' section label inside Step 2 so the /profile CTA button no longer looks wedged mid-sentence between adjacent paragraphs. * chore(web): /install a11y + version pill polish - aria-live='polite' on copy buttons so screen readers announce the 'Copied!' state change. - Replace redundant INSTANCE_NAME eyebrow (already in the header logo) with 'Getting started'. - Hide the version pill when AGNES_VERSION is unset/'dev' — avoids the misleading 'vdev' label in local/unbuilt runs. - Manual summary focus-visible outline-offset +2px (was -2px which clipped inside the card), and mark the chevron as decorative. * fix(web): use session.user in dashboard avatar fallback Inside {% if session.user %} guard, the avatar fallback referenced (user.name or user.email). If user is None the block crashes when the profile picture is absent. Align with the guard variable. * fix: address Devin review round 3 on PR #28 - app/api/users.py: stop auto-sending email from reset_password. The magic-link sender would deliver a "Login Link" that — when clicked — consumes the reset_token via verify_magic_link and logs the user in WITHOUT prompting for a new password. Admins now share the raw reset_token from the API response manually, or use set-password directly. email_sent is always False. Documented inline. (Devin 🟡) - app/api/cli_artifacts.py: harden /cli/install.sh generation against shell injection via Host header or AGNES_VERSION. base_url is validated against a strict scheme+host+port regex; version against an alnum + dot/dash/underscore allowlist. Both values are also piped through shlex.quote() as defense in depth. (Devin 🟡) The shared users.reset_token column between magic-link and password- reset flows (Devin 🚩) remains an architectural gap; splitting into separate columns needs schema v8 and is tracked for a follow-up PR. * docs, chore(grpn): manual-deploy helpers + hackathon deploy learnings Adds scripts/grpn/ — Makefile + agnes-auto-upgrade.sh + README for operating Agnes on GRPN's existing foundryai-development VM when the full Terraform flow is blocked by org policies: - iam.disableServiceAccountKeyCreation (org constraint) forbids SA JSON keys, so GCP_SA_KEY-based CI is unavailable - No projectIamAdmin delegation → bootstrap-gcp.sh can't grant roles - Secret Manager IAM bindings require setIamPolicy which editor lacks Helper targets: deploy, deploy-tag, recreate, restart, stop, start, status, version, logs, ps, env, ssh, tunnel, open, bootstrap-admin, set-data-source, install-cron, uninstall-cron. docs/superpowers/plans/2026-04-22-grpn-deploy-learnings.md — running log of all org-policy constraints hit during the hackathon deploy, with workarounds and derived follow-ups (WIF support, external_ip variable, customer onboarding IAM checklist). Not a replacement for the TF flow — stopgap until WIF lands. * fix(web): make header logos clickable links to home * feat(web): one-click "Setup a new Claude Code" button Adds a single-button flow on the dashboard and /install page that generates a fresh personal access token via POST /auth/tokens and copies a complete, paste-ready setup script (server URL, token, install/verify commands) to the clipboard. Falls back to a modal textarea when the clipboard is blocked; redirects to /login on 401; surfaces backend errors inline. - dashboard.html: replaces the top "Set up your local environment" anchor with a real button wired to setupNewClaude(). Removes the duplicate bottom setup banner to keep a single entry point. - install.html: for signed-in users, Step 1 leads with the one-click button and demotes the curl one-liner into a collapsible "Or run manually" aside. Anonymous visitors still see the curl flow plus a sign-in hint. - No new deps. Vanilla JS. Token lives in memory/clipboard only — never rendered into persistent DOM. * feat(cli): add "da auth import-token" for non-interactive PAT login Writes a provided JWT into ~/.config/da/token.json using the canonical {access_token, email, role} shape expected by save_token(). Decodes the token locally to pull email/role claims, verifies it against the server via GET /api/catalog/tables, and refuses to overwrite an existing token file if the server returns 401. --email / --role overrides exist for tokens missing those claims; --skip-verify bypasses the server round-trip for offline / CI scenarios. * test(cli): cover da auth import-token success + 401 + claim-fallback paths Three new tests in TestAuthImportToken: - valid JWT + 200 -> canonical token.json written - 401 from /api/catalog/tables -> exit 1, existing token file untouched - JWT without email/role claims -> refused without overrides, accepted with --email / --role flags * feat(web): update one-click Claude setup instructions — explicit uv install, import-token, skills question Replaces the fragile `cat > token.json <<EOF` clipboard payload with an explicit, auditable sequence: 1. `curl -fsSL /cli/download` + `uv tool install --force` (no opaque `curl | bash`). 2. `da auth import-token --token ...` instead of hand-written JSON. 3. Explicit PATH persistence for zsh/bash. 4. A required question to the user about whether to copy the bundled skills into ~/.claude/skills/agnes/ or pull them on-demand via `da skills show`. 5. A final confirmation step with whoami + version output. Factored both pages to include a shared partial (app/web/templates/_claude_setup_instructions.jinja) so dashboard.html and install.html can never drift apart again. {server_url} and {token} stay as runtime placeholders substituted by renderSetupInstructions(). * feat(ui): modernize /admin/users + unify header nav across pages - New shared partial app/web/templates/_app_header.html — single source of truth for the top navigation. Used by base.html and dashboard.html (which doesn't extend base.html). Active page highlighted via request.url.path. Admin "Users" link gated by session.user.role. - style-custom.css: add .app-header / .app-nav-link / .app-btn-logout / .app-avatar styles (mirrors dashboard's previous inline copy under app-* prefix). Mobile-friendly fallback at <720px. - base.html: include the new partial so every page extending base (admin_users, profile, login_email, error, …) gets the same chrome the dashboard has. - dashboard.html: replace its inline <header class="header"> markup with the shared partial. Inline .header CSS left in place as harmless dead code (separate cleanup PR). - admin_users.html: rewritten with avatars, role pills (color-coded per role), toggle switch for active, search/filter input, toast notifications, modal dialogs replacing alert/confirm/prompt, one-click copy for the reset token, empty / loading states. All XSS-safe via the existing esc() helper + data-attribute event delegation. - tests/test_web_ui.py: smoke test that /admin/users renders the new shared header chrome and the modernized markup. * feat(api): serve CLI wheel at /cli/agnes.whl for direct uv install uv tool install inspects the URL path suffix to recognise a wheel, so /cli/download (which has no .whl suffix) cannot be installed directly. Expose a stable /cli/agnes.whl alias over the same wheel lookup so users can run: uv tool install --force https://<server>/cli/agnes.whl * test(cli): cover da auth import-token --server persisting to config.yaml The server persistence was already implemented in the import-token command (save_config({server}) call) but not covered by tests. Add an explicit test so the one-step setup contract — single import-token call writes both token and server — cannot regress. * feat(web): simpler Claude setup — single uv install URL, single import-token call User feedback: the prior clipboard payload repeated the server URL and token across multiple steps (curl + tmpfile + install + rm + separate seed-config + import-token). Collapse to: 1. uv tool install --force {server_url}/cli/agnes.whl (single URL, direct) 2. da auth import-token --token ... --server ... (one call, persists both) 3. da auth whoami 4. skills (ask user first) 5. confirm uv accepts HTTPS URLs that end in .whl and installs them directly, so the tmpfile dance is unnecessary. import-token --server already persists the server to config.yaml, so no separate printf > config.yaml step. * fix(tests): update admin users heading assertion after template rename The admin_users.html template now uses <h2 class="users-title">Users</h2> instead of <h2>User management</h2>. Update the assertion to match. * feat(ui): unify header across remaining 7 standalone pages These 7 pages render their own full <html> and don't extend base.html, so the previous unification commit only covered base + dashboard. Each had its own ad-hoc <header> markup with inconsistent classes (.top-header / .header / .page-header), inconsistent nav-link sets, and inconsistent avatar/email styling. Replace each inline <header>...</header> block with the shared {% include '_app_header.html' %} so /activity-center, /admin/permissions, /admin/tables, /catalog, /corporate-memory, /corporate-memory/admin, and /install all show the same chrome (Dashboard / Install CLI / Profile / Users / email + avatar / Logout) with the active page highlighted via request.url.path. Old inline header CSS (.header, .top-header, .page-header, .nav-link, etc.) is left in place as harmless dead code; it can be cleaned up in a follow-up sweep. * feat(web): add readable preview of Claude setup payload on dashboard + /install Move the line-by-line setup instructions into app/web/setup_instructions.py as the single source of truth, then render them in two modes from the existing _claude_setup_instructions.jinja partial: - preview_mode=True → visible, read-only <pre><code> block with the real server URL and a clearly-styled placeholder token (never a real one). - preview_mode=False → the JS SETUP_INSTRUCTIONS_TEMPLATE used by the one-click flow (unchanged behaviour). Both /dashboard (env-setup-cta card) and /install (Step 1 card) now show the preview directly under the 'Setup a new Claude Code' button so users can see exactly what will land in their clipboard before they click. * feat(web): update setup instructions — `da diagnose` step, explicit section titles Rework the Claude Code setup payload to: - Give every numbered step an unambiguous verb header ("1) Install the CLI", "2) Log in", "3) Verify the login", "4) Run diagnostics", "5) Skills (ask the user first)", "6) Confirm"). - Add step 4 `da diagnose` as the post-login health check. The CLI already ships this command (cli/commands/diagnose.py); it prints "Overall: healthy" and a list of green checks that map cleanly to next actions. - Ask the skills copy-vs-on-demand question verbatim so Claude Code always prompts the user the same way. - Replace the terse "Confirm" line with a 4-bullet summary (version, whoami, skills choice, diagnose status) so the return message is structured and comparable across setups. * chore(web): remove stale MCP card from /install (no MCP server today) The 'Use with Claude Code / MCP' card (Step 3 on /install) referenced an MCP integration Agnes does not ship. Remove the whole card. The one-click 'Setup a new Claude Code' flow in Step 1 already covers the long-lived client use case and is less confusing than dangling persistence tips for a non-existent integration. * feat(api): include user_email + last_used_ip + user_id in admin tokens list response Adds AdminTokenItem response model (superset of TokenListItem) and AccessTokenRepository.list_all_with_user() joining personal_access_tokens with users to denormalize user_email. Needed for /admin/tokens UI where admins triage tokens across all users. * feat(web): /admin/tokens page — list, filter, search, revoke across all users Adds a new admin-only page with client-side filtering (status, user email, last-used window), column sorting, counts bar (active/revoked/expired), and an inline revoke action. Mirrors the /admin/users visual language. * feat(web): add Tokens nav link for admins + deep-link from admin/users row Admin-only nav entry to /admin/tokens, and a per-row Tokens button on /admin/users that prefills the token page's user filter via ?user=<email>. * test(admin): cover /admin/tokens rendering, filter state, non-admin denial, revoke Verifies admin can render the page (title + JS hooks present), a non-admin is blocked, unauthenticated users are redirected, the admin list response includes user_email / user_id / last_used_ip, and admin can revoke another user's token. * feat(web): modern redesign of /admin/tokens — hero, stat strip, refined table, responsive cards, a11y * feat(web): ditch the table — /admin/tokens as a card stack, modern GitHub-style list Replaces the table-based layout with a stack of self-contained token cards inside a <ul role=list>. Each card is a flex row: avatar + name/meta on the left, last-used block in the middle, status pill + outlined 'Revoke' button on the right. Status and sort controls are pill-shaped toggle chips; user email search has an inline search icon. No <table>/<tr>/<th>/<td> anywhere. Responsive below 720px (card stacks vertically) and 480px (stat chips 2x2). Preserves filter IDs (flt-status, flt-user, flt-last-used) and data-revoke for existing tests. * feat(web): add /tokens (role-aware) — single page for both user PAT CRUD and admin overview - Rename admin_tokens.html -> tokens.html with a new is_admin context flag. - New route GET /tokens: renders the same card-stack UI for everyone. * Admins: loads /auth/admin/tokens, shows owner column + stat strip, keeps the owner-email search box and sort-by-owner chip. * Non-admins: loads /auth/tokens (own tokens only), hides owner column + stat chips, adds a 'New token' CTA in the hero that opens a modal (name + expires_in_days) calling POST /auth/tokens. The raw token is revealed once in a dismissable banner and cleared from the DOM on Hide. - GET /admin/tokens now 302-redirects to /tokens, preserving query string (so the /admin/users deep-link ?user=foo still works). * feat(web): /tokens full-bleed layout to match dashboard width The hero, toolbar, and card list used to sit inside base.html's .container (max-width 800px). Break out with negative horizontal margins so the page spans the viewport like /dashboard does, capped at 1440px for readability on very wide screens with a 24px gutter on each side. - No change to base.html itself. The override is scoped to .tokens-page. - body { overflow-x: hidden; } guards against rare horizontal scrollbars. - < 808px viewport: reset to natural flow (mobile already narrower). - ≥ 1488px viewport: cap to 1440px and re-center. * chore(web): remove /profile template + nav link (redirect /profile -> /tokens) The old /profile PAT CRUD page is now redundant — the modern /tokens page covers both user and admin flows. Delete the template; the router's /profile handler already 302-redirects to /tokens. Nav cleanup: - Remove the 'Profile' link. - Show a single 'Tokens' link to every signed-in user (previously only admins saw it). - Active-state matches /tokens, /admin/tokens, and /profile so the highlight survives the redirect chain. /install CTA now points at /tokens instead of /profile. * test: cover /tokens for admin + non-admin flows, /profile redirect, nav update tests/test_admin_tokens_ui.py - Point admin rendering test at /tokens directly and tighten assertions (admin-only stat strip + owner search, non-admin CTA absent). - Add test_non_admin_can_render_tokens_page: personal body, New-token CTA, create-modal, reveal banner; stat strip + owner search absent. - Add test_admin_tokens_redirects_to_tokens: 302 to /tokens, query string (?user=...) preserved for the /admin/users deep-link. - Add test_profile_redirects_to_tokens: 302 to /tokens. - Add test_non_admin_can_create_pat_via_tokens_page_api: exercises the POST /auth/tokens call that the non-admin create-modal submits. tests/test_pat.py - test_profile_page_renders -> test_profile_page_redirects_to_tokens: assert the 302 + that /tokens lands on the unified non-admin body. tests/test_web_ui.py - admin_users nav assertion: 'Tokens' link present, 'Profile' link absent. - Add test_nav_shows_tokens_link_for_non_admin: non-admins see the same 'Tokens' link (previously only admins did). - Add test_profile_redirects_to_tokens back-compat check. * feat(web): collapse 'What Claude Code will receive' by default The preview block on /dashboard and /install now uses <details>/<summary> so it is hidden by default. Click the chevron/title to expand and review the clipboard payload. Markup stays in the DOM so existing tests that assert on content continue to pass. * fix(web): /tokens width — override .container to 1280px like dashboard The negative-margin full-bleed trick was fragile and pushed content past the right edge on deployed viewports. Replace with a simple max-width override of base.html's .container on this page only, matching /dashboard's 1280px center-column layout. * feat(web): split role-aware /tokens into my_tokens.html + admin_tokens.html * feat(web): router — separate handlers for /tokens (own) and /admin/tokens (all) * feat(web): nav — show Tokens for all, add All tokens for admins * test: cover split token pages (own vs all) + admin access gating * feat(web): move 'My tokens' into a user dropdown menu Replaces the separate Tokens/email/Logout nav trio with a rounded avatar trigger that opens a dropdown containing the user's email, role, a 'My tokens' link, and Logout. Admin-only 'All tokens' stays as a top-level nav item since it's an admin function, not a personal one. Click-outside and Escape close the panel; chevron rotates on open. * fix(api): allow PATs to list/get/revoke their own tokens (CLI flow) The documented 'da auth token list/revoke' CLI flow in docs/HEADLESS_USAGE.md uses a PAT, but the previous dependency (require_session_token) returned 403. Only create_token must be session-only to prevent PAT-spawning-PAT chains; listing and revoking your own tokens is safe with a PAT. * fix(api): cap expires_in_days at 3650 to avoid datetime overflow (500 to 400) Values above ~11 million days overflowed datetime.max in datetime.now(utc) + timedelta(days=...) and surfaced as an unhandled OverflowError → 500. Cap at 10 years with a clear 400 instead; the no-expiry code path is unaffected. * fix(api): relax _SAFE_URL_RE to allow path prefixes, underscores, and IPv6 The previous regex rejected legitimate reverse-proxy base_url values (https://host/agnes/), underscores in Docker Compose hostnames, and IPv6 literals (http://[::1]:8000). Widen the charset and allow an optional trailing path. shlex.quote continues to provide defense-in-depth against any metacharacter that slips through. * fix(web): /login/email and Google OAuth propagate next_path Previously, /login/email silently dropped the ?next=<path> query param so the hidden form field rendered empty and login always landed on /dashboard. Google's button was hard-coded to /auth/google/login, ignoring next entirely. - /login page now appends ?next to the Google button URL - /login/email reads + sanitizes next, passes as template context - google_login stashes sanitized next_path in session['login_next'] - google_callback pops + re-sanitizes and redirects there Sanitization factored into app/auth/_common.safe_next_path. * fix(auth): differentiate argon2 VerifyMismatchError from internal errors in web login The previous except (VerifyMismatchError, Exception) collapsed both cases into the generic 'invalid credentials' redirect, silently hiding corrupted-hash / library errors from ops. Split the two: bad password still gets ?error=invalid; anything else logs via logger.exception and redirects with ?err=auth_internal so ops have a visible signal and users don't retry forever against a broken password_hash column. * docs: correct CLAUDE.md table name (personal_access_tokens) v7 note referenced 'access_tokens.last_used_ip' but the real table is personal_access_tokens (as mentioned two tokens earlier in the same bullet). Same-file consistency fix. * chore(web): clarify admin user-reset UI — encourage Set password over the unused reset_token POST /api/users/{id}/reset-password stores and returns a token but no endpoint consumes it — the magic-link sender would log the user in without prompting for a new password, defeating the reset. - Drop the 'Reset' row action from admin_users so admins aren't pointed at a dead end. - Rewrite the reveal-modal copy to tell admins to use Set password and explicitly note that the magic-link flow isn't available for reset tokens in this build. The API endpoint stays for API-level future use. * test: cover PAT CLI flow, expires_in_days overflow, proxy base_url, next propagation - tests/test_pat.py: PAT can list own tokens (200, was 403); PAT can revoke own tokens (204); create_token returns 400 for expires_in_days > 3650 (was 500 via datetime overflow). - tests/test_cli_artifacts.py: _SAFE_URL_RE accepts reverse-proxy path prefixes, underscores, and IPv6 literals; end-to-end check of cli_install_script with a stubbed base_url that includes a path prefix (Agnes behind /agnes/). - tests/test_web_ui.py: /login propagates ?next to the Google button URL; /login/email renders next in the hidden form field and strips hostile values; unit coverage of safe_next_path. * fix(security): use \Z instead of $ in URL/version allowlists (trailing-\n bypass) Python regex `$` also matches just before a trailing newline, so a Host header or AGNES_VERSION value like "good.example.com\n$(rm -rf /)" would slip past the allowlist. `\Z` anchors to strict end-of-string. shlex.quote downstream remains as defense-in-depth, but the allowlist is now the tight gate it claims to be. * fix(auth): PAT with null expiry omits JWT exp claim (DB is the source of truth) Previously a PAT created with `expires_in_days=null` (user-requested "never expires") set the DB `expires_at` to NULL (correct) but still baked a ~100y `exp` claim into the JWT. That is misleading: the PAT silently did expire eventually, despite the UI and API promising "no expiry". `create_access_token` now accepts `omit_exp=True` to skip the `exp` claim entirely. `app/api/tokens.py` passes that when `expires_in_days is None`. The authoritative expiry check lives in `app/auth/dependencies.py`, which reads `expires_at` from the DB row — unchanged. PyJWT accepts claim-less JWTs indefinitely. * test: cover trailing-newline regex bypass + no-exp JWT for unbounded PAT - test_safe_url_re_rejects_trailing_newline_bypass: asserts both `_SAFE_URL_RE` and `_SAFE_VERSION_RE` reject values with a trailing `\n` (previously accepted because Python `$` matches before `\n`). - test_pat_null_expiry_jwt_has_no_exp_claim: POST /auth/tokens with `expires_in_days=null`, decode the returned JWT, assert `exp` is absent while `typ=pat`, `sub`, and `jti` are still present. - test_pat_with_null_expiry_is_accepted_by_verify_token: verify_token round-trips a claim-less JWT without ExpiredSignatureError. - test_pat_null_expiry_end_to_end_allows_authenticated_request: use the null-expiry PAT against /auth/tokens and confirm it authenticates. * docs(auth): document X-Forwarded-For trust model in _client_ip Deployment runs behind Caddy which strips incoming X-Forwarded-For and sets its own, so the leftmost hop is trustworthy. Clarify that the stored last_used_ip is audit-only and never used for access control — if the app is ever exposed directly, this value becomes client-settable. * docs: /profile → /tokens in install.sh next-steps, CLI error, HEADLESS_USAGE, security skill After splitting PAT management to /tokens (with /profile as a back-compat 302), stale references remained in user-facing text. Update them to the canonical /tokens URL so shell scripts, CLI error hints, docs, and the bundled security skill are all consistent.
1308 lines
51 KiB
Markdown
1308 lines
51 KiB
Markdown
# Connector Kit — Design Spec
|
|
|
|
**Date:** 2026-04-14
|
|
**Status:** Draft
|
|
**Scope:** Standardized connector SDK replacing ad-hoc extractor implementations
|
|
**Issue:** [#5 — RFC: Connector SDK](https://github.com/keboola/agnes-the-ai-analyst/issues/5)
|
|
**POC:** `tests/test_connector_kit_poc.py` (29/29 passing)
|
|
|
|
---
|
|
|
|
## 1. Problem Statement
|
|
|
|
The platform currently has three connectors (Keboola, BigQuery, Jira), each written ad-hoc with different interfaces:
|
|
|
|
| Connector | Entry point | Capabilities | Lines |
|
|
|-----------|-------------|-------------|-------|
|
|
| Keboola | `run(output_dir, table_configs, url, token)` | batch + remote | ~300 |
|
|
| BigQuery | `init_extract(output_dir, project_id, table_configs)` | remote only | ~150 |
|
|
| Jira | `init_extract(output_dir)` + `update_meta(output_dir, table)` | batch + webhook | ~200 |
|
|
|
|
All three produce `extract.duckdb` with `_meta` tables, but each re-implements:
|
|
- DuckDB file creation and atomic swap with WAL cleanup
|
|
- `_meta` table management (slightly different schemas across connectors)
|
|
- `_remote_attach` table (duplicated SQL)
|
|
- Error handling and progress reporting
|
|
- Parquet writing logic
|
|
|
|
Adding a new connector requires studying existing implementations and copying ~100 lines of boilerplate. There is no formal interface, no discovery mechanism, no schema evolution tracking, and no contract tests.
|
|
|
|
### Design goals
|
|
|
|
1. **New connector in ~50-80 lines** — author writes only API-specific code
|
|
2. **Formal contract** — Python Protocol with explicit capabilities
|
|
3. **Discovery built-in** — `discover()` returns available tables + Arrow schemas
|
|
4. **Schema evolution** — automatic detection of added/removed/changed columns
|
|
5. **Backward compatible** — existing connectors keep working, migrate incrementally
|
|
6. **Tested** — contract tests that any connector can run against itself
|
|
|
|
### Non-goals
|
|
|
|
- Replacing DuckDB as the query engine
|
|
- Building a full ETL framework (we are not dlt/Airbyte)
|
|
- Supporting non-Python connectors (future consideration, not this spec)
|
|
- SQL translation layer (we are not CData — DuckDB IS our SQL engine)
|
|
|
|
---
|
|
|
|
## 2. Architecture
|
|
|
|
### Layer model
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────────┐
|
|
│ Layer 3: ConnectorRuntime │
|
|
│ extract.duckdb lifecycle, schema tracking, state mgmt, │
|
|
│ retry, progress reporting, contract tests, CLI scaffold │
|
|
├──────────────────────────────────────────────────────────────┤
|
|
│ Layer 2: Connector Protocol │
|
|
│ discover() → read() → stream() → remote() │
|
|
│ Python Protocol — implement only what you support │
|
|
├──────────────────────────────────────────────────────────────┤
|
|
│ Layer 1: API client (external, not our concern) │
|
|
│ HTTP calls, auth, pagination — raw data from source │
|
|
│ May be hand-written or generated via driver_builder │
|
|
└──────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Data flow
|
|
|
|
```
|
|
Connector.discover()
|
|
│
|
|
▼
|
|
ConnectorRuntime.run()
|
|
├─ Cap.READ → Connector.read(table, options) → Iterator[pa.RecordBatch]
|
|
│ │
|
|
│ ParquetBatchWriter
|
|
│ │
|
|
│ data/{table}.parquet
|
|
│
|
|
├─ Cap.STREAM → Connector.stream(table) → AsyncIterator[pa.RecordBatch]
|
|
│ │
|
|
│ PartitionedParquetWriter
|
|
│ │
|
|
│ data/{table}/YYYY-MM.parquet
|
|
│
|
|
├─ Cap.REMOTE → Connector.remote() → RemoteAttachInfo
|
|
│ │
|
|
│ _remote_attach table
|
|
│
|
|
└─ finalize → extract.duckdb (_meta + views, atomic swap)
|
|
│
|
|
SyncOrchestrator.rebuild() (unchanged)
|
|
│
|
|
analytics.duckdb
|
|
```
|
|
|
|
### Relationship to existing code
|
|
|
|
| Current | After Connector Kit | Change |
|
|
|---------|-------------------|--------|
|
|
| `connectors/keboola/extractor.py:run()` | `KeboolaConnector` class + `ConnectorRuntime` | Refactor |
|
|
| `connectors/bigquery/extractor.py:init_extract()` | `BigQueryConnector` class + `ConnectorRuntime` | Refactor |
|
|
| `connectors/jira/extract_init.py` + `webhook.py` | `JiraConnector` class + `ConnectorRuntime` | Refactor |
|
|
| `src/orchestrator.py` | Unchanged — still reads extract.duckdb | No change |
|
|
| `app/api/sync.py` subprocess pattern | Updated to use `ConnectorRuntime.run()` | Minor change |
|
|
|
|
---
|
|
|
|
## 3. Connector Protocol
|
|
|
|
### 3.1 Capability flags
|
|
|
|
```python
|
|
# File: src/connector_kit/protocol.py
|
|
|
|
from enum import Flag, auto
|
|
|
|
class Cap(Flag):
|
|
"""Capabilities a connector can declare.
|
|
|
|
Uses Flag enum for composability: Cap.READ | Cap.DISCOVER
|
|
Check membership: Cap.READ in connector.capabilities
|
|
Iterate: list(connector.capabilities) → individual flags
|
|
"""
|
|
DISCOVER = auto() # Can list tables + schemas from source
|
|
READ = auto() # Can download data in batches (full or incremental)
|
|
STREAM = auto() # Can receive continuous changes (webhooks, CDC)
|
|
REMOTE = auto() # Can configure DuckDB extension pass-through
|
|
WRITE = auto() # Can push data back to source
|
|
```
|
|
|
|
**Design decision: `Flag` over `set[str]`.**
|
|
Flag enum is type-safe, composable (`|`, `in`), iterable, and serializable to/from YAML via name mapping. Validated in POC test `TestCapabilityFlags`.
|
|
|
|
### 3.2 Data types
|
|
|
|
```python
|
|
# File: src/connector_kit/protocol.py
|
|
|
|
from dataclasses import dataclass, field
|
|
import pyarrow as pa
|
|
|
|
@dataclass
|
|
class TableInfo:
|
|
"""Describes a table available in the source."""
|
|
name: str # View name in analytics.duckdb
|
|
schema: pa.Schema # Arrow schema with types + nullability
|
|
capabilities: Cap # Per-table capabilities (subset of connector caps)
|
|
primary_key: list[str] | None = None # For merge/upsert strategies
|
|
description: str = "" # Human-readable, stored in _meta
|
|
|
|
@dataclass
|
|
class ReadOptions:
|
|
"""Options passed to read() — runtime builds these from state + config."""
|
|
columns: list[str] | None = None # Projection pushdown (None = all)
|
|
filter: dict | None = None # Filter pushdown: {"date": {">=": "2026-01-01"}}
|
|
incremental_key: str | None = None # Column name for incremental extraction
|
|
incremental_value: str | None = None # Last known value (from previous run state)
|
|
batch_size: int = 10_000 # Rows per RecordBatch yield
|
|
|
|
@dataclass
|
|
class RemoteAttachInfo:
|
|
"""Configuration for DuckDB extension pass-through."""
|
|
extension: str # DuckDB extension name: 'keboola', 'bigquery'
|
|
url: str # Connection string for ATTACH
|
|
token_env: str # Environment variable name holding auth token (NOT the token)
|
|
alias: str = "" # DuckDB alias; defaults to extension name
|
|
|
|
@dataclass
|
|
class ExtractStats:
|
|
"""Returned by ConnectorRuntime.run() — replaces ad-hoc result dicts."""
|
|
tables_extracted: int = 0
|
|
tables_failed: int = 0
|
|
total_rows: int = 0
|
|
schema_changes: list[str] = field(default_factory=list)
|
|
errors: list[str] = field(default_factory=list)
|
|
```
|
|
|
|
**Why Arrow schema?**
|
|
- DuckDB consumes Arrow zero-copy (`SELECT * FROM batch`)
|
|
- Schema evolution is diffable: added/removed fields, type changes
|
|
- Cross-language (Rust, C++ connectors can produce Arrow)
|
|
- Parquet IS Arrow on disk — no conversion needed
|
|
- Validated in POC: `TestArrowIntegration` (3 tests)
|
|
|
|
### 3.3 Protocol definition
|
|
|
|
```python
|
|
# File: src/connector_kit/protocol.py
|
|
|
|
from typing import Protocol, Iterator, AsyncIterator, runtime_checkable
|
|
|
|
@runtime_checkable
|
|
class Connector(Protocol):
|
|
"""
|
|
Structural typing contract for connectors.
|
|
|
|
Implement only the methods matching your declared capabilities.
|
|
The runtime checks capabilities before calling methods, so unimplemented
|
|
methods are never invoked.
|
|
|
|
Why Protocol over ABC:
|
|
- Structural subtyping (duck typing) — no inheritance required
|
|
- isinstance() check works at runtime via @runtime_checkable
|
|
- Partial implementation is natural — no NotImplementedError stubs
|
|
- Plays well with dataclasses and existing code
|
|
"""
|
|
|
|
@property
|
|
def capabilities(self) -> Cap:
|
|
"""Declare what this connector supports. Required by all connectors."""
|
|
...
|
|
|
|
def discover(self) -> list[TableInfo]:
|
|
"""List available tables in the source with their schemas.
|
|
|
|
Called by runtime before extraction to:
|
|
- Auto-populate table list if none specified
|
|
- Detect schema evolution (compare with previous run)
|
|
- Provide discovery in CLI: `da connector discover <name>`
|
|
|
|
Required when: Cap.DISCOVER in capabilities
|
|
"""
|
|
...
|
|
|
|
def read(self, table: str, options: ReadOptions) -> Iterator[pa.RecordBatch]:
|
|
"""Extract data from a table as Arrow RecordBatch stream.
|
|
|
|
MUST yield RecordBatch objects — not dicts, not DataFrames.
|
|
Each batch should contain `options.batch_size` rows (approximately).
|
|
The runtime writes batches to Parquet incrementally (constant memory).
|
|
|
|
For incremental extraction:
|
|
- Check options.incremental_key and options.incremental_value
|
|
- Only yield rows where incremental_key > incremental_value
|
|
- Runtime tracks state between runs automatically
|
|
|
|
Required when: Cap.READ in capabilities
|
|
"""
|
|
...
|
|
|
|
def stream(self, table: str) -> AsyncIterator[pa.RecordBatch]:
|
|
"""Receive continuous changes as Arrow RecordBatch stream.
|
|
|
|
Each yield = one event or micro-batch of events.
|
|
Runtime handles:
|
|
- Writing to partitioned parquets (YYYY-MM.parquet)
|
|
- File locking for concurrent webhook writes
|
|
- _meta updates after each write
|
|
|
|
Required when: Cap.STREAM in capabilities
|
|
"""
|
|
...
|
|
|
|
def remote(self) -> RemoteAttachInfo:
|
|
"""Provide DuckDB extension pass-through configuration.
|
|
|
|
The runtime writes this to _remote_attach table in extract.duckdb.
|
|
The orchestrator reads it and re-ATTACHes the extension at query time.
|
|
|
|
IMPORTANT: Never include actual tokens — only env var names.
|
|
|
|
Required when: Cap.REMOTE in capabilities
|
|
"""
|
|
...
|
|
```
|
|
|
|
**Validated in POC:** `TestProtocolCompliance` confirms `isinstance(connector, Connector)` works, and partial implementations (e.g., stream-only connector without `read()`) are accepted.
|
|
|
|
---
|
|
|
|
## 4. Connector Manifest
|
|
|
|
### 4.1 Format
|
|
|
|
Each connector has a `connector.yaml` in its directory:
|
|
|
|
```yaml
|
|
# File: connectors/{name}/connector.yaml
|
|
|
|
name: keboola # Unique identifier, matches directory name
|
|
version: "1.0.0" # Semver
|
|
description: "Keboola Storage connector — batch extraction and remote query"
|
|
entrypoint: connectors.keboola.connector.KeboolaConnector # Python import path
|
|
|
|
capabilities: [discover, read, remote] # Maps to Cap flags
|
|
|
|
auth:
|
|
type: token # token | oauth | basic | service_account | none
|
|
env_vars:
|
|
- name: KEBOOLA_STORAGE_TOKEN
|
|
required: true
|
|
description: "Keboola Storage API token"
|
|
|
|
config: # Connector-specific config (JSON Schema subset)
|
|
url:
|
|
type: string
|
|
format: uri
|
|
required: true
|
|
description: "Keboola stack URL (e.g., https://connection.keboola.com)"
|
|
bucket:
|
|
type: string
|
|
required: false
|
|
description: "Default bucket for table extraction"
|
|
|
|
health_check: # Optional: runtime calls before extraction
|
|
endpoint: "${url}/v2/storage"
|
|
method: GET
|
|
headers:
|
|
X-StorageApi-Token: "${KEBOOLA_STORAGE_TOKEN}"
|
|
expect_status: 200
|
|
timeout_seconds: 10
|
|
```
|
|
|
|
### 4.2 Manifest loading
|
|
|
|
```python
|
|
# File: src/connector_kit/manifest.py
|
|
|
|
@dataclass
|
|
class ConnectorManifest:
|
|
name: str
|
|
version: str
|
|
description: str
|
|
entrypoint: str
|
|
capabilities: Cap
|
|
auth: dict
|
|
config: dict
|
|
health_check: dict | None = None
|
|
|
|
@classmethod
|
|
def load(cls, path: Path) -> "ConnectorManifest":
|
|
"""Load and validate connector.yaml."""
|
|
data = yaml.safe_load(path.read_text())
|
|
# Map capability strings to Cap flags
|
|
cap_map = {c.name.lower(): c for c in Cap}
|
|
caps = Cap(0)
|
|
for c in data["capabilities"]:
|
|
if c not in cap_map:
|
|
raise ValueError(f"Unknown capability: {c}. Valid: {list(cap_map)}")
|
|
caps |= cap_map[c]
|
|
return cls(
|
|
name=data["name"],
|
|
version=data["version"],
|
|
description=data["description"],
|
|
entrypoint=data["entrypoint"],
|
|
capabilities=caps,
|
|
auth=data.get("auth", {}),
|
|
config=data.get("config", {}),
|
|
health_check=data.get("health_check"),
|
|
)
|
|
|
|
def instantiate(self, config: dict) -> Connector:
|
|
"""Import and instantiate the connector class."""
|
|
module_path, class_name = self.entrypoint.rsplit(".", 1)
|
|
module = importlib.import_module(module_path)
|
|
cls = getattr(module, class_name)
|
|
return cls(config)
|
|
```
|
|
|
|
**Validated in POC:** `TestManifestValidation` (5 tests) confirms YAML parsing, capability mapping, auth config, and health check extraction.
|
|
|
|
---
|
|
|
|
## 5. Connector Runtime
|
|
|
|
### 5.1 Responsibilities
|
|
|
|
The runtime replaces all boilerplate currently duplicated across connectors:
|
|
|
|
| Responsibility | Currently | Runtime handles |
|
|
|----------------|-----------|----------------|
|
|
| Create output_dir + data/ | Each connector | `__init__()` |
|
|
| Create extract.duckdb | Each connector | `_build_extract_db()` |
|
|
| Create _meta table | Each connector (slightly different schemas) | `_build_extract_db()` |
|
|
| Create _remote_attach | Keboola + BigQuery | `_write_remote_attach()` |
|
|
| Write parquets from data | Each connector | `_extract_table()` |
|
|
| Atomic swap + WAL cleanup | Each connector | `_atomic_swap()` |
|
|
| Error handling per table | Each connector | `run()` try/except loop |
|
|
| Schema tracking | Nobody | `_check_schema_evolution()` |
|
|
| Incremental state | Nobody (Jira has manual partitioning) | `_save_state()` / `_load_state()` |
|
|
| Progress reporting | Nobody | `_report_progress()` callback |
|
|
|
|
### 5.2 Implementation
|
|
|
|
```python
|
|
# File: src/connector_kit/runtime.py
|
|
|
|
_SAFE_IDENTIFIER = re.compile(r"^[a-zA-Z_][a-zA-Z0-9_]{0,63}$")
|
|
|
|
class ConnectorRuntime:
|
|
"""Manages the extract.duckdb lifecycle for any Connector implementation."""
|
|
|
|
def __init__(self, output_dir: Path):
|
|
self.output_dir = output_dir
|
|
self.data_dir = output_dir / "data"
|
|
self.db_path = output_dir / "extract.duckdb"
|
|
self.state_path = output_dir / ".state.yaml"
|
|
self.data_dir.mkdir(parents=True, exist_ok=True)
|
|
|
|
@staticmethod
|
|
def _validate_identifier(name: str) -> bool:
|
|
"""Validate DuckDB identifier. Same regex as src/orchestrator.py."""
|
|
return bool(_SAFE_IDENTIFIER.match(name))
|
|
|
|
def run(
|
|
self,
|
|
connector: Connector,
|
|
tables: list[str] | None = None,
|
|
on_progress: Callable[[str, int], None] | None = None,
|
|
) -> ExtractStats:
|
|
"""Execute the full extraction pipeline.
|
|
|
|
Args:
|
|
connector: Any object satisfying the Connector protocol.
|
|
tables: Specific tables to extract. None = auto-discover all.
|
|
on_progress: Optional callback(table_name, rows_so_far).
|
|
|
|
Returns:
|
|
ExtractStats with counts, errors, and schema changes.
|
|
"""
|
|
stats = ExtractStats()
|
|
|
|
# --- Phase 1: Discovery ---
|
|
available: list[TableInfo] = []
|
|
if Cap.DISCOVER in connector.capabilities:
|
|
available = connector.discover()
|
|
|
|
if tables is None:
|
|
tables = [t.name for t in available if Cap.READ in t.capabilities]
|
|
|
|
# Validate all table names (SQL injection prevention)
|
|
for name in tables:
|
|
if not self._validate_identifier(name):
|
|
raise ValueError(f"Invalid table name: {name!r} (must match {_SAFE_IDENTIFIER.pattern})")
|
|
|
|
# --- Phase 2: Schema evolution check ---
|
|
for table_name in tables:
|
|
table_info = self._find_table(available, table_name)
|
|
if table_info:
|
|
change = self._check_schema_evolution(table_name, table_info.schema)
|
|
if change:
|
|
stats.schema_changes.append(change)
|
|
|
|
# --- Phase 3: Batch extraction ---
|
|
if Cap.READ in connector.capabilities:
|
|
for table_name in tables:
|
|
try:
|
|
options = self._build_read_options(table_name)
|
|
rows = self._extract_table(connector, table_name, options, on_progress)
|
|
stats.tables_extracted += 1
|
|
stats.total_rows += rows
|
|
except Exception as e:
|
|
stats.tables_failed += 1
|
|
stats.errors.append(f"{table_name}: {e}")
|
|
logger.exception("Failed to extract table %s", table_name)
|
|
|
|
# --- Phase 4: Remote attach ---
|
|
if Cap.REMOTE in connector.capabilities:
|
|
try:
|
|
info = connector.remote()
|
|
self._write_remote_attach(info)
|
|
except Exception as e:
|
|
stats.errors.append(f"remote_attach: {e}")
|
|
|
|
# --- Phase 5: Build extract.duckdb ---
|
|
self._build_extract_db(available, tables)
|
|
|
|
# --- Phase 6: Save state ---
|
|
self._save_state(tables)
|
|
|
|
return stats
|
|
```
|
|
|
|
### 5.3 Extract table (Arrow → Parquet)
|
|
|
|
```python
|
|
def _extract_table(
|
|
self,
|
|
connector: Connector,
|
|
table: str,
|
|
options: ReadOptions,
|
|
on_progress: Callable | None,
|
|
) -> int:
|
|
"""Extract via Arrow RecordBatch iterator → single Parquet file.
|
|
|
|
Memory usage is constant regardless of table size — each batch
|
|
is written and then discarded. Validated with 100K rows in POC.
|
|
"""
|
|
parquet_path = self.data_dir / f"{table}.parquet"
|
|
writer: pq.ParquetWriter | None = None
|
|
total_rows = 0
|
|
|
|
try:
|
|
for batch in connector.read(table, options):
|
|
if writer is None:
|
|
writer = pq.ParquetWriter(
|
|
str(parquet_path),
|
|
batch.schema,
|
|
compression="zstd",
|
|
)
|
|
writer.write_batch(batch)
|
|
total_rows += batch.num_rows
|
|
if on_progress:
|
|
on_progress(table, total_rows)
|
|
finally:
|
|
if writer:
|
|
writer.close()
|
|
|
|
return total_rows
|
|
```
|
|
|
|
**Key details:**
|
|
- `compression="zstd"` — best compression/speed tradeoff for analytical data
|
|
- Writer is lazy-initialized from first batch schema (handles empty tables)
|
|
- `finally` ensures parquet file is properly closed even on errors
|
|
- Validated in POC: `TestLargeDataBatching` (100 batches x 1000 rows)
|
|
|
|
### 5.4 Build extract.duckdb
|
|
|
|
```python
|
|
def _build_extract_db(self, available: list[TableInfo], tables: list[str]):
|
|
"""Build extract.duckdb with _meta + views. Atomic swap.
|
|
|
|
Produces the same contract as current connectors — orchestrator
|
|
sees no difference. _meta schema matches existing convention with
|
|
one addition: schema_json for evolution tracking.
|
|
"""
|
|
tmp_db = self.output_dir / "extract.duckdb.tmp"
|
|
if tmp_db.exists():
|
|
tmp_db.unlink()
|
|
|
|
con = duckdb.connect(str(tmp_db))
|
|
try:
|
|
# _meta table — matches existing schema + schema_json column
|
|
con.execute("""
|
|
CREATE TABLE _meta (
|
|
table_name VARCHAR NOT NULL,
|
|
description VARCHAR,
|
|
rows BIGINT,
|
|
size_bytes BIGINT,
|
|
extracted_at TIMESTAMP DEFAULT current_timestamp,
|
|
query_mode VARCHAR DEFAULT 'local',
|
|
schema_json VARCHAR
|
|
)
|
|
""")
|
|
|
|
# _remote_attach table (if .remote_attach.yaml exists)
|
|
ra_path = self.output_dir / ".remote_attach.yaml"
|
|
if ra_path.exists():
|
|
ra = yaml.safe_load(ra_path.read_text())
|
|
con.execute("""
|
|
CREATE TABLE _remote_attach (
|
|
alias VARCHAR,
|
|
extension VARCHAR,
|
|
url VARCHAR,
|
|
token_env VARCHAR
|
|
)
|
|
""")
|
|
con.execute(
|
|
"INSERT INTO _remote_attach VALUES (?, ?, ?, ?)",
|
|
[
|
|
ra.get("alias") or ra["extension"],
|
|
ra["extension"],
|
|
ra["url"],
|
|
ra["token_env"],
|
|
],
|
|
)
|
|
|
|
# Views and _meta entries for each extracted table
|
|
for table_name in tables:
|
|
parquet_path = self.data_dir / f"{table_name}.parquet"
|
|
if parquet_path.exists():
|
|
con.execute(
|
|
f'CREATE VIEW "{table_name}" AS '
|
|
f"SELECT * FROM read_parquet('{parquet_path}')"
|
|
)
|
|
rows = con.execute(
|
|
f'SELECT count(*) FROM "{table_name}"'
|
|
).fetchone()[0]
|
|
size = parquet_path.stat().st_size
|
|
elif Cap.REMOTE in (self._find_table(available, table_name) or TableInfo(
|
|
name="", schema=pa.schema([]), capabilities=Cap(0)
|
|
)).capabilities:
|
|
# Remote-only table — no parquet, just _meta entry
|
|
rows = 0
|
|
size = 0
|
|
else:
|
|
continue
|
|
|
|
info = self._find_table(available, table_name)
|
|
desc = info.description if info else ""
|
|
schema_str = info.schema.to_string() if info else ""
|
|
|
|
con.execute(
|
|
"INSERT INTO _meta VALUES (?, ?, ?, ?, current_timestamp, ?, ?)",
|
|
[table_name, desc, rows, size, "local", schema_str],
|
|
)
|
|
|
|
con.execute("CHECKPOINT")
|
|
finally:
|
|
con.close()
|
|
|
|
# Atomic swap (same pattern as existing connectors)
|
|
self._atomic_swap(tmp_db, self.db_path)
|
|
```
|
|
|
|
### 5.5 Atomic swap
|
|
|
|
```python
|
|
@staticmethod
|
|
def _atomic_swap(tmp_path: Path, target_path: Path):
|
|
"""Atomic DB swap with WAL cleanup.
|
|
|
|
Same pattern used by all existing connectors — ensures readers
|
|
on the old file continue uninterrupted (Unix inode semantics).
|
|
"""
|
|
# Remove old WAL
|
|
old_wal = Path(str(target_path) + ".wal")
|
|
if old_wal.exists():
|
|
old_wal.unlink()
|
|
|
|
# Remove old DB
|
|
if target_path.exists():
|
|
target_path.unlink()
|
|
|
|
# Clean temp WAL before move
|
|
tmp_wal = Path(str(tmp_path) + ".wal")
|
|
if tmp_wal.exists():
|
|
tmp_wal.unlink()
|
|
|
|
# Atomic move
|
|
tmp_path.rename(target_path)
|
|
```
|
|
|
|
### 5.6 Schema evolution detection
|
|
|
|
```python
|
|
def _check_schema_evolution(self, table: str, new_schema: pa.Schema) -> str | None:
|
|
"""Compare Arrow schemas between runs. Returns human-readable diff or None.
|
|
|
|
Serializes schemas via Arrow IPC stream format (compatible with all
|
|
PyArrow versions including 23.x). Validated in POC: TestSchemaEvolution.
|
|
"""
|
|
schema_file = self.output_dir / f".schema_{table}.arrow"
|
|
|
|
if schema_file.exists():
|
|
reader = pa.ipc.open_stream(schema_file.read_bytes())
|
|
old_schema = reader.schema
|
|
|
|
if old_schema != new_schema:
|
|
old_names = set(old_schema.names)
|
|
new_names = set(new_schema.names)
|
|
added = new_names - old_names
|
|
removed = old_names - new_names
|
|
|
|
parts = [f"{table}:"]
|
|
if added:
|
|
parts.append(f"added {added}")
|
|
if removed:
|
|
parts.append(f"removed {removed}")
|
|
for name in old_names & new_names:
|
|
old_t = old_schema.field(name).type
|
|
new_t = new_schema.field(name).type
|
|
if old_t != new_t:
|
|
parts.append(f"{name}: {old_t} → {new_t}")
|
|
|
|
self._save_schema(table, new_schema)
|
|
return " ".join(parts)
|
|
|
|
# First run or no change
|
|
self._save_schema(table, new_schema)
|
|
return None
|
|
|
|
def _save_schema(self, table: str, schema: pa.Schema):
|
|
schema_file = self.output_dir / f".schema_{table}.arrow"
|
|
sink = pa.BufferOutputStream()
|
|
writer = pa.ipc.new_stream(sink, schema)
|
|
writer.close()
|
|
schema_file.write_bytes(sink.getvalue().to_pybytes())
|
|
```
|
|
|
|
### 5.7 Incremental state management
|
|
|
|
```python
|
|
def _build_read_options(self, table: str) -> ReadOptions:
|
|
"""Build ReadOptions with incremental state from previous run."""
|
|
state = self._load_state()
|
|
options = ReadOptions()
|
|
if table in state:
|
|
options.incremental_key = state[table].get("incremental_key")
|
|
options.incremental_value = state[table].get("incremental_value")
|
|
return options
|
|
|
|
def _load_state(self) -> dict:
|
|
if self.state_path.exists():
|
|
return yaml.safe_load(self.state_path.read_text()) or {}
|
|
return {}
|
|
|
|
def _save_state(self, tables: list[str]):
|
|
state = self._load_state()
|
|
for table in tables:
|
|
if table not in state:
|
|
state[table] = {}
|
|
state[table]["last_extracted"] = datetime.utcnow().isoformat()
|
|
self.state_path.write_text(yaml.dump(state, default_flow_style=False))
|
|
```
|
|
|
|
### 5.8 Streaming support
|
|
|
|
```python
|
|
async def run_stream(
|
|
self,
|
|
connector: Connector,
|
|
table: str,
|
|
event_data: dict,
|
|
) -> int:
|
|
"""Process a single stream event (e.g., webhook payload).
|
|
|
|
Called by webhook handlers. Writes to partitioned parquets
|
|
(YYYY-MM.parquet) matching existing Jira pattern.
|
|
|
|
Returns number of rows written.
|
|
"""
|
|
if Cap.STREAM not in connector.capabilities:
|
|
raise ValueError(f"Connector does not support streaming")
|
|
|
|
table_dir = self.data_dir / table
|
|
table_dir.mkdir(parents=True, exist_ok=True)
|
|
|
|
rows_written = 0
|
|
async for batch in connector.stream(table):
|
|
partition = datetime.utcnow().strftime("%Y-%m")
|
|
parquet_path = table_dir / f"{partition}.parquet"
|
|
|
|
if parquet_path.exists():
|
|
# Append to existing partition
|
|
existing = pq.read_table(str(parquet_path))
|
|
combined = pa.concat_tables([existing, pa.Table.from_batches([batch])])
|
|
pq.write_table(combined, str(parquet_path), compression="zstd")
|
|
else:
|
|
pq.write_table(
|
|
pa.Table.from_batches([batch]),
|
|
str(parquet_path),
|
|
compression="zstd",
|
|
)
|
|
|
|
rows_written += batch.num_rows
|
|
|
|
# Update _meta for this table (same as Jira's update_meta pattern)
|
|
self._update_meta_for_stream_table(table)
|
|
return rows_written
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Example Connector Implementations
|
|
|
|
### 6.1 Keboola (batch + remote)
|
|
|
|
Current `connectors/keboola/extractor.py:run()` is ~300 lines. After refactor:
|
|
|
|
```python
|
|
# File: connectors/keboola/connector.py
|
|
|
|
class KeboolaConnector:
|
|
"""Keboola Storage connector — batch extraction and remote query."""
|
|
|
|
capabilities = Cap.DISCOVER | Cap.READ | Cap.REMOTE
|
|
|
|
def __init__(self, config: dict):
|
|
self.url = config["url"]
|
|
self.token = os.environ["KEBOOLA_STORAGE_TOKEN"]
|
|
self._default_bucket = config.get("bucket", "")
|
|
self._table_buckets: dict[str, str] = {} # Populated by discover()
|
|
# Layer 1: API client (existing connectors/keboola/client.py)
|
|
self.client = KeboolaClient(self.url, self.token)
|
|
# DuckDB extension availability (checked once)
|
|
self._has_extension = self._check_extension()
|
|
|
|
def discover(self) -> list[TableInfo]:
|
|
"""List tables in configured Keboola buckets."""
|
|
tables = []
|
|
for bucket in self.client.list_buckets():
|
|
for table_meta in self.client.list_bucket_tables(bucket["id"]):
|
|
schema = self._columns_to_arrow_schema(table_meta.get("columns", []))
|
|
tables.append(TableInfo(
|
|
name=table_meta["name"],
|
|
schema=schema,
|
|
capabilities=Cap.READ | Cap.REMOTE,
|
|
primary_key=table_meta.get("primaryKey"),
|
|
description=table_meta.get("description", ""),
|
|
))
|
|
return tables
|
|
|
|
def read(self, table: str, options: ReadOptions) -> Iterator[pa.RecordBatch]:
|
|
"""Extract table data — via DuckDB extension or legacy CSV export."""
|
|
if self._has_extension:
|
|
yield from self._read_via_extension(table, options)
|
|
else:
|
|
yield from self._read_via_csv(table, options)
|
|
|
|
def remote(self) -> RemoteAttachInfo:
|
|
return RemoteAttachInfo(
|
|
extension="keboola",
|
|
url=self.url,
|
|
token_env="KEBOOLA_STORAGE_TOKEN",
|
|
alias="kbc",
|
|
)
|
|
|
|
def _read_via_extension(self, table, options):
|
|
"""Use DuckDB Keboola extension for direct parquet export.
|
|
|
|
Note: bucket is passed per-table via ReadOptions or looked up from
|
|
table_registry config. The runtime resolves this before calling read().
|
|
"""
|
|
con = duckdb.connect()
|
|
con.execute("INSTALL keboola FROM community; LOAD keboola")
|
|
token_escaped = self.token.replace("'", "''")
|
|
con.execute(f"ATTACH '{self.url}' AS kbc (TYPE keboola, TOKEN '{token_escaped}')")
|
|
|
|
# Bucket comes from table_registry config, resolved by runtime
|
|
bucket = self._table_buckets.get(table, self._default_bucket)
|
|
query = f'SELECT * FROM kbc."{bucket}"."{table}"'
|
|
result = con.execute(query)
|
|
|
|
while True:
|
|
batch = result.fetch_record_batch(options.batch_size)
|
|
if batch.num_rows == 0:
|
|
break
|
|
yield batch
|
|
|
|
con.close()
|
|
|
|
def _read_via_csv(self, table, options):
|
|
"""Fallback: legacy KeboolaClient CSV export → Arrow."""
|
|
for chunk_df in self.client.export_table_chunked(table, chunk_size=options.batch_size):
|
|
yield pa.RecordBatch.from_pandas(chunk_df)
|
|
|
|
# ... helper methods (~20 lines)
|
|
```
|
|
|
|
**Result: ~80 lines** (API-specific code only). Runtime handles extract.duckdb, _meta, atomic swap, schema tracking, state.
|
|
|
|
### 6.2 BigQuery (remote only)
|
|
|
|
```python
|
|
# File: connectors/bigquery/connector.py
|
|
|
|
class BigQueryConnector:
|
|
"""BigQuery connector — remote-only via DuckDB extension."""
|
|
|
|
capabilities = Cap.DISCOVER | Cap.REMOTE
|
|
|
|
def __init__(self, config: dict):
|
|
self.project_id = config["project_id"]
|
|
|
|
def discover(self) -> list[TableInfo]:
|
|
"""List tables in BigQuery datasets via DuckDB extension."""
|
|
con = duckdb.connect()
|
|
con.execute("INSTALL bigquery FROM community; LOAD bigquery")
|
|
con.execute(f"ATTACH 'project={self.project_id}' AS bq (TYPE bigquery, READ_ONLY)")
|
|
# Query information_schema for table list
|
|
tables = con.execute("""
|
|
SELECT table_schema, table_name
|
|
FROM bq.information_schema.tables
|
|
WHERE table_type = 'BASE TABLE'
|
|
""").fetchall()
|
|
con.close()
|
|
return [
|
|
TableInfo(
|
|
name=f"{schema}_{name}",
|
|
schema=pa.schema([]), # Schema inferred at query time
|
|
capabilities=Cap.REMOTE,
|
|
description=f"BigQuery: {schema}.{name}",
|
|
)
|
|
for schema, name in tables
|
|
]
|
|
|
|
def remote(self) -> RemoteAttachInfo:
|
|
return RemoteAttachInfo(
|
|
extension="bigquery",
|
|
url=f"project={self.project_id}",
|
|
token_env="", # Auth via GOOGLE_APPLICATION_CREDENTIALS
|
|
alias="bq",
|
|
)
|
|
```
|
|
|
|
**Result: ~40 lines.**
|
|
|
|
### 6.3 Jira (batch + stream)
|
|
|
|
```python
|
|
# File: connectors/jira/connector.py
|
|
|
|
class JiraConnector:
|
|
"""Jira connector — REST API batch + webhook streaming."""
|
|
|
|
capabilities = Cap.DISCOVER | Cap.READ | Cap.STREAM
|
|
|
|
TABLES = {
|
|
"issues": ISSUES_SCHEMA,
|
|
"comments": COMMENTS_SCHEMA,
|
|
"changelog": CHANGELOG_SCHEMA,
|
|
"attachments": ATTACHMENTS_SCHEMA,
|
|
"issuelinks": ISSUELINKS_SCHEMA,
|
|
"remote_links": REMOTE_LINKS_SCHEMA,
|
|
}
|
|
|
|
def __init__(self, config: dict):
|
|
self.base_url = config["url"]
|
|
self.token = config.secret("JIRA_API_TOKEN")
|
|
self.email = config.get("email", "")
|
|
self._webhook_queue: asyncio.Queue = asyncio.Queue()
|
|
|
|
def discover(self) -> list[TableInfo]:
|
|
return [
|
|
TableInfo(
|
|
name=name,
|
|
schema=schema,
|
|
capabilities=Cap.READ | Cap.STREAM,
|
|
description=f"Jira {name}",
|
|
)
|
|
for name, schema in self.TABLES.items()
|
|
]
|
|
|
|
def read(self, table: str, options: ReadOptions) -> Iterator[pa.RecordBatch]:
|
|
"""Backfill — iterate Jira REST API search results."""
|
|
jql = f"updated >= '{options.incremental_value}'" if options.incremental_value else ""
|
|
for page in self._search_paginated(table, jql, options.batch_size):
|
|
transformed = transform_jira_page(table, page) # existing transform.py
|
|
yield pa.RecordBatch.from_pylist(transformed, schema=self.TABLES[table])
|
|
|
|
async def stream(self, table: str) -> AsyncIterator[pa.RecordBatch]:
|
|
"""Process webhook events from queue."""
|
|
while not self._webhook_queue.empty():
|
|
event = await self._webhook_queue.get()
|
|
transformed = transform_jira_event(table, event)
|
|
if transformed:
|
|
yield pa.RecordBatch.from_pylist(
|
|
[transformed],
|
|
schema=self.TABLES[table],
|
|
)
|
|
|
|
def push_event(self, event: dict):
|
|
"""Called by webhook handler to enqueue events."""
|
|
self._webhook_queue.put_nowait(event)
|
|
```
|
|
|
|
**Result: ~60 lines** (excluding existing transform.py which stays unchanged).
|
|
|
|
---
|
|
|
|
## 7. CLI Integration
|
|
|
|
### 7.1 New CLI commands
|
|
|
|
```
|
|
da connector list # List installed connectors + capabilities
|
|
da connector discover <name> # Run discover(), show available tables
|
|
da connector test <name> # Run contract tests against connector
|
|
da connector new <name> [--caps ...] # Scaffold new connector from template
|
|
```
|
|
|
|
### 7.2 Scaffold template
|
|
|
|
`da connector new hubspot --caps discover,read,write` generates:
|
|
|
|
```
|
|
connectors/hubspot/
|
|
├── connector.yaml # Manifest (pre-filled with name, caps)
|
|
├── connector.py # Connector class skeleton
|
|
├── __init__.py
|
|
└── tests/
|
|
└── test_connector.py # Contract tests (from runtime)
|
|
```
|
|
|
|
Generated `connector.py`:
|
|
|
|
```python
|
|
"""HubSpot connector — generated scaffold."""
|
|
|
|
import pyarrow as pa
|
|
from src.connector_kit.protocol import Cap, Connector, ReadOptions, TableInfo
|
|
|
|
class HubspotConnector:
|
|
capabilities = Cap.DISCOVER | Cap.READ | Cap.WRITE
|
|
|
|
def __init__(self, config: dict):
|
|
# TODO: Initialize API client
|
|
pass
|
|
|
|
def discover(self) -> list[TableInfo]:
|
|
# TODO: Query HubSpot API for available objects
|
|
return []
|
|
|
|
def read(self, table: str, options: ReadOptions) -> Iterator[pa.RecordBatch]:
|
|
# TODO: Implement data extraction
|
|
yield from []
|
|
```
|
|
|
|
### 7.3 Contract tests
|
|
|
|
The runtime provides reusable test functions that any connector can run:
|
|
|
|
```python
|
|
# File: src/connector_kit/contract_tests.py
|
|
|
|
def test_discover_returns_valid_tables(connector: Connector):
|
|
"""Every discovered table must have a name, schema, and valid capabilities."""
|
|
if Cap.DISCOVER not in connector.capabilities:
|
|
pytest.skip("Connector does not support DISCOVER")
|
|
tables = connector.discover()
|
|
assert len(tables) > 0, "discover() must return at least one table"
|
|
for t in tables:
|
|
assert t.name, "Table name must not be empty"
|
|
assert isinstance(t.schema, pa.Schema), f"Table {t.name} schema must be Arrow Schema"
|
|
assert t.capabilities, f"Table {t.name} must declare capabilities"
|
|
|
|
def test_read_yields_valid_batches(connector: Connector):
|
|
"""read() must yield valid Arrow RecordBatches matching declared schema."""
|
|
if Cap.READ not in connector.capabilities:
|
|
pytest.skip("Connector does not support READ")
|
|
tables = connector.discover() if Cap.DISCOVER in connector.capabilities else []
|
|
readable = [t for t in tables if Cap.READ in t.capabilities]
|
|
if not readable:
|
|
pytest.skip("No readable tables discovered")
|
|
table = readable[0]
|
|
options = ReadOptions(batch_size=10)
|
|
batches = list(itertools.islice(connector.read(table.name, options), 3))
|
|
for batch in batches:
|
|
assert isinstance(batch, pa.RecordBatch)
|
|
assert batch.num_rows > 0 or batch.num_rows == 0 # Empty is OK
|
|
assert batch.schema == table.schema, (
|
|
f"Batch schema mismatch for {table.name}: "
|
|
f"expected {table.schema}, got {batch.schema}"
|
|
)
|
|
|
|
def test_full_extract_pipeline(connector: Connector, tmp_path: Path):
|
|
"""End-to-end: connector → runtime → extract.duckdb."""
|
|
runtime = ConnectorRuntime(tmp_path / "test_extract")
|
|
stats = runtime.run(connector)
|
|
assert stats.tables_failed == 0, f"Extraction errors: {stats.errors}"
|
|
db_path = tmp_path / "test_extract" / "extract.duckdb"
|
|
assert db_path.exists()
|
|
con = duckdb.connect(str(db_path), read_only=True)
|
|
meta = con.execute("SELECT table_name FROM _meta").fetchall()
|
|
assert len(meta) > 0, "extract.duckdb must have at least one table in _meta"
|
|
con.close()
|
|
|
|
def test_remote_attach_info(connector: Connector):
|
|
"""remote() must return valid extension info without embedded secrets."""
|
|
if Cap.REMOTE not in connector.capabilities:
|
|
pytest.skip("Connector does not support REMOTE")
|
|
info = connector.remote()
|
|
assert info.extension, "Extension name must not be empty"
|
|
assert info.url, "URL must not be empty"
|
|
# SECURITY: token_env must be an env var name, not an actual token
|
|
if info.token_env:
|
|
assert not info.token_env.startswith("sk-"), "token_env must be env var name, not token"
|
|
assert not info.token_env.startswith("xox"), "token_env must be env var name, not token"
|
|
assert len(info.token_env) < 100, "token_env looks like a token, not an env var name"
|
|
```
|
|
|
|
Usage in a connector's test file:
|
|
|
|
```python
|
|
# File: connectors/hubspot/tests/test_connector.py
|
|
|
|
from src.connector_kit.contract_tests import *
|
|
|
|
@pytest.fixture
|
|
def connector():
|
|
return HubspotConnector({"url": "https://api.hubspot.com", ...})
|
|
|
|
# All contract tests run automatically via the wildcard import
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Integration with sync.py
|
|
|
|
### 8.1 Updated sync flow
|
|
|
|
The subprocess pattern stays (DuckDB lock isolation), but the subprocess now uses ConnectorRuntime:
|
|
|
|
```python
|
|
# In app/api/sync.py — updated _run_sync()
|
|
|
|
# Before (current):
|
|
cmd = [sys.executable, "-c", """
|
|
import json, sys
|
|
configs = json.load(sys.stdin)
|
|
from connectors.keboola.extractor import run
|
|
result = run(output_dir, configs, url, token)
|
|
print(json.dumps(result))
|
|
"""]
|
|
|
|
# After (with Connector Kit):
|
|
cmd = [sys.executable, "-c", """
|
|
import json, sys
|
|
from pathlib import Path
|
|
from src.connector_kit.manifest import ConnectorManifest
|
|
from src.connector_kit.runtime import ConnectorRuntime
|
|
|
|
payload = json.load(sys.stdin)
|
|
manifest = ConnectorManifest.load(Path(payload["manifest_path"]))
|
|
connector = manifest.instantiate(payload["config"])
|
|
runtime = ConnectorRuntime(Path(payload["output_dir"]))
|
|
stats = runtime.run(connector, tables=payload.get("tables"))
|
|
print(json.dumps(stats.__dict__))
|
|
"""]
|
|
```
|
|
|
|
### 8.2 Orchestrator compatibility
|
|
|
|
**No changes to `src/orchestrator.py`.** The runtime produces the same `extract.duckdb` contract:
|
|
- `_meta` table with `table_name, description, rows, size_bytes, extracted_at, query_mode` (+ optional `schema_json`)
|
|
- `_remote_attach` table with `alias, extension, url, token_env`
|
|
- Views pointing to `read_parquet(...)` for local tables
|
|
|
|
The orchestrator's `_attach_and_create_views()` and `_attach_remote_extensions()` continue to work unchanged. The orchestrator SELECTs only 4 specific columns from `_meta` (`table_name, rows, size_bytes, query_mode`), so the added `schema_json` column is invisible to it.
|
|
|
|
**Note:** `src/db.py:get_analytics_db_readonly()` also reads `_remote_attach` via `_reattach_remote_extensions()` — this is a second consumer of the same 4-column contract, and also requires no changes.
|
|
|
|
### 8.3 Sync.py additional concerns
|
|
|
|
The current `_run_sync()` in `app/api/sync.py` does more than just run extractors:
|
|
|
|
1. **Custom connectors** — scans `connectors/custom/*/extractor.py` and runs each in a subprocess. Must be preserved: during transition, scan for both legacy `extractor.py` and new `connector.yaml`.
|
|
2. **Auto-profiling** — runs `ProfilerService.profile_table()` after sync for first 10 tables per source. Must be preserved in the refactored sync flow.
|
|
3. **Auto-discovery** — when no tables are registered and KEBOOLA_STORAGE_TOKEN is set, attempts automatic table discovery. With Connector Kit this becomes cleaner: `connector.discover()` provides this natively.
|
|
|
|
---
|
|
|
|
## 9. File Layout
|
|
|
|
### New files
|
|
|
|
```
|
|
src/connector_kit/
|
|
├── __init__.py # Public API exports
|
|
├── protocol.py # Cap, TableInfo, ReadOptions, RemoteAttachInfo, Connector
|
|
├── runtime.py # ConnectorRuntime
|
|
├── manifest.py # ConnectorManifest (YAML loader)
|
|
├── contract_tests.py # Reusable test functions
|
|
└── scaffold.py # CLI scaffold generator (da connector new)
|
|
```
|
|
|
|
### Modified files
|
|
|
|
```
|
|
connectors/keboola/
|
|
├── connector.yaml # NEW: manifest
|
|
├── connector.py # NEW: KeboolaConnector class
|
|
├── extractor.py # KEPT: deprecated, delegates to connector.py
|
|
├── client.py # UNCHANGED: legacy API client
|
|
└── ...
|
|
|
|
connectors/bigquery/
|
|
├── connector.yaml # NEW
|
|
├── connector.py # NEW: BigQueryConnector class
|
|
├── extractor.py # KEPT: deprecated, delegates to connector.py
|
|
└── ...
|
|
|
|
connectors/jira/
|
|
├── connector.yaml # NEW
|
|
├── connector.py # NEW: JiraConnector class
|
|
├── extract_init.py # KEPT: deprecated, delegates to connector.py
|
|
├── transform.py # UNCHANGED (stable infrastructure per CLAUDE.md)
|
|
├── file_lock.py # UNCHANGED (stable infrastructure per CLAUDE.md)
|
|
└── ...
|
|
|
|
app/api/sync.py # MODIFIED: use ConnectorRuntime in subprocess
|
|
cli/ # MODIFIED: add `da connector` subcommands
|
|
tests/test_connector_kit_poc.py # EXISTS: POC validation (29 tests)
|
|
```
|
|
|
|
### Unchanged files (per CLAUDE.md: stable infrastructure)
|
|
|
|
- `connectors/jira/file_lock.py`
|
|
- `connectors/jira/transform.py`
|
|
- `services/ws_gateway/`
|
|
- `src/orchestrator.py`
|
|
|
|
---
|
|
|
|
## 10. Migration Plan
|
|
|
|
### Phase 1: Core SDK (this spec)
|
|
|
|
1. Create `src/connector_kit/` package with Protocol, Runtime, Manifest
|
|
2. Move POC code from `tests/test_connector_kit_poc.py` to production
|
|
3. Add contract tests
|
|
4. Add `da connector list` and `da connector test` CLI commands
|
|
5. Update `tests/helpers/contract.py` to accept optional `schema_json` column in `_meta` (currently enforces exact 6-column match, new SDK produces 7 columns)
|
|
6. Add POC test for `_remote_attach` table in extract.duckdb (current POC only validates YAML, not DuckDB table)
|
|
|
|
**Deliverable:** SDK exists, no connectors migrated yet. Old code untouched.
|
|
|
|
### Phase 2: Keboola migration
|
|
|
|
1. Create `connectors/keboola/connector.yaml` + `connector.py`
|
|
2. `KeboolaConnector` wraps existing `client.py` + DuckDB extension logic
|
|
3. Old `extractor.py:run()` delegates to `ConnectorRuntime + KeboolaConnector`
|
|
4. Verify: `da connector test keboola` passes contract tests
|
|
5. Verify: `pytest tests/test_keboola_extractor.py` still passes (backward compat)
|
|
|
|
**Deliverable:** Keboola works via new SDK. Old API still works.
|
|
|
|
### Phase 3: BigQuery + Jira migration
|
|
|
|
1. Same pattern as Phase 2 for BigQuery (simplest — remote only)
|
|
2. Jira is most complex — stream capability, existing transform.py
|
|
3. Jira requires modifying `connectors/jira/webhook.py` to bridge existing synchronous webhook handler to the queue-based `stream()` interface. Note: `webhook.py` is NOT marked as stable infrastructure (only `transform.py` and `file_lock.py` are protected)
|
|
4. Verify all existing tests pass
|
|
|
|
**Deliverable:** All three connectors use SDK. Old APIs deprecated.
|
|
|
|
### Phase 4: CLI scaffold + developer experience
|
|
|
|
1. `da connector new <name>` scaffold command
|
|
2. `da connector discover <name>` for interactive discovery
|
|
3. Documentation for third-party connector authors
|
|
4. Remove deprecated `extractor.py` entry points
|
|
|
|
**Deliverable:** External developers can create connectors.
|
|
|
|
### Phase 5: driver_builder integration (optional/future)
|
|
|
|
1. `da connector generate-client <name> <api_docs_url>`
|
|
2. Uses driver_builder to generate API client (Layer 1)
|
|
3. Generates connector scaffold wrapping the client
|
|
4. Developer fills in Arrow schema mapping
|
|
|
|
**Deliverable:** New connector from API docs in minutes.
|
|
|
|
---
|
|
|
|
## 11. Validation
|
|
|
|
### POC results (already passing)
|
|
|
|
Test file: `tests/test_connector_kit_poc.py` — **29/29 tests, 0.69s**
|
|
|
|
| Test class | Tests | What it validates |
|
|
|------------|-------|-------------------|
|
|
| `TestCapabilityFlags` | 3 | Flag composition, per-table caps, iteration |
|
|
| `TestProtocolCompliance` | 3 | `isinstance()` check, partial implementation, structural typing |
|
|
| `TestArrowIntegration` | 3 | RecordBatch → DuckDB zero-copy, iterator consumption, Parquet roundtrip |
|
|
| `TestConnectorRuntime` | 5 | Full pipeline, selective extract, incremental state, empty tables, partial failure |
|
|
| `TestSchemaEvolution` | 5 | Added/removed columns, type changes, no-change, first-run |
|
|
| `TestStreamingCapability` | 2 | AsyncIterator, stream → DuckDB |
|
|
| `TestRemoteOnlyConnector` | 1 | Remote-only metadata without data |
|
|
| `TestManifestValidation` | 5 | YAML parsing, capability mapping, auth, config schema, health check |
|
|
| `TestDiscoveryToReadPipeline` | 1 | End-to-end: discover → read → query |
|
|
| `TestLargeDataBatching` | 1 | 100K rows in constant memory |
|
|
|
|
### Acceptance criteria for production
|
|
|
|
- [ ] All 29 POC tests pass after moving code to `src/connector_kit/`
|
|
- [ ] Existing test suite (633 tests) passes with no regressions
|
|
- [ ] `da connector test keboola` passes all contract tests
|
|
- [ ] `da connector test bigquery` passes all contract tests
|
|
- [ ] `da connector test jira` passes all contract tests
|
|
- [ ] Orchestrator produces identical analytics.duckdb from SDK-wrapped connectors
|
|
- [ ] Sync API (`POST /api/sync/trigger`) works unchanged
|
|
- [ ] Schema evolution detected on real Keboola table schema change
|
|
|
|
---
|
|
|
|
## 12. Open Questions
|
|
|
|
1. **Incremental merge strategy.** Current spec supports incremental via `incremental_key` / `incremental_value`, but doesn't specify how to merge new data with existing parquets (append vs. replace vs. upsert). Phase 1 uses full replace (current behavior); upsert support is a Phase 3+ concern.
|
|
|
|
2. **Partitioned parquet vs. single file.** Jira uses `YYYY-MM.parquet` partitions, others use single `{table}.parquet`. The runtime should support both — configurable per-table or per-connector. Current spec defaults to single file for `read()`, partitioned for `stream()`.
|
|
|
|
3. **Concurrent webhook writes.** Jira's `file_lock.py` handles concurrent webhook-to-parquet writes. The runtime should integrate this, but `file_lock.py` is marked as stable infrastructure in CLAUDE.md. Resolution: runtime delegates to existing `file_lock.py`, no changes needed.
|
|
|
|
4. **Health check execution.** Manifest declares health check, but who executes it? Options: (a) runtime before extraction, (b) CLI on demand, (c) scheduler periodically. Phase 1: CLI only (`da connector test <name>` runs health check). Automatic health check before extraction in Phase 2.
|
|
|
|
5. **Custom connector auto-discovery.** Current `sync.py` scans `connectors/custom/*/extractor.py`. With Connector Kit, scan for `connectors/*/connector.yaml` instead. Need to handle transition period where both patterns coexist.
|
|
|
|
6. **Keboola `_remote_attach` conditional creation.** Current `extractor.py` only creates `_remote_attach` when both `has_remote` AND `use_extension` are true. The Connector Kit runtime always calls `_write_remote_attach()` when `Cap.REMOTE` is declared. This means `_remote_attach` will be present even when the extension is unavailable (fallback to legacy client). The orchestrator handles missing extensions gracefully (logs warning, skips), so this behavioral change is safe but should be noted.
|
|
|
|
7. **Identifier validation shared module.** The `_SAFE_IDENTIFIER` regex is currently duplicated in `src/orchestrator.py`, `src/db.py`, and `cli/commands/analyst.py`. The Connector Kit adds a fourth copy. Consider extracting to a shared `src/validators.py` module in Phase 1.
|
|
|
|
---
|
|
|
|
## Appendix A: Review Findings
|
|
|
|
This spec was reviewed against the actual codebase. All findings have been addressed in the current version.
|
|
|
|
| # | Finding | Severity | Resolution |
|
|
|---|---------|----------|------------|
|
|
| 1 | `_meta` schema adds `schema_json` — breaks `tests/helpers/contract.py` exact 6-column assert | WARNING | Added to Phase 1 migration step 5 |
|
|
| 2 | `_remote_attach` 4-column schema matches all consumers | CORRECT | No action needed |
|
|
| 3 | Stable files (file_lock.py, transform.py, ws_gateway/) respected | CORRECT | No action needed |
|
|
| 4 | POC test count (29/29) is accurate | CORRECT | No action needed |
|
|
| 5 | POC doesn't test `_remote_attach` in DuckDB (only YAML) | WARNING | Added to Phase 1 migration step 6 |
|
|
| 6 | `config.secret()` method does not exist in codebase | ERROR | Fixed → `os.environ["KEBOOLA_STORAGE_TOKEN"]` |
|
|
| 7 | `self._bucket` used but never assigned in KeboolaConnector | ERROR | Fixed → `_default_bucket` + `_table_buckets` in `__init__` |
|
|
| 8 | Keboola `_remote_attach` conditional creation not replicated | WARNING | Documented in Open Question 6 |
|
|
| 9 | Custom connectors + auto-profiling in `sync.py` not addressed | WARNING | Added Section 8.3 |
|
|
| 10 | `src/db.py` is second `_remote_attach` consumer | WARNING | Added note in Section 8.2 |
|
|
| 11 | `_SAFE_IDENTIFIER` validation missing from runtime | SUGGESTION | Added `_validate_identifier()` to runtime + validation in `run()` |
|
|
| 12 | Jira `webhook.py` incompatible with queue-based streaming | WARNING | Added to Phase 3 step 3 |
|