* feat(unified-stack): Browse + My Stack + Recipes + RBAC matrix (v49–v55)
Squash of 94 commits spanning the v49 → v55 unified-stack rewrite.
Full per-feature breakdown lives in CHANGELOG.md under [Unreleased].
Major buckets:
* v49 schema — first-class user_groups + user_group_members +
resource_grants; admin can CRUD groups and grants; Google
Workspace nightly sync writes into the new tables.
* v49 data_packages — admin-curated bundles of tables, RBAC-gated,
first-class section on /catalog Browse + My Stack.
* v49 memory_domains — row-backed (replaces hardcoded VALID_DOMAINS
enum); admin can CRUD; grants follow the same shape as tables and
packages.
* v50 cover_image_url + admin sidebar collapsibles + per-row Mode
tooltip + admin queue domain badges + admin "+ New Item" seed flow.
* v51 lifecycle status (prod/poc/coming-soon/draft) + category +
palette swatches on admin modals.
* v52 per-table detail page /catalog/t/<id>.
* v53 Recipes — admin-curated SQL templates as a second tab on
/catalog with full Edit/Delete admin affordances.
* v54 soft-delete (deleted_at) + Undo toast for packages, memory
domains, and recipes; hard_delete() retained as escape hatch.
* v55 Recipes RBAC — ResourceType.RECIPE registered, inline Group
Access matrix on Create + Edit Recipe modals (mirrors the Memory
Domain pattern).
* Activity Center per-resource filter (resource_prefix LIKE-anchored
on audit_log.resource); admin nav g+letter keyboard shortcuts;
loadAdminTablesLayout N+1 → single endpoint; /api/memory 30s
page-level cache.
* CI hardening — Keboola legacy tests pytest.importorskip; perf-
smoke threshold widened to stop cold-cache flake.
5002 tests passing, 35 skipped.
* feat(p2 backlog): Cmd-K palette + suggest-a-domain + nightly E2E + v55 schema
10-item P2 sweep on top of the unified-stack squash. New behaviour:
* Cmd-K admin command palette (base.html) — fuzzy-search overlay over
admin + user-facing routes. Arrows/Enter to navigate, Esc to close.
* Stack-tabs digit shortcuts — 1/2/3 switch Browse / My Stack /
Recipes on /catalog + /corporate-memory.
* Friendlier non-admin empty state on /corporate-memory, plus a
"Suggest a domain" CTA → POST /api/memory-domain-suggestions, admin
queue with approve/reject. Backed by a new memory_domain_suggestions
table (schema v55).
* /admin/corporate-memory 7-tab strip grouped under Moderation /
Catalog parent labels.
* Bulk-assign table → package dropdown annotates each option with
"(N of M tables already in)" so the existing distribution is visible
before picking a target.
* GET /api/memory + /tree accept is_required filter; admin status
dropdowns route the "Required" sentinel onto it (status no longer
holds 'mandatory' post-v49, so the old dropdown returned nothing).
* chip-input.js is now opt-in per template via {% block extra_scripts %}
instead of loaded globally on every page from base.html.
* Edit-modal close helpers consolidated onto _closeEditModalById();
docs the per-source-type modal architecture decision.
* New .github/workflows/e2e-nightly.yml runs agent-browser smoke
scripts (scripts/e2e/smoke_*.sh) against a docker-compose stack
nightly at 04:30 UTC; failures open an agent-browser-nightly issue.
5012 tests passing, 35 skipped.
* fix(visual audit): 6 page regressions on memory + data-package surfaces
agent-browser walkthrough of every memory + data-package page in the PR
turned up 6 real bugs. Fixes:
1. Admin memory modals were dead. Duplicate `let _cmdNewDomainId`
declarations from the deprecated step-2 RBAC stubs in
admin_corporate_memory.html collided with the live state vars
declared earlier in the same <script> → SyntaxError on parse →
the entire second script block silently failed → every inline
onclick= handler defined there (`+ New Memory Domain`, Edit, etc.)
was a no-op. Removed the duplicate stubs.
2. /catalog/t/<table_id> + /catalog/r/<slug> rendered unstyled.
Both templates injected their CSS via {% block head %} but
base.html exposes {% block head_extra %} — wrong block name
meant <style> rules never reached the rendered HTML. Renamed
to head_extra. Hero card, section cards, dark SQL block, proper
full-width inputs all now render as designed.
3. L49 leak — "MANDATORY" KPI label + "Make Mandatory" row buttons
on /admin/corporate-memory still used the old word. Renamed to
"Required" / "Mark as Required" so UI matches the data model
(v49 split moved the Required tier onto the orthogonal
is_required boolean; status no longer holds 'mandatory').
4. Activity Center Resource dropdown didn't know the v55
`memory_domain_suggestion:` namespace — added it.
5. Tab strip on /admin/corporate-memory wrapped text 2× per button
on narrow viewports after the L50 MODERATION/CATALOG group
labels pushed total width past most viewports. Switched the
strip to flex-wrap:nowrap + overflow-x:auto with
white-space:nowrap + flex-shrink:0 on every direct child so the
tabs stay one row and slide horizontally when they overflow.
5012 tests passing, 35 skipped.
* rebase-cleanup: align with main's 0.54.25-27 API design + comment fix
Three follow-on fixes after rebasing onto origin/main (0.54.27):
* admin_tables.html: dropped a stray nested ``{% if data_source_type
== 'keboola' %}`` around ``prefillFromKeboolaTable`` (main never had
it; the outer Phase F2 guard already covers it) and reworded a JS
comment that contained literal ``{% %}`` tokens which Jinja was
parsing as a real tag → unbalanced if/endif → 30 template render
failures across the suite.
* /api/stack/subscription/{type}/{id}: DELETE now returns 204 instead
of 200 per the 0.54.26 design rules. CLI client + parity tests
updated to accept 2xx / assert 204.
* Memory-domain suggestion approve/reject paths added to
``_VERB_PATH_ALLOWLIST`` — they are pending → approved/rejected
state-machine transitions (approve also creates the real
memory_domains row as a side effect), so the RPC shape is
intentional rather than a missed PATCH refactor.
5035 tests passing, 35 skipped.
* fix(catalog_table_detail): real polish pass — hero glyph, dedup pills, rows/size meta, scoped sync CTA
The previous fix only got the block-name typo so the existing CSS rendered.
The actual layout was still wireframe-tier on close inspection:
* No cover glyph in the hero (a flat white card with title + meta line);
data-package + memory-domain detail pages both have a colored icon
square. Restored parity — table.icon emoji if set, otherwise initials
on a colored square using table.color.
* "INTERNAL" pill rendered twice for agnes_audit etc. — the mode pill
and the source-type pill happened to be identical strings. Now skip
the source pill when it matches the mode (`internal == internal`).
* Bucket / source_table code chip showed `Agnes Internal.audit_log` for
internal rows — meaningless to a user. Hidden when source_type is
internal.
* `pairs_well_with` admin input was a comma-separated `<input>` always
visible. Wrapped all 4 sections in an Edit-on-demand toggle: read-
only display by default, "+ Add" / "Edit" button on the right edge
of each section header reveals the inline form, Cancel hides it.
* "Trigger sync now" was a cramped link squashed into the empty-state
flex row (visible as `Tr…` overflow before). Promoted to a proper
btn-primary button under the empty-state copy. Hidden entirely for
internal tables (which are server-managed — no upstream to pull).
* Hero meta now surfaces row count + payload size (when sync_state has
them) + last sync timestamp on a single line — was missing from the
original.
* Mode pills colored by tier (local=green, remote=amber, materialized=
blue, internal=gray) so the basic fact about a table reads at a
glance, not from upper-cased ALL-CAPS text alone.
* tests(v56): TDD baseline for extended data-packages content + per-table docs
68 failing tests across 8 files spec the v56 surface before any
implementation lands:
* test_schema_v55_to_v56_migration.py — schema bump, additive ALTERs
on data_packages + table_registry, idempotency, sequential-upgrade
preservation
* test_data_packages_repo_v56.py — repo create/update/get/list for
owner_name, owner_team, tags, long_description, when_to_use,
when_not_to_use, example_questions (JSON list round-trip, empty
defaults, partial-update preservation)
* test_table_registry_v56_docs.py — update_docs for grain, platforms,
partition_col, history, gotchas; preserves v52 docs columns
* test_api_data_packages_v56.py — PUT/POST/GET for all new fields,
field-level validation (tag count, bullet length, description size),
virtual badge derivation (curated/new)
* test_api_registry_docs_v56.py — PATCH /api/admin/registry/{id}/docs
for v56 fields, validation, RBAC unchanged
* test_web_catalog_package_detail_v56.py — /catalog/p/<slug> rewrite
asserts on rendered owner line, tag pills, badges, What it is,
Use it when, Skip it when, Example questions, per-table extended
detail in collapsible row, key-gotcha distinctness, admin-only Edit
* test_web_stack_card_v56_metadata.py — Browse-grid card additions
(owner chip, tag chips, badges) without breaking back-compat for
rows missing the new fields
* test_data_packages_no_vendor_content.py — CI guard: scans app/ +
src/ + cli/ + config/ + scripts/ for Groupon-specific tokens from
the colleague's spec MD; fails if any leak into OSS surfaces
* test_db_schema_version.py — bumped 55 → 56 with rationale
Plus updates schema-version assertion to 56. Implementation lands in
subsequent commits (schema migration → repo → API → templates).
* feat(v56): schema + repo for extended data-packages content
Schema additions (ALTER ADD COLUMN IF NOT EXISTS — additive + idempotent):
* data_packages: owner_name, owner_team, tags, long_description,
when_to_use, when_not_to_use, example_questions (JSON-as-VARCHAR for
the lists)
* table_registry: grain, platforms, partition_col, history, gotchas
(extends the v52 sample_questions / things_to_know / pairs_well_with
docs surface with structured per-table content)
Repo extensions:
* DataPackagesRepository.create + update accept the new fields with
the same Optional-is-no-op contract as v51 (pass an empty list to
clear a JSON column)
* _decode_row decodes the new JSON-list columns to Python lists; NULL
rounds back to [] so callers don't branch
* TableRegistryRepository.update_docs grew the v56 fields alongside
the existing v52 ones — single PATCH can write either tier
atomically
* TableRegistryRepository._decode_row picks up platforms + gotchas in
the same NULL-tolerant decoder
22 repo + migration tests passing. API + UI land in subsequent commits.
* feat(v56): API surface for extended data-packages + per-table docs
CreateDataPackageRequest + UpdateDataPackageRequest grew the v56 fields
(owner_name, owner_team, tags, long_description, when_to_use,
when_not_to_use, example_questions) with per-field validators that
match the Foundry spec checklist:
* tags: ≤8 entries × ≤30 chars
* long_description: ≤4000 chars
* use/skip: ≤8 bullets × ≤200 chars
* example_questions: ≤12 × ≤200 chars
_serialize emits all v56 fields plus a virtual ``badges`` list derived
server-side at render time (no DB column needed): "curated" when the
creator is in the Admin group, "new" within 30 days of created_at.
Backdating created_at or admin-status changes pick up automatically.
PATCH /api/admin/registry/{id}/docs extended with v56 structured
per-table fields (grain, platforms, partition_col, history, gotchas).
gotchas: list of {key: bool, body: str} Pydantic models with the same
≤8 cap; first key=true entry becomes the Key gotcha on the rendered
package detail page. PATCH echoes the fresh state so callers can
re-render without a second GET.
26 API tests passing (16 data-packages + 10 registry-docs).
* feat(v56): /catalog/p/<slug> rewrite + Browse-grid card augmentation
The third (and final) v56 commit lights up the UI surfaces backed by
the schema + API commits earlier in this PR:
* /catalog/p/<slug> template rebuilt around the Foundry spec's
section ladder — hero (icon + name + badges + owner + tags +
description + meta + Add-to-stack), "What it is" markdown body,
paired "Use it when / Skip it when" panels, "Tables in this
package" with collapsible per-table extended detail (grain /
platforms / partition_col / history / gotchas + sample questions),
and an "Example questions you can ask Claude" prompt panel. Each
section guarded by ``{% if pkg.<field> %}`` — empty content fields
hide the section entirely (no "No X yet" placeholder noise on the
public-facing drilldown).
* router catalog_package_detail hydrates per-table v56 fields onto
the tables list + derives the virtual badges (curated / new)
server-side from creator-in-Admin + 30-day created_at.
* StackResolver.ResourceEntry grew owner_name / owner_team / tags /
badges; _fetch_entries pulls the v56 columns + computes badges
once per fetch using a single Admin-group SELECT.
* _data_package_entry_dict adapter passes the new fields through to
the macro; tags are merged source-type pills + admin-authored
category tags per the spec convention.
* _stack_card.html renders the v56 badges (top-left, data-badge=
hooks) + the owner chip (data-card-owner hook) without breaking
back-compat — pre-v56 rows render unchanged.
* Admin PUT handler strips the v56 docs fields from the
read-modify-write merged dict so register() doesn't blow up
with the now-larger row shape (same pattern as the v52 docs
fields stripping).
5115 tests passing (+98 v56 + 18 fixed regressions from the merged-
register PUT path), 35 skipped.
* fix(rbac): Edit-on-package + Group-access 'required' persistence + CI vendor guard
Three related bugs reported on the merged-with-main branch:
1. Clicking Edit on a Data Package card landed on /admin/tables with
a `#<pkg.id>` hash that nothing listened to — admin saw the global
table listing, not the editor for that specific package. Added a
`?edit_package=<pkg_id>` query-param handler in admin_tables.html
(analog to the existing `?edit=<table_id>` and `?assign_to=<pkg_id>`
patterns) that calls openEditDataPackageModal on DOMContentLoaded
after a 250ms layout settle. Updated the package-detail Edit link
to use the new query param.
2. Setting Group Access to 'required' didn't persist — re-opening
the modal showed 'available'. Root cause was the v49
``resource_grants.requirement`` enum existing in the DB but the
POST /api/admin/grants endpoint not surfacing it: ``CreateGrantRequest``
declared only group_id + resource_type + resource_id, so Pydantic
silently dropped the matrix's ``requirement: 'required'`` payload
and the new row landed at the DB column default ('available').
Plumbed ``requirement`` through ``CreateGrantRequest`` →
``ResourceGrantsRepository.create`` so the value persists in one
round-trip. Plus a UNIQUE-constraint race in the matrix
diff-apply: DELETE-old + POST-new ran in parallel via
``Promise.allSettled``, so POST could fire first and trip the
unique check before DELETE freed the slot. Switched to sequential
(await all deletes; then await all writes) across all three
matrices (Edit Data Package, Edit Memory Domain, Edit Recipe).
3. CI vendor-content guard ``test_no_groupon_specific_strings_in_oss``
tripped on two of my own docstrings: a "Foundry Data team" mention
in two src/db.py comments + an ``s1_session_landings`` example in
cli/skills/agnes-table-registration.md. Rephrased the comments to
"extended-descriptions admin spec" and replaced the example with
a generic ``events_daily`` table name.
5164 tests passing, 35 skipped (+4 regression tests pinning the POST
/api/admin/grants requirement contract). Vendor guard back to green.
* fix(catalog): admin Browse path drops v58 card fields
The /catalog and /memory admin god-mode branch built ResourceEntry
instances inline from pkg_repo.list() / domains_repo.list() and skipped
owner_name, owner_team, tags, and derived badges (curated/new). Visible
symptom: a package with an owner + tags rendered with the v56 chrome
for non-admin viewers but as a bare card for admins.
Adds StackResolver.browse_admin(user_id, resource_type) — admin god-mode
Browse that walks the full table but routes through the same
_fetch_entries enrichment pass as browse(), so admin + non-admin Browse
stay visually consistent. Both /catalog and /corporate-memory routes
switch to it.
Regression test in tests/test_stack_resolver_browse_admin.py covers:
owner/tags propagation, new/curated badge derivation, in_stack from
admin subscriptions, all-packages-regardless-of-grants, and the
ValueError for unsupported resource types.
* fix(catalog): three /catalog tab-strip UX bugs
1. Required Remove → red toast
browse_admin passed empty required_ids to _fetch_entries, so the
admin's own required grants surfaced as 'available' and the macro
rendered an actionable Remove button that POST /unsubscribe 400'd
on. Now derives required_ids from the admin's own groups so
Required packages render with the disabled "In stack (required)"
button. Regression test in test_stack_resolver_browse_admin.py.
2. Remove green-toasts but card stays until refresh
The My-Stack empty-state placeholder was only emitted server-side
when stack_entries was empty at render time. Removing the last
card left the tab completely blank — users read that as "Remove
didn't work, let me refresh". Both grid + empty-state are now
always rendered with one of them initially hidden; the JS swaps
visibility on add/remove instead of injecting DOM. Same fix in
/corporate-memory.
3. "What are Recipes?" + ambiguous (admin) suffix
Recipes tab now carries its own curator-block explainer (the
shared one was moved inside Browse view so it doesn't bleed
across tabs). The grey "(admin)" suffix becomes a yellow
.admin-only-hint chip with a title tooltip — visibility hint is
now unambiguous: yellow chip = "only you see this", non-admins
don't see the affordance at all.
* schema: renumber v51..v58 → v52..v59 to make room for main's v51
Main 0.54.29 introduced a NEW v51 (table_registry.bq_fqn — issue #343)
that releases ahead of this branch. The unified-stack chain v51..v58
shifts up by one so main's v51 stays as the released schema and ours
become v52..v59. Function names, internal version bumps, dispatch
ladder thresholds, and the migration-test references all move
together. Subsequent merge with main lands the bq_fqn column at the
freed v51 slot.
* fix(seed): seed admin lands in BOTH Admin AND Everyone groups
The LOCAL_DEV_MODE / SEED_ADMIN_EMAIL bootstrap only added the seed
user to Admin. Everyone-scoped grants — the canonical "every-user-
sees-this" pattern for Required onboarding — didn't surface for the
seed admin's own /catalog because they weren't in Everyone. Symptom:
admin grants a Required-tier package to Everyone, then sees it on
/catalog still rendered with an "Add to stack" button (because the
admin's resolved required_ids was empty for that package).
The dual-membership keeps Admin (authorization) and Everyone
(default-grant target) intentionally separate per the design comment
on UserRepository.create — every membership remains traceable to a
concrete row, just now with a system_seed row in Everyone too. Both
INSERTs go through UserGroupMembersRepository.add_member which is
idempotent on (user_id, group_id), so re-fires on every lifespan
startup don't duplicate rows.
Regression test in test_main_seed_admin_everyone.py.
* style: unify admin-only hints across marketplace + memory detail pages
Replaces three stale ``(admin)`` parentheticals with the same yellow
``admin-only`` chip introduced for /catalog tab actions. Same tooltip
copy ("Visible only to admins — analysts won't see this …") so the
visibility hint is unmistakable wherever it appears:
- Hard delete on marketplace_plugin_detail (admin-only destructive
action — same gating as the original suffix conveyed).
- Hard delete on marketplace_item_detail (same).
- Edit link on memory_domain_detail (title-attr only before; now a
visible chip too).
Non-admin viewers never saw these affordances — the gates are
unchanged. Pure styling pass for consistency.
* fix(catalog): exclude soft-deleted data packages + memory domains from Browse
``StackResolver._fetch_entries`` and ``browse_admin`` were querying
data_packages / memory_domains without a ``deleted_at IS NULL`` guard.
A package soft-deleted via /admin/* (v54 soft-delete contract) stayed
visible on /catalog and /memory until either an Undo or a hard delete
— directly contradicting the soft-delete UX which is supposed to
remove the affordance immediately and only retain the row for the
Undo window.
The repository accessors (DataPackagesRepository.list,
MemoryDomainsRepository.list, list_packages_of_table, etc.) already
filter deleted rows; this commit brings the resolver's direct SQL in
line with that contract.
Regression test in test_stack_resolver_browse_admin.py.
* fix(catalog): Add/Remove updates full card chrome, not just button
The previous _applyStackChange flipped only the footer button label —
the card border (.is-in-stack class), top-right "In stack" badge, and
button color class (--add / --remove) stayed at their server-rendered
state. After Add the user saw the button checkmark but the rest of
the card still looked like "available, not in stack". They read this
as "the change didn't take — let me refresh".
This commit makes the optimistic update mirror what the server-side
macro renders for the new state:
* ``c.classList.toggle('is-in-stack', becameInStack)`` — flips the
border + visual state class.
* Top-right ``.stack-card__req-badge--instack`` badge is injected on
Add, removed on Remove (skipped when ``data-requirement='required'``
— that slot is owned by the Required badge).
* Button text is "Remove" / "+ Add to stack" matching the macro
(was "✓ In stack" which was visually nice but inconsistent).
* Button color class --add / --remove swaps so the destructive Remove
tint kicks in immediately.
The clone-into-My-Stack path applies the same updates so the new card
in My Stack reads identically to a server-rendered in_stack card.
Mirrored in /corporate-memory.
* fix(memory): four Devin-review bugs on /memory drill-down + manifest
PR #333 Devin review surfaced four real bugs that ship a broken
/memory experience even though the unit tests passed.
1. Manifest md5 omits is_required + content (app/api/sync.py:836-840)
_build_memory_domains_section hashed only (id|title|status) per
item. _build_per_domain_markdown routes items between "## Required"
and "## Approved" by is_required and embeds full content — so an
admin edit of either dimension left the manifest md5 unchanged,
`agnes pull` skipped the re-fetch, and the analyst kept a stale
bundle.md. Now both fields participate in the hash.
2. required_count always 0 (src/repositories/memory_domains.py)
list_items_of_domain only SELECTed (id, title, status) so the
`it.get("is_required")` in the manifest builder always evaluated
to None → required_count = 0 regardless of actual state. The
manifest builder advertised a count it could never compute. Now
projects is_required + content too (required by fix 1 anyway).
3. Vote URL 404 (memory_domain_detail.html:289-290)
Constructed `/api/memory/items/{id}/vote` but the route is
`/api/memory/{id}/vote`. Every upvote/downvote button was a
silent no-op.
4. Dismiss/undismiss URL + method both wrong (memory_domain_detail.html:296-305)
Constructed `/api/memory/items/{id}/dismiss` (extra /items/) and
/undismiss (no such route — undismiss is DELETE on /dismiss).
Both buttons silently 404'd. Now POST + DELETE on
`/api/memory/{id}/dismiss` per app/api/memory.py:635/675.
* fix: multi-agent reviewer findings — vendor-token scrubs + manifest md5 predicate + soft-delete filter
Three reviewer findings from the multi-agent review on PR #333,
fixed in-place per CLAUDE.md issue-economy rule.
Reviewer-rules (Important — vendor-agnostic OSS):
- app/main.py:218 comment: replaced 'foundryai-prod' with generic
'a customer prod instance' phrasing. Public OSS repo must not
carry customer-specific tokens (CLAUDE.md § Project conventions).
- tests/test_table_registry_v56_docs.py:70 fixture string:
replaced "user_brand_affiliation = 'groupon'" with 'acme' on
the same rule.
Reviewer-architecture (closes still-unresolved Devin 🚩 ANALYSIS):
- app/api/sync.py _build_memory_domains_section: md5 hash loop now
filters items to the SAME predicate the bundle renderer uses
(is_required OR status='approved'). Pre-fix the hash iterated ALL
items but _build_per_domain_markdown only rendered the union of
required items + approved-non-required items — so an admin edit
to a pending/rejected non-required item flipped the md5 against
an identical-bytes bundle, triggering a wasteful re-fetch on
every analyst's next 'agnes pull'. The earlier commit fixed the
hash-input fields (is_required + content); this closes the
set-of-items asymmetry Devin separately flagged.
Reviewer-RBAC (minor cleanup):
- app/resource_types.py _data_package_blocks and _memory_domain_blocks
now filter 'WHERE deleted_at IS NULL' (v54 soft-delete column) so
the /admin/access UI doesn't surface soft-deleted entities as
grantable. Mirrors the existing filter on _recipe_blocks. No
security leak pre-fix (resolver double-filters and re-checks at
serve time), just UI cleanliness.
- app/services/stack_resolver.py add_to_stack: docstring note
added explaining that authorization is enforced at the API layer
(app/api/stack.py can_access gate), not at the resolver. The
initial review suggested adding a defensive 403 here, but that
broke 5 existing tests that legitimately call add_to_stack
directly without setting up grants first; the docstring captures
the contract instead. stack() already intersects subscriptions
with current available_ids on every read, so a 'zombie' row from
a misuse never leaks into the user-facing manifest.
* release: 0.55.0 — unified Browse + My Stack (Data Packages + Memory), schema v48→v59, 3 BREAKING
1369 lines
60 KiB
Python
1369 lines
60 KiB
Python
"""Sync endpoints — manifest, trigger, sync-settings, table-subscriptions."""
|
|
|
|
import hashlib
|
|
import logging
|
|
import os
|
|
import subprocess
|
|
import threading
|
|
import time
|
|
import traceback
|
|
from datetime import datetime, timezone
|
|
from pathlib import Path
|
|
from typing import Any, Optional, List
|
|
|
|
from fastapi import APIRouter, Body, Depends, HTTPException, BackgroundTasks
|
|
from pydantic import BaseModel, Field
|
|
import duckdb
|
|
|
|
from app.auth.access import require_admin
|
|
from app.auth.dependencies import get_current_user, _get_db
|
|
from app.auth.scheduler_token import SCHEDULER_USER_EMAIL
|
|
from app.utils import get_data_dir as _get_data_dir
|
|
from src.audit_helpers import client_kind_from_user
|
|
from src.repositories.audit import AuditRepository
|
|
from src.repositories.sync_state import SyncStateRepository
|
|
from src.repositories.sync_settings import SyncSettingsRepository
|
|
from src.repositories.table_registry import TableRegistryRepository
|
|
from src.rbac import can_access_table
|
|
from src.scheduler import filter_due_tables, is_table_due
|
|
|
|
logger = logging.getLogger(__name__)
|
|
router = APIRouter(prefix="/api/sync", tags=["sync"])
|
|
|
|
# Process-wide guard against overlapping `_run_sync` invocations. Two
|
|
# concurrent extractor subprocesses both write `extract.duckdb` and fight
|
|
# for its file lock — the first sync stalls, the second crashes, and the
|
|
# `/api/health` check times out long enough that Docker flips the
|
|
# container to `unhealthy`, which (behind a `reverse_proxy` upstream)
|
|
# bricks external traffic until contention drains. The singleton-ness is
|
|
# enforced both in the trigger handler (return 409 fast, before the work
|
|
# is scheduled) and in `_run_sync` itself (defense in depth, in case
|
|
# something bypasses the handler).
|
|
_sync_lock = threading.Lock()
|
|
|
|
# Race-protection: the trigger handler returns 200 BEFORE the background task
|
|
# acquires ``_sync_lock``. In that ~few-hundred-ms gap, ``/api/sync/status``
|
|
# would honestly report ``locked=False`` — and the host-side
|
|
# ``agnes-auto-upgrade.sh`` defer probe (which polls this endpoint) would
|
|
# proceed with ``docker compose up -d`` and SIGKILL the still-spawning
|
|
# extractor / materialized worker. Mid-sync container kill is the exact
|
|
# class of corruption the WAL replay auto-recovery is meant to be a
|
|
# safety net for, not a routine occurrence.
|
|
#
|
|
# Fix: stamp the trigger time alongside the lock. ``/api/sync/status`` also
|
|
# returns ``locked=True`` for ``_TRIGGER_HOLD_SEC`` seconds after the most
|
|
# recent trigger, even if the background task hasn't yet acquired the lock.
|
|
# The window is short enough that an operator-issued ``/api/sync/trigger``
|
|
# followed by an immediate ``GET /api/sync/status`` is consistent
|
|
# (locked=True), but long enough to cover the schedule → background-task
|
|
# spawn latency. Defense in depth: the real lock still gates the
|
|
# extractor subprocess.
|
|
_TRIGGER_HOLD_SEC = 30
|
|
_recent_trigger_at: float = 0.0 # monotonic clock; 0 = never triggered
|
|
|
|
|
|
def _file_hash(path: Path) -> str:
|
|
if not path.exists():
|
|
return ""
|
|
h = hashlib.md5()
|
|
with open(path, "rb") as f:
|
|
for chunk in iter(lambda: f.read(8192), b""):
|
|
h.update(chunk)
|
|
return h.hexdigest()
|
|
|
|
|
|
def _materialize_table(
|
|
*,
|
|
table_id: str,
|
|
sql: str,
|
|
bq,
|
|
output_dir: str,
|
|
max_bytes: Optional[int],
|
|
) -> dict:
|
|
"""Thin wrapper around `connectors.bigquery.extractor.materialize_query`
|
|
so the trigger pass can be unit-tested by patching this seam without
|
|
touching the real BqAccess factory or the duckdb import."""
|
|
from connectors.bigquery.extractor import materialize_query
|
|
return materialize_query(
|
|
table_id=table_id, sql=sql, bq=bq,
|
|
output_dir=output_dir, max_bytes=max_bytes,
|
|
)
|
|
|
|
|
|
def _run_materialized_pass(
|
|
conn: duckdb.DuckDBPyConnection,
|
|
bq,
|
|
tables: Optional[List[str]] = None,
|
|
) -> dict:
|
|
"""Walk `table_registry` for `query_mode='materialized'` rows and run any
|
|
that are due, dispatching by ``source_type`` to the correct connector's
|
|
materialize_query. Honors per-table `sync_schedule` via `is_table_due()`,
|
|
computes the file hash inline, and updates `sync_state` so the manifest
|
|
can serve the row to `agnes pull` without re-hashing on every request.
|
|
|
|
``tables`` (when not None) restricts the pass to a specific subset —
|
|
targeted re-syncs from the operator (POST /api/sync/trigger with a
|
|
body) need this, otherwise an admin asking to re-sync `kbc_job` would
|
|
re-process every other materialized row that's also due. Matched
|
|
against both the registry id and name (admins often pass either).
|
|
|
|
BigQuery rows go through BqAccess + bigquery_query() (jobs API),
|
|
optionally cost-guarded by ``max_bytes_per_materialize``.
|
|
Keboola rows go through KeboolaAccess + ATTACH-and-COPY, no
|
|
guardrail (extension has no dry-run primitive).
|
|
|
|
Returns:
|
|
``{"materialized": [ids], "skipped": [ids], "errors": [{table, error}]}``
|
|
|
|
Errors are aggregated per row — one budget-blown table doesn't stop a
|
|
healthy sibling. ``MaterializeBudgetError`` is caught and rendered with
|
|
its structured fields so operator alerting can pick out the cap-vs-actual
|
|
bytes from the log line.
|
|
"""
|
|
from app.instance_config import get_value
|
|
from connectors.bigquery.extractor import MaterializeBudgetError, MaterializeInFlightError
|
|
|
|
bq_output_dir = str(Path(_get_data_dir()) / "extracts" / "bigquery")
|
|
kb_output_dir = Path(_get_data_dir()) / "extracts" / "keboola" / "data"
|
|
|
|
# Sentinel: max_bytes <= 0 (or None) disables the guardrail. `get_value()`
|
|
# treats YAML `null` as "missing" → returns the default; operators must use
|
|
# the explicit `0` sentinel to disable. See config/instance.yaml.example.
|
|
# YAML accepts floats too (e.g. `10737418240.0`), and operators may
|
|
# write `1e10` for readability; coerce to int and tolerate non-numeric
|
|
# entries by falling through to the disable path with a warning.
|
|
raw_max = get_value(
|
|
"data_source", "bigquery", "max_bytes_per_materialize",
|
|
default=10 * 2**30,
|
|
)
|
|
try:
|
|
n = int(raw_max) if raw_max is not None else 0
|
|
except (TypeError, ValueError):
|
|
logger.warning(
|
|
"data_source.bigquery.max_bytes_per_materialize is not numeric "
|
|
"(%r); cost guardrail disabled. Set an integer or 0 to disable.",
|
|
raw_max,
|
|
)
|
|
n = 0
|
|
bq_max_bytes = n if n > 0 else None
|
|
|
|
registry = TableRegistryRepository(conn)
|
|
state = SyncStateRepository(conn)
|
|
|
|
summary = {"materialized": [], "skipped": [], "errors": []}
|
|
keboola_access = None # lazy-init on first Keboola row
|
|
|
|
# Targeted-trigger filter. Compare against both id and name so an admin
|
|
# who passes either form (the registry id slug, or the human-friendly
|
|
# name) gets the same result. `None` means "no filter — process all
|
|
# due materialized rows".
|
|
target_set: Optional[set] = (
|
|
set(tables) if tables is not None else None
|
|
)
|
|
|
|
for row in registry.list_all():
|
|
if row.get("query_mode") != "materialized":
|
|
continue
|
|
|
|
# Convention across connectors: sync_state.table_id and the parquet
|
|
# filename are keyed by `table_registry.name` (matches Keboola's
|
|
# `_meta.table_name`) so the manifest's `registry_by_name` lookup
|
|
# at `_build_manifest_for_user` resolves cleanly. Without this,
|
|
# admins who register `name="Orders_90d"` (id slugified to
|
|
# `orders_90d`) would see `query_mode` default to `"local"` in the
|
|
# manifest because the lookup misses on `id`.
|
|
ref_name = row["name"]
|
|
|
|
if target_set is not None and not (
|
|
ref_name in target_set or row.get("id") in target_set
|
|
):
|
|
summary["skipped"].append(
|
|
{"table": ref_name, "reason": "not_in_target"}
|
|
)
|
|
continue
|
|
|
|
last = state.get_last_sync(ref_name)
|
|
last_iso = last.isoformat() if last else None
|
|
# Per-table schedule wins; fall through to AGNES_DEFAULT_SYNC_SCHEDULE
|
|
# (operator override), then to ``every 1h`` (OSS-historical default).
|
|
# The env knob lets a deployment dial down the platform-wide refresh
|
|
# cadence without having to PUT every registry row — useful when
|
|
# data freshness budget is "once per day" and the hourly default
|
|
# over-fetches.
|
|
schedule = (
|
|
row.get("sync_schedule")
|
|
or os.environ.get("AGNES_DEFAULT_SYNC_SCHEDULE", "").strip()
|
|
or "every 1h"
|
|
)
|
|
if not is_table_due(schedule, last_iso):
|
|
summary["skipped"].append({"table": ref_name, "reason": "due_check"})
|
|
continue
|
|
|
|
source_type = row.get("source_type") or "bigquery" # legacy default
|
|
|
|
# Dispatch by source_type. BQ rows keep using `_materialize_table`
|
|
# (the existing test seam); Keboola rows use the new Keboola
|
|
# materialize_query via a lazily-initialized KeboolaAccess.
|
|
try:
|
|
if source_type == "bigquery":
|
|
stats = _materialize_table(
|
|
table_id=ref_name,
|
|
sql=row["source_query"],
|
|
bq=bq,
|
|
output_dir=bq_output_dir,
|
|
max_bytes=bq_max_bytes,
|
|
)
|
|
elif source_type == "keboola":
|
|
if keboola_access is None:
|
|
# Lazy-init the Storage API client (replaces the old
|
|
# DuckDB extension `KeboolaAccess`). One client is shared
|
|
# across all keboola materialized rows in this pass —
|
|
# `requests.Session` inside it is thread-safe and reuses
|
|
# the connection pool for HTTP keep-alive across rows.
|
|
# Variable name kept as `keboola_access` to minimise
|
|
# diff churn against the surrounding error-handling
|
|
# block; the type is now `KeboolaStorageClient`.
|
|
from connectors.keboola.storage_api import KeboolaStorageClient
|
|
keboola_url = get_value(
|
|
"data_source", "keboola", "stack_url", default=""
|
|
) or os.environ.get("KEBOOLA_STACK_URL", "")
|
|
token_env = get_value(
|
|
"data_source", "keboola", "token_env",
|
|
default="KEBOOLA_STORAGE_TOKEN",
|
|
) or "KEBOOLA_STORAGE_TOKEN"
|
|
keboola_token = os.environ.get(token_env, "")
|
|
if not (keboola_url and keboola_token):
|
|
summary["errors"].append({
|
|
"table": ref_name,
|
|
"error": (
|
|
"Keboola URL/token not configured for "
|
|
"materialized path (data_source.keboola.stack_url "
|
|
f"+ env {token_env})"
|
|
),
|
|
})
|
|
continue
|
|
keboola_access = KeboolaStorageClient(
|
|
url=keboola_url, token=keboola_token,
|
|
)
|
|
kb_output_dir.mkdir(parents=True, exist_ok=True)
|
|
from connectors.keboola.extractor import (
|
|
materialize_query as kb_materialize_query,
|
|
)
|
|
# Storage API needs the bucket+table split — registry rows
|
|
# carry both fields per the standard register-table schema.
|
|
bucket = row.get("bucket", "")
|
|
source_table = row.get("source_table") or ref_name
|
|
if not bucket:
|
|
summary["errors"].append({
|
|
"table": ref_name,
|
|
"error": (
|
|
"materialized keboola row is missing 'bucket'; "
|
|
"re-register with --bucket <in.c-...>"
|
|
),
|
|
})
|
|
continue
|
|
kb_stats = kb_materialize_query(
|
|
table_id=ref_name,
|
|
bucket=bucket,
|
|
source_table=source_table,
|
|
source_query=row.get("source_query"),
|
|
storage_client=keboola_access,
|
|
output_dir=kb_output_dir,
|
|
)
|
|
# Normalize Keboola materialize_query output to the shape the
|
|
# BQ branch uses for downstream sync_state updates. KB returns
|
|
# {table_id, path, rows, bytes, md5}; map to
|
|
# {rows, size_bytes, hash}.
|
|
stats = {
|
|
"rows": kb_stats["rows"],
|
|
"size_bytes": kb_stats["bytes"],
|
|
"hash": kb_stats["md5"],
|
|
"query_mode": "materialized",
|
|
}
|
|
else:
|
|
summary["errors"].append({
|
|
"table": ref_name,
|
|
"error": (
|
|
f"materialized path not supported for "
|
|
f"source_type={source_type!r}"
|
|
),
|
|
})
|
|
continue
|
|
except MaterializeInFlightError:
|
|
# In-flight on a sibling worker / scheduler tick — treat as
|
|
# 'skipped, in-flight'. Do NOT call state.set_error: that
|
|
# would flip status='error' on a healthy concurrent run and
|
|
# the registry UI would surface a false-positive failure.
|
|
summary["skipped"].append({"table": ref_name, "reason": "in_flight"})
|
|
continue
|
|
except MaterializeBudgetError as e:
|
|
logger.warning(
|
|
"Materialize cap exceeded for %s: %s bytes > %s bytes",
|
|
e.table_id, f"{e.current:,}", f"{e.limit:,}",
|
|
)
|
|
summary["errors"].append({
|
|
"table": ref_name,
|
|
"error": str(e),
|
|
"current": e.current,
|
|
"limit": e.limit,
|
|
})
|
|
# Persist the failure so `GET /api/admin/registry` can surface
|
|
# `last_sync_error` to the admin UI / `agnes admin status`.
|
|
# Without this, scheduler stderr was the only place the cap
|
|
# failure showed up and operators had no API path to it.
|
|
state.set_error(ref_name, str(e))
|
|
continue
|
|
except Exception as e:
|
|
logger.exception("Materialize failed for %s", ref_name)
|
|
summary["errors"].append({"table": ref_name, "error": str(e)})
|
|
state.set_error(ref_name, str(e))
|
|
continue
|
|
|
|
# `materialize_query` returns the parquet's MD5 inline — hashing
|
|
# there means we don't re-read a multi-GB file on the request
|
|
# thread. Fallback to `_file_hash(parquet_path)` if for some
|
|
# reason the stats dict didn't carry it (defensive).
|
|
parquet_hash = stats.get("hash")
|
|
if not parquet_hash:
|
|
output_dir_for_hash = (
|
|
bq_output_dir if source_type == "bigquery" else str(kb_output_dir.parent)
|
|
)
|
|
parquet_path = Path(output_dir_for_hash) / "data" / f"{ref_name}.parquet"
|
|
parquet_hash = _file_hash(parquet_path)
|
|
# `update_sync` resets `status='ok'` / `error=NULL` on the upsert
|
|
# path (its argument defaults), so a row that previously errored
|
|
# has the failure cleared by this call. No separate clear_error
|
|
# needed here — the test invariant is that a successful materialize
|
|
# leaves status='ok' and error='', which `update_sync` already
|
|
# establishes.
|
|
state.update_sync(
|
|
table_id=ref_name,
|
|
rows=stats["rows"],
|
|
file_size_bytes=stats["size_bytes"],
|
|
hash=parquet_hash,
|
|
)
|
|
summary["materialized"].append(ref_name)
|
|
|
|
return summary
|
|
|
|
|
|
def _run_sync(tables: Optional[List[str]] = None):
|
|
"""Run extractor as subprocess + orchestrator rebuild.
|
|
|
|
Reads table configs from DuckDB (in main process which has the shared
|
|
connection), passes them as JSON via stdin to the extractor subprocess.
|
|
This avoids DuckDB lock conflicts — subprocess never opens system.duckdb.
|
|
|
|
Singleton: only one invocation runs at a time per process (see
|
|
`_sync_lock` module-level). The trigger handler also fast-fails with
|
|
409 when the lock is held, so this branch is defense in depth.
|
|
"""
|
|
import json as _json
|
|
import sys as _sys
|
|
|
|
if not _sync_lock.acquire(blocking=False):
|
|
print(
|
|
"[SYNC] another sync is already in flight — skipping",
|
|
file=_sys.stderr, flush=True,
|
|
)
|
|
return
|
|
|
|
try:
|
|
from app.instance_config import get_data_source_type, get_value
|
|
from src.db import get_system_db
|
|
|
|
source_type = get_data_source_type()
|
|
data_dir = _get_data_dir()
|
|
|
|
# Read table configs in main process (has shared DuckDB connection)
|
|
sys_conn = get_system_db()
|
|
# Track whether the REGISTRY (not the post-filter list) was empty.
|
|
# Auto-discovery must only fire on a truly empty registry; if the
|
|
# filter returned [] because nothing was due, re-discovering would
|
|
# bypass the schedule entirely on Keboola instances. (Devin BUG_0001
|
|
# on ebb8cc9.)
|
|
registry_has_tables = False
|
|
try:
|
|
repo = TableRegistryRepository(sys_conn)
|
|
if tables:
|
|
# Manual operator override — bypass schedule filter entirely
|
|
# so an admin saying "sync these specific tables now" wins.
|
|
all_configs = [repo.get(t) for t in tables]
|
|
table_configs = [c for c in all_configs if c is not None]
|
|
registry_has_tables = bool(table_configs)
|
|
else:
|
|
table_configs = repo.list_local(source_type) if source_type else repo.list_local()
|
|
# Auto-discover gate must consider the WHOLE registry, not
|
|
# just `local` rows. After the Keboola migration to
|
|
# materialized (v25→v26), an instance can have 30
|
|
# materialized Keboola rows and zero local rows — but
|
|
# `bool(table_configs)` here would be False, and
|
|
# `not registry_has_tables` would re-trigger
|
|
# `_discover_and_register_tables` on every scheduler tick,
|
|
# creating duplicate "auto-discovered" rows with the wrong
|
|
# bucket prefix every time.
|
|
# Use list_all (any source, any mode) for the gate.
|
|
registry_has_tables = bool(repo.list_all())
|
|
# Without this filter, every scheduler tick would re-sync
|
|
# every table regardless of its sync_schedule cadence,
|
|
# making the field a no-op at trigger time. Tables with
|
|
# no schedule pass through unchanged (opt-in feature).
|
|
state_repo = SyncStateRepository(sys_conn)
|
|
table_configs = filter_due_tables(table_configs, state_repo)
|
|
finally:
|
|
sys_conn.close()
|
|
|
|
if not table_configs:
|
|
# Auto-discover tables on first sync when registry is empty.
|
|
# `not registry_has_tables` is the load-bearing guard — without
|
|
# it, "filter excluded everything" looks identical to "registry
|
|
# empty" and we'd re-discover + re-sync every tick regardless of
|
|
# sync_schedule.
|
|
if not registry_has_tables and source_type == "keboola" and os.environ.get("KEBOOLA_STORAGE_TOKEN"):
|
|
logger.info("No tables registered — running auto-discovery from Keboola")
|
|
try:
|
|
from app.api.admin import _discover_and_register_tables
|
|
auto_conn = get_system_db()
|
|
try:
|
|
result = _discover_and_register_tables(auto_conn, "auto-discovery")
|
|
logger.info("Auto-discovered %d tables, skipped %d", result["registered"], result["skipped"])
|
|
finally:
|
|
auto_conn.close()
|
|
# Re-read table configs after auto-registration
|
|
sys_conn2 = get_system_db()
|
|
try:
|
|
table_configs = TableRegistryRepository(sys_conn2).list_local(source_type)
|
|
finally:
|
|
sys_conn2.close()
|
|
except Exception as e:
|
|
logger.warning("Auto-discovery failed: %s", e)
|
|
|
|
# CRITICAL: don't early-return when local-mode tables are empty.
|
|
# `list_local("bigquery")` is always empty on BQ-only deployments
|
|
# (BQ rows are always remote or materialized, never local), so an
|
|
# early return would prevent the materialized pass AND the
|
|
# orchestrator rebuild from ever firing on a BQ-only instance.
|
|
# Devin BUG_0002 on PR #148 commit 2fa44f2. Just flag whether the
|
|
# Keboola subprocess + custom-connectors should run; everything
|
|
# below (materialized pass, orchestrator rebuild, profiler) runs
|
|
# unconditionally so a registry with materialized rows but no
|
|
# local rows still publishes them.
|
|
run_extractor_subprocess = bool(table_configs)
|
|
if not run_extractor_subprocess:
|
|
logger.info(
|
|
"No local-mode tables to sync for source_type=%s — "
|
|
"skipping extractor subprocess; materialized pass + "
|
|
"orchestrator rebuild still run.",
|
|
source_type,
|
|
)
|
|
|
|
env = {**os.environ}
|
|
|
|
if run_extractor_subprocess:
|
|
# v26: incremental + partitioned strategies need last_sync from
|
|
# sync_state to compute changedSince. The subprocess MUST NOT
|
|
# reopen system.duckdb (parent holds the lock — see contract at
|
|
# the top of this function), so the parent reads watermarks
|
|
# here and injects them into each table_config under the key
|
|
# `__last_sync__`. extractor.run() picks them up via
|
|
# _read_last_sync's first-check-config-then-fall-back pattern.
|
|
ws_conn = get_system_db()
|
|
try:
|
|
ws_repo = SyncStateRepository(ws_conn)
|
|
for tc in table_configs:
|
|
if tc.get("sync_strategy") in ("incremental", "partitioned"):
|
|
state = ws_repo.get_table_state(tc.get("id") or tc.get("name"))
|
|
if state and state.get("status") != "error":
|
|
ls = state.get("last_sync")
|
|
if ls is not None:
|
|
tc["__last_sync__"] = ls
|
|
finally:
|
|
ws_conn.close()
|
|
|
|
# Serialize configs — strip non-serializable fields
|
|
serializable = []
|
|
for tc in table_configs:
|
|
serializable.append({k: (v.isoformat() if hasattr(v, 'isoformat') else v)
|
|
for k, v in tc.items() if v is not None})
|
|
|
|
# Run extractor subprocess with table configs via stdin
|
|
# Subprocess does NOT open system.duckdb — no lock conflict
|
|
cmd = [_sys.executable, "-c", """
|
|
import json, sys, os, logging, signal
|
|
from pathlib import Path
|
|
|
|
# Subprocess inherits no logging config — without basicConfig, Python's
|
|
# lastResort handler only surfaces WARNING+ to stderr and INFO-level
|
|
# extraction progress from connectors.keboola.extractor.run() is silently
|
|
# dropped. capture_output=True in the parent then swallows the rest.
|
|
# Devin BUG_0002 on PR #136 review.
|
|
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
|
|
|
|
# Convert SIGTERM into a controlled SystemExit so the ProcessPoolExecutor
|
|
# `with` block in connectors.keboola.extractor.run() runs its __exit__
|
|
# (shutdown/wait_for_workers) before this process dies. Without this,
|
|
# SIGTERM kills the parent abruptly, leaving the OS to clean up the pool
|
|
# children — but each worker holds an open Keboola Storage export job
|
|
# whose lifetime is tied to the HTTP poll loop, and those leak until the
|
|
# Keboola side TTLs them out. The parent extractor calls this from
|
|
# app.api.sync._run_sync after `subprocess.Popen(start_new_session=True)`
|
|
# + `os.killpg(SIGTERM)` on timeout.
|
|
def _exit_on_sigterm(signum, frame):
|
|
sys.exit(143)
|
|
signal.signal(signal.SIGTERM, _exit_on_sigterm)
|
|
|
|
configs = json.load(sys.stdin)
|
|
url = os.environ.get("KEBOOLA_STACK_URL", "")
|
|
token = os.environ.get("KEBOOLA_STORAGE_TOKEN", "")
|
|
|
|
if not url or not token:
|
|
print("ERROR: Missing KEBOOLA_STACK_URL or KEBOOLA_STORAGE_TOKEN", file=sys.stderr)
|
|
sys.exit(1)
|
|
|
|
from connectors.keboola.extractor import run, compute_exit_code
|
|
data_dir = Path(os.environ.get("DATA_DIR", "./data"))
|
|
result = run(str(data_dir / "extracts" / "keboola"), configs, url, token)
|
|
print(json.dumps(result))
|
|
# Issue #81 Group B: surface partial-failure as exit 2 so the API
|
|
# caller can distinguish "every table failed" from "9/10 succeeded".
|
|
sys.exit(compute_exit_code(result, len(configs)))
|
|
"""]
|
|
|
|
print(f"[SYNC] Starting extractor subprocess for {len(table_configs)} tables", file=_sys.stderr, flush=True)
|
|
|
|
# Run in a new process group (start_new_session=True) so a
|
|
# timeout can take down the whole tree — the extractor itself
|
|
# plus any ProcessPoolExecutor workers it spawned for parallel
|
|
# legacy-fallback. Without this, plain `subprocess.run` on
|
|
# timeout SIGKILLs only the immediate child; the pool workers
|
|
# are reparented to PID 1 and continue holding open Keboola
|
|
# Storage export jobs, blocking the next sync cycle's
|
|
# connectivity to those same job IDs.
|
|
extractor_timeout = int(os.environ.get("AGNES_EXTRACTOR_TIMEOUT_SEC", "3600"))
|
|
proc = subprocess.Popen(
|
|
cmd,
|
|
stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
|
|
text=True, env=env,
|
|
cwd=str(Path(__file__).parent.parent.parent),
|
|
start_new_session=True,
|
|
)
|
|
try:
|
|
stdout, stderr = proc.communicate(input=_json.dumps(serializable), timeout=extractor_timeout)
|
|
result = subprocess.CompletedProcess(cmd, proc.returncode, stdout, stderr)
|
|
except subprocess.TimeoutExpired:
|
|
# SIGTERM the whole process group first to give workers a
|
|
# chance to shut down cleanly (release Keboola export jobs,
|
|
# close DuckDB conns), then SIGKILL the stragglers after a
|
|
# short grace window.
|
|
import signal
|
|
try:
|
|
os.killpg(proc.pid, signal.SIGTERM)
|
|
except ProcessLookupError:
|
|
pass
|
|
try:
|
|
proc.communicate(timeout=10)
|
|
except subprocess.TimeoutExpired:
|
|
try:
|
|
os.killpg(proc.pid, signal.SIGKILL)
|
|
except ProcessLookupError:
|
|
pass
|
|
try:
|
|
proc.communicate(timeout=5)
|
|
except subprocess.TimeoutExpired:
|
|
pass
|
|
# Catch the timeout LOCALLY so the materialized BQ pass and
|
|
# orchestrator rebuild below still fire — pre-fix the timeout
|
|
# propagated to the outer except handler and skipped the rest
|
|
# of `_run_sync` (Devin BUG_0001 on PR #148 commit 2219255).
|
|
print(
|
|
f"[SYNC] Extractor timed out after {extractor_timeout}s — process "
|
|
"group killed; continuing to materialized pass + orchestrator rebuild",
|
|
file=_sys.stderr, flush=True,
|
|
)
|
|
result = None
|
|
|
|
if result is not None:
|
|
if result.stdout:
|
|
print(f"[SYNC] Extractor stdout: {result.stdout.strip()[-500:]}", file=_sys.stderr, flush=True)
|
|
if result.stderr:
|
|
print(f"[SYNC] Extractor stderr: {result.stderr[-500:]}", file=_sys.stderr, flush=True)
|
|
# Issue #81 Group B: three exit codes. 0 = full success,
|
|
# 1 = full failure, 2 = partial. Partial is a data-quality
|
|
# alert, not a crash — the orchestrator's per-table _meta
|
|
# machinery already captured which tables succeeded; we just
|
|
# need to log loudly so operator alerting can pick it up.
|
|
if result.returncode == 0:
|
|
print(f"[SYNC] Extractor OK", file=_sys.stderr, flush=True)
|
|
elif result.returncode == 2:
|
|
print(
|
|
f"[SYNC] Extractor PARTIAL FAILURE (exit 2) — some tables "
|
|
f"succeeded, some failed; see stderr for per-table errors. "
|
|
f"Successful tables will still be published by the orchestrator.",
|
|
file=_sys.stderr, flush=True,
|
|
)
|
|
else:
|
|
print(f"[SYNC] Extractor FAILED (exit {result.returncode})", file=_sys.stderr, flush=True)
|
|
|
|
# Run custom connectors (Tier A: local mount) — only when there
|
|
# were local-mode tables to drive the extractor. Custom connectors
|
|
# currently piggyback on the same env as the Keboola extractor.
|
|
connectors_dir = Path(os.environ.get("CONNECTORS_DIR", str(Path(__file__).parent.parent.parent / "connectors" / "custom")))
|
|
if connectors_dir.exists():
|
|
for connector_dir in sorted(connectors_dir.iterdir()):
|
|
if not connector_dir.is_dir():
|
|
continue
|
|
extractor = connector_dir / "extractor.py"
|
|
if not extractor.exists():
|
|
continue
|
|
logger.info("Running custom connector: %s", connector_dir.name)
|
|
try:
|
|
custom_result = subprocess.run(
|
|
[_sys.executable, str(extractor)],
|
|
env=env, capture_output=True, text=True, timeout=600,
|
|
cwd=str(Path(__file__).parent.parent.parent),
|
|
)
|
|
if custom_result.returncode != 0:
|
|
logger.error("Custom connector %s failed: %s", connector_dir.name, custom_result.stderr[-500:])
|
|
else:
|
|
logger.info("Custom connector %s completed", connector_dir.name)
|
|
except subprocess.TimeoutExpired:
|
|
logger.error("Custom connector %s timed out", connector_dir.name)
|
|
|
|
# Materialized SQL pass — runs admin-registered SQL through the
|
|
# source's DuckDB extension (BQ via BqAccess, Keboola via
|
|
# KeboolaAccess) and writes parquet for due rows. _run_materialized_pass
|
|
# itself dispatches by source_type, so we always run it regardless of
|
|
# which (or both) source types have a `project` / `stack_url` set —
|
|
# Keboola-only instances would otherwise silently skip Keboola
|
|
# materialized rows just because no BQ project is configured (Devin
|
|
# finding 2026-05-01: BUG_pr-review-job-3fbd31c9_0001). The BQ
|
|
# branch inside _run_materialized_pass uses a per-row try/except so
|
|
# the sentinel BqAccess (not_configured) raises a typed error that
|
|
# gets recorded against that row only — no cascade.
|
|
try:
|
|
from connectors.bigquery.access import get_bq_access
|
|
from src.db import get_system_db as _get_system_db
|
|
bq_access = get_bq_access() # sentinel if no BQ project; OK
|
|
mat_conn = _get_system_db()
|
|
try:
|
|
mat_summary = _run_materialized_pass(
|
|
mat_conn, bq_access, tables=tables,
|
|
)
|
|
finally:
|
|
mat_conn.close()
|
|
skipped_count = len(mat_summary["skipped"])
|
|
in_flight_count = sum(
|
|
1 for s in mat_summary["skipped"] if s.get("reason") == "in_flight"
|
|
)
|
|
print(
|
|
f"[SYNC] Materialized SQL: {len(mat_summary['materialized'])} ok, "
|
|
f"{skipped_count} skipped (in_flight={in_flight_count}), "
|
|
f"{len(mat_summary['errors'])} errors",
|
|
file=_sys.stderr, flush=True,
|
|
)
|
|
for err in mat_summary["errors"]:
|
|
print(
|
|
f"[SYNC] {err['table']}: {err['error']}",
|
|
file=_sys.stderr, flush=True,
|
|
)
|
|
except Exception as e:
|
|
print(
|
|
f"[SYNC] Materialized SQL pass FAILED: {e}",
|
|
file=_sys.stderr, flush=True,
|
|
)
|
|
traceback.print_exc()
|
|
|
|
# Rebuild master views (reads extract.duckdb files, no write conflict)
|
|
from src.orchestrator import SyncOrchestrator
|
|
orch = SyncOrchestrator()
|
|
views = orch.rebuild()
|
|
print(f"[SYNC] Orchestrator rebuild: {{{', '.join(f'{k}: {len(v)}' for k, v in views.items())}}}", file=_sys.stderr, flush=True)
|
|
|
|
# Auto-profile synced tables (best-effort, don't fail sync on profile error)
|
|
try:
|
|
from src.profiler import profile_table, TableInfo
|
|
from src.repositories.profiles import ProfileRepository
|
|
|
|
data_dir = Path(os.environ.get("DATA_DIR", "./data"))
|
|
extracts_dir = data_dir / "extracts"
|
|
|
|
sys_conn = get_system_db()
|
|
try:
|
|
profile_repo = ProfileRepository(sys_conn)
|
|
profiled = 0
|
|
for source_name, table_names in views.items():
|
|
for table_name in table_names[:10]: # Limit per sync
|
|
pq_path = extracts_dir / source_name / "data" / f"{table_name}.parquet"
|
|
if not pq_path.exists():
|
|
continue
|
|
try:
|
|
table_info = TableInfo(name=table_name, table_id=table_name)
|
|
profile = profile_table(table_info, pq_path, [], {}, {})
|
|
profile_repo.save(table_name, profile)
|
|
profiled += 1
|
|
except Exception as pe:
|
|
print(f"[SYNC] Profile {table_name}: {pe}", file=_sys.stderr, flush=True)
|
|
print(f"[SYNC] Profiled {profiled} tables", file=_sys.stderr, flush=True)
|
|
finally:
|
|
sys_conn.close()
|
|
except Exception as e:
|
|
print(f"[SYNC] Profiler skipped: {e}", file=_sys.stderr, flush=True)
|
|
|
|
except subprocess.TimeoutExpired:
|
|
# Outer-handler fallback for any subprocess.run call site (e.g.
|
|
# custom-connectors below) that didn't already catch its own
|
|
# TimeoutExpired. Concrete timeout value isn't available here —
|
|
# log generically.
|
|
print("[SYNC] Extractor subprocess timed out", file=_sys.stderr, flush=True)
|
|
except Exception as e:
|
|
print(f"[SYNC] FAILED: {e}", file=_sys.stderr, flush=True)
|
|
traceback.print_exc()
|
|
finally:
|
|
_sync_lock.release()
|
|
|
|
|
|
# ---- Manifest ----
|
|
|
|
def _table_manifest_entry(state: dict, reg: dict) -> dict:
|
|
"""Shape one ``sync_state`` row + registry metadata into the per-table
|
|
manifest object used in ``data_packages[].tables`` and ``direct_tables``.
|
|
|
|
Tolerant to empty ``state`` (table is registered but never synced) and
|
|
empty ``reg`` (sync_state row outlives the registry — race on unregister).
|
|
Both happen in real installs; the manifest is the read path so we must
|
|
not blow up on a partially-consistent snapshot.
|
|
"""
|
|
name = state.get("table_id") or reg.get("name") or reg.get("id") or ""
|
|
return {
|
|
"id": reg.get("id") or name,
|
|
"name": name,
|
|
"hash": state.get("hash", ""),
|
|
"md5": state.get("hash", ""),
|
|
"size_bytes": state.get("file_size_bytes", 0),
|
|
"rows": state.get("rows", 0),
|
|
"query_mode": reg.get("query_mode") or "local",
|
|
"source_type": reg.get("source_type") or "",
|
|
"updated": (
|
|
state.get("last_sync").isoformat() if state.get("last_sync") else None
|
|
),
|
|
}
|
|
|
|
|
|
def _build_data_packages_section(
|
|
conn, user: dict, registry_by_name: dict, states_by_table_id: dict
|
|
) -> tuple[list, set]:
|
|
"""Build the ``data_packages`` array per Section 5.1 of the design.
|
|
|
|
Returns the list plus a set of ``table_registry.id`` values that were
|
|
surfaced via at least one package — used to subtract from
|
|
``direct_tables`` so a table belonging to a package doesn't double-render.
|
|
"""
|
|
from app.resource_types import ResourceType
|
|
from app.services.stack_resolver import StackResolver
|
|
from src.repositories.data_packages import DataPackagesRepository
|
|
|
|
resolver = StackResolver(conn)
|
|
pkg_entries = resolver.stack(user["id"], ResourceType.DATA_PACKAGE)
|
|
if not pkg_entries:
|
|
return [], set()
|
|
repo = DataPackagesRepository(conn)
|
|
packaged_table_ids: set = set()
|
|
out: list = []
|
|
for entry in pkg_entries:
|
|
pkg = repo.get(entry.id)
|
|
if not pkg:
|
|
continue
|
|
table_rows = repo.list_tables(entry.id)
|
|
tables_payload: list = []
|
|
total_size_bytes = 0
|
|
for t in table_rows:
|
|
packaged_table_ids.add(t["id"])
|
|
# registry_by_name keys on name; sync_state.table_id mirrors
|
|
# registry.name today. Cover the id↔name asymmetry.
|
|
reg = registry_by_name.get(t["name"]) or {}
|
|
state = (
|
|
states_by_table_id.get(t["name"])
|
|
or states_by_table_id.get(t["id"])
|
|
or {}
|
|
)
|
|
entry_obj = _table_manifest_entry(state, reg or {"id": t["id"]})
|
|
tables_payload.append(entry_obj)
|
|
total_size_bytes += int(entry_obj.get("size_bytes") or 0)
|
|
out.append({
|
|
"id": pkg["id"],
|
|
"slug": pkg["slug"],
|
|
"name": pkg["name"],
|
|
"icon": pkg.get("icon"),
|
|
"color": pkg.get("color"),
|
|
"description": pkg.get("description"),
|
|
"requirement": entry.requirement,
|
|
"tables": tables_payload,
|
|
"total_size_bytes": total_size_bytes,
|
|
})
|
|
return out, packaged_table_ids
|
|
|
|
|
|
def _build_memory_domains_section(conn, user: dict) -> list:
|
|
"""Build the ``memory_domains`` array per Section 5.1.
|
|
|
|
Each entry carries a per-domain ``md5`` derived from the concatenated
|
|
item content/titles inside the domain — when the bundle changes the
|
|
md5 flips so the CLI knows to re-fetch.
|
|
|
|
TODO(phase-7): ``bundle_url`` points at a yet-to-implement per-domain
|
|
bundle endpoint (``/api/memory/bundle?domain=<slug>``). The CLI in
|
|
Phase 7 will need it; for now we emit the URL the future endpoint
|
|
will live at so older clients keep parsing the manifest cleanly.
|
|
"""
|
|
from app.resource_types import ResourceType
|
|
from app.services.stack_resolver import StackResolver
|
|
from src.repositories.memory_domains import MemoryDomainsRepository
|
|
|
|
resolver = StackResolver(conn)
|
|
dom_entries = resolver.stack(user["id"], ResourceType.MEMORY_DOMAIN)
|
|
if not dom_entries:
|
|
return []
|
|
repo = MemoryDomainsRepository(conn)
|
|
out: list = []
|
|
for entry in dom_entries:
|
|
dom = repo.get(entry.id)
|
|
if not dom:
|
|
continue
|
|
items = repo.list_items_of_domain(entry.id, limit=10000)
|
|
# Per-domain md5 — concatenate sorted item tuples so the hash
|
|
# is stable under list ordering and flips on any content
|
|
# mutation. MUST include ``is_required`` and ``content``
|
|
# because the bundle rendered by ``_build_per_domain_markdown``
|
|
# routes items between "## Required" and "## Approved" by
|
|
# ``is_required`` and embeds the full ``content`` body; without
|
|
# these in the hash, an admin edit of either dimension leaves
|
|
# the manifest md5 unchanged → ``agnes pull`` skips the
|
|
# re-fetch → analyst keeps a stale bundle.md.
|
|
#
|
|
# Filter to the SAME predicate the renderer uses (any
|
|
# ``is_required`` item OR ``status='approved' AND not is_required``)
|
|
# so edits to pending/rejected non-required items don't flip the
|
|
# md5 against an identical-bytes bundle — the original Devin
|
|
# review flagged this asymmetry (BUG-0001 fixed the hash inputs;
|
|
# this commit closes the matching 🚩 ANALYSIS that the SET of
|
|
# items hashed must also match what the renderer emits).
|
|
h = hashlib.md5()
|
|
renderable = [
|
|
it for it in items
|
|
if it.get("is_required") or it.get("status") == "approved"
|
|
]
|
|
for it in sorted(renderable, key=lambda r: r["id"]):
|
|
h.update(
|
|
f"{it['id']}|{it.get('title','')}|{it.get('status','')}|"
|
|
f"{it.get('is_required', False)}|{it.get('content','')}|".encode()
|
|
)
|
|
required_count = sum(
|
|
1 for it in items
|
|
if (it.get("status") == "approved" and it.get("is_required"))
|
|
)
|
|
out.append({
|
|
"id": dom["id"],
|
|
"slug": dom["slug"],
|
|
"name": dom["name"],
|
|
"icon": dom.get("icon"),
|
|
"color": dom.get("color"),
|
|
"description": dom.get("description"),
|
|
"requirement": entry.requirement,
|
|
"bundle_url": f"/api/memory/bundle?domain={dom['slug']}",
|
|
"md5": h.hexdigest(),
|
|
"items_count": len(items),
|
|
"required_count": required_count,
|
|
})
|
|
return out
|
|
|
|
|
|
def _build_direct_tables_section(
|
|
conn, user: dict, registry_by_name: dict, states_by_table_id: dict,
|
|
packaged_table_ids: set,
|
|
) -> list:
|
|
"""Tables granted via ``TABLE`` resource_type (not DATA_PACKAGE).
|
|
|
|
A table granted both directly AND via a package only shows up under the
|
|
package — Section 5.1's BC story is that ``tables[]`` (legacy) still
|
|
lists everything, while ``direct_tables[]`` is the de-duplicated
|
|
forward-compatible projection.
|
|
"""
|
|
group_ids = [
|
|
r[0] for r in conn.execute(
|
|
"SELECT group_id FROM user_group_members WHERE user_id = ?",
|
|
[user["id"]],
|
|
).fetchall()
|
|
]
|
|
if not group_ids:
|
|
return []
|
|
placeholders = ",".join(["?"] * len(group_ids))
|
|
rows = conn.execute(
|
|
f"""SELECT DISTINCT resource_id FROM resource_grants
|
|
WHERE group_id IN ({placeholders})
|
|
AND resource_type = 'table'""",
|
|
group_ids,
|
|
).fetchall()
|
|
direct_ids = {r[0] for r in rows} - packaged_table_ids
|
|
out: list = []
|
|
for tid in direct_ids:
|
|
# resource_grants.resource_id for ``TABLE`` is canonically the
|
|
# registry id; fall back to name lookup if migration left a name.
|
|
reg = None
|
|
for r in registry_by_name.values():
|
|
if r.get("id") == tid:
|
|
reg = r
|
|
break
|
|
if reg is None:
|
|
reg = registry_by_name.get(tid) or {}
|
|
state = states_by_table_id.get(reg.get("name") or tid) or {}
|
|
out.append(_table_manifest_entry(state, reg))
|
|
return out
|
|
|
|
|
|
def _build_manifest_for_user(conn, user: dict) -> dict:
|
|
"""Build manifest dict filtered by user's accessible tables.
|
|
|
|
Joins ``sync_state`` with ``table_registry`` so each table entry exposes
|
|
``query_mode`` and ``source_type``. The CLI uses these to decide whether
|
|
to download a parquet (local) or skip it (remote, e.g. BigQuery views).
|
|
|
|
Defensive defaults: if a sync_state row has no matching registry entry
|
|
(race / manual deletion), fall back to ``query_mode='local'`` and
|
|
``source_type=''`` so the manifest still serializes cleanly.
|
|
|
|
v49: extended with ``data_packages`` / ``memory_domains`` /
|
|
``direct_tables`` arrays per Section 5.1 of the unified-stack design.
|
|
Legacy ``tables`` dict stays in parallel for one release — older CLIs
|
|
still parse it; newer clients prefer the typed sections.
|
|
"""
|
|
sync_repo = SyncStateRepository(conn)
|
|
table_repo = TableRegistryRepository(conn)
|
|
all_states = sync_repo.get_all_states()
|
|
# `sync_state.table_id` is sourced from `_meta.table_name` which equals
|
|
# `table_registry.name`, NOT `table_registry.id`. Auto-discovered Keboola
|
|
# tables and manually-registered ones with mixed-case/spaced names produce
|
|
# id != name; an id-keyed lookup would miss them and silently default to
|
|
# `query_mode=local`, causing the CLI to try downloading remote tables.
|
|
registry_by_name = {t["name"]: t for t in table_repo.list_all()}
|
|
|
|
# Filter by user's accessible tables. `can_access_table` has its own
|
|
# admin shortcut (Admin group → True). Lookup translates name→id first
|
|
# because `s["table_id"]` is sourced from `_meta.table_name` = registry
|
|
# `name` while `can_access_table` keys on registry `id`; when id != name
|
|
# an id-keyed call would miss.
|
|
def _id_for(state):
|
|
reg = registry_by_name.get(state["table_id"])
|
|
return reg["id"] if reg else state["table_id"]
|
|
all_states = [s for s in all_states if can_access_table(user, _id_for(s), conn)]
|
|
|
|
data_dir = _get_data_dir()
|
|
tables = {}
|
|
for state in all_states:
|
|
table_id = state["table_id"]
|
|
reg = registry_by_name.get(table_id, {})
|
|
tables[table_id] = {
|
|
"hash": state.get("hash", ""),
|
|
"updated": state.get("last_sync").isoformat() if state.get("last_sync") else None,
|
|
"size_bytes": state.get("file_size_bytes", 0),
|
|
"rows": state.get("rows", 0),
|
|
"query_mode": reg.get("query_mode") or "local",
|
|
"source_type": reg.get("source_type") or "",
|
|
}
|
|
|
|
# Asset hashes
|
|
docs_dir = data_dir / "docs"
|
|
assets = {}
|
|
for asset_name, asset_path in [
|
|
("docs", docs_dir),
|
|
("profiles", data_dir / "src_data" / "metadata" / "profiles.json"),
|
|
]:
|
|
if asset_path.exists():
|
|
if asset_path.is_file():
|
|
assets[asset_name] = {"hash": _file_hash(asset_path)}
|
|
else:
|
|
newest = max(
|
|
(f.stat().st_mtime for f in asset_path.rglob("*") if f.is_file()),
|
|
default=0,
|
|
)
|
|
assets[asset_name] = {"hash": str(int(newest))}
|
|
|
|
# v49 unified-stack manifest extensions (Section 5.1).
|
|
# DEPRECATED v49: ``tables`` dict above is kept paralel for one release —
|
|
# older CLIs depend on it; new clients prefer ``direct_tables`` +
|
|
# ``data_packages[].tables``.
|
|
states_by_table_id = {s["table_id"]: s for s in all_states}
|
|
try:
|
|
data_packages, packaged_ids = _build_data_packages_section(
|
|
conn, user, registry_by_name, states_by_table_id,
|
|
)
|
|
except Exception:
|
|
logger.exception("manifest data_packages section build failed")
|
|
data_packages, packaged_ids = [], set()
|
|
try:
|
|
memory_domains = _build_memory_domains_section(conn, user)
|
|
except Exception:
|
|
logger.exception("manifest memory_domains section build failed")
|
|
memory_domains = []
|
|
try:
|
|
direct_tables = _build_direct_tables_section(
|
|
conn, user, registry_by_name, states_by_table_id, packaged_ids,
|
|
)
|
|
except Exception:
|
|
logger.exception("manifest direct_tables section build failed")
|
|
direct_tables = []
|
|
|
|
return {
|
|
"tables": tables,
|
|
"assets": assets,
|
|
"server_time": datetime.now(timezone.utc).isoformat(),
|
|
"data_packages": data_packages,
|
|
"memory_domains": memory_domains,
|
|
"direct_tables": direct_tables,
|
|
}
|
|
|
|
|
|
@router.get("/manifest")
|
|
async def sync_manifest(
|
|
user: dict = Depends(get_current_user),
|
|
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
|
):
|
|
"""Return hash-based manifest of all synced data, filtered per user.
|
|
|
|
Side-effect: stamps ``users.last_pull_at`` so the /home status frame
|
|
can show when the analyst last pulled. This GET is the canonical
|
|
"I am about to sync" signal — agnes pull hits it first, then
|
|
downloads parquets whose hash changed. UI bumps (manifest browsed in
|
|
a browser session) also count; cheap and accurate enough for a
|
|
homepage card.
|
|
"""
|
|
try:
|
|
conn.execute(
|
|
"UPDATE users SET last_pull_at = current_timestamp WHERE id = ?",
|
|
[user["id"]],
|
|
)
|
|
# Also emit an audit_log row so /me/stats Sync activity has a
|
|
# timeline of pulls (the column UPDATE only retains the most
|
|
# recent one). Action `manifest.fetch` covers both `agnes pull`
|
|
# via PAT and browser-driven manifest peeks; clients can
|
|
# disambiguate via client_kind.
|
|
AuditRepository(conn).log(
|
|
user_id=user["id"],
|
|
action="manifest.fetch",
|
|
resource="manifest",
|
|
result="ok",
|
|
client_kind="api",
|
|
)
|
|
except Exception:
|
|
# Never block a pull because the stamp UPDATE / audit row hit a
|
|
# transient issue (locked WAL, partial migration window). The
|
|
# manifest itself is the load-bearing payload.
|
|
pass
|
|
# v49 Section 9.2 — emit a server-side ``sync.pull_started`` event so
|
|
# /admin/telemetry can count distinct pulls per user per day. Best-effort.
|
|
try:
|
|
from src.repositories.usage import UsageRepository
|
|
UsageRepository(conn).emit_server_event(
|
|
event_type="sync.pull_started",
|
|
user_id=user["id"],
|
|
username=user.get("email") or user["id"],
|
|
props={"client_kind": client_kind_from_user(user)},
|
|
)
|
|
except Exception:
|
|
pass
|
|
return _build_manifest_for_user(conn, user)
|
|
|
|
|
|
# ---- Pull confirm (Phase 7, Task 7.6) ----
|
|
|
|
|
|
class PullConfirmTypeReport(BaseModel):
|
|
added: int = 0
|
|
updated: int = 0
|
|
removed: int = 0
|
|
|
|
|
|
class PullConfirmRequest(BaseModel):
|
|
"""Per-type aggregate the CLI submits after every pull finishes.
|
|
|
|
Pairs with the ``sync.pull_started`` event emitted by GET /manifest
|
|
so admin telemetry can compute pull-success rates + duration
|
|
distributions. Optional fields fall back to zero counts — older CLI
|
|
versions that don't track a section emit nothing for it.
|
|
"""
|
|
|
|
duration_ms: Optional[int] = None
|
|
direct_tables: Optional[PullConfirmTypeReport] = None
|
|
data_packages: Optional[PullConfirmTypeReport] = None
|
|
memory_domains: Optional[PullConfirmTypeReport] = None
|
|
errors: int = 0
|
|
|
|
|
|
@router.post("/pull-confirm")
|
|
async def pull_confirm(
|
|
payload: PullConfirmRequest,
|
|
user: dict = Depends(get_current_user),
|
|
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
|
):
|
|
"""Telemetry hook the CLI fires at the end of every ``agnes pull``.
|
|
|
|
Best-effort: a telemetry insert failure must NOT bubble up to the
|
|
CLI (the user already has their parquets, the pull succeeded). The
|
|
response is a fixed shape ``{"recorded": True}`` so older clients
|
|
that ignore the body keep working when the field set evolves.
|
|
"""
|
|
props: dict = {
|
|
"duration_ms": payload.duration_ms,
|
|
"errors": payload.errors,
|
|
"client_kind": client_kind_from_user(user),
|
|
}
|
|
for section in ("direct_tables", "data_packages", "memory_domains"):
|
|
section_payload = getattr(payload, section)
|
|
if section_payload is not None:
|
|
props[f"{section}_added"] = section_payload.added
|
|
props[f"{section}_updated"] = section_payload.updated
|
|
props[f"{section}_removed"] = section_payload.removed
|
|
|
|
try:
|
|
from src.repositories.usage import UsageRepository
|
|
UsageRepository(conn).emit_server_event(
|
|
event_type="sync.pull_completed",
|
|
user_id=user["id"],
|
|
username=user.get("email") or user["id"],
|
|
props=props,
|
|
)
|
|
except Exception:
|
|
logger.warning("usage_events emit failed for sync.pull_completed")
|
|
return {"recorded": True}
|
|
|
|
|
|
# ---- Status ----
|
|
|
|
@router.get("/status")
|
|
async def sync_status():
|
|
"""Whether a sync is currently in flight on this app process.
|
|
|
|
Public (no auth) — used by the host-side ``agnes-auto-upgrade.sh``
|
|
cron to decide whether to skip a `docker compose up -d` that would
|
|
kill a running extractor / materialized pass mid-flight. Cheap to
|
|
serve (single Lock.locked() check) and contains no sensitive data.
|
|
|
|
Returns:
|
|
``{"locked": bool}`` — True if `_sync_lock` is currently held by
|
|
a `_run_sync` invocation, OR a sync was triggered within the
|
|
last ``_TRIGGER_HOLD_SEC`` seconds (so the FastAPI background
|
|
task hasn't yet acquired the lock). Without the trigger-hold
|
|
window, an auto-upgrade probe firing in the gap between the
|
|
trigger handler's 200 response and the background task's
|
|
``_sync_lock.acquire()`` would see ``locked=False`` and proceed
|
|
with ``up -d`` — killing the just-spawning extractor.
|
|
"""
|
|
locked = _sync_lock.locked()
|
|
if not locked and _recent_trigger_at:
|
|
# Monotonic deadline; clock skew / DST jumps don't matter.
|
|
locked = (time.monotonic() - _recent_trigger_at) < _TRIGGER_HOLD_SEC
|
|
return {"locked": locked}
|
|
|
|
|
|
# ---- Trigger ----
|
|
|
|
@router.post("/trigger")
|
|
async def trigger_sync(
|
|
background_tasks: BackgroundTasks,
|
|
body: Optional[Any] = Body(None),
|
|
user: dict = Depends(require_admin),
|
|
):
|
|
"""Trigger data sync from configured source. Admin only. Runs in background.
|
|
|
|
Body accepts three shapes (all optional — empty body / `null` syncs
|
|
every registered table):
|
|
|
|
- ``["kbc_job", "orders"]`` — bare JSON array of table ids
|
|
- ``{"tables": ["kbc_job", "orders"]}`` — object with a ``tables``
|
|
key (matches the wire shape of the response, more discoverable
|
|
for clients building requests by hand)
|
|
- ``null`` / no body — sync everything
|
|
|
|
Both array forms have shipped at different times; accepting both
|
|
keeps older clients (PR-build CLIs, helper scripts) working while
|
|
surfacing the shape that mirrors the response payload. Anything
|
|
else returns HTTP 422 with a structured detail.
|
|
|
|
Returns 409 if a previously-triggered sync is still running. Two
|
|
concurrent extractor subprocesses fight for the same `extract.duckdb`
|
|
file lock — that contention starves uvicorn, makes `/api/health` time
|
|
out, flips the container to `unhealthy`, and (behind a `reverse_proxy`
|
|
upstream like the bundled Caddy overlay) bricks external traffic
|
|
until contention drains. Fast-fail here keeps that from happening.
|
|
"""
|
|
if body is None:
|
|
tables: Optional[List[str]] = None
|
|
elif isinstance(body, list):
|
|
tables = list(body)
|
|
elif isinstance(body, dict):
|
|
tables = body.get("tables")
|
|
if tables is not None and not isinstance(tables, list):
|
|
raise HTTPException(
|
|
status_code=422,
|
|
detail="`tables` must be a list of strings",
|
|
)
|
|
else:
|
|
raise HTTPException(
|
|
status_code=422,
|
|
detail=(
|
|
"body must be a list of table ids, an object with a "
|
|
"`tables` list, or null"
|
|
),
|
|
)
|
|
if tables is not None and not all(isinstance(t, str) for t in tables):
|
|
raise HTTPException(
|
|
status_code=422,
|
|
detail="all entries in `tables` must be strings",
|
|
)
|
|
|
|
if _sync_lock.locked():
|
|
try:
|
|
from src.db import get_system_db
|
|
_audit_conn = get_system_db()
|
|
AuditRepository(_audit_conn).log(
|
|
user_id=user.get("id"),
|
|
action="sync.trigger",
|
|
resource=(
|
|
(tables[0] if len(tables) == 1 else f"{len(tables)} tables")
|
|
if tables else "all_tables"
|
|
)[:256],
|
|
params={"requested_at": datetime.now(timezone.utc).isoformat(), "tables": tables},
|
|
result="error.in_progress",
|
|
client_kind=client_kind_from_user(user),
|
|
)
|
|
_audit_conn.close()
|
|
except Exception:
|
|
logger.exception("audit_log write failed for sync.trigger (in_progress); continuing")
|
|
raise HTTPException(
|
|
status_code=409,
|
|
detail="sync_already_in_progress",
|
|
)
|
|
_t0 = time.monotonic()
|
|
# Stamp the trigger time so `/api/sync/status` reports locked=True
|
|
# for the next ``_TRIGGER_HOLD_SEC`` even though the background
|
|
# task hasn't yet acquired ``_sync_lock``. Closes the race window
|
|
# the host-side ``agnes-auto-upgrade.sh`` defer probe was hitting.
|
|
global _recent_trigger_at
|
|
_recent_trigger_at = _t0
|
|
background_tasks.add_task(_run_sync, tables)
|
|
try:
|
|
from src.db import get_system_db
|
|
_audit_conn = get_system_db()
|
|
AuditRepository(_audit_conn).log(
|
|
user_id=user.get("id"),
|
|
action="sync.trigger",
|
|
resource=(
|
|
(tables[0] if len(tables) == 1 else f"{len(tables)} tables")
|
|
if tables else "all_tables"
|
|
)[:256],
|
|
params={"requested_at": datetime.now(timezone.utc).isoformat(), "tables": tables},
|
|
result="success",
|
|
duration_ms=int((time.monotonic() - _t0) * 1000),
|
|
client_kind=client_kind_from_user(user),
|
|
)
|
|
_audit_conn.close()
|
|
except Exception:
|
|
logger.exception("audit_log write failed for sync.trigger; continuing")
|
|
return {
|
|
"status": "triggered",
|
|
"tables": tables or "all",
|
|
"message": "Data sync started in background. Check /api/health for progress.",
|
|
}
|
|
|
|
|
|
# ---- Sync Settings (dataset subscriptions) ----
|
|
|
|
class SyncSettingsUpdate(BaseModel):
|
|
datasets: dict # {dataset_name: bool}
|
|
|
|
|
|
@router.get("/settings")
|
|
async def get_sync_settings(
|
|
user: dict = Depends(get_current_user),
|
|
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
|
):
|
|
"""Get user's dataset sync settings."""
|
|
repo = SyncSettingsRepository(conn)
|
|
settings = repo.get_user_settings(user["id"])
|
|
enabled = repo.get_enabled_datasets(user["id"])
|
|
return {
|
|
"user_id": user["id"],
|
|
"settings": settings,
|
|
"enabled_datasets": enabled,
|
|
}
|
|
|
|
|
|
@router.post("/settings")
|
|
async def update_sync_settings(
|
|
request: SyncSettingsUpdate,
|
|
user: dict = Depends(get_current_user),
|
|
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
|
):
|
|
"""Update user's dataset sync settings.
|
|
|
|
A dataset can only be enabled when the user has access (via
|
|
``resource_grants(group, "table", dataset)`` or Admin membership). The
|
|
user_sync_settings layer is per-user preference, not authorization —
|
|
the gate stops users from enabling sync on tables they cannot read.
|
|
"""
|
|
from app.auth.access import can_access
|
|
from app.resource_types import ResourceType
|
|
|
|
settings_repo = SyncSettingsRepository(conn)
|
|
results = {}
|
|
for dataset, enabled in request.datasets.items():
|
|
if not can_access(user["id"], ResourceType.TABLE.value, dataset, conn):
|
|
results[dataset] = {"error": "no permission"}
|
|
continue
|
|
settings_repo.set_dataset_enabled(user["id"], dataset, enabled)
|
|
results[dataset] = {"enabled": enabled}
|
|
|
|
return {"updated": results}
|
|
|
|
|
|
# ---- Table Subscriptions ----
|
|
|
|
class TableSubscriptionUpdate(BaseModel):
|
|
table_mode: str = "all" # "all" or "explicit"
|
|
tables: dict = Field(default_factory=dict, max_length=500) # {table_name: bool}
|
|
|
|
|
|
@router.get("/table-subscriptions")
|
|
async def get_table_subscriptions(
|
|
user: dict = Depends(get_current_user),
|
|
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
|
):
|
|
"""Get user's per-table subscription settings."""
|
|
repo = SyncSettingsRepository(conn)
|
|
settings = repo.get_user_settings(user["id"])
|
|
return {"user_id": user["id"], "subscriptions": settings}
|
|
|
|
|
|
@router.post("/table-subscriptions")
|
|
async def update_table_subscriptions(
|
|
request: TableSubscriptionUpdate,
|
|
user: dict = Depends(get_current_user),
|
|
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
|
):
|
|
"""Update per-table subscription preferences.
|
|
|
|
Mirrors the RBAC gate in POST /settings: a table can only be subscribed
|
|
to when the user holds a resource_grants row for it (or is Admin). This
|
|
prevents an authenticated user from subscribing to tables they cannot read.
|
|
"""
|
|
from app.auth.access import can_access
|
|
from app.resource_types import ResourceType
|
|
|
|
repo = SyncSettingsRepository(conn)
|
|
results = {}
|
|
for table_name, enabled in request.tables.items():
|
|
if not can_access(user["id"], ResourceType.TABLE.value, table_name, conn):
|
|
results[table_name] = {"error": "no permission"}
|
|
continue
|
|
repo.set_dataset_enabled(user["id"], table_name, enabled)
|
|
results[table_name] = {"enabled": enabled}
|
|
return {"table_mode": request.table_mode, "updated": results}
|