agnes-the-ai-analyst/cli/commands/admin.py
ZdenekSrotyr 28430ced09
Keboola cutover: native parquet path + sync correctness + auto-discover protection (#190)
* fix: cutover regressions + parallel Keboola legacy fallback

Bundled fixes from a fresh-deploy run on a Keboola Storage backend with
the block-shared-snowflake-access feature flag — DuckDB Keboola
extension's per-table scan can't access bucket schemas, so the legacy
kbcstorage Storage-API client is the only working path.

CUTOVER REGRESSIONS

- agnes pull hash mismatch on every Keboola local-mode table —
  src/orchestrator.py:_update_sync_state stored md5(mtime+size)[:12]
  while the CLI compares against full 32-char content MD5. Now stores
  the same content MD5 the materialized SQL path already used.

- Trailing-slash sanitization in connectors/keboola/access.py and
  extractor.py — DuckDB Keboola extension's ATTACH fails when the URL
  ends in / (canonical form).

- src/profiler.py:TableInfo.description becomes optional — two call
  sites instantiated without it, crashing the profiler pass.

- scripts/ops/agnes-auto-upgrade.sh: chown on UID change — older images
  ran as root, current runs as agnes (uid 999). Reads target uid:gid
  from /etc/passwd inside the new image and chowns ${STATE_DIR},
  /data/extracts, /data/analytics when the digest moves.

- POST /api/sync/trigger is now singleton per process — two
  near-simultaneous trigger calls each forked an extractor subprocess,
  fought for extract.duckdb's file lock, starved uvicorn, flipped the
  container to unhealthy. Trigger now returns 409
  (sync_already_in_progress) when held; _run_sync acquires non-blocking.

PARALLEL LEGACY FALLBACK

- Process pool fan-out for the _extract_via_legacy queue (default 8
  workers, override via AGNES_KEBOOLA_PARALLELISM). Process pool, not
  thread pool, because connectors/keboola/client.py:export_table does
  os.chdir(temp_dir) — process-global, so threads raced and slice files
  landed in the wrong directory ("[Errno 2] No such file or directory:
  '<job_id>.csv_X_Y_Z.csv'").

- Extractor subprocess timeout 1800s -> 3600s (configurable via
  AGNES_EXTRACTOR_TIMEOUT_SEC). 28+ tables × multi-minute Keboola export
  jobs need the headroom on telemetry-class projects.

- Process group cleanup on timeout — Popen(start_new_session=True) puts
  the extractor in its own group. On timeout the parent SIGTERMs the
  group (10s grace) then SIGKILLs stragglers. Without this, the pool
  workers were reparented to PID 1 and continued holding open Keboola
  Storage export jobs. Inline extractor script also installs a SIGTERM
  -> sys.exit(143) handler so the with ProcessPoolExecutor(...) block
  __exit__ runs cleanly.

Tests: existing tests that patched subprocess.run updated to patch
subprocess.Popen with a _FakePopen stand-in (same exit-code-injection
contract). Two tests that exercised the parallel path forced
AGNES_KEBOOLA_PARALLELISM=1 to keep mocks alive (mocks don't ride into
ProcessPoolExecutor subprocesses).

Squashed onto current main (was 7 commits + multi-commit CHANGELOG +
agnes-auto-upgrade.sh conflicts; squash avoids per-commit conflict
resolution against main's flat-mount STATE_DIR refactor and 0.38.0
release cut).

* feat(keboola): Storage API direct extract path; drop extension data path

The DuckDB Keboola extension's COPY routes through Keboola QueryService,
which is unreliable on linked-bucket projects (extension v0.1.6 fixes
that case but isn't yet in the community CDN, and pre-fix any project
with the block-shared-snowflake-access feature flag couldn't see bucket
schemas at all). Move the extract path off the extension entirely and
talk to the Storage API directly via signed-URL download — works on any
project, regardless of extension state.

connectors/keboola/storage_api.py (NEW)
  Lightweight client built on requests.Session. Three endpoints:
  - POST /v2/storage/tables/{id}/export-async        (kicks off job)
  - GET  /v2/storage/jobs/{id}                        (poll until done)
  - GET  /v2/storage/files/{id}?federationToken=1     (signed URL detail)
  - GET  <signed_url>                                 (download bytes)
  Supports sliced exports (manifest + per-slice signed URLs) and gzipped
  payloads. ExportFilter dataclass mirrors the Keboola filter spec
  (whereFilters / columns / changedSince / limit) and handles JSON
  round-trip with the registry's source_query column. Token redaction
  in error messages. Bounded exponential backoff on job polling.
  No cloud-SDK dependency on the data path; thread-safe.

connectors/keboola/extractor.py
  - materialize_query() rewritten: takes bucket/source_table/source_query
    (JSON filter spec), exports via KeboolaStorageClient, converts CSV
    to parquet via DuckDB, atomic os.replace. Same return shape so
    sync.py downstream code stays uniform with the BQ branch.
  - _extract_via_legacy() also moved to Storage API direct (kept the
    name for caller compatibility with _legacy_worker / the parallel
    batch extractor). Per-call temp directories — no os.chdir, threads
    don't race.

app/api/sync.py
  _run_materialized_pass for source_type='keboola' rows now constructs a
  KeboolaStorageClient (replaces KeboolaAccess) and passes
  bucket/source_table/source_query to materialize_query. Reuses one
  client across rows for HTTP keep-alive. Sources keboola URL from env
  too (KEBOOLA_STACK_URL) when instance.yaml doesn't have stack_url
  configured.

cli/commands/admin.py
  discover-and-register defaults Keboola rows to query_mode='materialized'
  (NULL source_query = full table), matching the v26 migration's
  unification of the local/materialized split for Keboola. BigQuery and
  Jira keep their per-source defaults.

src/db.py
  Schema bump 25 → 26. Migration: UPDATE table_registry SET
  query_mode='materialized' WHERE source_type='keboola' AND
  query_mode='local'. NULL source_query on those rows means "full table
  export" — same effective behavior the local mode provided, but now
  via Storage API instead of the extension.

pyproject.toml
  kbcstorage dep stays (admin-side bucket/table list still uses the
  SDK in app/api/admin.py / connectors/keboola/client.py); only the
  data path is migrated off the SDK. Comment updated to reflect the
  new boundary.

tests
  - test_keboola_storage_api.py (NEW, 19 tests): ExportFilter parsing,
    HTTP client (token redaction, retry logic, polling), download_file
    (single, gzipped, sliced), end-to-end export_table_to_csv.
  - test_keboola_materialize.py rewritten: mocks KeboolaStorageClient
    instead of FakeAccess; same atomic-write + zero-rows + unsafe-id
    contracts.
  - test_sync_trigger_keboola_materialized.py: registry rows now carry
    bucket+source_table+JSON-shape source_query.

114+ Keboola-impacted tests green locally.

* test: schema version assertion bumped to 26 alongside the keboola query_mode migration

* fix(keboola): cutover hot-patches surfaced on agnes-dev

Five small fixes that were applied as in-container hot-patches during
agnes-dev cutover and need to be on the source-of-truth image so a fresh
upgrade does not undo them.

- app/api/sync.py: auto-discover gate considers the WHOLE registry (any
  source, any mode), not just rows where source matches and query_mode
  is local. After the v25→v26 keboola materialized migration an
  instance can have 30 materialized rows and zero local rows; the
  previous gate kept re-firing _discover_and_register_tables every
  scheduler tick, creating duplicate auto-discovered rows with the
  wrong bucket prefix every time.

- app/api/admin.py: _discover_and_register_tables reassembles the
  bucket as <stage>.<bucket-id> (e.g. in.c-finance) instead of
  dropping the stage prefix; default query_mode for keboola is now
  materialized (the v26 contract); validator allows NULL source_query
  for keboola materialized rows (full-table export via Storage API
  export-async, no SQL needed).

- cli/commands/admin.py: register-table mirrors the server validator
  (NULL source_query allowed for source_type=keboola); --bucket help
  text generalized to cover both BQ dataset and Keboola bucket id.

- connectors/keboola/extractor.py: max_line_size=64 MiB on
  read_csv_auto so embedded JSON / SQL cells (kbc_component_configuration
  in particular) do not trip the default 2 MiB ceiling.

- connectors/keboola/storage_api.py: GCP backend support — when the
  Storage API returns a manifest whose slice URLs are gs://
  references with a gcsCredentials block, rewrite to the JSON REST
  download endpoint and authenticate with the issued OAuth bearer
  token; redact tokens in any surfaced error string.

* test: align with new keboola materialized + auto-discover-gate contracts

- test_admin_keboola_materialized: rename
  test_register_keboola_materialized_rejects_missing_source_query →
  test_register_keboola_materialized_accepts_missing_source_query.
  v25→v26 introduced 'keboola materialized with NULL source_query
  means full-table export via Storage API export-async' as the
  default registration shape; the rejection case is no longer the
  contract.

- test_sync_filter: add list_all() to _StubRegistry. The auto-discover
  gate in _run_sync now keys off the WHOLE registry (not just local
  rows) so materialized-only Keboola instances do not re-trigger
  discovery on every tick.

* feat(keboola): native parquet export — skip CSV roundtrip

Storage API export-async accepts fileType={csv,parquet}. Switching the
materialized sync to parquet eliminates the CSV → DuckDB COPY → parquet
roundtrip that pinned a single uvicorn worker over 4 GiB on multi-GB
tables (read_csv with all_varchar + max_line_size=64MB has to
materialize the whole CSV in memory before COPY can stream out a
parquet). Snowflake UNLOAD on Keboola's side already produces typed,
self-contained parquet files; the extractor downloads them and renames
into place.

Two cases:

- **Single-file** export (small table): file_info.url points at one
  signed URL; download_file streams chunks straight to .parquet.tmp
  and we're done. No DuckDB.

- **Sliced** export (Snowflake UNLOAD respects MAX_FILE_SIZE — 16 MiB
  default — so anything larger arrives as N parquet slices): each
  slice is a complete parquet file with its own footer; naive concat
  would corrupt them. download_file_slices keeps the slices as
  separate files in a tempdir, then DuckDB COPY (SELECT * FROM
  read_parquet([slice0, slice1, ...])) merges them into one
  consolidated parquet. DuckDB streams row groups during this — peak
  memory bounded to one row group (~1 MiB) regardless of source size.

The legacy CSV path stays as the explicit opt-in via source_query=
'{"file_type":"csv"}' for projects whose backend can't UNLOAD
parquet (none known today; cheap escape hatch). Backward-compat alias
KeboolaStorageClient.export_table_to_csv kept.

Also fixes a latent bug in download_file's gzip detection: previous
heuristic flagged any unencrypted file as gzipped, which would have
corrupted parquet downloads at gunzip time. Name-suffix-only now.

* fix: tempdir leak cleanup, every 0m schedule, /sync/trigger body shapes

Three small self-contained fixes uncovered during agnes-dev cutover.

- connectors/keboola/extractor.py: tempfile.TemporaryDirectory now uses
  ignore_cleanup_errors=True so a worker death mid-write doesn't leave
  multi-GiB stale slice trees on the boot disk. (12 GiB seen after a
  disk-full crash where TemporaryDirectory's own cleanup also raised
  and got swallowed.)

- src/scheduler.py: is_valid_schedule accepts 'every 0m' (interval=0
  = always due). Force-resync of an errored row no longer requires
  waiting out the default 'every 1h' interval — admin can flip the
  schedule, trigger, then flip back.

- app/api/sync.py: POST /api/sync/trigger accepts both ['table_id']
  (legacy bare-array body) and {'tables': ['table_id']} (matches the
  response payload shape, more discoverable for clients building
  requests by hand). Malformed bodies return 422 with a structured
  detail; null/missing means 'sync everything' as before.

Tests cover: tempdir cleanup on raise (sliced parquet path),
is_valid_schedule + is_table_due 'every 0m' acceptance, and trigger
body parametrized matrix (8 valid shapes + 6 rejection cases).

* fix: targeted-trigger filter in materialized pass + auto-upgrade defer

Two operational gaps observed during agnes-dev cutover, in the same
sync-routing area.

- _run_materialized_pass now takes a 'tables' arg and skips rows not in
  the target set with reason='not_in_target'. POST /api/sync/trigger
  with a body of tables previously only scoped the legacy extractor
  subprocess — the materialized pass kept iterating every due
  materialized row, so an admin asking to re-sync kbc_job re-ran
  every other due materialized row alongside it. Match on registry id
  OR name (admins commonly pass either form). tables=None preserves
  the no-filter behavior.

- New GET /api/sync/status (public, no auth) returns {locked: bool}
  off _sync_lock.locked(). agnes-auto-upgrade.sh probes this before
  docker compose up -d and exits 0 with a 'deferred recreate' log
  line if a sync is in flight — the next 5-min cron tick retries.
  Pre-fix, an auto-upgrade triggered mid-sync would recreate the
  uvicorn worker and kill the in-flight extractor / Snowflake-UNLOAD
  download (observed when kbc_job's first 7-day retry got SIGKILLed).
  Connection failures in the probe fall through to the upgrade —
  being stuck on a wedged image is worse than interrupting a
  hypothetical sync.

* fix: auto-discover protects admin overrides + surfaces drift

Two real-world incidents on agnes-dev drove this:

1. kbc_job was registered manually with the correct
   (in.c-kbc_telemetry, kbc_job) coordinates. A naive auto-discover
   re-run would have inserted a SECOND kbc_job row at the slugified
   id 'in_c-keboola-storage_kbc_job' (where Keboola's discovery
   places it) — and that row's Storage API export-async 404s.

2. An earlier auto-discover bug stripped the stage prefix from
   bucket ids ('c-finance' instead of 'in.c-finance'), inserting
   137 rows whose syncs all failed.

Fix:

- _discover_and_register_tables now builds a plan first
  (_build_keboola_discovery_plan) classifying each discovered table
  into one of new / existing_match / existing_drift / invalid, then
  executes only the 'new' bucket. Drift rows are reported with both
  sides of the disagreement plus drift_kind:
  - same_id_diff_coords: registry has the same id but different
    bucket / source_table (admin migrated coords inline).
  - name_collision: discovery's slugified id differs from any
    registry id, but the discovered .name matches an existing row's
    .name (case-insensitive). Catches the kbc_job case.

- Bucket detection now prefers the API's authoritative bucket_id
  field (separate field on the Keboola tables.list response,
  normalised by KeboolaClient.discover_all_tables). Falls back to
  id-string parsing only when bucket_id is missing (older fallback
  path inside discover_all_tables).

- Endpoint POST /api/admin/discover-and-register?dry_run=true
  returns the plan without writing — would_register, drift,
  invalid lists. Lets an operator audit before merging discovery
  with a registry that has admin overrides.

Removed 'every 0m' from test_register_request_rejects_malformed_sync_schedule
— the runtime started accepting it in the previous commit (force-resync
override) and the validator follows suit.

* feat(keboola): AGNES_TEMP_DIR routes tempfiles off overlayfs /tmp

The container's /tmp lives on the boot disk's overlayfs (29 GiB on
agnes-dev, shared with /var). Snowflake UNLOAD of a wide table writes
slices into per-call /tmp tempdirs that fill multi-GiB / many-slice
exports long before the dedicated data disk fills. agnes-dev hit
100% boot-disk while the 20 GiB data disk had 15 GiB free.

connectors.keboola.storage_api.get_temp_root() reads AGNES_TEMP_DIR;
mkdirs the target on first use; unset / empty / unwritable falls
back to None (system tempdir, OSS-pre-fix behaviour). Both
materialize_query (parquet path) and _extract_via_legacy (CSV
fallback) and the sliced-CSV concat path in storage_api use the
helper now.

docker-compose.yml defaults AGNES_TEMP_DIR=/data/tmp on app, scheduler,
and extract services. The data volume is the dedicated disk in
production layouts and a plain docker volume in single-disk
dev/laptop setups — same blast radius as the previous /tmp default
on the latter, no regression.
2026-05-07 12:12:14 +02:00

942 lines
35 KiB
Python

"""Admin commands — agnes admin."""
import json
import typer
from cli.client import api_get, api_post, api_delete, api_patch, api_put
from cli.commands.admin_metrics import admin_metrics_app
from cli.commands.admin_store import admin_store_app
from cli.commands.memory_admin import memory_admin_app
admin_app = typer.Typer(help="Admin operations (requires admin role)")
admin_app.add_typer(admin_metrics_app, name="metrics")
admin_app.add_typer(admin_store_app, name="store")
admin_app.add_typer(memory_admin_app, name="memory")
@admin_app.command("add-user")
def add_user(
email: str = typer.Argument(..., help="User email"),
name: str = typer.Option("", help="User display name"),
):
"""Add a new user. New users start with no group memberships — to make
them admin, add them to the Admin group separately:
agnes admin group add-member <admin-group-id> <email>
"""
resp = api_post("/api/users", json={"email": email, "name": name or email.split("@")[0]})
if resp.status_code == 201:
data = resp.json()
typer.echo(f"Created user: {data['email']} (id: {data['id']})")
else:
typer.echo(f"Failed: {resp.json().get('detail', resp.text)}", err=True)
raise typer.Exit(1)
@admin_app.command("list-users")
def list_users(as_json: bool = typer.Option(False, "--json")):
"""List all users."""
resp = api_get("/api/users")
if resp.status_code != 200:
typer.echo(f"Failed: {resp.json().get('detail', resp.text)}", err=True)
raise typer.Exit(1)
users = resp.json()
if as_json:
typer.echo(json.dumps(users, indent=2))
else:
for u in users:
status_str = "active" if u.get("active", True) else "DEACTIVATED"
admin_flag = "admin" if u.get("is_admin") else "user"
typer.echo(
f" {u['email']:30s} {admin_flag:6s} {status_str:12s} id={u['id'][:8]}"
)
@admin_app.command("remove-user")
def remove_user(user_id: str = typer.Argument(..., help="User ID to remove")):
"""Remove a user."""
resp = api_delete(f"/api/users/{user_id}")
if resp.status_code == 204:
typer.echo("User removed.")
else:
typer.echo(f"Failed: {resp.text}", err=True)
raise typer.Exit(1)
@admin_app.command("register-table")
def register_table(
name: str = typer.Argument(..., help="Table display name (DuckDB view name for BQ)"),
source_type: str = typer.Option("keboola", help="Source type: keboola | bigquery | jira | local"),
bucket: str = typer.Option("", help="Source bucket (Keboola) or dataset (BigQuery)"),
source_table: str = typer.Option("", help="Source table name in the bucket/dataset"),
query_mode: str = typer.Option("local", help="Query mode: local | remote | materialized"),
query: str = typer.Option(
"",
"--query",
help=(
"SQL body for query_mode='materialized' (BigQuery only). "
"Inline SQL or `@path/to.sql` to read from disk."
),
),
description: str = typer.Option("", help="Table description"),
sync_schedule: str = typer.Option(
"",
help="Cron schedule (e.g. 'every 6h' / 'daily 03:00'); honored by materialized BQ rows",
),
dry_run: bool = typer.Option(
False,
"--dry-run",
help="Run validation + (BQ) source-side check without writing to the registry",
),
):
"""Register a single table.
Modes:
- **local** (Keboola): batch pull, parquet on disk. Requires
`--bucket` + `--source-table`.
- **remote** (BigQuery): view only, queries go to BQ. Requires
`--bucket` + `--source-table`.
- **materialized** (BigQuery): server-side scheduled SQL → parquet.
Requires `--query` (inline or `@file.sql`) AND `--bucket` (BQ
dataset of the destination identifier). `--source-table` defaults
to the registered `name` when omitted; explicit override is rare.
Note: `agnes schema <name>` builds the BQ identifier as
`bq.<bucket>.<source_table>` even for materialized rows, so an
empty `--bucket` here registers the row but breaks subsequent
schema/describe calls.
`--dry-run` goes through /precheck (BQ remote only — for materialized
rows, dry-run is a no-op since the SQL itself is the contract).
"""
from pathlib import Path
# Resolve --query @file.sql shorthand.
source_query = ""
if query:
if query.startswith("@"):
sql_path = Path(query[1:])
if not sql_path.exists():
typer.echo(f"Error: SQL file not found: {sql_path}", err=True)
raise typer.Exit(2)
source_query = sql_path.read_text(encoding="utf-8").strip()
else:
source_query = query.strip()
# Keboola materialized rows can omit --query: a NULL source_query means
# "full-table export via Storage API export-async" (see v25→v26
# migration notes). For BigQuery materialized rows, --query is still
# required — BQ has no analogous "full table" semantic at the registry
# layer (the path is a SELECT against `<project>.<dataset>.<table>`,
# which the admin must spell out).
if query_mode == "materialized" and not source_query and source_type != "keboola":
typer.echo(
"Error: --query-mode materialized requires --query (literal SQL or @path.sql) for source_type=" + source_type,
err=True,
)
raise typer.Exit(2)
# Bucket is load-bearing on materialized rows. For BQ it backs the
# destination identifier (`agnes schema <name>` builds `bq."<bucket>"."
# <src>"` from it; an empty bucket trips "unsafe BQ identifier in
# registry" at query time). For Keboola it's the bucket id passed to
# `/v2/storage/tables/<bucket>.<source_table>/export-async` — without
# it the export call would 404. Same requirement, different rationale.
if query_mode == "materialized" and not bucket:
typer.echo(
"Error: --query-mode materialized requires --bucket (the "
"BQ dataset / Keboola bucket id for the source identifier).",
err=True,
)
raise typer.Exit(2)
payload = {
"name": name,
"source_type": source_type,
"bucket": bucket,
"source_table": source_table or name,
"query_mode": query_mode,
"description": description,
}
# Omit empty optional fields so the server-side validator doesn't see
# `source_query=""` on a remote/local row (which would trigger the
# "source_query forbidden" branch).
if source_query:
payload["source_query"] = source_query
if sync_schedule:
payload["sync_schedule"] = sync_schedule
if dry_run:
# Hits /precheck — no DB write, but for BQ does a real
# bigquery.Client(project).get_table() round-trip so the operator
# gets the same NotFound / Forbidden error they'd see at
# registration time, before committing.
resp = api_post("/api/admin/register-table/precheck", json=payload)
if resp.status_code == 200:
data = resp.json()
t = data.get("table") or {}
typer.echo("[DRY RUN] precheck OK")
typer.echo(f" name: {t.get('name')}")
typer.echo(f" source_type: {t.get('source_type')}")
typer.echo(f" bucket: {t.get('bucket')}")
typer.echo(f" source_table: {t.get('source_table')}")
if t.get("project_id"):
typer.echo(f" project_id: {t.get('project_id')}")
if t.get("rows") is not None:
typer.echo(f" rows: {t.get('rows'):,}")
if t.get("size_bytes") is not None:
typer.echo(f" size_bytes: {t.get('size_bytes'):,}")
cols = t.get("columns") or []
if cols:
typer.echo(f" columns ({len(cols)}):")
for c in cols:
typer.echo(f" - {c.get('name'):<32s} {c.get('type', '')}")
return
typer.echo(f"Precheck failed: {resp.json().get('detail', resp.text)}", err=True)
raise typer.Exit(1)
resp = api_post("/api/admin/register-table", json=payload)
# 200 (BQ sync materialize OK), 201 (legacy non-BQ), and 202 (BQ
# background materialize) are all success.
if resp.status_code in (200, 201, 202):
if resp.status_code == 202:
typer.echo(f"Registered (materializing in background): {name}")
else:
typer.echo(f"Registered: {name}")
# Post-success hints. Two operator gotchas this catches:
#
# 1. `agnes pull` does not auto-materialize newly-registered
# rows — registration adds a registry row, but the parquet
# is built only when the scheduler tick runs (or first-sync
# is triggered manually). Without this hint operators see
# "Updated 0 tables" on `agnes pull` and assume something
# is broken.
# 2. `register-table` does NOT auto-grant. `agnes catalog`
# filters per-user via `resource_grants`, so operators
# other than the registering admin won't see the new row
# until a grant is created.
#
# Hint #1 only fires for `local` and `materialized` (the modes
# that actually produce a parquet); 202-async path covers a
# different signal, so don't double-message there.
if query_mode in ("local", "materialized") and resp.status_code != 202:
typer.echo(
" Next: run `agnes setup first-sync` to materialize "
"the parquet (or wait for the scheduler tick)."
)
typer.echo(
f" Note: register-table does not auto-grant. Run "
f"`agnes admin grant create <group> table {name}` to "
f"make this visible in `agnes catalog` for non-admin users."
)
elif resp.status_code == 409:
typer.echo(f"Already exists: {name}")
else:
typer.echo(f"Failed: {resp.json().get('detail', resp.text)}", err=True)
raise typer.Exit(1)
@admin_app.command("discover-and-register")
def discover_and_register(
source_type: str = typer.Option("keboola", help="Source type"),
token: str = typer.Option(None, help="Keboola Storage API token"),
url: str = typer.Option(None, help="Keboola stack URL"),
dry_run: bool = typer.Option(False, "--dry-run", help="Show what would be registered"),
as_json: bool = typer.Option(False, "--json", help="Output as JSON"),
):
"""Discover all tables from source and register them."""
import httpx
import os
kbc_token = token or os.environ.get("KEBOOLA_STORAGE_TOKEN", "")
kbc_url = url or os.environ.get("KEBOOLA_STACK_URL", "")
if not kbc_token or not kbc_url:
typer.echo("Need KEBOOLA_STORAGE_TOKEN and KEBOOLA_STACK_URL (env or --token/--url)", err=True)
raise typer.Exit(1)
typer.echo(f"Discovering tables from {kbc_url}...")
resp = httpx.get(f"{kbc_url.rstrip('/')}/v2/storage/tables",
headers={"X-StorageApi-Token": kbc_token}, timeout=30)
resp.raise_for_status()
tables = resp.json()
typer.echo(f"Found {len(tables)} tables")
if as_json and dry_run:
typer.echo(json.dumps([{"id": t["id"], "name": t["name"],
"bucket": t.get("bucket", {}).get("id", ""),
"rows": t.get("rowsCount", 0)} for t in tables], indent=2))
return
registered = 0
skipped = 0
errors = 0
for t in tables:
table_id = t["id"]
name = t["name"]
bucket_id = t.get("bucket", {}).get("id", "")
if dry_run:
typer.echo(f" [DRY RUN] {name:30s} bucket={bucket_id:20s} rows={t.get('rowsCount', 0):>10,}")
continue
# Keboola tables always go through the Storage API export-async
# path (`materialize_query`), which is `query_mode='materialized'`
# in the registry. A NULL source_query means "full table export"
# — same effective semantics the old 'local' mode gave, but via
# the Storage API instead of the DuckDB extension. See
# connectors/keboola/storage_api.py + the v25→v26 migration.
# Other connectors keep their per-source default.
default_mode = "materialized" if source_type == "keboola" else "local"
resp = api_post("/api/admin/register-table", json={
"name": name,
"source_type": source_type,
"bucket": bucket_id,
"source_table": name,
"query_mode": default_mode,
"description": f"Auto-discovered from {source_type}",
})
# 200 (BQ synchronous materialize), 201 (legacy non-BQ insert),
# and 202 (BQ background materialize) are all success — mirrors
# the matrix in the single-table register-table command. Pre-fix
# this only accepted 201, so every successful BQ row counted as
# an error (review NIT 6 in #119).
if resp.status_code in (200, 201, 202):
registered += 1
suffix = " (materializing in background)" if resp.status_code == 202 else ""
typer.echo(f"{name}{suffix}")
elif resp.status_code == 409:
skipped += 1
else:
errors += 1
typer.echo(f"{name}: {resp.json().get('detail', resp.text)}")
if not dry_run:
typer.echo(f"\nDone: {registered} registered, {skipped} already existed, {errors} errors")
@admin_app.command("list-tables")
def list_tables(as_json: bool = typer.Option(False, "--json")):
"""List registered tables."""
resp = api_get("/api/admin/registry")
if resp.status_code != 200:
typer.echo(f"Failed: {resp.text}", err=True)
raise typer.Exit(1)
data = resp.json()
if as_json:
typer.echo(json.dumps(data, indent=2))
else:
typer.echo(f"Registered tables: {data['count']}")
for t in data["tables"]:
typer.echo(f" {t['name']:30s} src={t.get('source_type','?'):10s} mode={t.get('query_mode','?'):6s} bucket={t.get('bucket',''):20s}")
@admin_app.command("unregister-table")
def unregister_table(
table_id: str = typer.Argument(..., help="Table id to unregister"),
yes: bool = typer.Option(
False, "--yes", "-y",
help="Skip the confirmation prompt (for scripts).",
),
):
"""Unregister a table from the registry.
Calls `DELETE /api/admin/registry/{table_id}`. The server unhooks the
master view, removes the canonical parquet for materialized rows, and
clears the matching `sync_state` row. Issue #177.
"""
if not yes:
typer.echo(f"About to unregister table: {table_id}")
if not typer.confirm("Continue?"):
typer.echo("Aborted.")
raise typer.Exit(0)
resp = api_delete(f"/api/admin/registry/{table_id}")
if resp.status_code == 204:
typer.echo(f"Unregistered: {table_id}")
return
if resp.status_code == 404:
typer.echo(f"Not registered: {table_id}", err=True)
raise typer.Exit(1)
try:
detail = resp.json().get("detail", resp.text)
except Exception:
detail = resp.text
typer.echo(f"Failed: {detail}", err=True)
raise typer.Exit(1)
@admin_app.command("update-table")
def update_table(
table_id: str = typer.Argument(..., help="Table id to update"),
name: str = typer.Option(None, "--name", help="New display name"),
bucket: str = typer.Option(None, "--bucket", help="New bucket / dataset"),
source_table: str = typer.Option(
None, "--source-table", help="New source table name"
),
query_mode: str = typer.Option(
None,
"--query-mode",
help="New query mode: local | remote | materialized",
),
query: str = typer.Option(
None,
"--query",
help=(
"New SQL body for query_mode='materialized' (BigQuery). "
"Inline SQL or `@path/to.sql` to read from disk. Use "
"`--query=` (empty value) to clear."
),
),
description: str = typer.Option(
None, "--description", help="New description"
),
sync_schedule: str = typer.Option(
None,
"--sync-schedule",
help="New cron schedule (e.g. 'every 6h' / 'daily 03:00'); honored by materialized BQ rows",
),
source_type: str = typer.Option(
None,
"--source-type",
help="Change source type. Rare — most edits keep this fixed.",
),
):
"""Update a registered table.
Calls `PUT /api/admin/registry/{table_id}` with only the supplied
fields. Field omitted → unchanged. Issue #177.
For BQ rows, the server schedules a background rebuild so the master
view picks up the change without waiting for the next scheduled sync.
Switching `query_mode` away from `materialized` clears the stale
`source_query` automatically.
"""
from pathlib import Path
payload: dict = {}
if name is not None:
payload["name"] = name
if bucket is not None:
payload["bucket"] = bucket
if source_table is not None:
payload["source_table"] = source_table
if query_mode is not None:
payload["query_mode"] = query_mode
if description is not None:
payload["description"] = description
if sync_schedule is not None:
payload["sync_schedule"] = sync_schedule
if source_type is not None:
payload["source_type"] = source_type
if query is not None:
if query.startswith("@"):
sql_path = Path(query[1:])
if not sql_path.exists():
typer.echo(f"Error: SQL file not found: {sql_path}", err=True)
raise typer.Exit(2)
payload["source_query"] = sql_path.read_text(encoding="utf-8").strip()
else:
payload["source_query"] = query.strip()
if not payload:
typer.echo(
"No fields supplied. Pass at least one of --name, --bucket, "
"--source-table, --query-mode, --query, --description, "
"--sync-schedule, --source-type.",
err=True,
)
raise typer.Exit(2)
resp = api_put(f"/api/admin/registry/{table_id}", json=payload)
if resp.status_code == 200:
data = resp.json()
updated = data.get("updated") or sorted(payload.keys())
typer.echo(f"Updated {table_id}: {', '.join(updated)}")
return
if resp.status_code == 404:
typer.echo(f"Not registered: {table_id}", err=True)
raise typer.Exit(1)
try:
detail = resp.json().get("detail", resp.text)
except Exception:
detail = resp.text
typer.echo(f"Failed: {detail}", err=True)
raise typer.Exit(1)
@admin_app.command("metadata-show")
def metadata_show(
table_id: str = typer.Argument(..., help="Table ID to show metadata for"),
as_json: bool = typer.Option(False, "--json", help="Output as JSON"),
):
"""Show column metadata for a table."""
resp = api_get(f"/api/admin/metadata/{table_id}")
if resp.status_code != 200:
typer.echo(f"Failed: {resp.json().get('detail', resp.text)}", err=True)
raise typer.Exit(1)
data = resp.json()
if as_json:
typer.echo(json.dumps(data, indent=2))
else:
columns = data.get("columns", [])
if not columns:
typer.echo(f"No column metadata for table: {table_id}")
return
typer.echo(f"Column metadata for table: {table_id} ({len(columns)} columns)")
typer.echo(f" {'COLUMN':<30s} {'BASETYPE':<12s} {'CONFIDENCE':<12s} DESCRIPTION")
typer.echo(" " + "-" * 80)
for col in columns:
typer.echo(
f" {col['column_name']:<30s} {col.get('basetype') or '':^12s} "
f"{col.get('confidence') or '':^12s} {col.get('description') or ''}"
)
@admin_app.command("metadata-apply")
def metadata_apply(
proposal_path: str = typer.Argument(..., help="Path to proposal JSON file"),
push_to_source: bool = typer.Option(False, "--push-to-source", help="Push metadata to Keboola after import"),
dry_run: bool = typer.Option(False, "--dry-run", help="Show what would change without applying"),
):
"""Apply a metadata proposal JSON to DuckDB."""
import os
if not os.path.exists(proposal_path):
typer.echo(f"Proposal file not found: {proposal_path}", err=True)
raise typer.Exit(1)
with open(proposal_path, "r", encoding="utf-8") as f:
proposal = json.load(f)
tables = proposal.get("tables", {})
total = sum(len(t.get("columns", {})) for t in tables.values())
if dry_run:
typer.echo(f"[DRY RUN] Would import {total} column(s) from {len(tables)} table(s):")
for table_id, table_data in tables.items():
columns = table_data.get("columns", {})
for col_name, col_data in columns.items():
typer.echo(
f" {table_id}.{col_name}: basetype={col_data.get('basetype')} "
f"description={col_data.get('description')}"
)
return
from src.repositories.column_metadata import ColumnMetadataRepository
from src.db import get_system_db
conn = get_system_db()
try:
repo = ColumnMetadataRepository(conn)
count = repo.import_proposal(proposal_path)
typer.echo(f"Imported {count} column(s) from proposal.")
finally:
conn.close()
if push_to_source:
for table_id in tables:
resp = api_post(f"/api/admin/metadata/{table_id}/push")
if resp.status_code == 200:
typer.echo(f"Pushed metadata for {table_id} to source.")
else:
typer.echo(f"Failed to push {table_id}: {resp.json().get('detail', resp.text)}", err=True)
# ---- User management (#11) ----
def _resolve_user_id(ref: str) -> str:
"""Accept either a UUID or an email; look up email → id via list."""
if "@" not in ref:
return ref
resp = api_get("/api/users")
if resp.status_code != 200:
typer.echo(f"Could not list users: {resp.text}", err=True)
raise typer.Exit(1)
for u in resp.json():
if u.get("email") == ref:
return u["id"]
typer.echo(f"User not found: {ref}", err=True)
raise typer.Exit(1)
def _print_user_result(resp, ok_msg: str) -> None:
if resp.status_code in (200, 204):
typer.echo(ok_msg)
else:
try:
detail = resp.json().get("detail", resp.text)
except Exception:
detail = resp.text
typer.echo(f"Failed: {detail}", err=True)
raise typer.Exit(1)
@admin_app.command("set-role")
def set_role(
user_ref: str = typer.Argument(..., help="User id or email"),
role: str = typer.Argument(..., help="(removed — see message)"),
):
"""[REMOVED] Roles were replaced by group memberships in v0.25."""
typer.echo(
"Error: 'agnes admin set-role' was removed in v0.25.\n"
" Roles were replaced by group memberships.\n"
f" Make {user_ref!r} admin:\n"
" agnes admin group list # find Admin group id\n"
f" agnes admin group add-member <admin-id> {user_ref}\n",
err=True,
)
raise typer.Exit(2)
@admin_app.command("deactivate")
def deactivate(user_ref: str = typer.Argument(..., help="User id or email")):
"""Deactivate a user (blocks login, existing tokens also rejected)."""
uid = _resolve_user_id(user_ref)
resp = api_post(f"/api/users/{uid}/deactivate")
_print_user_result(resp, f"Deactivated {user_ref}")
@admin_app.command("activate")
def activate(user_ref: str = typer.Argument(..., help="User id or email")):
"""Re-activate a deactivated user."""
uid = _resolve_user_id(user_ref)
resp = api_post(f"/api/users/{uid}/activate")
_print_user_result(resp, f"Activated {user_ref}")
@admin_app.command("reset-password")
def reset_password(user_ref: str = typer.Argument(..., help="User id or email")):
"""Generate a reset token (emailed if SMTP/SendGrid configured)."""
uid = _resolve_user_id(user_ref)
resp = api_post(f"/api/users/{uid}/reset-password")
if resp.status_code == 200:
data = resp.json()
typer.echo(f"Reset URL: {data['reset_url']}")
typer.echo(f"Email sent: {data['email_sent']}")
else:
typer.echo(f"Failed: {resp.json().get('detail', resp.text)}", err=True)
raise typer.Exit(1)
@admin_app.command("set-password")
def set_password(
user_ref: str = typer.Argument(..., help="User id or email"),
password: str = typer.Option(
..., prompt=True, hide_input=True, confirmation_prompt=True,
help="New password (hidden input)",
),
):
"""Set a user's password directly (force-reset flow)."""
uid = _resolve_user_id(user_ref)
resp = api_post(f"/api/users/{uid}/set-password", json={"password": password})
if resp.status_code == 204:
typer.echo(f"Password set for {user_ref}")
else:
typer.echo(f"Failed: {resp.json().get('detail', resp.text)}", err=True)
raise typer.Exit(1)
# ---- Access management (v12 — user_groups + members + resource_grants) ----
#
# Calls the unified access REST API under /api/admin (see app/api/access.py).
# Every endpoint requires Admin user_group membership.
group_app = typer.Typer(help="User group + membership management")
grant_app = typer.Typer(help="Resource grant CRUD")
admin_app.add_typer(group_app, name="group")
admin_app.add_typer(grant_app, name="grant")
def _fail(resp, prefix: str = "Failed") -> None:
try:
detail = resp.json().get("detail", resp.text)
except Exception:
detail = resp.text
typer.echo(f"{prefix}: {detail}", err=True)
raise typer.Exit(1)
def _print_rows(rows: list, columns: list[tuple[str, str, int]]) -> None:
header = " " + " ".join(f"{h:<{w}s}" for _, h, w in columns)
typer.echo(header)
typer.echo(" " + "-" * (len(header) - 2))
for row in rows:
cells = []
for key, _, width in columns:
val = row.get(key)
cells.append(f"{(str(val) if val is not None else ''):<{width}s}")
typer.echo(" " + " ".join(cells))
def _resolve_group_id(ref: str) -> str:
"""Accept group id (UUID-ish) or name; look up via /api/admin/groups."""
resp = api_get("/api/admin/groups")
if resp.status_code != 200:
_fail(resp, prefix="Could not list groups")
for g in resp.json():
if g["id"] == ref or g["name"] == ref:
return g["id"]
typer.echo(f"Group not found: {ref}", err=True)
raise typer.Exit(1)
@group_app.command("list")
def group_list(as_json: bool = typer.Option(False, "--json")):
"""List all user groups."""
resp = api_get("/api/admin/groups")
if resp.status_code != 200:
_fail(resp)
rows = resp.json()
if as_json:
typer.echo(json.dumps(rows, indent=2)); return
typer.echo(f"User groups: {len(rows)}")
_print_rows(rows, [
("name", "NAME", 24),
("description", "DESCRIPTION", 40),
("is_system", "SYSTEM", 7),
("member_count", "MEMBERS", 8),
("grant_count", "GRANTS", 7),
])
@group_app.command("create")
def group_create(
name: str = typer.Argument(..., help="Group name"),
description: str = typer.Option("", help="Description"),
):
"""Create a new user group."""
resp = api_post("/api/admin/groups", json={"name": name, "description": description or None})
if resp.status_code != 201:
_fail(resp)
typer.echo(f"Created group: {name} (id={resp.json()['id']})")
@group_app.command("delete")
def group_delete(group_ref: str = typer.Argument(..., help="Group id or name")):
"""Delete a user group (and its members + grants)."""
gid = _resolve_group_id(group_ref)
resp = api_delete(f"/api/admin/groups/{gid}")
if resp.status_code in (200, 204):
typer.echo(f"Deleted group {group_ref}"); return
_fail(resp)
@group_app.command("members")
def group_members(group_ref: str = typer.Argument(..., help="Group id or name")):
"""List members of a group."""
gid = _resolve_group_id(group_ref)
resp = api_get(f"/api/admin/groups/{gid}/members")
if resp.status_code != 200:
_fail(resp)
rows = resp.json()
typer.echo(f"Members: {len(rows)}")
_print_rows(rows, [
("email", "EMAIL", 30),
("name", "NAME", 20),
("source", "SOURCE", 14),
("active", "ACTIVE", 7),
])
@group_app.command("add-member")
def group_add_member(
group_ref: str = typer.Argument(..., help="Group id or name"),
email: str = typer.Argument(..., help="User email"),
):
"""Add a user to a group (source='admin' — survives Google sync)."""
gid = _resolve_group_id(group_ref)
resp = api_post(f"/api/admin/groups/{gid}/members", json={"email": email})
if resp.status_code != 201:
_fail(resp)
typer.echo(f"Added {email} to {group_ref}")
@group_app.command("remove-member")
def group_remove_member(
group_ref: str = typer.Argument(..., help="Group id or name"),
email: str = typer.Argument(..., help="User email"),
):
"""Remove a user from a group (only admin-source rows can be removed this way)."""
gid = _resolve_group_id(group_ref)
user_id = _resolve_user_id(email)
resp = api_delete(f"/api/admin/groups/{gid}/members/{user_id}")
if resp.status_code in (200, 204):
typer.echo(f"Removed {email} from {group_ref}"); return
_fail(resp)
@grant_app.command("list")
def grant_list(
resource_type: str = typer.Option("", "--type", help="Filter by resource type"),
group_ref: str = typer.Option("", "--group", help="Filter by group id or name"),
as_json: bool = typer.Option(False, "--json"),
):
"""List resource grants."""
params = {}
if resource_type:
params["resource_type"] = resource_type
if group_ref:
params["group_id"] = _resolve_group_id(group_ref)
resp = api_get("/api/admin/grants", params=params)
if resp.status_code != 200:
_fail(resp)
rows = resp.json()
if as_json:
typer.echo(json.dumps(rows, indent=2)); return
typer.echo(f"Resource grants: {len(rows)}")
_print_rows(rows, [
("group_name", "GROUP", 20),
("resource_type", "RESOURCE TYPE", 22),
("resource_id", "RESOURCE ID", 40),
("assigned_by", "ASSIGNED BY", 24),
])
@grant_app.command("create")
def grant_create(
group_ref: str = typer.Argument(..., help="Group id or name"),
resource_type: str = typer.Argument(..., help="Resource type (e.g. marketplace_plugin)"),
resource_id: str = typer.Argument(..., help="Resource path (e.g. foundry-ai/metrics-plugin)"),
):
"""Grant a group access to a specific resource."""
gid = _resolve_group_id(group_ref)
resp = api_post("/api/admin/grants", json={
"group_id": gid,
"resource_type": resource_type,
"resource_id": resource_id,
})
if resp.status_code != 201:
_fail(resp)
typer.echo(f"Granted {group_ref}: {resource_type}/{resource_id}")
@grant_app.command("delete")
def grant_delete(grant_id: str = typer.Argument(..., help="Grant id")):
"""Delete a grant by id."""
resp = api_delete(f"/api/admin/grants/{grant_id}")
if resp.status_code in (200, 204):
typer.echo(f"Deleted grant {grant_id}"); return
_fail(resp)
@grant_app.command("resource-types")
def grant_resource_types(as_json: bool = typer.Option(False, "--json")):
"""List the resource types modules have registered."""
resp = api_get("/api/admin/resource-types")
if resp.status_code != 200:
_fail(resp)
rows = resp.json()
if as_json:
typer.echo(json.dumps(rows, indent=2)); return
_print_rows(rows, [
("key", "KEY", 28),
("display_name", "DISPLAY NAME", 28),
("id_format", "ID FORMAT", 36),
])
# ---------------------------------------------------------------------------
# Break-glass: out-of-band admin grant.
#
# Talks directly to system.duckdb — no HTTP, no auth dependency. The whole
# point is recovery for the case where the running server's authorization
# layer is broken or there is no admin left to authenticate as. Requires
# filesystem access to ${DATA_DIR}/state/system.duckdb and is therefore
# restricted to operators with shell access on the host.
# ---------------------------------------------------------------------------
breakglass_app = typer.Typer(
help="Out-of-band recovery (talks directly to system.duckdb)",
)
admin_app.add_typer(breakglass_app, name="break-glass")
@breakglass_app.command("grant-admin")
def break_glass_grant_admin(
email: str = typer.Argument(..., help="Email of the user to promote"),
yes: bool = typer.Option(
False, "--yes", "-y", help="Skip confirmation prompt"
),
) -> None:
"""Grant Admin-group membership to a user without going through the API.
Operates directly on system.duckdb. Use when the server is up but the
Admin group has no live members (race, mistake, accidental DELETE) or
when bootstrapping a brand-new install before any admin exists. Membership
is recorded with source='cli_break_glass' so it's distinguishable from
google_sync / admin / system_seed in audits.
The DuckDB file must not be locked by a running app process — stop the
app or use a separate replica before running this.
"""
import uuid as _uuid
from src.db import SYSTEM_ADMIN_GROUP, get_system_db
from src.repositories.user_groups import UserGroupsRepository
from src.repositories.user_group_members import UserGroupMembersRepository
from src.repositories.users import UserRepository
if not yes:
confirm = typer.confirm(
f"Grant Admin-group membership to {email!r} (break-glass)?",
default=False,
)
if not confirm:
typer.echo("Aborted.")
raise typer.Exit(1)
conn = get_system_db()
try:
users = UserRepository(conn)
groups = UserGroupsRepository(conn)
members = UserGroupMembersRepository(conn)
admin_group = groups.get_by_name(SYSTEM_ADMIN_GROUP)
if admin_group is None:
typer.echo(
f"FATAL: '{SYSTEM_ADMIN_GROUP}' group missing. Start the app "
"once so _seed_system_groups can recreate it, then retry.",
err=True,
)
raise typer.Exit(2)
existing = users.get_by_email(email)
if existing is None:
user_id = _uuid.uuid4().hex
users.create(
id=user_id,
email=email,
name=email.split("@", 1)[0],
)
typer.echo(f"Created user {email} (id={user_id[:8]}…)")
else:
user_id = existing["id"]
if members.has_membership(user_id, admin_group["id"]):
typer.echo(
f"{email} is already a member of '{SYSTEM_ADMIN_GROUP}'."
)
return
members.add_member(
user_id=user_id,
group_id=admin_group["id"],
source="cli_break_glass",
added_by="cli:break-glass",
)
typer.echo(
f"Granted Admin to {email}. Audit source='cli_break_glass'."
)
finally:
try:
conn.close()
except Exception:
pass