* feat(store): flea-market upload guardrails + soft delete + JOIN-based admin queue
Adds an end-to-end guardrails pipeline for store uploads (manifest +
static-security + LLM review), persists blocked bundles for forensics,
introduces soft-delete (Archive) semantics, consolidates the legacy
/store/{id} surface into /marketplace/flea/{id}, and reworks the admin
queue so lifecycle filters read live entity visibility via LEFT JOIN
rather than a denormalized submission column.
Schema v29 → v35:
* v29 store_submissions table + store_entities.visibility_status
* v30 file_size, bundle_sha256, bundle_purged_at on submissions
* v31 reshape store_submissions (drop legacy unique on entity_id)
* v32 store_entities.archived_at/by + 'archived' visibility value
* v33 drop store_submissions.retry_count (unused)
* v34 ensure idx_store_submissions_entity exists post column-drop
* v35 broaden visibility_status enum + JOIN architecture cutover
Pipeline (src/store_guardrails/):
* Inline checks: manifest_check, static_scan, quality_check
* LLM review configurable haiku|sonnet|opus (default haiku)
* BackgroundTasks-driven async path with structured-output JSON
* Per-submitter daily quota (default 50)
* 30-day TTL purge job (POST /api/admin/run-blocked-purge)
* Bundle SHA256 + size persisted; sha256 survives purge for forensics
Visibility model:
* pending | approved | hidden | archived
* _enforce_visibility returns 404 (no leak) for non-owner non-admin
* Owner sees own non-approved entries via include_owner_id widening
* Install refused with 409 entity_not_approved when not approved
Soft-delete (DELETE /api/store/entities/{id}):
* Default = soft (visibility_status='archived'); existing installs
keep getting served the bundle so users don't lose the plugin
* ?hard=true admin-only: drops bundle + cascades user_store_installs
* Hard-delete preserves entity_id on submission as tombstone so
audit_log linkage survives for the activity timeline
Admin queue lifecycle (the JOIN refactor):
* Verdict (store_submissions.status) is immutable forensic record
* Lifecycle (store_entities.visibility_status) is live state
* /admin/store/submissions Archived chip translates to
`e.visibility_status='archived'` via LEFT JOIN — any path that
flips visibility surfaces in the queue immediately
* Detail page renders Status (verdict) and Entity lifecycle side by
side so admins see "approved at review, now archived" at a glance
URL consolidation:
* /store/{id} deleted (no redirect, stale bookmarks 404)
* /marketplace/flea/{id} is the canonical detail surface
* Three in-tree callers (upload-success, my-stack card, store
listing card) updated to point at the new URL
* Quarantine banner extracted to _quarantine_banner.html partial,
self-guarded, included from both flea detail templates
* Banner JS auto-refreshes when the verdict lands by polling
/api/marketplace/flea/{id}/detail (visibility_status +
submission_status — the latter is needed because blocked_llm
keeps the entity at visibility_status='pending')
Audit log resource format:
* runner.py emits prefixed `store_submission:{id}` (post-fix)
* Detail-page timeline query handles three patterns: prefixed
submission, helper-emitted `store_entity:{sub_id}`, and bare-id
legacy rows — all surface in the activity timeline
UX fixes:
* Owner sees Under review / Quarantined / Hidden banner with status
* Install button gray-disabled (not blue) when non-approved
* Owner cannot delete quarantined entries (403); admin can
* Admin queue: filter chips, sortable columns, paging, page-size
* Auto-refresh queue every 5s while pending rows are visible
* Store upload page file picker no longer opens twice (label →
input default action collided with explicit JS handler)
Tests: 168 passed across the guardrails suites (admin submissions,
store API, inline / LLM / purge guardrails, store repositories,
marketplace filter, schema version). New regression coverage
includes: archive surfaces via JOIN even when API path is bypassed;
deleted submission renders activity timeline (tombstone); flea
detail surfaces submission_status only for owner/admin; detail page
renders Entity lifecycle row; audit log resource format covers both
helper and runner paths.
* fix(store-guardrails): PR #233 follow-up — prompt injection, atomic PUT, BG race, schema, reaper, sort whitelist
Addresses 9 of the 23 findings from the PR #233 review (spec at
docs/superpowers/specs/2026-05-09-pr233-guardrails-fixes-spec.md).
Merge-gate items #1-#6 plus high-value mediums #7, #9-#12, #23.
Architectural items (#8 enum split, #14 factory) and pure
maintainability (#15-#22) deferred to follow-ups.
Security:
* #1 prompt injection — SYSTEM_PROMPT now passed via the SDK's
dedicated system= parameter; bundle wrapped in <bundle>...</bundle>
sentinels declared data-only by the system prompt; literal
sentinel strings in user content are escaped so an adversarial
README can't forge a close tag.
* #6 static scan honesty — module docstring + admin copy + docs
declare static scan as signal not gate; .md/.txt/.rst/.html/.json/
.yaml/.yml/.toml skipped to avoid false positives on prose.
AST mode for Python deferred (separate flag, FP comparison work).
Correctness:
* #2 PUT atomicity — bundles bake into plugin.staging-<rand>/
alongside live, atomic-rename on success; failed checks leave
live tree byte-for-byte intact.
* #3 BG-task race — set_visibility_if_pending guards verdict flips
to the (pending, hidden) review window; admin archives during
review survive; skipped flips audit-logged.
* #4 v35 NOT NULL/DEFAULT — schema v35→v36 re-applies them on
store_entities.visibility_status. CHECK constraint enforced
application-side (DuckDB ADD CHECK on existing column unsupported).
* #7 stuck-review reaper — reap_stuck_llm_reviews flips pending_llm
rows older than guardrails.stuck_review_grace_seconds (default
1800) to review_error. Scheduler runs every 15 min via new
/api/admin/run-reap-stuck-reviews. Set knob to 0 to disable.
* #9 quota counter — count_blocked_for_submitter_since now counts
blocked_inline + blocked_llm + review_error so a submitter
triggering only LLM-blocked verdicts is bounded.
* #10 missing risk_level — surfaces as review_error with
error='missing_risk_level' instead of silently defaulting to
'medium' (which looked like a model-decided block).
* #11 archived_at clear — set_visibility nulls archived_at +
archived_by when transitioning out of 'archived' so a future
read doesn't show stale archive forensics on an approved row.
Maintainability:
* #12 FSM doc comment — accurate insert/transition/lifecycle
description in src/db.py near store_submissions schema.
* #23 sort-key whitelist — admin queue rejects unknown sort keys
with 400 invalid_sort_key; substring-replace footgun removed.
Deferred (separate PRs):
* #5 quota race — proper fix requires asyncio.Lock spanning the
full pipeline; threading.Lock blocks event loop, DuckDB MVCC
doesn't help. API-level slowapi bounds worst case for now.
* #6 part 3 (AST static scan), #8 (enum split), #13 (import
bundle docs), #14 (factory consolidation), #15-#22 (maint).
Tests:
* New: tests/test_store_guardrails_prompt_injection.py (corpus +
trust-boundary invariants), tests/test_store_put_atomic.py,
tests/test_store_guardrails_reaper.py.
* Extended: test_store_guardrails_llm.py (system param, missing
risk_level, BG race), test_admin_store_submissions.py (quota
counter widening, sort whitelist 400), test_store_repositories.py
(un-archive metadata clear), test_db_schema_version.py (v36).
* Full suite: 3738 passed; 17 pre-existing baseline failures
unchanged (db migration tests, cli binary rename, catalog export,
user mgmt v5 backfill — confirmed by stash + rerun on clean tree).
463 lines
21 KiB
Text
463 lines
21 KiB
Text
# AI Data Analyst - Instance Configuration
|
||
# ==========================================
|
||
# This is the main configuration file for your instance.
|
||
# Copy to instance.yaml and fill in your values.
|
||
#
|
||
# SECRET VALUES use ${ENV_VAR} syntax - actual values go in .env file.
|
||
# Non-secret values are set directly here.
|
||
|
||
# --- Config version ---
|
||
# Incremented when the config schema changes. Must match SUPPORTED_CONFIG_VERSIONS
|
||
# in config/loader.py. Currently only version 1 is supported.
|
||
config_version: 1
|
||
|
||
# --- Instance branding ---
|
||
instance:
|
||
name: "AI Data Analyst"
|
||
subtitle: "Your Organization"
|
||
copyright: "Your Organization"
|
||
# logo_svg: Full <svg> element for header logo (optional, default: Keboola logo)
|
||
# Example: '<svg width="120" height="30" viewBox="0 0 100 30" xmlns="http://www.w3.org/2000/svg"><text y="22" font-size="24" fill="#333">Logo</text></svg>'
|
||
# sync_interval: "1 hour" # Cadence shown in analyst CLAUDE.md (e.g., "1 hour", "30 minutes", "daily")
|
||
|
||
# --- Server ---
|
||
server:
|
||
hostname: "" # DNS name (e.g., "data.acme.com")
|
||
host: "" # IP address
|
||
app_dir: "/opt/data-analyst" # Installation directory
|
||
# --- Client setup (shown in "Get Started" on dashboard) ---
|
||
# ssh_alias: "data-analyst" # SSH config Host alias for analysts (default: "data-analyst")
|
||
# ssh_key: "~/.ssh/data_analyst_server" # SSH key path for analysts (default: "~/.ssh/data_analyst_server")
|
||
# project_dir: "data-analyst" # Local project folder name (default: "data-analyst")
|
||
|
||
# --- Admin users ---
|
||
# Manage the server, own data files, get unlimited resource limits.
|
||
# SSH keys are used by server/setup.sh during provisioning.
|
||
admins:
|
||
- username: "admin"
|
||
ssh_public_key: "ssh-ed25519 AAAA..."
|
||
|
||
# --- Deployment ---
|
||
deployment:
|
||
method: "manual" # manual | github_actions
|
||
repo_url: "" # e.g., "git@github.com:acme/ai-data-analyst.git"
|
||
branch: "main"
|
||
|
||
# --- Authentication ---
|
||
# At minimum, set allowed_domain and webapp_secret_key.
|
||
# Email magic link auth works out of the box (no external service needed).
|
||
# Google OAuth is optional - add credentials to enable it.
|
||
auth:
|
||
allowed_domain: "" # Email domain(s) for login, comma-separated (e.g., "acme.com" or "acme.com, partner.org")
|
||
webapp_secret_key: "${WEBAPP_SECRET_KEY}"
|
||
# Optional: Google OAuth (if not set, only email magic link is available)
|
||
google_client_id: "${GOOGLE_CLIENT_ID}"
|
||
google_client_secret: "${GOOGLE_CLIENT_SECRET}"
|
||
|
||
# --- Webapp username shaping ---
|
||
#
|
||
# By default, a user's OS account is derived from their full email:
|
||
# e.psimecek@acme.com -> e_psimecek_acme_com
|
||
#
|
||
# Two options let you control this:
|
||
#
|
||
# username_strip_domain: true
|
||
# Use only the local part of the email (before @).
|
||
# Safe when allowed_domain ensures all users share a single domain.
|
||
# e.psimecek@acme.com -> e_psimecek
|
||
# Keeps usernames short and readable.
|
||
#
|
||
# username_prefix: "myapp_"
|
||
# Prepend a fixed string to every webapp-created account name.
|
||
# Necessary when an external identity system (GCP OS Login, LDAP, SAML)
|
||
# already creates OS accounts in /home/ using the same naming scheme.
|
||
# Without a prefix, the webapp sees those existing OS accounts and refuses
|
||
# to register new analyst accounts ("already in use by a system account").
|
||
# With prefix "myapp_" and strip_domain true:
|
||
# e.psimecek@acme.com -> myapp_e_psimecek
|
||
# Linux enforces a 32-character username limit. Keep the prefix short.
|
||
# Changing or removing either option later will invalidate all existing
|
||
# analyst accounts. Use username_mapping (top-level) to bridge legacy accounts.
|
||
#
|
||
# username_strip_domain: false
|
||
# username_prefix: ""
|
||
# disabled_providers: # Hide auth methods from login page
|
||
# - "email" # Disable email magic link (use when Google OAuth is configured)
|
||
|
||
# --- Theme (optional) ---
|
||
# Customize colors, fonts, and shape to match your brand.
|
||
# All values are optional - defaults provide a clean blue theme.
|
||
# See docs/theme-reference.html for a visual guide.
|
||
theme:
|
||
# primary: "#0073D1" # Main brand color (buttons, links, accents)
|
||
# primary_dark: "#005BA3" # Hover/active state of primary
|
||
# primary_light: "rgba(0, 115, 209, 0.1)" # Light tint backgrounds
|
||
# text_primary: "#1A253C" # Main text color
|
||
# text_secondary: "#6B7280" # Muted/secondary text
|
||
# background: "#F5F7FA" # Page background
|
||
# surface: "#FFFFFF" # Card/panel background
|
||
# border: "#E5E7EB" # Borders and dividers
|
||
# font_primary: "'Inter', system-ui, sans-serif"
|
||
# font_url: "https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display=swap"
|
||
# radius: "6px" # Border radius (cards, buttons, inputs)
|
||
# success: "#10B77F"
|
||
# warning: "#F59F0A"
|
||
# error: "#EA580C"
|
||
|
||
# --- Data source ---
|
||
data_source:
|
||
type: "keboola" # keboola | bigquery | local
|
||
keboola:
|
||
storage_token: "${KEBOOLA_STORAGE_TOKEN}"
|
||
stack_url: "" # e.g., "https://connection.keboola.com"
|
||
project_id: ""
|
||
bigquery:
|
||
project: "${BIGQUERY_PROJECT}" # GCP project hosting the data (used in FROM clause)
|
||
location: "${BIGQUERY_LOCATION}" # BigQuery location (e.g., "us-central1", "US")
|
||
# Uses ADC (Application Default Credentials) - VM service account on GCP
|
||
# Data can live in a different project -- use fully-qualified table IDs in data_description.md
|
||
# billing_project: "prj-billing" # GCP project to bill BQ jobs to / submit jobs from.
|
||
# # Defaults to `project`. Set when the SA has bigquery.data.* on
|
||
# # the data project but lacks serviceusage.services.use there.
|
||
# # Mismatch -> every BQ call 403 USER_PROJECT_DENIED.
|
||
# # `da diagnose` warns when this falls back to `project`.
|
||
# # Configurable via /admin/server-config UI.
|
||
# max_bytes_per_materialize: 10737418240
|
||
# # Cost guardrail (bytes) for query_mode='materialized' BQ scans.
|
||
# # Dry-run check before running; exceeding -> registration / sync
|
||
# # rejected. Default 10 GiB (10737418240). Set 0 to disable.
|
||
# # null falls through to default. Configurable via /admin/server-config UI.
|
||
# query_timeout_ms: 600000
|
||
# # DuckDB BigQuery extension query timeout (milliseconds).
|
||
# # Applied via `SET bq_query_timeout_ms` after every LOAD bigquery
|
||
# # on every BQ-touching DuckDB session. Extension default is
|
||
# # 90 000 ms = 90 s, which is too tight for analyst queries against
|
||
# # view-backed datasets -- bumped to 600 000 ms = 10 min by default.
|
||
# # Set 0 to fall through to the extension default. Configurable via
|
||
# # /admin/server-config UI.
|
||
# session_pool_size: 4
|
||
# # Number of pre-warmed DuckDB+bigquery-extension sessions kept
|
||
# # in a process-local pool. Each acquire amortizes the
|
||
# # ~0.5 s INSTALL/LOAD/CREATE-SECRET cost across requests; a fresh
|
||
# # build only happens when the pool is empty. Default 4. Set 0
|
||
# # to disable pooling (every acquire builds + closes a fresh
|
||
# # session; matches pre-pool behavior).
|
||
|
||
# --- OpenMetadata catalog (optional) ---
|
||
# Enriches table and column metadata from OpenMetadata REST API.
|
||
# If not configured, app works normally without catalog enrichment.
|
||
# All openmetadata.* fields configurable via /admin/server-config UI.
|
||
# openmetadata:
|
||
# url: "https://your-catalog.example.com"
|
||
# token: "${OPENMETADATA_TOKEN}" # JWT bearer token
|
||
# cache_ttl_seconds: 3600 # Cache TTL in seconds
|
||
# verify_ssl: true # set to false ONLY for internal
|
||
# # CAs / self-signed certs; defaults
|
||
# # to true. Setting false ships the
|
||
# # JWT over an unverified channel.
|
||
|
||
# --- Email delivery (optional, for magic link auth) ---
|
||
# Without SMTP, magic links are shown directly in browser (development mode).
|
||
# For production, configure any SMTP relay (Gmail, Mailgun, SendGrid SMTP, etc.)
|
||
email:
|
||
from_address: "noreply@example.com"
|
||
from_name: "AI Data Analyst"
|
||
smtp_host: "${SMTP_HOST}" # e.g., "smtp.gmail.com"
|
||
smtp_port: 587 # 587 for STARTTLS, 465 for SSL
|
||
smtp_user: "${SMTP_USER}"
|
||
smtp_password: "${SMTP_PASSWORD}"
|
||
|
||
# --- Desktop app (optional) ---
|
||
# All desktop.* fields configurable via /admin/server-config UI (rarely changed once set).
|
||
desktop:
|
||
jwt_issuer: "data-analyst"
|
||
jwt_secret: "${DESKTOP_JWT_SECRET}"
|
||
url_scheme: "data-analyst"
|
||
|
||
# --- Telegram notifications (optional) ---
|
||
telegram:
|
||
bot_token: "${TELEGRAM_BOT_TOKEN}"
|
||
bot_username: ""
|
||
domain_suffix: ""
|
||
|
||
# --- Jira integration (optional) ---
|
||
jira:
|
||
domain: ""
|
||
email: ""
|
||
api_token: "${JIRA_API_TOKEN}"
|
||
webhook_secret: "${JIRA_WEBHOOK_SECRET}"
|
||
sla_email: ""
|
||
sla_api_token: "${JIRA_SLA_API_TOKEN}"
|
||
cloud_id: ""
|
||
|
||
# --- Corporate Memory AI (optional) ---
|
||
# Extracts shared knowledge from team members' CLAUDE.local.md files.
|
||
# Provider: "anthropic" (direct API) or "openai_compat" (LiteLLM, OpenRouter, Azure, etc.)
|
||
ai:
|
||
provider: "anthropic" # or "openai_compat"
|
||
api_key: "${ANTHROPIC_API_KEY}" # or "${LLM_API_KEY}" for proxy
|
||
# base_url: "https://litellm.example.com" # Required for provider='openai_compat' (LiteLLM,
|
||
# OpenRouter, vLLM). Ignored when provider='anthropic'.
|
||
# Configurable via /admin/server-config UI.
|
||
model: "claude-haiku-4-5-20251001" # any model available on your provider
|
||
# --- Structured output quality control ---
|
||
# AI models can return JSON in three ways, each with different reliability:
|
||
#
|
||
# Layer 1 - "json_schema" (best):
|
||
# The provider enforces an exact schema. Every field, type, and structure
|
||
# is guaranteed. Available on: Anthropic, OpenAI, Claude via LiteLLM.
|
||
#
|
||
# Layer 2 - "json_object" (good):
|
||
# The provider guarantees valid JSON, but does not enforce a specific schema.
|
||
# Fields may be missing or have wrong types. Available on most providers.
|
||
#
|
||
# Layer 3 - "prompt" (acceptable):
|
||
# The AI is asked to respond in JSON via instructions in the prompt.
|
||
# No technical enforcement -- the model may still return invalid JSON.
|
||
# Works everywhere, but least reliable.
|
||
#
|
||
# "strict" = only Layer 1. Fail if provider doesn't support json_schema.
|
||
# Use when data quality is non-negotiable.
|
||
# "json" = Layer 1, fall back to Layer 2. No prompt-based fallback.
|
||
# Good balance of quality and compatibility.
|
||
# "auto" = All three layers as progressive fallback. Maximum compatibility.
|
||
# Use when you'd rather get imperfect data than no data.
|
||
structured_output: "auto"
|
||
|
||
# Legacy format (still supported, equivalent to provider: "anthropic"):
|
||
# ai:
|
||
# anthropic_api_key: "${ANTHROPIC_API_KEY}"
|
||
|
||
# Examples:
|
||
# --- LiteLLM proxy ---
|
||
# ai:
|
||
# provider: "openai_compat"
|
||
# base_url: "https://litellm.example.com"
|
||
# api_key: "${LLM_API_KEY}"
|
||
# model: "claude-haiku-4-5-20251001"
|
||
# structured_output: "strict"
|
||
#
|
||
# --- OpenRouter ---
|
||
# ai:
|
||
# provider: "openai_compat"
|
||
# base_url: "https://openrouter.ai/api/v1"
|
||
# api_key: "${OPENROUTER_API_KEY}"
|
||
# model: "anthropic/claude-3-haiku"
|
||
# structured_output: "auto"
|
||
|
||
# --- Flea-market upload guardrails (optional) ---
|
||
# Controls the pre-publish check pipeline for skill/agent/plugin uploads
|
||
# to /store. See docs/STORE_GUARDRAILS.md for the full check catalogue.
|
||
#
|
||
# guardrails:
|
||
# # Master kill-switch. When false, inline manifest/security/quality
|
||
# # checks still run (they're free) but the LLM step is skipped and new
|
||
# # uploads are auto-approved. Useful for local dev without an LLM key.
|
||
# enabled: true
|
||
#
|
||
# # Anthropic model tier for the LLM security review.
|
||
# # haiku — ~$0.001/review, default, good enough for routine uploads
|
||
# # sonnet — ~$0.015/review, deeper reasoning, fewer false negatives
|
||
# # opus — ~$0.075/review, only for high-stakes deployments
|
||
# # You can also pin a concrete model ID (e.g. "claude-haiku-4-5-20251001").
|
||
# review_model: "haiku"
|
||
#
|
||
# # Per-submitter daily cap on inline-blocked uploads. Bounds disk +
|
||
# # admin-queue spam. Set to 0 to disable. Default 50.
|
||
# blocked_quota_per_day: 50
|
||
#
|
||
# # How many days to keep blocked bundle bytes on disk before the
|
||
# # daily TTL job purges them. Submission row + sha256 + size always
|
||
# # survive — only the bundle bytes go. Set to 0 to retain forever
|
||
# # (rely on admin Delete). Default 30.
|
||
# blocked_bundle_ttl_days: 30
|
||
|
||
# --- Corporate Memory governance (optional) ---
|
||
# Controls how AI-extracted knowledge is reviewed and distributed.
|
||
# If not present, system operates in legacy mode (democratic wiki, no admin review).
|
||
#
|
||
# The corporate_memory.* schema is editable via /admin/server-config UI; you can
|
||
# also continue to manage it via this YAML file. The UI surfaces every leaf with
|
||
# a hint, so use it to discover the schema if this comment block has aged.
|
||
#
|
||
# corporate_memory:
|
||
# # How knowledge reaches users:
|
||
# # "mandatory_only" — admin controls everything, no user voting
|
||
# # "admin_curated" — admin controls, users vote as feedback signal
|
||
# # "hybrid" — mandatory from admin + optional from user voting (default)
|
||
# distribution_mode: "hybrid"
|
||
#
|
||
# # How new AI-extracted items enter the system:
|
||
# # "review_queue" — nothing published without admin approval (default)
|
||
# # "auto_publish" — items go live immediately, admin intervenes retroactively
|
||
# # "threshold" — high-confidence auto-publish, low-confidence to review queue
|
||
# approval_mode: "review_queue"
|
||
#
|
||
# # Default review period for approved/mandatory items (months)
|
||
# review_period_months: 6
|
||
#
|
||
# # Notify km_admins about new pending items
|
||
# notify_on_new_items: true
|
||
#
|
||
# # --- V1 Context Engineering ---
|
||
#
|
||
# sources:
|
||
# claude_local_md:
|
||
# enabled: true
|
||
# confidence_base: 0.50
|
||
# session_transcripts:
|
||
# enabled: true
|
||
# confidence_base: 0.60
|
||
# max_turns_per_session: 100
|
||
# detection_types:
|
||
# - correction
|
||
# - confirmation
|
||
# - unprompted_definition
|
||
#
|
||
# extraction:
|
||
# model: "claude-haiku-4-5-20251001"
|
||
# sensitivity_check: true
|
||
# contradiction_check: true
|
||
#
|
||
# confidence:
|
||
# # Base score per extraction source. Key format: "source_type" or "source_type.detection_type"
|
||
# base:
|
||
# user_verification.correction: 0.90
|
||
# user_verification.unprompted_definition: 0.90
|
||
# user_verification.confirmation: 0.60
|
||
# admin_mandate: 1.00
|
||
# claude_local_md: 0.50
|
||
# session_transcript: 0.50
|
||
# # Per-key modifier step sizes applied to base when optional signals are present.
|
||
# modifiers:
|
||
# user_verification.correction:
|
||
# additional_verifiers: 0.05 # per extra unique verifier
|
||
# user_verification.unprompted_definition:
|
||
# additional_verifiers: 0.05
|
||
# user_verification.confirmation:
|
||
# admin_confirmed: 0.20
|
||
# session_transcript:
|
||
# user_confirmed_in_session: 0.20
|
||
# # Confidence decay applied to items as they age.
|
||
# decay:
|
||
# mode: exponential # linear | exponential
|
||
# half_life_months: 12 # used when mode=exponential
|
||
# decay_rate_monthly: 0.02 # used when mode=linear
|
||
# floor:
|
||
# admin_mandate: 0.50 # admin policies don't silently decay to zero
|
||
# user_verification: 0.40 # user-verified facts never fall below 0.40
|
||
# default: 0.0
|
||
#
|
||
# contradiction_detection:
|
||
# enabled: true
|
||
# max_candidates: 10
|
||
#
|
||
# entity_resolution:
|
||
# enabled: true
|
||
# entities:
|
||
# metrics: ["churn", "MRR", "ARR", "NPS", "CAC", "LTV"]
|
||
# products: ["Platform", "API", "Dashboard"]
|
||
#
|
||
# domain_owners:
|
||
# finance: ["cfo@company.com"]
|
||
# engineering: ["cto@company.com"]
|
||
# product: ["pm@company.com"]
|
||
#
|
||
# domains:
|
||
# - finance
|
||
# - engineering
|
||
# - product
|
||
# - data
|
||
# - operations
|
||
# - infrastructure
|
||
|
||
# --- User groups for audience targeting (optional) ---
|
||
# Used with Corporate Memory governance to target mandatory knowledge to specific groups.
|
||
#
|
||
# groups:
|
||
# finance:
|
||
# label: "Finance & Analytics"
|
||
# members: ["analyst1@company.com", "analyst2@company.com"]
|
||
# engineering:
|
||
# label: "Engineering"
|
||
# members: ["dev1@company.com", "dev2@company.com"]
|
||
|
||
# --- User display and permissions ---
|
||
# Corporate Memory avatars + optional km_admin flag for governance.
|
||
# users:
|
||
# admin@company.com:
|
||
# display_name: "Admin User"
|
||
# km_admin: true # Corporate Memory admin (approve/mandate knowledge)
|
||
# analyst@company.com:
|
||
# display_name: "Analyst User"
|
||
users: {}
|
||
|
||
# --- Username mapping (webapp email -> server username, only if different) ---
|
||
username_mapping: {}
|
||
|
||
# --- Optional datasets (sync settings UI) ---
|
||
datasets: {}
|
||
|
||
# --- Data catalog ---
|
||
catalog:
|
||
categories: {}
|
||
order: []
|
||
|
||
# --- Data profiler (optional) ---
|
||
# profiler:
|
||
# sample_size: 500000 # If table > this, sample this many rows; otherwise use all
|
||
# max_categorical_distinct: 50 # Treat as categorical if unique <= this
|
||
# top_values_limit: 10 # Top values per categorical column
|
||
# histogram_bins: 15 # Bins in histogram visualizations
|
||
# sample_rows_limit: 5 # Sample rows to show in UI "Sample" tab
|
||
# alert_high_missing_pct: 30.0 # Alert threshold for high missing %
|
||
# alert_missing_pct: 5.0 # Alert threshold for missing %
|
||
# alert_imbalance_pct: 60.0 # Alert threshold for imbalance %
|
||
# alert_high_cardinality: 50 # Alert threshold for high cardinality columns
|
||
|
||
# --- Remote query (optional) ---
|
||
# Settings for remote BigQuery queries via `python -m src.remote_query`.
|
||
# Used when tables have query_mode: "remote" in data_description.md.
|
||
# remote_query:
|
||
# timeout_seconds: 300 # BQ + DuckDB query timeout
|
||
# max_result_rows: 100000 # Max rows in final output
|
||
# max_bq_registration_rows: 500000 # Max rows per --register-bq sub-query
|
||
# default_format: "table" # Default output format
|
||
# output_dir: "/tmp/remote_query" # Directory for Parquet/CSV exports
|
||
|
||
# --- v2 API knobs (optional) ---
|
||
# Controls for the /api/v2/{catalog,schema,sample,scan,scan/estimate} endpoints.
|
||
# All values are optional — the defaults shown below are applied if keys are absent.
|
||
#
|
||
# api:
|
||
# # --- Scan / fetch limits ---
|
||
# scan:
|
||
# max_limit: 10000000 # Hard row cap per /api/v2/scan request (default: 10 M)
|
||
# max_result_bytes: 2147483648 # Hard byte cap on Arrow stream response: 2 GB (default)
|
||
# # If exceeded, partial result returned with X-Agnes-Truncated header.
|
||
# max_concurrent_per_user: 5 # In-flight /api/v2/scan requests allowed per user (default: 5)
|
||
# # Note: quota is process-local; N replicas → effective N× cap.
|
||
# max_daily_bytes_per_user: 53687091200 # Per-user daily byte quota: 50 GB (default). Resets at UTC midnight.
|
||
# bq_cost_per_tb_usd: 5.00 # Cost rate shown in /api/v2/scan/estimate response (default: $5/TB)
|
||
# request_timeout_seconds: 300 # Server-side timeout for a single scan request (default: 300 s)
|
||
# # --- Discovery cache TTLs ---
|
||
# catalog_cache_ttl_seconds: 300 # /api/v2/catalog response cache lifetime (default: 5 min)
|
||
# schema_cache_ttl_seconds: 3600 # /api/v2/schema/{table_id} cache lifetime (default: 1 h)
|
||
# sample_cache_ttl_seconds: 3600 # /api/v2/sample/{table_id} cache lifetime (default: 1 h)
|
||
# # Admins can force-refresh via POST /api/v2/sample/{id}?refresh=true
|
||
|
||
# --- Materialize concurrency safety (optional) ---
|
||
# Concurrency safety net for the materialize path (BQ + Keboola). When
|
||
# two materialize attempts race for the same table_id, the second one
|
||
# raises MaterializeInFlightError and skips. The lock is held in a
|
||
# .parquet.lock sibling file; if a holder process is hard-killed before
|
||
# kernel-level flock release, the next attempt reclaims the lock once
|
||
# the file's mtime is older than this TTL.
|
||
#
|
||
# Default 86400 (24h) is generous on purpose — anything shorter risks
|
||
# a long-running COPY being interrupted by its own scheduler successor.
|
||
# Lower it only if you know your materialize never exceeds the new
|
||
# value AND your host has a habit of hard-killing processes.
|
||
# Min 60 (1 minute), max 604800 (7 days). Configurable via /admin/server-config UI.
|
||
materialize:
|
||
lock_ttl_seconds: 86400
|