* chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88) Vendor-neutralization step before public release. The directory mixed two concerns: (1) generic ops scripts referenced from mainline OSS infrastructure (TLS rotation, auto-upgrade cron) and (2) one operator's hackathon manual-deploy helper with hardcoded GCP project IDs, VM names, and admin emails. Splitting them per concern. Moved (still in OSS, just under a vendor-neutral name): - scripts/grpn/agnes-tls-rotate.sh → scripts/ops/agnes-tls-rotate.sh - scripts/grpn/agnes-auto-upgrade.sh → scripts/ops/agnes-auto-upgrade.sh Removed (belongs in private consumer infra repos, not upstream OSS): - scripts/grpn/Makefile (hardcoded prj-grp-foundryai-dev-7c37, foundryai-development VM name, e_zsrotyr@groupon.com bootstrap email) - scripts/grpn/README.md (GRPN hackathon deploy walkthrough) - docs/superpowers/plans/2026-04-22-grpn-deploy-learnings.md (org-specific deploy log) Cross-refs updated in README.md, CLAUDE.md, docs/DEPLOYMENT.md, docker-compose.yml. CHANGELOG entry flags BREAKING (ops) for any consumer infra repo that installs these scripts via path-based systemd timers. This is the first wave of #88 — the remaining leaks (test data with prj-grp-dataview-prod-1ff9, AIAgent.FoundryAI tags in OpenMetadata test fixtures, docstrings in connectors/openmetadata/enricher.py) will be a separate, smaller PR. Refs #88. * chore(oss): comprehensive vendor-neutralization (#88 wave 2 + review fixes) PR #94 review found that the original wave-1 grep was scoped wrong and many leaks survived. This commit closes wave 1 properly AND folds in all wave-2 anonymization in a single pass — easier to review than two PRs. Wave-1 review-fix corrections: - Caddyfile: scripts/grpn/agnes-tls-rotate.sh → scripts/ops/ (the original wave-1 grep filter excluded extensionless files like Caddyfile). - CHANGELOG bullet rewritten — original wording implied an in-repo migration for infra/modules/customer-instance/, which is wrong (the TF module embeds the script inline via heredoc, never sourced from scripts/grpn/). Now flags downstream consumer infra repos only. - infra/modules/customer-instance/variables.tf: Czech docstring with `grpn` example → English description with `acme, example` placeholders. Wave-2 anonymization: - Code docstrings (connectors/openmetadata/{client,transformer,enricher}.py, src/catalog_export.py, scripts/duckdb_manager.py): prj-grp-… → my-bq-project / prj-example-1234, AIAgent.FoundryAI → AIAgent.MyAgent, FoundryAIDataModel → AnalyticsDataModel. - Test fixtures (4 files): same set of replacements — 157 tests still pass. - .github/workflows/keboola-deploy.yml: "Groupon-side dev VMs" comment → generic "per-developer dev VMs". - docs/auth-groups.md + scripts/debug/probe_google_groups.py: kids-ai-data-analysis project name → acme-internal-prod placeholder. - 5 planning/spec docs under docs/superpowers/{plans,specs}/2026-04-21-*: hardcoded IPs (34.77.94.14, 34.77.102.61) → <dev-vm-ip>/<prod-vm-ip>; GRPN/Groupon → Acme/another-customer; prj-grp-… → prj-example-…. - scripts/switch-dev-vm.sh deleted — hackathon-era helper hardcoded to a specific shared dev VM. Per-developer dev VMs are the supported pattern. Final grep `groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.(94|102)\.…|kids-ai-data` returns zero hits (excluding CHANGELOG.md historical entries). CHANGELOG entry expanded to document both waves under one bullet, with the BREAKING (ops) clarification about the TF module being unaffected. Refs review of #94, closes #88. * fix(oss): close remaining #94 review-2 findings (Czech, padak refs, CHANGELOG) Reviewer of PR #94 round 2 caught 4 remaining items the wave-2 pass missed: 1. infra/modules/customer-instance/variables.tf had Czech descriptions on 8 more variables. Previous review only flagged line 19; this round audited the rest. Translated lines 2, 28, 42-46 (heredoc), 60, 65, 71, 78, 84 to English. Same review concern: a Terraform module that is the customer-facing API surface in Czech is unfit for OSS distribution. 2. infra/modules/customer-instance/outputs.tf had Czech descriptions on four outputs. Same fix. 3. docs/padak-security.md referenced a private repo (padak/keboola_agent_cli#206) in two places. Replaced with generic 'tracked upstream in the auth-CLI repo' per CLAUDE.md vendor-agnostic rule (no cross-refs to private repos). 4. scripts/fetch-env-from-secrets.sh:41 had a Czech comment. Translated. 5. CHANGELOG cosmetic: bullet said 'AIAgent.FoundryAI -> AIAgent.MyAgent' but the actual code uses both MyAgent (in docstrings) and Example (in test fixtures). Reworded to mention both targets. Final grep across all shipping file types (.md, .py, .yml, .yaml, .sh, Makefile, .json, .tf, .tpl, Caddyfile, .toml) for groupon|grpn|foundryai| prj-grp|groupondev|34.77.94.14|34.77.102.61|kids-ai-data|padak/keboola_agent_cli returns ZERO hits (excluding CHANGELOG.md). Czech-diacritic grep across .tf/.toml/Caddyfile/Makefile/.yml returns ZERO hits. 157/157 OpenMetadata + DuckDB tests still pass. * fix(oss): close #94 round-3 leaks (env.template, instance.yaml.example, padak typo) Round-3 reviewer caught two MUST-FIX leaks the round-2 grep missed (grep was scoped to extensions that did not include .template / .example suffixes — the audit was right, the previous grep was not paranoid enough): 1. config/instance.yaml.example:114 — '(optional - Groupon-specific)' brand leak in a shipping config example. Replaced with '(optional)'. 2. config/.env.template:68 — stale path 'scripts/grpn/agnes-tls-rotate.sh' in operator-facing env-template comment. The script lives at scripts/ops/ now (commit 16a85cc); this comment had been pointing operators at a non-existent path. 3. docs/padak-security.md:188 — phrase duplication 'tracked in tracked upstream' from a sloppy substitution in round-2. Trivial wording fix. Final paranoid grep across .md/.py/.yml/.yaml/.sh/Makefile/.json/.tf/.tpl/ Caddyfile/.toml/.template/.example/.env* with the full token set (groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.94\.14|34\.77\.102\.61| kids-ai-data|padak/keboola_agent_cli) returns ZERO hits, excluding CHANGELOG.md historical entries. * fix(oss): #94 round-4 — QUICKSTART.md + rename padak-security.md Devin Review caught two findings on the latest round-3 commit: 1. docs/QUICKSTART.md:67 still pointed users at the deleted scripts/switch-dev-vm.sh. A Quickstart user following step-by-step would hit a missing-file error at the final step. Replaced with the inline gcloud-ssh equivalent that the Removed bullet documents. 2. docs/padak-security.md filename retains the personal identifier 'padak'. The PR fixed the body content (replaced padak/keboola_agent_cli#206 references with generic wording) but missed the filename. Renamed to docs/security-audit-2026-04.md (date-anchored, vendor-neutral). Updated the historical CHANGELOG link to point at the new path with an inline note about the rename. * fix(oss): redact remaining hardcoded IPs from planning docs + remove default email Devin Review caught two more leaks: 1. scripts/fetch-env-from-secrets.sh line 16 had a hardcoded personal-email default (zdenek.srotyr@keboola.com). Replaced with ':?' bash error so SEED_ADMIN_EMAIL must be explicitly set — safer than carrying any specific identity. 2. Planning docs still had 35.195.96.98 and 34.62.223.189 (legacy prod/dev IPs) that the round-1 IP-replace pattern missed (it only targeted 34.77.x.x). Generic regex redaction across all five planning docs replaces every public IP with <redacted-ip>, preserving private/loopback/IAP ranges.
283 lines
12 KiB
Text
283 lines
12 KiB
Text
# AI Data Analyst - Instance Configuration
|
|
# ==========================================
|
|
# This is the main configuration file for your instance.
|
|
# Copy to instance.yaml and fill in your values.
|
|
#
|
|
# SECRET VALUES use ${ENV_VAR} syntax - actual values go in .env file.
|
|
# Non-secret values are set directly here.
|
|
|
|
# --- Instance branding ---
|
|
instance:
|
|
name: "AI Data Analyst"
|
|
subtitle: "Your Organization"
|
|
copyright: "Your Organization"
|
|
# logo_svg: Full <svg> element for header logo (optional, default: Keboola logo)
|
|
# Example: '<svg width="120" height="30" viewBox="0 0 100 30" xmlns="http://www.w3.org/2000/svg"><text y="22" font-size="24" fill="#333">Logo</text></svg>'
|
|
|
|
# --- Server ---
|
|
server:
|
|
hostname: "" # DNS name (e.g., "data.acme.com")
|
|
host: "" # IP address
|
|
app_dir: "/opt/data-analyst" # Installation directory
|
|
# --- Client setup (shown in "Get Started" on dashboard) ---
|
|
# ssh_alias: "data-analyst" # SSH config Host alias for analysts (default: "data-analyst")
|
|
# ssh_key: "~/.ssh/data_analyst_server" # SSH key path for analysts (default: "~/.ssh/data_analyst_server")
|
|
# project_dir: "data-analyst" # Local project folder name (default: "data-analyst")
|
|
|
|
# --- Admin users ---
|
|
# Manage the server, own data files, get unlimited resource limits.
|
|
# SSH keys are used by server/setup.sh during provisioning.
|
|
admins:
|
|
- username: "admin"
|
|
ssh_public_key: "ssh-ed25519 AAAA..."
|
|
|
|
# --- Deployment ---
|
|
deployment:
|
|
method: "manual" # manual | github_actions
|
|
repo_url: "" # e.g., "git@github.com:acme/ai-data-analyst.git"
|
|
branch: "main"
|
|
|
|
# --- Authentication ---
|
|
# At minimum, set allowed_domain and webapp_secret_key.
|
|
# Email magic link auth works out of the box (no external service needed).
|
|
# Google OAuth is optional - add credentials to enable it.
|
|
auth:
|
|
allowed_domain: "" # Email domain(s) for login, comma-separated (e.g., "acme.com" or "acme.com, partner.org")
|
|
webapp_secret_key: "${WEBAPP_SECRET_KEY}"
|
|
# Optional: Google OAuth (if not set, only email magic link is available)
|
|
google_client_id: "${GOOGLE_CLIENT_ID}"
|
|
google_client_secret: "${GOOGLE_CLIENT_SECRET}"
|
|
|
|
# --- Webapp username shaping ---
|
|
#
|
|
# By default, a user's OS account is derived from their full email:
|
|
# e.psimecek@acme.com -> e_psimecek_acme_com
|
|
#
|
|
# Two options let you control this:
|
|
#
|
|
# username_strip_domain: true
|
|
# Use only the local part of the email (before @).
|
|
# Safe when allowed_domain ensures all users share a single domain.
|
|
# e.psimecek@acme.com -> e_psimecek
|
|
# Keeps usernames short and readable.
|
|
#
|
|
# username_prefix: "myapp_"
|
|
# Prepend a fixed string to every webapp-created account name.
|
|
# Necessary when an external identity system (GCP OS Login, LDAP, SAML)
|
|
# already creates OS accounts in /home/ using the same naming scheme.
|
|
# Without a prefix, the webapp sees those existing OS accounts and refuses
|
|
# to register new analyst accounts ("already in use by a system account").
|
|
# With prefix "myapp_" and strip_domain true:
|
|
# e.psimecek@acme.com -> myapp_e_psimecek
|
|
# Linux enforces a 32-character username limit. Keep the prefix short.
|
|
# Changing or removing either option later will invalidate all existing
|
|
# analyst accounts. Use username_mapping (top-level) to bridge legacy accounts.
|
|
#
|
|
# username_strip_domain: false
|
|
# username_prefix: ""
|
|
# disabled_providers: # Hide auth methods from login page
|
|
# - "email" # Disable email magic link (use when Google OAuth is configured)
|
|
|
|
# --- Theme (optional) ---
|
|
# Customize colors, fonts, and shape to match your brand.
|
|
# All values are optional - defaults provide a clean blue theme.
|
|
# See docs/theme-reference.html for a visual guide.
|
|
theme:
|
|
# primary: "#0073D1" # Main brand color (buttons, links, accents)
|
|
# primary_dark: "#005BA3" # Hover/active state of primary
|
|
# primary_light: "rgba(0, 115, 209, 0.1)" # Light tint backgrounds
|
|
# text_primary: "#1A253C" # Main text color
|
|
# text_secondary: "#6B7280" # Muted/secondary text
|
|
# background: "#F5F7FA" # Page background
|
|
# surface: "#FFFFFF" # Card/panel background
|
|
# border: "#E5E7EB" # Borders and dividers
|
|
# font_primary: "'Inter', system-ui, sans-serif"
|
|
# font_url: "https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display=swap"
|
|
# radius: "6px" # Border radius (cards, buttons, inputs)
|
|
# success: "#10B77F"
|
|
# warning: "#F59F0A"
|
|
# error: "#EA580C"
|
|
|
|
# --- Data source ---
|
|
data_source:
|
|
type: "keboola" # keboola | bigquery | local
|
|
keboola:
|
|
storage_token: "${KEBOOLA_STORAGE_TOKEN}"
|
|
stack_url: "" # e.g., "https://connection.keboola.com"
|
|
project_id: ""
|
|
bigquery:
|
|
project: "${BIGQUERY_PROJECT}" # GCP project for job execution/billing
|
|
location: "${BIGQUERY_LOCATION}" # BigQuery location (e.g., "us-central1", "US")
|
|
# Uses ADC (Application Default Credentials) - VM service account on GCP
|
|
# Data can live in a different project -- use fully-qualified table IDs in data_description.md
|
|
|
|
# --- OpenMetadata catalog (optional) ---
|
|
# Enriches table and column metadata from OpenMetadata REST API.
|
|
# If not configured, app works normally without catalog enrichment.
|
|
# openmetadata:
|
|
# url: "https://your-catalog.example.com"
|
|
# token: "${OPENMETADATA_TOKEN}" # JWT bearer token
|
|
# cache_ttl_seconds: 3600 # Cache TTL in seconds
|
|
|
|
# --- Email delivery (optional, for magic link auth) ---
|
|
# Without SMTP, magic links are shown directly in browser (development mode).
|
|
# For production, configure any SMTP relay (Gmail, Mailgun, SendGrid SMTP, etc.)
|
|
email:
|
|
from_address: "noreply@example.com"
|
|
from_name: "AI Data Analyst"
|
|
smtp_host: "${SMTP_HOST}" # e.g., "smtp.gmail.com"
|
|
smtp_port: 587 # 587 for STARTTLS, 465 for SSL
|
|
smtp_user: "${SMTP_USER}"
|
|
smtp_password: "${SMTP_PASSWORD}"
|
|
|
|
# --- Desktop app (optional) ---
|
|
desktop:
|
|
jwt_issuer: "data-analyst"
|
|
jwt_secret: "${DESKTOP_JWT_SECRET}"
|
|
url_scheme: "data-analyst"
|
|
|
|
# --- Telegram notifications (optional) ---
|
|
telegram:
|
|
bot_token: "${TELEGRAM_BOT_TOKEN}"
|
|
bot_username: ""
|
|
domain_suffix: ""
|
|
|
|
# --- Jira integration (optional) ---
|
|
jira:
|
|
domain: ""
|
|
email: ""
|
|
api_token: "${JIRA_API_TOKEN}"
|
|
webhook_secret: "${JIRA_WEBHOOK_SECRET}"
|
|
sla_email: ""
|
|
sla_api_token: "${JIRA_SLA_API_TOKEN}"
|
|
cloud_id: ""
|
|
|
|
# --- Corporate Memory AI (optional) ---
|
|
# Extracts shared knowledge from team members' CLAUDE.local.md files.
|
|
# Provider: "anthropic" (direct API) or "openai_compat" (LiteLLM, OpenRouter, Azure, etc.)
|
|
ai:
|
|
provider: "anthropic" # or "openai_compat"
|
|
api_key: "${ANTHROPIC_API_KEY}" # or "${LLM_API_KEY}" for proxy
|
|
# base_url: "https://litellm.example.com" # required for openai_compat
|
|
model: "claude-haiku-4-5-20251001" # any model available on your provider
|
|
# --- Structured output quality control ---
|
|
# AI models can return JSON in three ways, each with different reliability:
|
|
#
|
|
# Layer 1 - "json_schema" (best):
|
|
# The provider enforces an exact schema. Every field, type, and structure
|
|
# is guaranteed. Available on: Anthropic, OpenAI, Claude via LiteLLM.
|
|
#
|
|
# Layer 2 - "json_object" (good):
|
|
# The provider guarantees valid JSON, but does not enforce a specific schema.
|
|
# Fields may be missing or have wrong types. Available on most providers.
|
|
#
|
|
# Layer 3 - "prompt" (acceptable):
|
|
# The AI is asked to respond in JSON via instructions in the prompt.
|
|
# No technical enforcement -- the model may still return invalid JSON.
|
|
# Works everywhere, but least reliable.
|
|
#
|
|
# "strict" = only Layer 1. Fail if provider doesn't support json_schema.
|
|
# Use when data quality is non-negotiable.
|
|
# "json" = Layer 1, fall back to Layer 2. No prompt-based fallback.
|
|
# Good balance of quality and compatibility.
|
|
# "auto" = All three layers as progressive fallback. Maximum compatibility.
|
|
# Use when you'd rather get imperfect data than no data.
|
|
structured_output: "auto"
|
|
|
|
# Legacy format (still supported, equivalent to provider: "anthropic"):
|
|
# ai:
|
|
# anthropic_api_key: "${ANTHROPIC_API_KEY}"
|
|
|
|
# Examples:
|
|
# --- LiteLLM proxy ---
|
|
# ai:
|
|
# provider: "openai_compat"
|
|
# base_url: "https://litellm.example.com"
|
|
# api_key: "${LLM_API_KEY}"
|
|
# model: "claude-haiku-4-5-20251001"
|
|
# structured_output: "strict"
|
|
#
|
|
# --- OpenRouter ---
|
|
# ai:
|
|
# provider: "openai_compat"
|
|
# base_url: "https://openrouter.ai/api/v1"
|
|
# api_key: "${OPENROUTER_API_KEY}"
|
|
# model: "anthropic/claude-3-haiku"
|
|
# structured_output: "auto"
|
|
|
|
# --- Corporate Memory governance (optional) ---
|
|
# Controls how AI-extracted knowledge is reviewed and distributed.
|
|
# If not present, system operates in legacy mode (democratic wiki, no admin review).
|
|
#
|
|
# corporate_memory:
|
|
# # How knowledge reaches users:
|
|
# # "mandatory_only" — admin controls everything, no user voting
|
|
# # "admin_curated" — admin controls, users vote as feedback signal
|
|
# # "hybrid" — mandatory from admin + optional from user voting (default)
|
|
# distribution_mode: "hybrid"
|
|
#
|
|
# # How new AI-extracted items enter the system:
|
|
# # "review_queue" — nothing published without admin approval (default)
|
|
# # "auto_publish" — items go live immediately, admin intervenes retroactively
|
|
# # "threshold" — high-confidence auto-publish, low-confidence to review queue
|
|
# approval_mode: "review_queue"
|
|
#
|
|
# # Default review period for approved/mandatory items (months)
|
|
# review_period_months: 6
|
|
#
|
|
# # Notify km_admins about new pending items
|
|
# notify_on_new_items: true
|
|
|
|
# --- User groups for audience targeting (optional) ---
|
|
# Used with Corporate Memory governance to target mandatory knowledge to specific groups.
|
|
#
|
|
# groups:
|
|
# finance:
|
|
# label: "Finance & Analytics"
|
|
# members: ["analyst1@company.com", "analyst2@company.com"]
|
|
# engineering:
|
|
# label: "Engineering"
|
|
# members: ["dev1@company.com", "dev2@company.com"]
|
|
|
|
# --- User display and permissions ---
|
|
# Corporate Memory avatars + optional km_admin flag for governance.
|
|
# users:
|
|
# admin@company.com:
|
|
# display_name: "Admin User"
|
|
# km_admin: true # Corporate Memory admin (approve/mandate knowledge)
|
|
# analyst@company.com:
|
|
# display_name: "Analyst User"
|
|
users: {}
|
|
|
|
# --- Username mapping (webapp email -> server username, only if different) ---
|
|
username_mapping: {}
|
|
|
|
# --- Optional datasets (sync settings UI) ---
|
|
datasets: {}
|
|
|
|
# --- Data catalog ---
|
|
catalog:
|
|
categories: {}
|
|
order: []
|
|
|
|
# --- Data profiler (optional) ---
|
|
# profiler:
|
|
# sample_size: 500000 # If table > this, sample this many rows; otherwise use all
|
|
# max_categorical_distinct: 50 # Treat as categorical if unique <= this
|
|
# top_values_limit: 10 # Top values per categorical column
|
|
# histogram_bins: 15 # Bins in histogram visualizations
|
|
# sample_rows_limit: 5 # Sample rows to show in UI "Sample" tab
|
|
# alert_high_missing_pct: 30.0 # Alert threshold for high missing %
|
|
# alert_missing_pct: 5.0 # Alert threshold for missing %
|
|
# alert_imbalance_pct: 60.0 # Alert threshold for imbalance %
|
|
# alert_high_cardinality: 50 # Alert threshold for high cardinality columns
|
|
|
|
# --- Remote query (optional) ---
|
|
# Settings for remote BigQuery queries via `python -m src.remote_query`.
|
|
# Used when tables have query_mode: "remote" in data_description.md.
|
|
# remote_query:
|
|
# timeout_seconds: 300 # BQ + DuckDB query timeout
|
|
# max_result_rows: 100000 # Max rows in final output
|
|
# max_bq_registration_rows: 500000 # Max rows per --register-bq sub-query
|
|
# default_format: "table" # Default output format
|
|
# output_dir: "/tmp/remote_query" # Directory for Parquet/CSV exports
|