agnes-the-ai-analyst

Author	SHA1	Message	Date
Petr	5fc9526627	Phase 2: Replace demo YAML metrics with OpenMetadata catalog data - Add get_metric_by_fqn() to OpenMetadataClient - Add get_metrics() to CatalogEnricher with TTL caching - Implement _parse_om_metric() to extract category/grain from OpenMetadata tags - Implement _load_metrics_from_catalog() to fetch and categorize metrics - Implement _build_om_metric_detail() to convert OpenMetadata format to MetricParser JSON - Add /api/catalog/metrics/<fqn> endpoint for metric detail modal - Update _load_metrics_data() to prefer catalog over YAML fallback - Update metric_modal.js to route catalog:{fqn} to catalog API endpoint - Delete 10 demo YAML files from docs/metrics/ - Replace metric tests with new unit tests for catalog parsing functions (19 tests) Catalog metrics provide single source of truth vs maintaining demo YAML files. UI remains unchanged - only data source changes from YAML to OpenMetadata catalog.	2026-03-12 15:10:42 +01:00
Petr	be58e63394	Move profiler config to instance.yaml (KISS principle) Instead of hardcoded Python constants, load profiler settings from config: - instance.yaml: profiler section with all parameters - Defaults: fallback to sensible defaults if config not found - Centralized: all profiler tuning in one place, no code changes needed	2026-03-12 14:45:14 +01:00
Petr	c25278538c	Simplify profiler config: use single SAMPLE_SIZE parameter (KISS) Replace SAMPLE_THRESHOLD + SAMPLE_SIZE with single SAMPLE_SIZE: - If table > SAMPLE_SIZE: sample that many rows - Otherwise: use all rows Cleaner, easier to configure.	2026-03-12 14:43:23 +01:00
Petr	e2d3afade3	Log when catalog enrichment (tags/tier) are found	2026-03-12 14:34:45 +01:00
Petr	14d75d6229	Fix: correct OpenMetadata catalog URL path and add debug logging - Change catalog URL from /explore/{fqn} to /table/{fqn} - Add debug logging to see parsed tags, owners, tier from API response	2026-03-12 14:34:12 +01:00
Petr	de66f6dd55	Fix: include partition fields in TableConfig for catalog enrichment Pass partition_by, partition_granularity, partition_column_type, and incremental_window_days from YAML to TableConfig to avoid validation errors when sync_strategy='partitioned'	2026-03-12 14:29:05 +01:00
Petr	2d03a9b557	Display OpenMetadata catalog enrichment in table profile overview - API endpoint /api/catalog/profile/ enriches response with catalog metadata (tier, owners, tags, url) - renderOverview() template function displays 'Data Catalog' section with tier, owners, tags, and catalog link - Graceful degradation: section only shown if catalog enrichment available	2026-03-12 14:28:02 +01:00
Petr	a7faf70cb3	Allow self-signed certificates for OpenMetadata catalog (internal networks) OpenMetadata catalog uses self-signed HTTPS certificate on internal networks. Disable SSL verification in httpx client and suppress related warnings.	2026-03-12 14:12:44 +01:00
Petr	c5c24cb45b	Implement OpenMetadata catalog integration (Phase 1) Add OpenMetadata REST API connector and enricher to merge table/column metadata from OpenMetadata catalog at sync and query time. Changes: - connectors/openmetadata/client.py: HTTP client for OM API - connectors/openmetadata/enricher.py: Data enrichment with TTL cache - tests/test_openmetadata_*: Unit tests for client and enricher - src/config.py: Add catalog_fqn field to TableConfig - src/data_sync.py: Use enricher in _generate_schema_yaml (catalog > BQ API > data_description.md) - webapp/app.py: Initialize enricher, enrich catalog data with tags/tier/owners/url - config/instance.yaml.example: Document openmetadata section Features: - FQN auto-derivation: bigquery.{table.id} - TTL cache (default 1h) to avoid repeated API calls - Graceful degradation: disabled if token missing, silent on HTTP errors - Column description priority: catalog > BQ API > (none) - Table description priority: catalog > data_description.md	2026-03-12 14:07:13 +01:00
Petr	8bb46a9e0a	Add per-partition streaming sync and hybrid query architecture Partitioned sync: iterates day-by-day instead of loading full dataset. Each partition: query BQ -> stream to disk -> free RAM. Peak ~50 MB. New helpers: _sync_single_partition, _cleanup_old_partitions, _generate_partition_dates. Config: added partition_column_type (DATE/TIMESTAMP/DATETIME), query_mode (local/remote/hybrid). DuckDB manager: hybrid architecture support (local Parquet + remote BQ tables). Data sync: skips remote tables, filters by query_mode. Tests: 113 passing (adapter, client, config, data_sync, duckdb_manager).	2026-03-12 13:20:41 +01:00
Petr	d2e83ce9d0	Set DuckDB memory_limit=4GB in profiler to prevent OOM Server has 8GB RAM with other services running. DuckDB defaults to using all available memory, causing OOM killer when profiling large tables (22M rows, 39 cols triggered 7.5GB RSS -> killed).	2026-03-12 11:06:49 +01:00
Petr	85c87ec375	Pass explicit bqstorage_client to to_arrow_iterable() for Storage API Without explicit bqstorage_client parameter, to_arrow_iterable() silently falls back to REST API pagination (~5K rows/sec). With explicit client, it uses parallel gRPC streams via BQ Storage API (~300K rows/sec). No temp table materialization - BQ already writes query results to an internal temp table automatically. We just tell the reader to use the fast gRPC path instead of slow HTTP pagination.	2026-03-12 10:51:44 +01:00
Petr	4f74543a12	Fix streaming: use RowIterator.to_arrow_iterable() not QueryJob QueryJob only has to_arrow(), not to_arrow_iterable(). Must call query_job.result() first to get RowIterator, which has the streaming to_arrow_iterable() method.	2026-03-11 20:15:35 +01:00
Petr	ee70da86c3	Stream BQ results to Parquet instead of loading into memory Replace to_arrow() (loads entire result into RAM) with to_arrow_iterable() (streams RecordBatches). Each batch is written directly to disk via ParquetWriter - constant memory regardless of table size. Prevents OOM on 8GB server for multi-million row tables.	2026-03-11 20:13:03 +01:00
Petr	a191ede28c	Add columns and row_filter to TableConfig for selective BQ export Propagate column selection and row filtering from data_description.md through the BigQuery adapter to the BQ client. This enables exporting only needed columns and applying date range filters at the SQL level, critical for large DataView tables (e.g., 412-col unit_economics).	2026-03-11 19:37:04 +01:00
Petr	468f56092b	Add standalone DuckDB-based data profiler script Zero-dependency profiler for Parquet/CSV files producing JSON profiles with column statistics, histograms, alerts, and sample data. Supports single files, directories, composite primary keys, and optional HTML report generation.	2026-03-11 15:12:04 +01:00
Petr	c77a6f6c2e	Fix clipped annotation badges in theme-reference.html Remove overflow:hidden from mockup containers and reposition surface/text_primary badges that were cut off at edges.	2026-03-11 14:09:04 +01:00
Petr	e26e47a071	Add BQ Storage API fallback to REST when readsessions permission missing	2026-03-11 13:59:09 +01:00
Petr	d438438e33	Add configurable white-label theming via instance.yaml Extend theming from 3 CSS variables (primary colors only) to 14 configurable properties covering colors, fonts, borders, and shape. All values are optional with sensible defaults. - New _theme.html include replaces duplicated inline injection - Wire theme include into all 7 templates (base, login, dashboard, catalog, admin_tables, activity_center, corporate_memory) - Conditional font loading: skip default Inter when custom font_url set - Config.theme_overrides() classmethod generates CSS variable dict - Visual theme-reference.html guide for instance configurators - Document all theme keys in instance.yaml.example	2026-03-11 13:58:58 +01:00
Petr	758910463b	Add BigQuery data source adapter BigQuery connector that syncs BQ tables to local Parquet files via PyArrow (no CSV intermediate step). Supports full refresh, timestamp-based incremental (via incremental_column), and partition-based sync strategies. - connectors/bigquery/client.py: BQ API wrapper with ADC auth, parameterized queries, metadata cache, cross-project support (job project != data project) - connectors/bigquery/adapter.py: DataSource implementation with merge/dedup - src/config.py: Add incremental_column field to TableConfig - 72 unit tests (mocked, no GCP SDK required)	2026-03-11 13:56:12 +01:00
Petr	eb5264b903	Make header logo configurable via instance.yaml logo_svg Move hardcoded Keboola SVG logo from 4 templates into config. Templates now use {{ config.LOGO_SVG \| safe }}. Default falls back to Keboola logo when not configured.	2026-03-11 13:08:26 +01:00
Petr	d3c7f7feea	Fix activity-center 500 error: provide default data structure The activity_center view was passing an empty dict but the template expected nested keys (executive_summary, maturity_roadmap, etc). Added _build_activity_data() that returns properly structured defaults.	2026-03-11 12:59:40 +01:00
Petr	91a05a2c2b	Add auth.disabled_providers config to skip auth providers Reads disabled_providers list from instance.yaml auth section. Listed providers are skipped during auto-discovery.	2026-03-11 12:54:23 +01:00
Petr	954aa0f17e	Add theme color support via instance.yaml Allow instances to override primary CSS color variables through theme section in instance.yaml config.	2026-03-11 00:42:10 +01:00
Petr	e35e602c59	Update CLAUDE.md with metrics, table registry, password auth Add docs/metrics/ to project structure, Business Metrics and Table Registry patterns to implementation details, password auth provider to extensibility section, fix sync command for returning users.	2026-03-10 23:05:03 +01:00
Petr	ad3b94c168	Add Business Metrics card to dashboard	2026-03-10 22:52:48 +01:00
Petr	34fde746e7	Remove hardcoded Jira and Telemetry cards from dashboard	2026-03-10 22:50:22 +01:00
Petr	49559fba1b	Remove hardcoded Jira and Telemetry cards from catalog These Keboola-specific data source cards don't belong in the OSS repo. The catalog now shows only dynamic content: Core Business Data (from data_description.md) and Business Metrics (from docs/metrics/*.yml). Also update auto-install.md with Business Metrics documentation, pipeline diagram, and expanded checklist.	2026-03-10 22:48:07 +01:00
Petr	5a84473213	Add dynamic Business Metrics with sample e-commerce definitions Replace hardcoded Keboola-specific metrics card in Data Catalog with dynamic Jinja template that renders whatever metric YAMLs exist in docs/metrics/. Add 10 sample e-commerce metric definitions across 4 categories (revenue, customers, marketing, support) that align with the sample data generator tables. Key changes: - MetricParser: new category colors + dynamic sql_* field discovery - _load_metrics_data(): scans docs/metrics//.yml with prod fallback - catalog.html: 240 lines hardcoded HTML -> 35 lines Jinja loop - metric_modal.js: regex-based category class removal, new categories - 21 tests validating YAML schema, parser, and loader	2026-03-10 22:38:44 +01:00
Petr	f685dc357f	Document Data Catalog and Profiler pipeline in auto-install guide - Add architecture diagram showing data flow from instance config through profiler to webapp - Explain folder_mapping dual purpose (catalog categories + file paths) - Add Step 6c for running the profiler - Document foreign_keys for relationship diagrams - Explain profiles.json fallback for catalog header stats - Expand checklist with profiler verification steps	2026-03-10 22:14:45 +01:00
Petr	28543d98b1	Fix profiler file_size and catalog stats fallback - Profiler computes file_size_mb from actual parquet files when sync_state.json is absent (sample data / no-sync deployments) - Catalog header falls back to profiles.json for aggregate stats (tables count, total rows) when sync_state.json is missing	2026-03-10 22:12:46 +01:00
Petr	1be0dc5300	Add flat parquet fallback to profiler get_parquet_path Tries subfolder path first (Keboola-style layout), then falls back to flat path for simple deployments like sample data.	2026-03-10 22:09:14 +01:00
Petr	7f61ae8772	Update auto-install docs with Data Catalog setup - Split Step 6 into 6a (Generate Parquet) and 6b (Configure Data Catalog) - Document data_description.md + instance.yaml catalog categories - Uncomment data_description.md symlink in Step 3c - Add Data Catalog verification to Step 6 checklist	2026-03-10 22:00:28 +01:00
Petr	302494b632	Add --format parquet using project's ParquetManager Generator now supports --format {csv,parquet,both}. Parquet mode uses src.parquet_manager.ParquetManager for snappy compression, proper column types (DATE, TIMESTAMP, DOUBLE), and metadata. No more ad-hoc pandas conversion needed on the server.	2026-03-10 21:46:20 +01:00
Petr	44bf43535b	Add sample data generator with 9 e-commerce tables Synthetic data generator for demo/testing without real data adapter: - 9 tables: customers, products, campaigns, web_sessions, web_leads, orders, order_items, payments, support_tickets - 4 size presets: xs (1MB), s (15MB), m (150MB), l (1.5GB) - Realistic patterns: seasonality, Pareto customer distribution, segment-based behavior, referential integrity - Deterministic output via --seed parameter Also: docs/sample-data.md, updated auto-install.md with Step 6, updated CLAUDE.md (email auth provider, dual-repo architecture)	2026-03-10 12:31:14 +01:00
Petr	879bc6c44f	docs	2026-03-10 11:43:11 +01:00
Petr	495940d6b8	Rewrite auto-install guide with dual-repo architecture Document the full end-to-end workflow: OSS repo (code) + private instance repo (config/secrets). Covers SSH key isolation per repo, symlink bridging, and ongoing deployment workflow.	2026-03-10 11:38:41 +01:00
Petr	1ac868d787	Improve setup instructions for robustness - Check for existing SSH config entry before overwriting - Use --no-perms --no-group in rsync (fixes macOS permission errors) - Explicit mkdir instead of brace expansion (Claude Code compatibility) - Gracefully handle missing server directories (empty server is OK) - Conditional steps for setup_views.sh and CLAUDE.md template	2026-03-10 11:29:31 +01:00
Petr	fde1d6fc01	Move Claude Code setup to dashboard, remove step 5 from onboarding - Get Started page now has 4 steps (folder, SSH key, pubkey, register) - After account creation, dashboard shows prominent "Set up your local environment" CTA with claude command and Copy Setup Instructions - CTA only visible when user hasn't synced yet (last_sync is empty) - Bottom banner demoted to subtle secondary style for returning users	2026-03-10 11:18:56 +01:00
Petr	45454ab86a	Redesign onboarding: compact single-screen layout with terminal block - Merge steps 1-3 into a single dark terminal block with copy buttons - Inline registration form with single-row layout for step 4 - Compact step 5 with Claude Code command and copy button on one line - Full-width layout (960px) instead of narrow 640px column - Everything fits on one screen without scrolling	2026-03-10 11:10:19 +01:00
Petr	9c4208bb89	Unify onboarding into single-column stepper with inline registration Merge the two-column layout (setup steps + registration form) into one unified flow. Step 4 now contains the registration form inline, creating a natural top-to-bottom progression through the setup process.	2026-03-10 11:06:11 +01:00
Petr	21af1abb6e	Fix setup instructions: add SSH key steps, fix clipboard on HTTP - Add steps 2-4 (SSH key generation, copy pubkey, create account) - Fix clipboard copy using textarea fallback for non-HTTPS contexts - Generate simple plain-text Claude Code prompt instead of full YAML - Show what Claude will do (SSH, rsync, DuckDB, CLAUDE.md)	2026-03-10 11:00:48 +01:00
Petr	f635195c80	Add multi-domain support and full-email username generation - Support comma-separated domains in auth.allowed_domain config - Use full email as system username (user@domain.com -> user_domain_com) to avoid collisions with reserved names and across domains - Update both auth providers (google, email) for multi-domain display - Add tests for username generation and update email auth tests	2026-03-10 10:50:01 +01:00
Petr	a8a9efeb60	Update auto-install docs with steps 3-4 (config + email auth)	2026-03-10 10:43:39 +01:00
Petr	e2ab219171	Add email magic link authentication provider New pluggable auth provider that sends passwordless sign-in links. Works with domain restriction (same as Google OAuth). Falls back to showing the link in browser when SMTP is not configured (dev mode).	2026-03-10 10:39:19 +01:00
Petr	b99ec576ca	Add self-service data onboarding system Table Registry as central source of truth (JSON) with atomic writes, optimistic locking, audit logging, and data_description.md generation. Existing readers (config.py, profiler.py) need zero changes. Phase 1 - Discovery API: - discover_tables() on DataSource ABC + Keboola implementation - admin_required decorator with server-side recomputation - GET /api/admin/discover-tables endpoint Phase 2 - Table Registry: - src/table_registry.py with CRUD, validation, migration from MD - Admin API: register/update/unregister with version locking - DELETE cascade cleans up per-user subscriptions Phase 3 - Auto-Profiling: - profile_changed_tables() for incremental profiling - Non-fatal hook in sync_all() after successful sync Phase 4 - Per-Table Subscriptions: - table_mode (all/explicit) with per-table toggles - GET/POST /api/table-subscriptions endpoints - Subscription status in catalog and dashboard views Phase 5 - Smart Sync: - Python-generated rsync filter files (not shell YAML parsing) - sync_data.sh uses --filter="merge ..." for explicit mode Phase 6 - Admin UI: - /admin/tables with discovery, registration modal, registry mgmt - Vanilla JS, matching existing design system	2026-03-09 14:25:37 +01:00
Petr	7c9007a8f9	Update docs for modular architecture (auth/, services/, scripts/) Add auth providers, standalone services, and service patterns to project structure in README, ARCHITECTURE, and CLAUDE.md. Reflects the completed extraction of auth, telegram bot, ws gateway, corporate memory, and session collector.	2026-03-09 13:11:40 +01:00
Petr	15b513266d	Merge dev_scripts/ into scripts/ Move dev_run.py and test_sync.sh from dev_scripts/ to scripts/, eliminating the separate dev_scripts directory. Update scripts README with development scripts section.	2026-03-09 13:11:36 +01:00
Petr	2d3f127e58	Update paths in docs and sudoers after services/ extraction All references to server/telegram_bot/, server/ws_gateway/, server/corporate_memory/, server/session_collector* updated to their new locations under services/.	2026-03-09 13:02:13 +01:00
Petr	c6a711aa27	Extract pluggable auth provider system into auth/ package Replace hardcoded Google OAuth + password auth registration with auto-discovered auth providers. Each provider in auth/<name>/provider.py implements AuthProvider ABC and is automatically registered at startup. - auth/__init__.py: AuthProvider ABC + discover_providers() scanner - auth/google/: Google OAuth provider (extracted from webapp/auth.py) - auth/password/: Email/password provider (delegates to webapp/password_auth) - auth/desktop/: Desktop JWT auth (API-only, not visible on login page) - webapp/auth.py: stripped to core infra (login_required, /login, /logout) - webapp/app.py: auto-discovery loop replaces manual blueprint registration - login.html: dynamic provider buttons via Jinja loop	2026-03-09 13:02:08 +01:00

1 2 3 4 5

210 commits