agnes-the-ai-analyst

Author	SHA1	Message	Date
ZdenekSrotyr	30987eef16	fix: add union_by_name=true to read_parquet calls in profiler Handles schema evolution across partitions when profiling tables with multiple parquet files that may have different column sets.	2026-04-09 18:42:33 +02:00
ZdenekSrotyr	fa30298589	fix: use DATA_DIR env var instead of hardcoded /data paths - services/telegram_bot/config.py: NOTIFICATIONS_DIR now uses DATA_DIR fallback - src/profiler.py: DATA_DIR now uses main DATA_DIR env var instead of PROFILER_DATA_DIR - services/telegram_bot/dispatch.py: WS_GATEWAY_SOCKET_PATH now uses WS_GATEWAY_SOCKET env var	2026-04-09 16:39:44 +02:00
ZdenekSrotyr	92fbb88c15	chore: Docker prod config (Python 3.13, no reload), fix utcnow deprecation, update docs	2026-04-08 12:10:47 +02:00
Petr	a667b4e32f	Fix profiler crash for remote-only tables without primary_key Same issue as config.py - profiler's TableInfo and parser required primary_key and sync_strategy, breaking auto-profile after sync when daily_deal_traffic (remote-only) is in config.	2026-03-25 14:47:00 +01:00
Petr	be58e63394	Move profiler config to instance.yaml (KISS principle) Instead of hardcoded Python constants, load profiler settings from config: - instance.yaml: profiler section with all parameters - Defaults: fallback to sensible defaults if config not found - Centralized: all profiler tuning in one place, no code changes needed	2026-03-12 14:45:14 +01:00
Petr	c25278538c	Simplify profiler config: use single SAMPLE_SIZE parameter (KISS) Replace SAMPLE_THRESHOLD + SAMPLE_SIZE with single SAMPLE_SIZE: - If table > SAMPLE_SIZE: sample that many rows - Otherwise: use all rows Cleaner, easier to configure.	2026-03-12 14:43:23 +01:00
Petr	d2e83ce9d0	Set DuckDB memory_limit=4GB in profiler to prevent OOM Server has 8GB RAM with other services running. DuckDB defaults to using all available memory, causing OOM killer when profiling large tables (22M rows, 39 cols triggered 7.5GB RSS -> killed).	2026-03-12 11:06:49 +01:00
Petr	28543d98b1	Fix profiler file_size and catalog stats fallback - Profiler computes file_size_mb from actual parquet files when sync_state.json is absent (sample data / no-sync deployments) - Catalog header falls back to profiles.json for aggregate stats (tables count, total rows) when sync_state.json is missing	2026-03-10 22:12:46 +01:00
Petr	1be0dc5300	Add flat parquet fallback to profiler get_parquet_path Tries subfolder path first (Keboola-style layout), then falls back to flat path for simple deployments like sample data.	2026-03-10 22:09:14 +01:00
Petr	b99ec576ca	Add self-service data onboarding system Table Registry as central source of truth (JSON) with atomic writes, optimistic locking, audit logging, and data_description.md generation. Existing readers (config.py, profiler.py) need zero changes. Phase 1 - Discovery API: - discover_tables() on DataSource ABC + Keboola implementation - admin_required decorator with server-side recomputation - GET /api/admin/discover-tables endpoint Phase 2 - Table Registry: - src/table_registry.py with CRUD, validation, migration from MD - Admin API: register/update/unregister with version locking - DELETE cascade cleans up per-user subscriptions Phase 3 - Auto-Profiling: - profile_changed_tables() for incremental profiling - Non-fatal hook in sync_all() after successful sync Phase 4 - Per-Table Subscriptions: - table_mode (all/explicit) with per-table toggles - GET/POST /api/table-subscriptions endpoints - Subscription status in catalog and dashboard views Phase 5 - Smart Sync: - Python-generated rsync filter files (not shell YAML parsing) - sync_data.sh uses --filter="merge ..." for explicit mode Phase 6 - Admin UI: - /admin/tables with discovery, registration modal, registry mgmt - Vanilla JS, matching existing design system	2026-03-09 14:25:37 +01:00
Petr	86edd27655	Extract Jira into connectors/jira module Move all Jira-specific code into a self-contained connector module: - 22 files moved via git mv (transform, service, webhook, scripts, systemd units, tests, docs, bin helper) - All imports updated to use connectors.jira.* paths - Jira is now conditional: auto-detected via JIRA_DOMAIN env var - Webapp registers Jira blueprint only when available - Health service monitors Jira timers only when enabled - Profiler loads Jira tables dynamically from filesystem - Sync settings uses config-driven dependency validation - Renamed keboola_platform_url -> custom_url in transform - Updated deploy.sh, sudoers-deploy, backfill_gap.sh paths - Fixed pytest.ini to skip live tests by default	2026-03-09 11:17:50 +01:00
Petr	c56905d34f	Initial commit: OSS data distribution platform Open-source AI data analyst platform extracted from internal repo. Includes data sync engine, Keboola adapter, Flask web portal, server deployment scripts, and configuration templates.	2026-03-08 23:31:28 +01:00

12 commits