Restructure docs for OSS readability
Remove redundant docs (GETTING_STARTED, README index, jira_schema), add ARCHITECTURE.md and llms.txt for AI-era discoverability, move notifications.md to docs/future/NOTIFICATIONS.md.
This commit is contained in:
parent
1471b8addf
commit
d8226c6641
7 changed files with 183 additions and 582 deletions
135
ARCHITECTURE.md
Normal file
135
ARCHITECTURE.md
Normal file
|
|
@ -0,0 +1,135 @@
|
|||
# Architecture
|
||||
|
||||
## System Overview
|
||||
|
||||
```
|
||||
Data Source (Keboola / CSV / BigQuery)
|
||||
|
|
||||
v
|
||||
+------------------------------------------+
|
||||
| Data Broker Server |
|
||||
| |
|
||||
| src/data_sync.py |
|
||||
| -> src/adapters/*.py (fetch data) |
|
||||
| -> src/parquet_manager.py (convert) |
|
||||
| |
|
||||
| /data/src_data/parquet/ (output) |
|
||||
| /data/docs/ (synced docs) |
|
||||
| /data/scripts/ (helpers) |
|
||||
+------------------------------------------+
|
||||
| rsync over SSH
|
||||
v
|
||||
+------------------------------------------+
|
||||
| Analyst Machine |
|
||||
| |
|
||||
| server/parquet/ -> DuckDB views |
|
||||
| user/duckdb/analytics.duckdb |
|
||||
| Claude Code queries DuckDB via SQL |
|
||||
+------------------------------------------+
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
### 1. Data Sync Engine (`src/`)
|
||||
|
||||
Pulls data from configured source, converts to Parquet.
|
||||
|
||||
| File | Role |
|
||||
|------|------|
|
||||
| `src/data_sync.py` | Orchestration + `DataSource` ABC (line 149) |
|
||||
| `src/adapters/base.py` | Adapter interface |
|
||||
| `src/adapters/keboola_adapter.py` | Keboola Storage adapter |
|
||||
| `src/keboola_client.py` | Low-level Keboola API client |
|
||||
| `src/parquet_manager.py` | CSV -> typed Parquet conversion |
|
||||
| `src/config.py` | Reads `data_description.md` for table definitions |
|
||||
| `src/profiler.py` | Data profiling for catalog UI |
|
||||
|
||||
### 2. Web Application (`webapp/`)
|
||||
|
||||
Flask app for user onboarding, settings, and data catalog.
|
||||
|
||||
| File | Role |
|
||||
|------|------|
|
||||
| `webapp/app.py` | Flask entry point, routes |
|
||||
| `webapp/config.py` | Loads `instance.yaml`, exposes `Config` to templates |
|
||||
| `webapp/account_service.py` | User account details, sync status |
|
||||
| `webapp/templates/` | Jinja2 templates (dashboard, setup, catalog) |
|
||||
|
||||
### 3. Configuration (`config/`)
|
||||
|
||||
| File | Role |
|
||||
|------|------|
|
||||
| `config/instance.yaml` | Main instance config (not committed) |
|
||||
| `config/instance.yaml.example` | Template with all options |
|
||||
| `config/loader.py` | YAML loader with `${ENV_VAR}` interpolation |
|
||||
| `config/.env.template` | Secret variable placeholders |
|
||||
| `docs/data_description.md` | Table schemas + sync strategies (not committed) |
|
||||
|
||||
### 4. Server Infrastructure (`server/`)
|
||||
|
||||
Deployment, systemd services, security.
|
||||
|
||||
| File | Role |
|
||||
|------|------|
|
||||
| `server/setup.sh` | Initial server provisioning (groups, users, dirs) |
|
||||
| `server/webapp-setup.sh` | Nginx, SSL, systemd for webapp |
|
||||
| `server/deploy.sh` | CI/CD deployment script |
|
||||
| `server/sudoers-deploy` | Least-privilege sudo rules for deploy user |
|
||||
| `server/sudoers-webapp` | Sudo rules for www-data (webapp) |
|
||||
| `server/bin/` | Management scripts (add-analyst, list-analysts, etc.) |
|
||||
|
||||
### 5. Analyst Scripts (`scripts/`)
|
||||
|
||||
Helper scripts synced to analyst machines.
|
||||
|
||||
| File | Role |
|
||||
|------|------|
|
||||
| `scripts/sync_data.sh` | Sync data from server via rsync |
|
||||
| `scripts/setup_views.sh` | Create DuckDB views over Parquet files |
|
||||
|
||||
## Config Loading Chain
|
||||
|
||||
```
|
||||
config/instance.yaml
|
||||
| (loaded by config/loader.py)
|
||||
| (${ENV_VAR} references resolved from .env / environment)
|
||||
v
|
||||
webapp/config.py
|
||||
| (_load_instance_config at module level)
|
||||
| (_get(config, *keys) for safe nested access)
|
||||
v
|
||||
inject_config() context processor
|
||||
| (exposes Config object to all Jinja templates)
|
||||
v
|
||||
{{ config.INSTANCE_NAME }} in templates
|
||||
```
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
1. Admin defines tables in docs/data_description.md
|
||||
2. src/config.py parses YAML blocks from markdown
|
||||
3. src/data_sync.py iterates tables, calls adapter
|
||||
4. Adapter fetches CSV/JSON from source API
|
||||
5. src/parquet_manager.py converts to typed Parquet
|
||||
6. Parquet files stored in /data/src_data/parquet/
|
||||
7. Analyst runs scripts/sync_data.sh (rsync over SSH)
|
||||
8. scripts/setup_views.sh creates DuckDB views
|
||||
9. Claude Code queries DuckDB, returns insights
|
||||
```
|
||||
|
||||
## Security Model
|
||||
|
||||
- **Groups**: `data-ops` (admins), `dataread` (analysts), `data-private` (privileged)
|
||||
- **Sudoers**: Explicit command whitelisting (no wildcards)
|
||||
- **SSH**: Key-based auth only, keys registered via webapp
|
||||
- **OAuth**: Google domain restriction via `auth.allowed_domain`
|
||||
- **Secrets**: `${ENV_VAR}` in YAML, actual values in `.env` (gitignored)
|
||||
- **Staging**: `/tmp/data_analyst_staging` with setgid for group ownership
|
||||
|
||||
## Key Patterns
|
||||
|
||||
- **Adapter pattern**: Factory in `src/adapters/__init__.py`, ABC in `src/data_sync.py`
|
||||
- **Atomic writes**: `tempfile.mkstemp()` + `os.fchmod()` + `os.replace()` for JSON state files
|
||||
- **User home writes**: `sudo install -o {user} -g {user}` for writing to analyst home dirs
|
||||
- **Config interpolation**: `${ENV_VAR}` in YAML resolved at load time, missing vars logged as warnings
|
||||
|
|
@ -142,6 +142,7 @@ Claude Code will connect to the local DuckDB database, write and execute SQL, an
|
|||
|
||||
## Documentation
|
||||
|
||||
- **[Architecture](ARCHITECTURE.md)** -- System components, data flow, and key patterns
|
||||
- **[Quick Start](docs/QUICKSTART.md)** -- End-to-end setup for new deployments
|
||||
- **[Configuration](docs/CONFIGURATION.md)** -- All configuration options explained
|
||||
- **[Deployment](docs/DEPLOYMENT.md)** -- Server provisioning and deployment guide
|
||||
|
|
|
|||
|
|
@ -1,231 +0,0 @@
|
|||
# Getting Started with AI Data Analyst
|
||||
|
||||
Quick start guide for analysts who want to explore company data using AI.
|
||||
|
||||
## What is This?
|
||||
|
||||
**AI Data Analyst** gives you local access to your organization's data (sales, HR, finance, telemetry) so you can analyze it using Claude Code with natural language questions.
|
||||
|
||||
Instead of writing SQL queries manually, you can ask Claude questions like:
|
||||
- "Which companies have the highest revenue?"
|
||||
- "Show me employee headcount trends over the last year"
|
||||
- "Compare actual PPU usage vs contract limits for this month"
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- An account on your organization's Data Analyst instance
|
||||
- Claude Code installed locally ([claude.ai/code](https://claude.ai/code))
|
||||
- That's it! Claude handles the rest.
|
||||
|
||||
## First Time Setup (5 minutes)
|
||||
|
||||
1. **Visit the setup page**: `https://your-instance-url`
|
||||
2. **Sign in** with your organization account
|
||||
3. **Click "Copy Setup Instructions"** - your username is pre-filled
|
||||
4. **Open Claude Code** in a new folder (e.g., `~/data-analysis`)
|
||||
5. **Paste the instructions** into Claude Code
|
||||
6. **Let Claude do the setup** - it will:
|
||||
- Generate SSH keys
|
||||
- Create your server account
|
||||
- Download ~690 MB of data
|
||||
- Set up DuckDB database
|
||||
- Install Python dependencies
|
||||
|
||||
That's it! Claude handles everything automatically.
|
||||
|
||||
## How to Use It
|
||||
|
||||
### Starting a New Session
|
||||
|
||||
Every time you open Claude Code in your project folder:
|
||||
|
||||
1. Claude will automatically detect the project (via `CLAUDE.md`)
|
||||
2. **Always check data freshness first** - ask Claude: "Is my data fresh?"
|
||||
3. If stale, ask: "Sync the latest data"
|
||||
4. Start asking questions!
|
||||
|
||||
### Example Questions to Ask Claude
|
||||
|
||||
**Sales & Revenue Analysis:**
|
||||
- "What are our top 10 customers by total contract value?"
|
||||
- "Show me new opportunities created this month"
|
||||
- "Which products generate the most revenue?"
|
||||
|
||||
**HR & Headcount:**
|
||||
- "How many employees do we have by department?"
|
||||
- "Show me headcount growth over the last 6 months"
|
||||
- "Who are the top salespeople by closed deals?"
|
||||
|
||||
**Platform Usage & Telemetry:**
|
||||
- "Which projects are using the most PPU credits?"
|
||||
- "Compare actual usage vs limits for our biggest customers"
|
||||
- "Show me PAYG payment trends"
|
||||
|
||||
**Finance:**
|
||||
- "What's our MRR trend over the last year?"
|
||||
- "Show me budget vs actuals for Q4"
|
||||
- "Compare revenue by product line"
|
||||
|
||||
**Cross-Domain Analysis:**
|
||||
- "Which account owners have the highest win rates?"
|
||||
- "Link organizations to their CRM accounts"
|
||||
- "Show employee owners and their total pipeline value"
|
||||
|
||||
### What Claude Can Do
|
||||
|
||||
Claude Code can:
|
||||
- ✅ Write and run SQL queries on your DuckDB database
|
||||
- ✅ Create visualizations and charts
|
||||
- ✅ Analyze trends and patterns
|
||||
- ✅ Join data across domains (sales, HR, finance, telemetry)
|
||||
- ✅ Export results to CSV or other formats
|
||||
- ✅ Keep your data fresh by syncing from the server
|
||||
|
||||
### What Data is Available?
|
||||
|
||||
Your local database contains:
|
||||
|
||||
| Domain | Tables | Examples |
|
||||
|--------|--------|----------|
|
||||
| **Sales & CRM** | 14 tables | Companies, contact, opportunities, contracts, products, activities, usage limits, MRR |
|
||||
| **HR** | 2 tables | Employees, historical snapshots |
|
||||
| **Finance** | 5 tables | P&L KPIs, budgets, actuals, exchange rates, infrastructure cost |
|
||||
| **Telemetry** | 4 tables | Organizations, projects, usage metrics, payments |
|
||||
|
||||
**Total: 25 tables with full relationships documented**
|
||||
|
||||
For detailed schemas, ask Claude: "Show me the table relationships" or check `docs/data_description.md`.
|
||||
|
||||
## Tips for Better Analysis
|
||||
|
||||
1. **Always check data freshness** - stale data = wrong conclusions
|
||||
- Ask: "Is my data fresh?" or "When was data last synced?"
|
||||
|
||||
2. **Be specific with questions**
|
||||
- ❌ "Show me sales data"
|
||||
- ✅ "Show me top 10 companies by contract value in 2024"
|
||||
|
||||
3. **Ask Claude to explain queries**
|
||||
- "Explain this query in plain English"
|
||||
- "Why did you join these tables this way?"
|
||||
|
||||
4. **Iterate on results**
|
||||
- "Now group by month" or "Add a filter for Europe only"
|
||||
|
||||
5. **Export when ready**
|
||||
- "Export this to CSV"
|
||||
- "Create a chart of this trend"
|
||||
|
||||
## Keeping Data Fresh
|
||||
|
||||
Your local data syncs from the server. Always work with fresh data:
|
||||
|
||||
**Sync latest data:**
|
||||
- Ask Claude: "Sync latest data"
|
||||
- Or run: `bash scripts/sync_data.sh`
|
||||
|
||||
**How often?** Data is refreshed on the server every few hours. Sync daily or before important analysis.
|
||||
|
||||
## Reporting Issues
|
||||
|
||||
If you encounter problems:
|
||||
|
||||
### Option 1: GitHub Issue (Preferred)
|
||||
|
||||
If you have access to the project's GitHub repository:
|
||||
1. Go to the repository's Issues page
|
||||
2. Click "New Issue"
|
||||
3. Describe the problem with:
|
||||
- What you were trying to do
|
||||
- Error message or unexpected behavior
|
||||
- Steps to reproduce
|
||||
|
||||
### Option 2: Internal Issue Tracker
|
||||
|
||||
If your organization uses an internal issue tracker (Linear, Jira, etc.):
|
||||
1. Create a new issue in the appropriate project
|
||||
2. Describe the problem
|
||||
3. The platform team will triage and handle it.
|
||||
|
||||
### What to Include
|
||||
|
||||
When reporting issues:
|
||||
- ✅ Error messages (copy the full text)
|
||||
- ✅ What you were trying to do
|
||||
- ✅ Output of `bash server/scripts/sync_data.sh`
|
||||
- ✅ Claude Code version (if relevant)
|
||||
|
||||
## Technical Details (For the Curious)
|
||||
|
||||
If you want to understand what's under the hood:
|
||||
|
||||
**Architecture:**
|
||||
- Data syncs from the configured data source to the server
|
||||
- Your local setup downloads Parquet files via rsync
|
||||
- DuckDB creates views over Parquet files (no data duplication)
|
||||
- Claude Code queries DuckDB using SQL
|
||||
|
||||
**File Structure:**
|
||||
```
|
||||
~/data-analysis/
|
||||
├── CLAUDE.md # Claude Code project context (auto-updated on sync)
|
||||
├── CLAUDE.local.md # Your personal customizations (never overwritten)
|
||||
├── .claude/
|
||||
│ └── settings.json # Project permissions (synced from server)
|
||||
├── server/ # Read-only data from server
|
||||
│ ├── parquet/ # Data files (~690 MB)
|
||||
│ ├── docs/ # Documentation
|
||||
│ ├── scripts/ # Helper scripts
|
||||
│ ├── examples/ # Example scripts
|
||||
│ └── metadata/ # Sync state, table metadata
|
||||
├── user/ # Your workspace (writable)
|
||||
│ ├── duckdb/ # DuckDB database
|
||||
│ ├── notifications/ # Your notification scripts
|
||||
│ ├── artifacts/ # Analysis outputs
|
||||
│ └── scripts/ # Your custom scripts
|
||||
└── .venv/ # Python environment
|
||||
```
|
||||
|
||||
**Python Environment:**
|
||||
- Virtual environment with: pandas, duckdb, pyarrow
|
||||
- Scripts auto-activate the venv
|
||||
- Claude Code manages this automatically
|
||||
|
||||
For complete technical documentation, see:
|
||||
- `docs/data_description.md` - Table schemas and relationships
|
||||
- `CLAUDE.md` - Project context for Claude Code
|
||||
- `../dev_docs/server.md` - Server architecture (for developers)
|
||||
|
||||
## FAQ
|
||||
|
||||
**Q: Do I need to know SQL?**
|
||||
A: No! Ask Claude in natural language. It writes the SQL for you.
|
||||
|
||||
**Q: Can I break anything?**
|
||||
A: No. You're only reading local data. The server and data source are read-only.
|
||||
|
||||
**Q: How much disk space do I need?**
|
||||
A: ~2 GB (data + database + Python dependencies)
|
||||
|
||||
**Q: What if my data gets out of sync?**
|
||||
A: Just ask Claude to sync: "Sync latest data from server"
|
||||
|
||||
**Q: What if DuckDB gets corrupted?**
|
||||
A: The sync script automatically detects corrupted DuckDB files and recreates them from parquet files. This can happen if sync is interrupted or the file is only partially transferred. All data is safe in parquet files - DuckDB only contains VIEW definitions that point to parquets.
|
||||
|
||||
**Q: Can I use this without Claude Code?**
|
||||
A: Yes, but you'd need to write SQL manually. Claude Code makes it much easier.
|
||||
|
||||
**Q: Is this data secure?**
|
||||
A: Yes. Data is synced via SSH (requires authentication). Only approved users with accounts can access it.
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Complete the setup** (follow instructions at your instance URL)
|
||||
2. **Ask Claude a simple question** to test: "How many companies are in the database?"
|
||||
3. **Explore the data** - ask Claude: "What tables are available?"
|
||||
4. **Start analyzing!** - ask real business questions
|
||||
|
||||
Need help? Contact your platform team or create an issue as described above.
|
||||
|
||||
Happy analyzing!
|
||||
|
|
@ -1,28 +0,0 @@
|
|||
# Analyst Documentation
|
||||
|
||||
Documentation for **analysts** using the AI Data Analyst platform.
|
||||
|
||||
This folder is synced to all analyst machines in the `server/docs/` directory.
|
||||
|
||||
## Quick Start
|
||||
- **[GETTING_STARTED.md](GETTING_STARTED.md)** - New user guide and setup instructions
|
||||
|
||||
## Data Reference
|
||||
- **[data_description.md](data_description.md)** - Single source of truth for table schemas, relationships, and sync strategies
|
||||
- **[jira_schema.md](jira_schema.md)** - Detailed Jira data schema
|
||||
|
||||
## Business Metrics
|
||||
- **[metrics/](metrics/)** - Standardized metric definitions
|
||||
- `metrics/metrics.yml` - Index of all available metrics
|
||||
- `metrics/finance/` - Financial metrics (infrastructure costs, retention)
|
||||
- `metrics/product_usage/` - Usage metrics (consumption, limits, telemetry)
|
||||
- `metrics/sales_revenue/` - Sales metrics (MRR, ARR, expansions)
|
||||
- `metrics/weekly_leadership_kpis/` - Weekly KPIs for leadership reporting
|
||||
|
||||
## Tools & Integrations
|
||||
- **[notifications.md](notifications.md)** - How to send Telegram notifications from your analysis scripts
|
||||
- **[setup/](setup/)** - Bootstrap configuration and Claude Code templates
|
||||
|
||||
## For Developers
|
||||
|
||||
Server administration, development docs, and internal planning are in the **`dev_docs/`** folder (not synced to analyst machines).
|
||||
|
|
@ -1,323 +0,0 @@
|
|||
# Jira Support Tickets Schema
|
||||
|
||||
This document describes the schema of transformed Jira data available for analysis.
|
||||
|
||||
## Data Location
|
||||
|
||||
```
|
||||
/data/src_data/parquet/jira/ # Transformed Parquet files (monthly chunks)
|
||||
├── issues/ # Main issues table
|
||||
│ ├── 2025-01.parquet
|
||||
│ ├── 2025-02.parquet
|
||||
│ └── ...
|
||||
├── comments/ # Issue comments
|
||||
│ └── YYYY-MM.parquet
|
||||
├── attachments/ # Attachment metadata with local paths
|
||||
│ └── YYYY-MM.parquet
|
||||
├── changelog/ # Change history
|
||||
│ └── YYYY-MM.parquet
|
||||
├── issuelinks/ # Links between issues
|
||||
│ └── YYYY-MM.parquet
|
||||
└── remote_links/ # External links (Confluence, Slack, etc.)
|
||||
└── YYYY-MM.parquet
|
||||
|
||||
/data/src_data/raw/jira/ # Raw data (JSON + files)
|
||||
├── issues/ # Raw JSON per issue
|
||||
├── attachments/ # Downloaded attachment files
|
||||
│ └── {issue_key}/ # By issue key (e.g., SUPPORT-15051/)
|
||||
│ └── {id}_{filename} # e.g., 56340_screenshot.png
|
||||
└── webhook_events/ # Raw webhook payloads (audit)
|
||||
```
|
||||
|
||||
**Monthly Partitioning:** Parquet files are partitioned by month based on `created_at` timestamp. This enables efficient rsync (only changed months sync) and keeps individual file sizes manageable for ~15,000 tickets.
|
||||
|
||||
**DuckDB Query Pattern:** Use glob patterns to query all months:
|
||||
```sql
|
||||
SELECT * FROM 'server/parquet/jira/issues/*.parquet';
|
||||
```
|
||||
|
||||
## Tables
|
||||
|
||||
### issues
|
||||
|
||||
Main table with support ticket information.
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `issue_key` | string | Unique issue identifier (e.g., "SUPPORT-15190") |
|
||||
| `issue_id` | string | Jira internal ID |
|
||||
| `issue_url` | string | Direct URL to issue in Jira |
|
||||
| `summary` | string | Issue title/summary |
|
||||
| `description` | string | Full description (plain text, extracted from ADF) |
|
||||
| `issue_type` | string | Type (Service Request, Bug, etc.) |
|
||||
| `status` | string | Current status (New, Under Review, Resolved, etc.) |
|
||||
| `status_category` | string | Status category (To Do, In Progress, Done) |
|
||||
| `priority` | string | Priority level (Lowest, Low, Medium, High, Highest) |
|
||||
| `resolution` | string | Resolution type if resolved |
|
||||
| `project_key` | string | Project key (SUPPORT) |
|
||||
| `project_name` | string | Project name (e.g., your Jira project name) |
|
||||
| `creator_email` | string | Email of ticket creator |
|
||||
| `creator_name` | string | Display name of creator |
|
||||
| `reporter_email` | string | Email of reporter |
|
||||
| `reporter_name` | string | Display name of reporter |
|
||||
| `assignee_email` | string | Email of assigned agent |
|
||||
| `assignee_name` | string | Display name of assignee |
|
||||
| `created_at` | datetime | When ticket was created |
|
||||
| `updated_at` | datetime | Last update timestamp |
|
||||
| `resolved_at` | datetime | When ticket was resolved (null if open) |
|
||||
| `due_date` | string | Due date if set |
|
||||
| `labels` | string (JSON) | Array of labels as JSON |
|
||||
| `attachment_count` | int | Number of attachments |
|
||||
| `comment_count` | int | Number of comments |
|
||||
| `issuelink_count` | int | Number of linked issues |
|
||||
| `request_type` | string | Service Desk request type name |
|
||||
| `request_status` | string | Service Desk specific status |
|
||||
| `severity` | string | Severity level (custom field) |
|
||||
| `triage` | string (JSON) | Triage multi-select (renamed from team_tier) |
|
||||
| `configuration_item` | string (JSON) | Configuration item multi-select (renamed from categories) |
|
||||
| `participants` | string (JSON) | List of participant emails |
|
||||
| `organizations` | string (JSON) | Related organizations |
|
||||
| `spam` | string | Spam flag (True/null) |
|
||||
| `context` | string | Context field (renamed from root_cause; maps to customfield_10330) |
|
||||
| `keboola_platform_url` | string | Keboola platform URL (renamed from resolution_summary) |
|
||||
| `slack_link` | string | Slack link (renamed from customer_type) |
|
||||
| `technical_issue_category` | string | Technical issue category (renamed from satisfaction_rating) |
|
||||
| `email_address` | string | Email address field (renamed from context; maps to customfield_10475) |
|
||||
| `satisfaction` | int | Customer satisfaction rating (1-5) |
|
||||
| `first_response_breached` | string | SLA: whether first response SLA was breached (True/False) |
|
||||
| `first_response_goal_millis` | int | SLA: first response goal duration in milliseconds |
|
||||
| `first_response_elapsed_millis` | int | SLA: actual first response time in milliseconds |
|
||||
| `time_to_resolution_breached` | string | SLA: whether resolution SLA was breached (True/False) |
|
||||
| `time_to_resolution_goal_millis` | int | SLA: resolution goal duration in milliseconds |
|
||||
| `time_to_resolution_elapsed_millis` | int | SLA: actual resolution time in milliseconds |
|
||||
| `l3_team` | string | L3 team assignment (new) |
|
||||
| `_synced_at` | string | When data was synced from Jira |
|
||||
| `_raw_file` | string | Source JSON filename |
|
||||
|
||||
### comments
|
||||
|
||||
Issue comments from support conversations.
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `comment_id` | string | Unique comment ID |
|
||||
| `issue_key` | string | Parent issue key (FK to issues) |
|
||||
| `author_email` | string | Comment author email |
|
||||
| `author_name` | string | Comment author display name |
|
||||
| `body` | string | Comment text (plain text, extracted from ADF) |
|
||||
| `created_at` | datetime | When comment was created |
|
||||
| `updated_at` | datetime | When comment was last edited |
|
||||
| `update_author_email` | string | Who last edited the comment |
|
||||
|
||||
### attachments
|
||||
|
||||
Attachment metadata with local file paths.
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `attachment_id` | string | Unique attachment ID |
|
||||
| `issue_key` | string | Parent issue key (FK to issues) |
|
||||
| `filename` | string | Original filename |
|
||||
| `local_path` | string | Server path to downloaded file |
|
||||
| `hierarchical_path` | string | Hierarchical path for future use (e.g., `15/051/56340_file.png`) |
|
||||
| `size_bytes` | int | File size in bytes |
|
||||
| `mime_type` | string | MIME type (image/png, application/pdf, etc.) |
|
||||
| `author_email` | string | Who uploaded the attachment |
|
||||
| `created_at` | datetime | When attachment was uploaded |
|
||||
| `content_url` | string | Jira API URL to download |
|
||||
| `thumbnail_url` | string | Jira API URL for thumbnail (images only) |
|
||||
|
||||
### changelog
|
||||
|
||||
History of all field changes on issues.
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `change_id` | string | Change history ID |
|
||||
| `issue_key` | string | Parent issue key (FK to issues) |
|
||||
| `author_email` | string | Who made the change |
|
||||
| `author_name` | string | Display name of who made change |
|
||||
| `field_name` | string | Name of changed field |
|
||||
| `field_type` | string | Type of field (jira, custom) |
|
||||
| `from_value` | string | Previous value (as string) |
|
||||
| `to_value` | string | New value (as string) |
|
||||
| `changed_at` | datetime | When change occurred |
|
||||
|
||||
### issuelinks
|
||||
|
||||
Links between Jira issues (blocks, duplicates, relates to, etc.).
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `issue_key` | string | Source issue key (FK to issues) |
|
||||
| `link_id` | string | Unique link ID |
|
||||
| `link_type` | string | Link type name (Blocks, Duplicate, Relates, etc.) |
|
||||
| `direction` | string | Link direction: "inward" or "outward" |
|
||||
| `linked_issue_key` | string | Target issue key |
|
||||
| `linked_issue_summary` | string | Summary of linked issue |
|
||||
| `linked_issue_status` | string | Status of linked issue |
|
||||
| `linked_issue_priority` | string | Priority of linked issue |
|
||||
|
||||
### remote_links
|
||||
|
||||
External links attached to issues (Confluence pages, Slack threads, external URLs).
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `issue_key` | string | Parent issue key (FK to issues) |
|
||||
| `remote_link_id` | string | Unique remote link ID |
|
||||
| `url` | string | External URL |
|
||||
| `title` | string | Link title/label |
|
||||
| `application_name` | string | Application name (e.g., "Confluence", "Slack") |
|
||||
| `application_type` | string | Application type identifier |
|
||||
|
||||
## Relationships
|
||||
|
||||
All child tables reference `jira_issues` via the `issue_key` column:
|
||||
|
||||
```
|
||||
jira_issues (PK: issue_key)
|
||||
├── jira_comments (FK: issue_key → jira_issues.issue_key)
|
||||
├── jira_attachments (FK: issue_key → jira_issues.issue_key)
|
||||
├── jira_changelog (FK: issue_key → jira_issues.issue_key)
|
||||
├── jira_issuelinks (FK: issue_key → jira_issues.issue_key)
|
||||
│ (FK: linked_issue_key → jira_issues.issue_key)
|
||||
└── jira_remote_links (FK: issue_key → jira_issues.issue_key)
|
||||
```
|
||||
|
||||
These relationships are used by the Data Profiler to populate the Relationships tab in the catalog UI. They enable navigation between related table profiles.
|
||||
|
||||
**Join examples:**
|
||||
|
||||
```sql
|
||||
-- Issues with their comments
|
||||
SELECT i.issue_key, i.summary, c.body, c.created_at
|
||||
FROM 'server/parquet/jira/issues/*.parquet' i
|
||||
JOIN 'server/parquet/jira/comments/*.parquet' c ON i.issue_key = c.issue_key;
|
||||
|
||||
-- Issues with linked issues
|
||||
SELECT i.issue_key, i.summary, l.link_type, l.direction, l.linked_issue_key
|
||||
FROM 'server/parquet/jira/issues/*.parquet' i
|
||||
JOIN 'server/parquet/jira/issuelinks/*.parquet' l ON i.issue_key = l.issue_key;
|
||||
```
|
||||
|
||||
## Example Queries (DuckDB)
|
||||
|
||||
**Note:** Use glob patterns (`*.parquet`) to query all monthly chunks at once.
|
||||
|
||||
### Active tickets by status
|
||||
|
||||
```sql
|
||||
SELECT status, COUNT(*) as count
|
||||
FROM 'server/parquet/jira/issues/*.parquet'
|
||||
WHERE resolved_at IS NULL
|
||||
GROUP BY status
|
||||
ORDER BY count DESC;
|
||||
```
|
||||
|
||||
### Average resolution time by severity
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
severity,
|
||||
COUNT(*) as tickets,
|
||||
AVG(EXTRACT(EPOCH FROM (resolved_at - created_at)) / 3600) as avg_hours
|
||||
FROM 'server/parquet/jira/issues/*.parquet'
|
||||
WHERE resolved_at IS NOT NULL
|
||||
GROUP BY severity;
|
||||
```
|
||||
|
||||
### Most active commenters
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
author_email,
|
||||
author_name,
|
||||
COUNT(*) as comments
|
||||
FROM 'server/parquet/jira/comments/*.parquet'
|
||||
GROUP BY author_email, author_name
|
||||
ORDER BY comments DESC
|
||||
LIMIT 10;
|
||||
```
|
||||
|
||||
### Tickets with attachments
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
i.issue_key,
|
||||
i.summary,
|
||||
a.filename,
|
||||
a.local_path
|
||||
FROM 'server/parquet/jira/issues/*.parquet' i
|
||||
JOIN 'server/parquet/jira/attachments/*.parquet' a ON i.issue_key = a.issue_key
|
||||
WHERE a.local_path IS NOT NULL;
|
||||
```
|
||||
|
||||
### Field change frequency
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
field_name,
|
||||
COUNT(*) as changes
|
||||
FROM 'server/parquet/jira/changelog/*.parquet'
|
||||
GROUP BY field_name
|
||||
ORDER BY changes DESC;
|
||||
```
|
||||
|
||||
### Query specific month only
|
||||
|
||||
```sql
|
||||
-- Query only January 2026 data
|
||||
SELECT * FROM 'server/parquet/jira/issues/2026-01.parquet';
|
||||
```
|
||||
|
||||
## Data Freshness
|
||||
|
||||
- Data is synced in **real-time** via Jira webhooks
|
||||
- Each issue update triggers: webhook → fetch → save JSON → download attachments → **incremental Parquet transform**
|
||||
- Parquet files are updated within seconds of Jira change (only affected month is rewritten)
|
||||
- Raw JSON is kept for audit and debugging
|
||||
- Historical data can be loaded via `scripts/jira_backfill.py`
|
||||
|
||||
## Viewing Attachments
|
||||
|
||||
Attachments are stored on the server at `/data/src_data/raw/jira/attachments/{issue_key}/`.
|
||||
Analysts can access them via symlink at `~/server/jira_attachments/`.
|
||||
|
||||
**Download attachments for a specific ticket:**
|
||||
```bash
|
||||
# Rsync one ticket's attachments to local temp folder
|
||||
rsync -avz data-analyst:server/jira_attachments/SUPPORT-1234/ /tmp/SUPPORT-1234/
|
||||
|
||||
# View locally
|
||||
ls /tmp/SUPPORT-1234/
|
||||
open /tmp/SUPPORT-1234/screenshot.png # macOS
|
||||
```
|
||||
|
||||
**Find attachment info from parquet:**
|
||||
```sql
|
||||
SELECT issue_key, filename, size_bytes, local_path
|
||||
FROM jira_attachments
|
||||
WHERE issue_key = 'SUPPORT-1234';
|
||||
```
|
||||
|
||||
## Custom Field Reference
|
||||
|
||||
| Field ID | Column Name | Description |
|
||||
|----------|-------------|-------------|
|
||||
| customfield_10004 | severity | Severity: 1-Highest to 5-Lowest |
|
||||
| customfield_10323 | triage | Triage multi-select (renamed from team_tier) |
|
||||
| customfield_10511 | configuration_item | Configuration item multi-select (renamed from categories) |
|
||||
| customfield_10365 | spam | Spam flag: True/null |
|
||||
| customfield_10010 | request_type_info | Service Desk request type metadata |
|
||||
| customfield_10330 | context | Context field (renamed from root_cause) |
|
||||
| customfield_10325 | keboola_platform_url | Keboola platform URL (renamed from resolution_summary) |
|
||||
| customfield_10350 | slack_link | Slack link (renamed from customer_type) |
|
||||
| customfield_10475 | email_address | Email address (renamed from context) |
|
||||
| customfield_10676 | technical_issue_category | Technical issue category (renamed from satisfaction_rating) |
|
||||
| customfield_10157 | satisfaction | Customer satisfaction rating (1-5) |
|
||||
| customfield_10328 | first_response_* | SLA: first response (breached, goal_millis, elapsed_millis) |
|
||||
| customfield_10161 | time_to_resolution_* | SLA: resolution time (breached, goal_millis, elapsed_millis) |
|
||||
| customfield_11831 | l3_team | L3 team assignment (new) |
|
||||
| customfield_10156 | participants | Participant user list |
|
||||
| customfield_10002 | organizations | Organizations |
|
||||
47
llms.txt
Normal file
47
llms.txt
Normal file
|
|
@ -0,0 +1,47 @@
|
|||
# AI Data Analyst
|
||||
|
||||
> A data distribution platform for AI analytical systems. Syncs data from configured sources (Keboola, CSV, BigQuery), converts to Parquet, and distributes to analysts who query locally using Claude Code and DuckDB.
|
||||
|
||||
## Key Files
|
||||
|
||||
- [README](README.md): Project overview, quick start, and feature list
|
||||
- [CLAUDE.md](CLAUDE.md): Claude Code project context and development instructions
|
||||
- [ARCHITECTURE.md](ARCHITECTURE.md): System architecture, data flow, and component overview
|
||||
|
||||
## Documentation
|
||||
|
||||
- [Quick Start](docs/QUICKSTART.md): End-to-end setup for new deployments
|
||||
- [Configuration](docs/CONFIGURATION.md): All instance.yaml options explained
|
||||
- [Deployment](docs/DEPLOYMENT.md): Server provisioning and deployment guide
|
||||
- [Data Sources](docs/DATA_SOURCES.md): Adapter system and data source configuration
|
||||
|
||||
## Core Source Code
|
||||
|
||||
- [src/data_sync.py](src/data_sync.py): Data sync orchestration and DataSource ABC
|
||||
- [src/adapters/](src/adapters/): Pluggable data source adapters (Keboola, CSV)
|
||||
- [src/parquet_manager.py](src/parquet_manager.py): CSV to Parquet conversion engine
|
||||
- [src/config.py](src/config.py): Runtime configuration from data_description.md
|
||||
- [config/loader.py](config/loader.py): Instance config loader with ${ENV_VAR} interpolation
|
||||
|
||||
## Web Application
|
||||
|
||||
- [webapp/app.py](webapp/app.py): Flask application entry point
|
||||
- [webapp/config.py](webapp/config.py): Webapp configuration from instance.yaml
|
||||
- [webapp/account_service.py](webapp/account_service.py): User account and sync management
|
||||
|
||||
## Server Operations
|
||||
|
||||
- [server/deploy.sh](server/deploy.sh): Deployment script
|
||||
- [server/setup.sh](server/setup.sh): Initial server provisioning
|
||||
- [server/webapp-setup.sh](server/webapp-setup.sh): Web application setup (Nginx, SSL, systemd)
|
||||
- [config/instance.yaml.example](config/instance.yaml.example): Configuration template
|
||||
|
||||
## Configuration
|
||||
|
||||
Instance config: `config/instance.yaml` (YAML with `${ENV_VAR}` references for secrets).
|
||||
Environment variables: `.env` file (never committed). See `config/.env.template`.
|
||||
Data schema: `docs/data_description.md` (YAML blocks in markdown).
|
||||
|
||||
## Architecture Summary
|
||||
|
||||
Data Source -> Source Adapter -> Parquet files on server -> rsync over SSH -> Analyst machine -> DuckDB views -> Claude Code queries
|
||||
Loading…
Reference in a new issue