agnes-the-ai-analyst/docs/jira_schema.md
Petr c56905d34f Initial commit: OSS data distribution platform
Open-source AI data analyst platform extracted from internal repo.
Includes data sync engine, Keboola adapter, Flask web portal,
server deployment scripts, and configuration templates.
2026-03-08 23:31:28 +01:00

323 lines
13 KiB
Markdown

# Jira Support Tickets Schema
This document describes the schema of transformed Jira data available for analysis.
## Data Location
```
/data/src_data/parquet/jira/ # Transformed Parquet files (monthly chunks)
├── issues/ # Main issues table
│ ├── 2025-01.parquet
│ ├── 2025-02.parquet
│ └── ...
├── comments/ # Issue comments
│ └── YYYY-MM.parquet
├── attachments/ # Attachment metadata with local paths
│ └── YYYY-MM.parquet
├── changelog/ # Change history
│ └── YYYY-MM.parquet
├── issuelinks/ # Links between issues
│ └── YYYY-MM.parquet
└── remote_links/ # External links (Confluence, Slack, etc.)
└── YYYY-MM.parquet
/data/src_data/raw/jira/ # Raw data (JSON + files)
├── issues/ # Raw JSON per issue
├── attachments/ # Downloaded attachment files
│ └── {issue_key}/ # By issue key (e.g., SUPPORT-15051/)
│ └── {id}_{filename} # e.g., 56340_screenshot.png
└── webhook_events/ # Raw webhook payloads (audit)
```
**Monthly Partitioning:** Parquet files are partitioned by month based on `created_at` timestamp. This enables efficient rsync (only changed months sync) and keeps individual file sizes manageable for ~15,000 tickets.
**DuckDB Query Pattern:** Use glob patterns to query all months:
```sql
SELECT * FROM 'server/parquet/jira/issues/*.parquet';
```
## Tables
### issues
Main table with support ticket information.
| Column | Type | Description |
|--------|------|-------------|
| `issue_key` | string | Unique issue identifier (e.g., "SUPPORT-15190") |
| `issue_id` | string | Jira internal ID |
| `issue_url` | string | Direct URL to issue in Jira |
| `summary` | string | Issue title/summary |
| `description` | string | Full description (plain text, extracted from ADF) |
| `issue_type` | string | Type (Service Request, Bug, etc.) |
| `status` | string | Current status (New, Under Review, Resolved, etc.) |
| `status_category` | string | Status category (To Do, In Progress, Done) |
| `priority` | string | Priority level (Lowest, Low, Medium, High, Highest) |
| `resolution` | string | Resolution type if resolved |
| `project_key` | string | Project key (SUPPORT) |
| `project_name` | string | Project name (e.g., your Jira project name) |
| `creator_email` | string | Email of ticket creator |
| `creator_name` | string | Display name of creator |
| `reporter_email` | string | Email of reporter |
| `reporter_name` | string | Display name of reporter |
| `assignee_email` | string | Email of assigned agent |
| `assignee_name` | string | Display name of assignee |
| `created_at` | datetime | When ticket was created |
| `updated_at` | datetime | Last update timestamp |
| `resolved_at` | datetime | When ticket was resolved (null if open) |
| `due_date` | string | Due date if set |
| `labels` | string (JSON) | Array of labels as JSON |
| `attachment_count` | int | Number of attachments |
| `comment_count` | int | Number of comments |
| `issuelink_count` | int | Number of linked issues |
| `request_type` | string | Service Desk request type name |
| `request_status` | string | Service Desk specific status |
| `severity` | string | Severity level (custom field) |
| `triage` | string (JSON) | Triage multi-select (renamed from team_tier) |
| `configuration_item` | string (JSON) | Configuration item multi-select (renamed from categories) |
| `participants` | string (JSON) | List of participant emails |
| `organizations` | string (JSON) | Related organizations |
| `spam` | string | Spam flag (True/null) |
| `context` | string | Context field (renamed from root_cause; maps to customfield_10330) |
| `keboola_platform_url` | string | Keboola platform URL (renamed from resolution_summary) |
| `slack_link` | string | Slack link (renamed from customer_type) |
| `technical_issue_category` | string | Technical issue category (renamed from satisfaction_rating) |
| `email_address` | string | Email address field (renamed from context; maps to customfield_10475) |
| `satisfaction` | int | Customer satisfaction rating (1-5) |
| `first_response_breached` | string | SLA: whether first response SLA was breached (True/False) |
| `first_response_goal_millis` | int | SLA: first response goal duration in milliseconds |
| `first_response_elapsed_millis` | int | SLA: actual first response time in milliseconds |
| `time_to_resolution_breached` | string | SLA: whether resolution SLA was breached (True/False) |
| `time_to_resolution_goal_millis` | int | SLA: resolution goal duration in milliseconds |
| `time_to_resolution_elapsed_millis` | int | SLA: actual resolution time in milliseconds |
| `l3_team` | string | L3 team assignment (new) |
| `_synced_at` | string | When data was synced from Jira |
| `_raw_file` | string | Source JSON filename |
### comments
Issue comments from support conversations.
| Column | Type | Description |
|--------|------|-------------|
| `comment_id` | string | Unique comment ID |
| `issue_key` | string | Parent issue key (FK to issues) |
| `author_email` | string | Comment author email |
| `author_name` | string | Comment author display name |
| `body` | string | Comment text (plain text, extracted from ADF) |
| `created_at` | datetime | When comment was created |
| `updated_at` | datetime | When comment was last edited |
| `update_author_email` | string | Who last edited the comment |
### attachments
Attachment metadata with local file paths.
| Column | Type | Description |
|--------|------|-------------|
| `attachment_id` | string | Unique attachment ID |
| `issue_key` | string | Parent issue key (FK to issues) |
| `filename` | string | Original filename |
| `local_path` | string | Server path to downloaded file |
| `hierarchical_path` | string | Hierarchical path for future use (e.g., `15/051/56340_file.png`) |
| `size_bytes` | int | File size in bytes |
| `mime_type` | string | MIME type (image/png, application/pdf, etc.) |
| `author_email` | string | Who uploaded the attachment |
| `created_at` | datetime | When attachment was uploaded |
| `content_url` | string | Jira API URL to download |
| `thumbnail_url` | string | Jira API URL for thumbnail (images only) |
### changelog
History of all field changes on issues.
| Column | Type | Description |
|--------|------|-------------|
| `change_id` | string | Change history ID |
| `issue_key` | string | Parent issue key (FK to issues) |
| `author_email` | string | Who made the change |
| `author_name` | string | Display name of who made change |
| `field_name` | string | Name of changed field |
| `field_type` | string | Type of field (jira, custom) |
| `from_value` | string | Previous value (as string) |
| `to_value` | string | New value (as string) |
| `changed_at` | datetime | When change occurred |
### issuelinks
Links between Jira issues (blocks, duplicates, relates to, etc.).
| Column | Type | Description |
|--------|------|-------------|
| `issue_key` | string | Source issue key (FK to issues) |
| `link_id` | string | Unique link ID |
| `link_type` | string | Link type name (Blocks, Duplicate, Relates, etc.) |
| `direction` | string | Link direction: "inward" or "outward" |
| `linked_issue_key` | string | Target issue key |
| `linked_issue_summary` | string | Summary of linked issue |
| `linked_issue_status` | string | Status of linked issue |
| `linked_issue_priority` | string | Priority of linked issue |
### remote_links
External links attached to issues (Confluence pages, Slack threads, external URLs).
| Column | Type | Description |
|--------|------|-------------|
| `issue_key` | string | Parent issue key (FK to issues) |
| `remote_link_id` | string | Unique remote link ID |
| `url` | string | External URL |
| `title` | string | Link title/label |
| `application_name` | string | Application name (e.g., "Confluence", "Slack") |
| `application_type` | string | Application type identifier |
## Relationships
All child tables reference `jira_issues` via the `issue_key` column:
```
jira_issues (PK: issue_key)
├── jira_comments (FK: issue_key → jira_issues.issue_key)
├── jira_attachments (FK: issue_key → jira_issues.issue_key)
├── jira_changelog (FK: issue_key → jira_issues.issue_key)
├── jira_issuelinks (FK: issue_key → jira_issues.issue_key)
│ (FK: linked_issue_key → jira_issues.issue_key)
└── jira_remote_links (FK: issue_key → jira_issues.issue_key)
```
These relationships are used by the Data Profiler to populate the Relationships tab in the catalog UI. They enable navigation between related table profiles.
**Join examples:**
```sql
-- Issues with their comments
SELECT i.issue_key, i.summary, c.body, c.created_at
FROM 'server/parquet/jira/issues/*.parquet' i
JOIN 'server/parquet/jira/comments/*.parquet' c ON i.issue_key = c.issue_key;
-- Issues with linked issues
SELECT i.issue_key, i.summary, l.link_type, l.direction, l.linked_issue_key
FROM 'server/parquet/jira/issues/*.parquet' i
JOIN 'server/parquet/jira/issuelinks/*.parquet' l ON i.issue_key = l.issue_key;
```
## Example Queries (DuckDB)
**Note:** Use glob patterns (`*.parquet`) to query all monthly chunks at once.
### Active tickets by status
```sql
SELECT status, COUNT(*) as count
FROM 'server/parquet/jira/issues/*.parquet'
WHERE resolved_at IS NULL
GROUP BY status
ORDER BY count DESC;
```
### Average resolution time by severity
```sql
SELECT
severity,
COUNT(*) as tickets,
AVG(EXTRACT(EPOCH FROM (resolved_at - created_at)) / 3600) as avg_hours
FROM 'server/parquet/jira/issues/*.parquet'
WHERE resolved_at IS NOT NULL
GROUP BY severity;
```
### Most active commenters
```sql
SELECT
author_email,
author_name,
COUNT(*) as comments
FROM 'server/parquet/jira/comments/*.parquet'
GROUP BY author_email, author_name
ORDER BY comments DESC
LIMIT 10;
```
### Tickets with attachments
```sql
SELECT
i.issue_key,
i.summary,
a.filename,
a.local_path
FROM 'server/parquet/jira/issues/*.parquet' i
JOIN 'server/parquet/jira/attachments/*.parquet' a ON i.issue_key = a.issue_key
WHERE a.local_path IS NOT NULL;
```
### Field change frequency
```sql
SELECT
field_name,
COUNT(*) as changes
FROM 'server/parquet/jira/changelog/*.parquet'
GROUP BY field_name
ORDER BY changes DESC;
```
### Query specific month only
```sql
-- Query only January 2026 data
SELECT * FROM 'server/parquet/jira/issues/2026-01.parquet';
```
## Data Freshness
- Data is synced in **real-time** via Jira webhooks
- Each issue update triggers: webhook → fetch → save JSON → download attachments → **incremental Parquet transform**
- Parquet files are updated within seconds of Jira change (only affected month is rewritten)
- Raw JSON is kept for audit and debugging
- Historical data can be loaded via `scripts/jira_backfill.py`
## Viewing Attachments
Attachments are stored on the server at `/data/src_data/raw/jira/attachments/{issue_key}/`.
Analysts can access them via symlink at `~/server/jira_attachments/`.
**Download attachments for a specific ticket:**
```bash
# Rsync one ticket's attachments to local temp folder
rsync -avz data-analyst:server/jira_attachments/SUPPORT-1234/ /tmp/SUPPORT-1234/
# View locally
ls /tmp/SUPPORT-1234/
open /tmp/SUPPORT-1234/screenshot.png # macOS
```
**Find attachment info from parquet:**
```sql
SELECT issue_key, filename, size_bytes, local_path
FROM jira_attachments
WHERE issue_key = 'SUPPORT-1234';
```
## Custom Field Reference
| Field ID | Column Name | Description |
|----------|-------------|-------------|
| customfield_10004 | severity | Severity: 1-Highest to 5-Lowest |
| customfield_10323 | triage | Triage multi-select (renamed from team_tier) |
| customfield_10511 | configuration_item | Configuration item multi-select (renamed from categories) |
| customfield_10365 | spam | Spam flag: True/null |
| customfield_10010 | request_type_info | Service Desk request type metadata |
| customfield_10330 | context | Context field (renamed from root_cause) |
| customfield_10325 | keboola_platform_url | Keboola platform URL (renamed from resolution_summary) |
| customfield_10350 | slack_link | Slack link (renamed from customer_type) |
| customfield_10475 | email_address | Email address (renamed from context) |
| customfield_10676 | technical_issue_category | Technical issue category (renamed from satisfaction_rating) |
| customfield_10157 | satisfaction | Customer satisfaction rating (1-5) |
| customfield_10328 | first_response_* | SLA: first response (breached, goal_millis, elapsed_millis) |
| customfield_10161 | time_to_resolution_* | SLA: resolution time (breached, goal_millis, elapsed_millis) |
| customfield_11831 | l3_team | L3 team assignment (new) |
| customfield_10156 | participants | Participant user list |
| customfield_10002 | organizations | Organizations |