agnes-the-ai-analyst/connectors/jira/README.md

# Jira Integration

Real-time sync of Jira support tickets for AI-powered analysis.

## Overview

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                           JIRA CLOUD                                        │
│                    (your-org.atlassian.net)                                 │
│                                                                             │
│  Issue created/updated/deleted  ───►  Webhook POST                          │
│  Comment added/updated          ───►  with HMAC signature                   │
│  Attachment uploaded            ───►                                        │
└─────────────────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                        DATA BROKER SERVER                                   │
│                    (your-instance.example.com)                              │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  Flask Webapp (/webhooks/jira)                                      │   │
│  │                                                                     │   │
│  │  1. Verify HMAC-SHA256 signature                                    │   │
│  │  2. Log raw webhook event                                           │   │
│  │  3. Extract issue key from payload                                  │   │
│  │  4. Fetch complete issue data via Jira REST API                     │   │
│  │  5. Overlay SLA fields via JSM service account (cloud API)          │   │
│  │  6. Save issue JSON to disk                                         │   │
│  │  7. Download all attachments                                        │   │
│  │  8. Trigger incremental Parquet transform                           │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                        │                                    │
│                                        ▼                                    │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  /data/src_data/raw/jira/                                           │   │
│  │  ├── issues/              # Raw JSON per issue                      │   │
│  │  │   ├── SUPPORT-15186.json                                         │   │
│  │  │   └── SUPPORT-15190.json                                         │   │
│  │  ├── attachments/         # Downloaded files                        │   │
│  │  │   └── SUPPORT-15190/                                             │   │
│  │  │       └── 56340_image.png                                        │   │
│  │  └── webhook_events/      # Audit log                               │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                        │                                    │
│                                        ▼                                    │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  incremental_jira_transform.py (called automatically)               │   │
│  │                                                                     │   │
│  │  • Load saved issue JSON                                            │   │
│  │  • Extract fields, convert ADF to plain text                        │   │
│  │  • Upsert into monthly Parquet (only affected month)                │   │
│  │  • Copy to distribution directory                                   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                        │                                    │
│                                        ▼                                    │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  /data/src_data/parquet/jira/    (monthly partitioned)              │   │
│  │  ├── issues/              # 49 columns, clean schema                │   │
│  │  │   ├── 2025-01.parquet                                            │   │
│  │  │   └── 2025-02.parquet                                            │   │
│  │  ├── comments/            # Extracted comment text                  │   │
│  │  ├── attachments/         # Metadata + local paths                  │   │
│  │  ├── changelog/           # Field change history                    │   │
│  │  ├── issuelinks/          # Links between issues                    │   │
│  │  └── remote_links/        # External links (Confluence, Slack)      │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘
                                        │
                                        │ rsync
                                        ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         ANALYST MACHINE                                     │
│                                                                             │
│  ~/data-analysis/                                                           │
│  └── server/                                                                │
│      └── parquet/                                                           │
│          └── jira/           # Synced Parquet + attachments                 │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  Claude Code + DuckDB                                               │   │
│  │                                                                     │   │
│  │  -- Query all months with glob pattern                              │   │
│  │  SELECT * FROM 'server/parquet/jira/issues/*.parquet'               │   │
│  │  WHERE severity LIKE '%Medium%';                                    │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘
```

## Components

### 1. Jira Webhook Configuration

**Location:** https://your-org.atlassian.net/plugins/servlet/webhooks

| Setting | Value |
|---------|-------|
| URL | `https://your-instance.example.com/webhooks/jira` |
| Secret | Same as `JIRA_WEBHOOK_SECRET` in server `.env` |
| JQL Filter | `project = "Your Project"` |

**Subscribed Events:**
- Issue: created, updated, deleted
- Comment: created, updated
- Attachment: created
- Issue link: created

### 2. Webhook Receiver

**File:** `connectors/jira/webhook.py`

Flask blueprint that handles incoming webhooks:

```python
@jira_bp.route("/jira", methods=["POST"])
def receive_jira_webhook():
    # 1. Verify HMAC signature
    # 2. Parse JSON payload
    # 3. Log event to webhook_events/
    # 4. Call jira_service.process_webhook_event()
```

**Endpoints:**

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/webhooks/jira` | POST | Receive webhooks from Jira |
| `/webhooks/jira/health` | GET | Health check, shows config status |
| `/webhooks/jira/test` | POST | Manual issue fetch (debug mode only) |

### 3. Jira Service

**File:** `connectors/jira/service.py`

Handles Jira API communication and data persistence:

```python
class JiraService:
    def fetch_issue(issue_key) -> dict
        # GET /rest/api/3/issue/{key}?expand=renderedFields,changelog&fields=*all

    def fetch_sla_fields(issue_key) -> dict | None
        # GET via cloud API with JSM service account
        # Returns SLA fields (first_response_time, time_to_resolution)

    def save_issue(issue_data) -> Path
        # 1. Fetch remote links
        # 2. Overlay SLA fields from service account
        # 3. Save to /data/src_data/raw/jira/issues/{key}.json
        # 4. Download attachments

    def download_attachment(attachment, issue_key) -> Path
        # GET attachment content URL with auth
        # Save to attachments/{issue_key}/{id}_{filename}
```

**Why fetch after webhook?**
- Webhook payload contains minimal data
- Full issue data requires API call with `fields=*all`
- Ensures we have complete, consistent data

**Why two API tokens?**
- Personal token fetches all fields except SLA (lacks JSM Agent licence)
- JSM service account token fetches SLA fields via Atlassian Cloud API
- SLA data is overlayed into the issue JSON before saving

### 4. Data Transformation

Two transformation modes are available:

#### 4a. Incremental Transform (Real-Time)

**File:** `connectors/jira/incremental_transform.py`

Called automatically by webhook handler after saving issue JSON and attachments. Updates only the affected monthly Parquet file.

```python
# Called from jira_service.py after save_issue()
from connectors.jira.incremental_transform import transform_single_issue

transform_single_issue(
    issue_key="SUPPORT-1234",
    deleted=False,  # or True for deletion events
)
```

**How it works:**
1. Loads the saved JSON for the issue
2. Determines the month from `created_at` date
3. Loads existing Parquet for that month (if any)
4. Upserts issue data (removes old, adds new)
5. Saves updated Parquet
6. Copies to distribution directory for rsync

**Benefits:**
- Data available within seconds of Jira change
- Only updates one monthly file (~50-100KB)
- Rsync transfers only changed files

#### 4b. Batch Transform (Initial Load / Recovery)

**File:** `connectors/jira/transform.py`

Used for initial historical load or to rebuild all Parquet from raw JSON.

```bash
python -m connectors.jira.transform \
    --raw-dir /data/src_data/raw/jira \
    --output-dir /data/src_data/parquet/jira \
    --attachments-dir /data/src_data/raw/jira/attachments
```

**Common transformations (both modes):**
- Extracts plain text from ADF (Atlassian Document Format)
- Maps custom field IDs to human-readable names
- Normalizes nested structures into flat tables
- Links attachments to local file paths
- Enforces explicit PyArrow schema for consistent types across months

### 5. Data Distribution

Analysts sync data via rsync (same as other data):

```bash
bash server/scripts/sync_data.sh
```

This syncs:
- `server/parquet/jira/` - Parquet tables (issues, comments, attachments metadata, changelog, issuelinks, remote_links)

For attachment files, see [Attachment Access](#attachment-access) section below.

## Data Flow Timeline (Real-Time)

```
T+0ms    Jira: Issue updated
T+50ms   Jira: Webhook POST to our server
T+100ms  Server: Verify signature, log event
T+150ms  Server: GET /rest/api/3/issue/{key} from Jira API
T+400ms  Server: GET SLA fields via JSM service account (cloud API)
T+500ms  Server: Save JSON (with SLA overlay) to raw/jira/issues/
T+600ms  Server: Download attachments (parallel)
T+800ms  Server: Incremental transform → update monthly Parquet
T+900ms  Server: Copy to distribution directory
T+1000ms Server: Return 200 OK to Jira

(analyst sync - any time)
T+Xsec   Analyst: bash sync_data.sh
T+Xsec   Analyst: rsync downloads only changed monthly file (~50KB)
T+Xsec   Analyst: Query with DuckDB - sees latest data
```

**Key improvement:** Incremental transform runs immediately after webhook processing, so data is available for sync within seconds of the Jira change.

## Configuration

### Server Environment Variables

In `/opt/data-analyst/.env`:

```bash
# Jira webhook integration
JIRA_WEBHOOK_SECRET=<random 64-char hex string>
JIRA_DOMAIN=your-org.atlassian.net
JIRA_EMAIL=integration-user@your-domain.com
JIRA_API_TOKEN=<API token from Atlassian>

# Jira SLA service account (JSM Agent licence for SLA fields)
JIRA_SLA_EMAIL=<JSM service account email>
JIRA_SLA_API_TOKEN=<API token from 1Password>
JIRA_CLOUD_ID=f0f7a244-4fb4-41f9-b1f0-b79e24a20f11
```

### GitHub Secrets

| Secret | Description |
|--------|-------------|
| `JIRA_WEBHOOK_SECRET` | HMAC secret for webhook verification |
| `JIRA_DOMAIN` | Jira Cloud domain |
| `JIRA_EMAIL` | Email for API authentication |
| `JIRA_API_TOKEN` | API token from Atlassian account |
| `JIRA_SLA_EMAIL` | JSM service account email (for SLA fields) |
| `JIRA_SLA_API_TOKEN` | JSM service account API token |
| `JIRA_CLOUD_ID` | Atlassian Cloud site ID |

### Getting Jira API Token

1. Go to https://id.atlassian.com/manage-profile/security/api-tokens
2. Click "Create API token"
3. Name it (e.g., "Data Analyst Integration")
4. Copy token to `JIRA_API_TOKEN`

**⚠️ IMPORTANT: API tokens expire after 365 days maximum (Atlassian limitation).**

Set a calendar reminder to rotate the token before expiration. When rotating:
1. Create new token in Atlassian
2. Update `JIRA_API_TOKEN` in GitHub Secrets and server `.env`
3. Restart webapp: `sudo systemctl restart webapp`
4. Test: `curl https://your-instance.example.com/webhooks/jira/health`

## Directory Structure

```
/data/src_data/
├── raw/
│   └── jira/                      # Raw data from webhooks
│       ├── issues/                # One JSON file per issue
│       │   ├── SUPPORT-15186.json
│       │   ├── SUPPORT-15189.json
│       │   └── SUPPORT-15190.json
│       ├── attachments/           # Downloaded files (by issue key)
│       │   ├── SUPPORT-15189/
│       │   │   ├── 56337_image.png
│       │   │   └── 56338_image-20260203-110549.png
│       │   └── SUPPORT-15190/
│       │       └── 56340_image.png
│       └── webhook_events/        # Audit log of all webhooks
│           ├── 20260203_105203_jira_issue_updated.json
│           └── 20260203_110457_comment_created.json
│
└── parquet/
    └── jira/                      # Transformed data (monthly partitioned)
        ├── issues/                # Main issues table
        │   ├── 2025-01.parquet
        │   ├── 2025-02.parquet
        │   └── ...
        ├── comments/              # Issue comments
        │   └── YYYY-MM.parquet
        ├── attachments/           # Attachment metadata
        │   └── YYYY-MM.parquet
        ├── changelog/             # Field change history
        │   └── YYYY-MM.parquet
        ├── issuelinks/            # Links between issues
        │   └── YYYY-MM.parquet
        └── remote_links/          # External links (Confluence, Slack, etc.)
            └── YYYY-MM.parquet
```

**Monthly Partitioning Benefits:**
- Efficient rsync: only changed months are transferred
- Better performance: smaller files for ~15,000 total tickets
- Incremental updates: new months don't rewrite old data

## Monitoring

### Health Check

```bash
curl https://your-instance.example.com/webhooks/jira/health
```

Response:
```json
{
  "status": "ok",
  "configured": true,
  "webhook_secret_set": true,
  "jira_domain": "your-org.atlassian.net"
}
```

### Logs

```bash
# Webapp logs (webhook processing)
tail -f /opt/data-analyst/logs/webapp-error.log | grep -i jira

# Recent webhook events
ls -lt /data/src_data/raw/jira/webhook_events/ | head -20

# Issue count
ls /data/src_data/raw/jira/issues/ | wc -l

# Attachment count
find /data/src_data/raw/jira/attachments/ -type f | wc -l
```

## Security

| Layer | Protection |
|-------|------------|
| Webhook | HMAC-SHA256 signature verification |
| API Auth | HTTP Basic Auth (email + API token) |
| Storage | Server directories with `data-ops` group permissions |
| Transport | HTTPS only (Let's Encrypt certificate) |

**Webhook Signature Verification:**
```python
expected = hmac.new(
    secret.encode('utf-8'),
    request.get_data(),
    hashlib.sha256
).hexdigest()

if not hmac.compare_digest(signature, expected):
    abort(401)
```

## Troubleshooting

### Webhook not received

1. Check Jira webhook is enabled and URL is correct
2. Verify JQL filter matches the issue's project
3. Check server firewall allows HTTPS from Atlassian IPs

### Signature verification fails

1. Verify `JIRA_WEBHOOK_SECRET` matches in both Jira and server `.env`
2. Check for trailing whitespace in secret
3. Restart webapp after changing `.env`

### Attachments not downloading

1. Check `JIRA_API_TOKEN` is valid
2. Verify API token has read access to attachments
3. Check disk space on `/data` partition
4. Large attachments (>50MB) are skipped by design

### Missing data in Parquet

1. Run transformation manually:
   ```bash
   python -m connectors.jira.transform \
       --raw-dir /data/src_data/raw/jira \
       --output-dir /data/src_data/parquet/jira \
       --attachments-dir /data/src_data/raw/jira/attachments
   ```
2. Check for errors in transformation output
3. Verify raw JSON files exist in `raw/jira/issues/`
4. Note: Output files are partitioned by month (e.g., `issues/2026-01.parquet`)

## Schema Reference

See [docs/jira_schema.md](jira_schema.md) for detailed table schemas and example queries.

## Historical Backfill

For initial setup or recovery, use the backfill script to download all historical issues.

**File:** `connectors/jira/scripts/backfill.py`

```bash
# Download all SUPPORT tickets (idempotent, skips existing)
python -m connectors.jira.scripts.backfill --parallel 4

# Environment variables required:
JIRA_DOMAIN=your-org.atlassian.net
JIRA_EMAIL=integration-user@your-domain.com
JIRA_API_TOKEN=<API token>
JIRA_DATA_DIR=/data/src_data/raw/jira  # optional, default path
```

**Features:**
- Uses new Jira Cloud API (`POST /rest/api/3/search/jql` with `nextPageToken`)
- Parallel downloads (configurable workers)
- Downloads all attachments
- Idempotent - skips already downloaded issues
- Handles rate limiting gracefully

**SLA backfill** (separate script, uses JSM service account):

**File:** `connectors/jira/scripts/backfill_sla.py`

```bash
# Fetch SLA fields for all issues (uses JIRA_SLA_* env vars)
python -m connectors.jira.scripts.backfill_sla --parallel 8

# Dry run (count files needing update):
python -m connectors.jira.scripts.backfill_sla --dry-run
```

The personal API token lacks JSM Agent licence needed for SLA fields.
This script uses the JSM service account via the
Atlassian Cloud API (`api.atlassian.com`) to fetch and embed SLA data
into existing raw JSON files.

**After backfill, run batch transform:**
```bash
python -m connectors.jira.transform \
    --raw-dir /data/src_data/raw/jira \
    --output-dir /data/src_data/parquet/jira \
    --attachments-dir /data/src_data/raw/jira/attachments

# Copy to distribution directory
cp -r /data/src_data/parquet/jira/* ~/server/parquet/jira/
```

## SLA Polling (Open Tickets)

SLA elapsed values (`first_response_elapsed_millis`, `time_to_resolution_elapsed_millis`) only update when a webhook fires. For idle open tickets (~49 tickets, ~0.3% of dataset), these values go stale and no longer reflect the actual current elapsed time.

**File:** `connectors/jira/scripts/poll_sla.py`

The SLA polling job runs every 15 minutes via systemd timer (`jira-sla-poll.timer`) as `root:data-ops` and:

1. Reads Parquet to find open issues with SLA data
2. Fetches fresh SLA **and status** fields via JSM service account (cloud API)
3. Updates raw JSON atomically (`tempfile.mkstemp()` + `os.fchmod(fd, 0o660)` + `os.replace()`)
4. Triggers incremental Parquet transform (inside advisory file lock)

**Self-healing:** The poll fetches `status`, `resolution`, `resolutiondate`, and `updated` alongside the SLA fields. If a ticket is resolved in Jira but still appears "open" in Parquet (e.g. due to a missed webhook), the poll automatically corrects the status in JSON and re-transforms to Parquet. Log output: `Self-healing: SUPPORT-XXXX is resolved in Jira`. This was added in response to [#203](https://github.com/keboola/agnes-the-ai-analyst/issues/203) where 12 tickets were permanently stale after a permission bug prevented webhooks from updating JSON files.

**File locking:** The entire read-modify-write + Parquet transform is wrapped in a per-issue advisory file lock (`connectors/jira/file_lock.py`) to prevent races with the webhook handler. The webhook handler (`connectors/jira/service.py`) uses the same lock. Different issue keys don't block each other.

**Important — `mkstemp` and ACL:** The `issues/` directory uses POSIX ACLs with `default:mask::rwx`. `tempfile.mkstemp()` creates files with mode `0600`, which overrides the ACL mask to `---` and breaks group access for www-data (webhook handler) and deploy (batch transform). The `os.fchmod(fd, 0o660)` call immediately after `mkstemp()` restores the mask to `rw-`, preserving ACL-based access. See [#203](https://github.com/keboola/agnes-the-ai-analyst/issues/203) for the full incident report.

```bash
# Manual run
python -m connectors.jira.scripts.poll_sla

# Dry run (count open issues)
python -m connectors.jira.scripts.poll_sla --dry-run

# Verbose logging
python -m connectors.jira.scripts.poll_sla --verbose
```

**Return states:**
- `updated` — SLA fields refreshed, status unchanged
- `healed` — status corrected (ticket was resolved in Jira but stale locally)
- `skipped` — no valid SLA data and ticket not resolved
- `failed` — API error or transform failure

**Note:** `sla_cycle_type` (ongoing/completed) is not stored in Parquet — compute it on-the-fly in DuckDB:
```sql
SELECT issue_key,
    CASE WHEN status_category = 'Done' THEN 'completed' ELSE 'ongoing' END AS sla_cycle_type,
    first_response_elapsed_millis,
    time_to_resolution_elapsed_millis
FROM 'server/parquet/jira/issues/*.parquet'
WHERE first_response_elapsed_millis IS NOT NULL
```

## Analyst Sync Configuration

Jira data is an **optional dataset** - not synced by default to save bandwidth.

**Enable Jira sync:**
```bash
# Edit local config (created on first sync_data.sh run)
nano ~/.config/data-analyst/sync.yaml

# Change:
datasets:
  jira: true              # Enable parquet data (~50MB)
  jira_attachments: false # Keep false unless you need actual files
```

**Then sync:**
```bash
bash server/scripts/sync_data.sh
```

DuckDB views for Jira tables are created automatically if data exists:
- `jira_issues` - main issues table
- `jira_comments` - issue comments
- `jira_attachments` - attachment metadata (filenames, sizes, URLs)
- `jira_changelog` - field change history
- `jira_issuelinks` - links between issues (blocks, duplicates, relates to)
- `jira_remote_links` - external links (Confluence, Slack, etc.)

## Attachment Access

Attachments (images, logs, PDFs) are stored separately from parquet data.

### Option 1: Download per-ticket (recommended)

Download attachments for a specific ticket to local temp folder:

```bash
# Download all attachments for one ticket
rsync -avz data-analyst:server/jira_attachments/SUPPORT-1234/ /tmp/SUPPORT-1234/

# View locally
ls /tmp/SUPPORT-1234/
open /tmp/SUPPORT-1234/screenshot.png  # macOS
```

This is fast (only downloads files for one ticket) and keeps your local machine clean.

### Option 2: Sync attachments locally (for heavy analysis)

If you need frequent access to attachments, enable full sync:

```yaml
# ~/.config/data-analyst/sync.yaml
datasets:
  jira: true
  jira_attachments: true   # Syncs ~500MB+ of files
```

Then `sync_data.sh` will rsync attachments to `./server/jira_attachments/`.

### Finding attachment path from parquet

The `jira_attachments` table has a `local_path` column with the server path:

```sql
SELECT
    issue_key,
    filename,
    local_path,
    size_bytes
FROM jira_attachments
WHERE issue_key = 'SUPPORT-1234';
```

Result:
```
issue_key     | filename        | local_path                                           | size_bytes
SUPPORT-1234  | screenshot.png  | /data/src_data/raw/jira/attachments/SUPPORT-1234/... | 45678
```

To access locally (if synced): replace `/data/src_data/raw/jira/attachments/` with `./server/jira_attachments/`.

## Future Improvements

- [x] ~~Automatic Parquet regeneration after each webhook~~ (Implemented: incremental transform)
- [x] ~~Incremental Parquet updates~~ (Implemented: upsert by issue_key)
- [x] ~~Full historical sync from Jira~~ (Implemented: jira_backfill.py)
- [x] ~~SLA polling for open tickets~~ (Implemented: jira_poll_sla.py, 15min timer)
- [ ] Comment attachment extraction (inline images in ADF)
- [ ] Custom field name resolution from Jira metadata API
- [ ] Attachment binary sync to analysts (currently metadata only)