* fix(security): close Jira webhook fail-open + path traversal (#83) Two related vulnerabilities: 1. Fail-open signature check: when JIRA_WEBHOOK_SECRET was unset, _verify_signature returned True and any unauthenticated POST to /webhooks/jira would run the full ingest pipeline. Now fail-closed — the handler short-circuits with 503 (operator-misconfiguration signal, distinct from 401 wrong-signature) when the secret is missing. 2. Path traversal via attacker-controlled issue_key: webhook payloads carry issue.key, which flowed unsanitized into save_issue (issues_dir / "{issue_key}.json"), download_attachment (attachments_dir / issue_key), and incremental_transform (raw_dir / "issues" / "{issue_key}.json"). A crafted webhook with issue.key="../../etc/passwd" could write outside the Jira data dir. Defense-in-depth: new connectors/jira/validation.py exposes is_valid_issue_key (whitelist regex ^[A-Z][A-Z0-9_]{0,31}-\d{1,12}$) and safe_join_under (Path.resolve() containment check). Both are enforced at the webhook entry point AND at every filesystem boundary in the connector. Tests: - New tests/test_jira_validation.py — unit tests for both helpers (parametrized invalid keys, traversal/symlink/absolute-path cases). - Webhook tests: test_unconfigured_secret_returns_503, test_path_traversal_in_issue_key_rejected (parametrized over 10 bad keys), test_valid_issue_key_accepted. CHANGELOG: two CRITICAL Fixed bullets under Unreleased. Closes #83. * fix(security): close remaining #83 review findings — webhookEvent traversal, _handle_deletion guard, regex tightening Reviewer of PR #93 flagged four MUST-FIXes: 1. _log_webhook_event used the attacker-controlled `webhookEvent` field as a filename component without sanitization. Payload with `webhookEvent: "../../tmp/pwn"` could escape WEBHOOK_LOG_DIR. Now: - non-`[A-Za-z0-9_-]` runs are replaced with `_` (dot excluded so `..` cannot survive sanitization as a directory component) - length capped at 64 chars - final path routed through safe_join_under New regression test `test_webhook_event_path_traversal_sanitized`. 2. _handle_deletion (connectors/jira/service.py:530) and process_webhook_event (line 487) still used raw issue_key in path builds. Even though the webhook handler validates upstream, the "defense-in-depth at every filesystem boundary" claim required these too. Both now run is_valid_issue_key and safe_join_under guards. 3. Regex `^[A-Z][A-Z0-9_]{0,31}-\d{1,12}$` permitted underscores in project keys. Atlassian's project-key validator does not — `A_B-1` is rejected by Jira itself. Tightened to `[A-Z0-9]` and updated tests: `ABC_DEF-1` is now invalid, added Cyrillic А-1 (lookalike), CRLF, and oversize cases to the bad-key parametrization. 4. Existing test test_deletion_of_nonexistent_issue_returns_true used `PROJ-NOEXIST` which is not a real Jira key shape. Updated to `PROJ-99999`. The test still exercises the same intent (deletion of issue with no local file is idempotent). 73/73 jira tests pass locally (test_jira_webhooks + test_jira_validation + test_jira_service + test_jira_service_full + test_jira_incremental). CHANGELOG updated to document the regex tightening and the new webhookEvent sanitization. Refs review of #93. * fix(tests): test_journey_jira tests assumed fail-open before #83 fix CI failure on PR #93 caught two journey tests that pinned the OLD fail-open contract: - test_webhook_with_no_secret_configured_accepted asserted 200 when JIRA_WEBHOOK_SECRET was unset. After the #83 fix that's a 503 (operator misconfig). Renamed to _refused and flipped the assertion. - test_webhook_empty_payload_rejected didn't set the secret, so the 503 short-circuit fired before the empty-payload 400 could. Set JIRA_WEBHOOK_SECRET in the patched Config so the test exercises the intended path. 56/56 jira journey + webhook + validation tests now pass. * fix(security): #93 round-3 — webhook fallback format + save_issue early validation Devin Review caught two real findings: 1. Webhook handler regression: the round-2 fix extracted issue_key only from event_data['issue']['key'], but process_webhook_event has long supported a fallback 'issue_key' top-level field for certain Jira event formats (e.g. delete events historically). The handler now blocks those events with 400 before they reach the service layer. Fix: mirror process_webhook_event's fallback in the handler — try issue.key first, fall through to event_data.get('issue_key') when empty. is_valid_issue_key still validates whichever source provided the key. 2. save_issue defense-in-depth was incomplete: is_valid_issue_key ran AFTER fetch_remote_links and fetch_sla_fields had already used the unvalidated issue_key in HTTP URL construction ({base_url}/issue/{issue_key}/remotelink etc.). A future internal caller invoking save_issue directly with attacker-controlled input could trigger outbound requests with a malicious path component (limited SSRF / URL-path manipulation against the Jira API server). Fix: move the is_valid_issue_key check to immediately after the null guard, before any HTTP request or filesystem op. Webhook layer still validates upstream, this is the second layer. 66 jira tests pass. Refs Devin Review of #93. * fix(changelog): #93 round-4 — add BREAKING marker to fail-closed bullet Devin Review caught: the JIRA_WEBHOOK_SECRET fail-closed change is a behavior change for operators (response code 503 vs old 200) that existing alerting may treat differently. Per CLAUDE.md changelog discipline rule, operators grep for **BREAKING** before bumping the pin. Added the marker + a short note on what action operators need to take (set the env var if they haven't). Refs Devin Review of #93. * fix: #93 round-5 — null-issue crash + comment drift Devin Review caught two findings on the round-4 commit: 1. Pre-existing crash on null issue field: a webhook payload with {"issue": null} (rather than omitting the key) caused event_data.get("issue", {}) to return None, then issue.get("key") raised AttributeError → unhandled 500. Pre-existing but reachable. Fix: 'event_data.get("issue") or {}' normalises None to {}, then the existing fallback / validation path returns 400 cleanly. New regression test test_null_issue_field_does_not_crash. 2. Inline comment drift: the comment at line 77 documented the allowed character class as [A-Za-z0-9._-] (with dot) but the regex at line 27 excludes dot deliberately (so '..' cannot survive sanitization). Fixed the comment to match. 52 jira tests pass. Refs Devin Review of #93 round 5. * fix: #93 round-6 — process_webhook_event also normalises null issue field Devin Review caught: the webhook handler at app/api/jira_webhooks.py correctly handles {"issue": null} via 'event_data.get("issue") or {}', but process_webhook_event at connectors/jira/service.py:509 still used the bare 'event_data.get("issue", {})' which returns None on explicit null. Internal callers (anything that invokes process_webhook_event without going through the HTTP handler) would hit the same AttributeError the round-5 fix closed at the handler layer. Same one-line fix. 32 jira tests pass. Refs Devin Review of #93 round 5. * fix: #93 round-7 — issue-key regex uses [0-9] not \d Devin Review caught: Python 3's \d matches any Unicode decimal digit (Arabic-Indic ٣, Bengali ৩, Devanagari ३, …). A key like TEST-٣ would pass the regex even though it's not a valid Jira input. Tightened to [0-9] (ASCII only). Added three Unicode-digit cases to the bad-key parametrization in test_jira_validation.py to lock in the contract. Refs Devin Review of #93 round 6. * fix: #93 round-8 — use \\Z anchor not $ in issue-key regex Devin Review caught: Python's $ anchor matches before a trailing \\n, so re.match('…$', 'TEST-1\\n') returns a match. is_valid_issue_key returned True for CRLF-injected keys. \\Z is hard end-of-string and closes that bypass. Manual verification: is_valid_issue_key('TEST-1\\n') → False (was True before fix) is_valid_issue_key('TEST-1\\r\\n') → False is_valid_issue_key('TEST-1') → True Refs Devin Review of #93 round 7. * docs: #93 round-9 — CHANGELOG regex matches implementation |
||
|---|---|---|
| .. | ||
| bin | ||
| scripts | ||
| systemd | ||
| tests | ||
| __init__.py | ||
| extract_init.py | ||
| file_lock.py | ||
| incremental_transform.py | ||
| README.md | ||
| service.py | ||
| transform.py | ||
| validation.py | ||
Jira Integration
Real-time sync of Jira support tickets for AI-powered analysis.
Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ JIRA CLOUD │
│ (your-org.atlassian.net) │
│ │
│ Issue created/updated/deleted ───► Webhook POST │
│ Comment added/updated ───► with HMAC signature │
│ Attachment uploaded ───► │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ DATA BROKER SERVER │
│ (your-instance.example.com) │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Flask Webapp (/webhooks/jira) │ │
│ │ │ │
│ │ 1. Verify HMAC-SHA256 signature │ │
│ │ 2. Log raw webhook event │ │
│ │ 3. Extract issue key from payload │ │
│ │ 4. Fetch complete issue data via Jira REST API │ │
│ │ 5. Overlay SLA fields via JSM service account (cloud API) │ │
│ │ 6. Save issue JSON to disk │ │
│ │ 7. Download all attachments │ │
│ │ 8. Trigger incremental Parquet transform │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ /data/src_data/raw/jira/ │ │
│ │ ├── issues/ # Raw JSON per issue │ │
│ │ │ ├── SUPPORT-15186.json │ │
│ │ │ └── SUPPORT-15190.json │ │
│ │ ├── attachments/ # Downloaded files │ │
│ │ │ └── SUPPORT-15190/ │ │
│ │ │ └── 56340_image.png │ │
│ │ └── webhook_events/ # Audit log │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ incremental_jira_transform.py (called automatically) │ │
│ │ │ │
│ │ • Load saved issue JSON │ │
│ │ • Extract fields, convert ADF to plain text │ │
│ │ • Upsert into monthly Parquet (only affected month) │ │
│ │ • Copy to distribution directory │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ /data/src_data/parquet/jira/ (monthly partitioned) │ │
│ │ ├── issues/ # 49 columns, clean schema │ │
│ │ │ ├── 2025-01.parquet │ │
│ │ │ └── 2025-02.parquet │ │
│ │ ├── comments/ # Extracted comment text │ │
│ │ ├── attachments/ # Metadata + local paths │ │
│ │ ├── changelog/ # Field change history │ │
│ │ ├── issuelinks/ # Links between issues │ │
│ │ └── remote_links/ # External links (Confluence, Slack) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
│
│ rsync
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ ANALYST MACHINE │
│ │
│ ~/data-analysis/ │
│ └── server/ │
│ └── parquet/ │
│ └── jira/ # Synced Parquet + attachments │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Claude Code + DuckDB │ │
│ │ │ │
│ │ -- Query all months with glob pattern │ │
│ │ SELECT * FROM 'server/parquet/jira/issues/*.parquet' │ │
│ │ WHERE severity LIKE '%Medium%'; │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Components
1. Jira Webhook Configuration
Location: https://your-org.atlassian.net/plugins/servlet/webhooks
| Setting | Value |
|---|---|
| URL | https://your-instance.example.com/webhooks/jira |
| Secret | Same as JIRA_WEBHOOK_SECRET in server .env |
| JQL Filter | project = "Your Project" |
Subscribed Events:
- Issue: created, updated, deleted
- Comment: created, updated
- Attachment: created
- Issue link: created
2. Webhook Receiver
File: connectors/jira/webhook.py
Flask blueprint that handles incoming webhooks:
@jira_bp.route("/jira", methods=["POST"])
def receive_jira_webhook():
# 1. Verify HMAC signature
# 2. Parse JSON payload
# 3. Log event to webhook_events/
# 4. Call jira_service.process_webhook_event()
Endpoints:
| Endpoint | Method | Description |
|---|---|---|
/webhooks/jira |
POST | Receive webhooks from Jira |
/webhooks/jira/health |
GET | Health check, shows config status |
/webhooks/jira/test |
POST | Manual issue fetch (debug mode only) |
3. Jira Service
File: connectors/jira/service.py
Handles Jira API communication and data persistence:
class JiraService:
def fetch_issue(issue_key) -> dict
# GET /rest/api/3/issue/{key}?expand=renderedFields,changelog&fields=*all
def fetch_sla_fields(issue_key) -> dict | None
# GET via cloud API with JSM service account
# Returns SLA fields (first_response_time, time_to_resolution)
def save_issue(issue_data) -> Path
# 1. Fetch remote links
# 2. Overlay SLA fields from service account
# 3. Save to /data/src_data/raw/jira/issues/{key}.json
# 4. Download attachments
def download_attachment(attachment, issue_key) -> Path
# GET attachment content URL with auth
# Save to attachments/{issue_key}/{id}_{filename}
Why fetch after webhook?
- Webhook payload contains minimal data
- Full issue data requires API call with
fields=*all - Ensures we have complete, consistent data
Why two API tokens?
- Personal token fetches all fields except SLA (lacks JSM Agent licence)
- JSM service account token fetches SLA fields via Atlassian Cloud API
- SLA data is overlayed into the issue JSON before saving
4. Data Transformation
Two transformation modes are available:
4a. Incremental Transform (Real-Time)
File: connectors/jira/incremental_transform.py
Called automatically by webhook handler after saving issue JSON and attachments. Updates only the affected monthly Parquet file.
# Called from jira_service.py after save_issue()
from connectors.jira.incremental_transform import transform_single_issue
transform_single_issue(
issue_key="SUPPORT-1234",
deleted=False, # or True for deletion events
)
How it works:
- Loads the saved JSON for the issue
- Determines the month from
created_atdate - Loads existing Parquet for that month (if any)
- Upserts issue data (removes old, adds new)
- Saves updated Parquet
- Copies to distribution directory for rsync
Benefits:
- Data available within seconds of Jira change
- Only updates one monthly file (~50-100KB)
- Rsync transfers only changed files
4b. Batch Transform (Initial Load / Recovery)
File: connectors/jira/transform.py
Used for initial historical load or to rebuild all Parquet from raw JSON.
python -m connectors.jira.transform \
--raw-dir /data/src_data/raw/jira \
--output-dir /data/src_data/parquet/jira \
--attachments-dir /data/src_data/raw/jira/attachments
Common transformations (both modes):
- Extracts plain text from ADF (Atlassian Document Format)
- Maps custom field IDs to human-readable names
- Normalizes nested structures into flat tables
- Links attachments to local file paths
- Enforces explicit PyArrow schema for consistent types across months
5. Data Distribution
Analysts sync data via rsync (same as other data):
bash server/scripts/sync_data.sh
This syncs:
server/parquet/jira/- Parquet tables (issues, comments, attachments metadata, changelog, issuelinks, remote_links)
For attachment files, see Attachment Access section below.
Data Flow Timeline (Real-Time)
T+0ms Jira: Issue updated
T+50ms Jira: Webhook POST to our server
T+100ms Server: Verify signature, log event
T+150ms Server: GET /rest/api/3/issue/{key} from Jira API
T+400ms Server: GET SLA fields via JSM service account (cloud API)
T+500ms Server: Save JSON (with SLA overlay) to raw/jira/issues/
T+600ms Server: Download attachments (parallel)
T+800ms Server: Incremental transform → update monthly Parquet
T+900ms Server: Copy to distribution directory
T+1000ms Server: Return 200 OK to Jira
(analyst sync - any time)
T+Xsec Analyst: bash sync_data.sh
T+Xsec Analyst: rsync downloads only changed monthly file (~50KB)
T+Xsec Analyst: Query with DuckDB - sees latest data
Key improvement: Incremental transform runs immediately after webhook processing, so data is available for sync within seconds of the Jira change.
Configuration
Server Environment Variables
In /opt/data-analyst/.env:
# Jira webhook integration
JIRA_WEBHOOK_SECRET=<random 64-char hex string>
JIRA_DOMAIN=your-org.atlassian.net
JIRA_EMAIL=integration-user@your-domain.com
JIRA_API_TOKEN=<API token from Atlassian>
# Jira SLA service account (JSM Agent licence for SLA fields)
JIRA_SLA_EMAIL=<JSM service account email>
JIRA_SLA_API_TOKEN=<API token from 1Password>
JIRA_CLOUD_ID=f0f7a244-4fb4-41f9-b1f0-b79e24a20f11
GitHub Secrets
| Secret | Description |
|---|---|
JIRA_WEBHOOK_SECRET |
HMAC secret for webhook verification |
JIRA_DOMAIN |
Jira Cloud domain |
JIRA_EMAIL |
Email for API authentication |
JIRA_API_TOKEN |
API token from Atlassian account |
JIRA_SLA_EMAIL |
JSM service account email (for SLA fields) |
JIRA_SLA_API_TOKEN |
JSM service account API token |
JIRA_CLOUD_ID |
Atlassian Cloud site ID |
Getting Jira API Token
- Go to https://id.atlassian.com/manage-profile/security/api-tokens
- Click "Create API token"
- Name it (e.g., "Data Analyst Integration")
- Copy token to
JIRA_API_TOKEN
⚠️ IMPORTANT: API tokens expire after 365 days maximum (Atlassian limitation).
Set a calendar reminder to rotate the token before expiration. When rotating:
- Create new token in Atlassian
- Update
JIRA_API_TOKENin GitHub Secrets and server.env - Restart webapp:
sudo systemctl restart webapp - Test:
curl https://your-instance.example.com/webhooks/jira/health
Directory Structure
/data/src_data/
├── raw/
│ └── jira/ # Raw data from webhooks
│ ├── issues/ # One JSON file per issue
│ │ ├── SUPPORT-15186.json
│ │ ├── SUPPORT-15189.json
│ │ └── SUPPORT-15190.json
│ ├── attachments/ # Downloaded files (by issue key)
│ │ ├── SUPPORT-15189/
│ │ │ ├── 56337_image.png
│ │ │ └── 56338_image-20260203-110549.png
│ │ └── SUPPORT-15190/
│ │ └── 56340_image.png
│ └── webhook_events/ # Audit log of all webhooks
│ ├── 20260203_105203_jira_issue_updated.json
│ └── 20260203_110457_comment_created.json
│
└── parquet/
└── jira/ # Transformed data (monthly partitioned)
├── issues/ # Main issues table
│ ├── 2025-01.parquet
│ ├── 2025-02.parquet
│ └── ...
├── comments/ # Issue comments
│ └── YYYY-MM.parquet
├── attachments/ # Attachment metadata
│ └── YYYY-MM.parquet
├── changelog/ # Field change history
│ └── YYYY-MM.parquet
├── issuelinks/ # Links between issues
│ └── YYYY-MM.parquet
└── remote_links/ # External links (Confluence, Slack, etc.)
└── YYYY-MM.parquet
Monthly Partitioning Benefits:
- Efficient rsync: only changed months are transferred
- Better performance: smaller files for ~15,000 total tickets
- Incremental updates: new months don't rewrite old data
Monitoring
Health Check
curl https://your-instance.example.com/webhooks/jira/health
Response:
{
"status": "ok",
"configured": true,
"webhook_secret_set": true,
"jira_domain": "your-org.atlassian.net"
}
Logs
# Webapp logs (webhook processing)
tail -f /opt/data-analyst/logs/webapp-error.log | grep -i jira
# Recent webhook events
ls -lt /data/src_data/raw/jira/webhook_events/ | head -20
# Issue count
ls /data/src_data/raw/jira/issues/ | wc -l
# Attachment count
find /data/src_data/raw/jira/attachments/ -type f | wc -l
Security
| Layer | Protection |
|---|---|
| Webhook | HMAC-SHA256 signature verification |
| API Auth | HTTP Basic Auth (email + API token) |
| Storage | Server directories with data-ops group permissions |
| Transport | HTTPS only (Let's Encrypt certificate) |
Webhook Signature Verification:
expected = hmac.new(
secret.encode('utf-8'),
request.get_data(),
hashlib.sha256
).hexdigest()
if not hmac.compare_digest(signature, expected):
abort(401)
Troubleshooting
Webhook not received
- Check Jira webhook is enabled and URL is correct
- Verify JQL filter matches the issue's project
- Check server firewall allows HTTPS from Atlassian IPs
Signature verification fails
- Verify
JIRA_WEBHOOK_SECRETmatches in both Jira and server.env - Check for trailing whitespace in secret
- Restart webapp after changing
.env
Attachments not downloading
- Check
JIRA_API_TOKENis valid - Verify API token has read access to attachments
- Check disk space on
/datapartition - Large attachments (>50MB) are skipped by design
Missing data in Parquet
- Run transformation manually:
python -m connectors.jira.transform \ --raw-dir /data/src_data/raw/jira \ --output-dir /data/src_data/parquet/jira \ --attachments-dir /data/src_data/raw/jira/attachments - Check for errors in transformation output
- Verify raw JSON files exist in
raw/jira/issues/ - Note: Output files are partitioned by month (e.g.,
issues/2026-01.parquet)
Schema Reference
See docs/jira_schema.md for detailed table schemas and example queries.
Historical Backfill
For initial setup or recovery, use the backfill script to download all historical issues.
File: connectors/jira/scripts/backfill.py
# Download all SUPPORT tickets (idempotent, skips existing)
python -m connectors.jira.scripts.backfill --parallel 4
# Environment variables required:
JIRA_DOMAIN=your-org.atlassian.net
JIRA_EMAIL=integration-user@your-domain.com
JIRA_API_TOKEN=<API token>
JIRA_DATA_DIR=/data/src_data/raw/jira # optional, default path
Features:
- Uses new Jira Cloud API (
POST /rest/api/3/search/jqlwithnextPageToken) - Parallel downloads (configurable workers)
- Downloads all attachments
- Idempotent - skips already downloaded issues
- Handles rate limiting gracefully
SLA backfill (separate script, uses JSM service account):
File: connectors/jira/scripts/backfill_sla.py
# Fetch SLA fields for all issues (uses JIRA_SLA_* env vars)
python -m connectors.jira.scripts.backfill_sla --parallel 8
# Dry run (count files needing update):
python -m connectors.jira.scripts.backfill_sla --dry-run
The personal API token lacks JSM Agent licence needed for SLA fields.
This script uses the JSM service account via the
Atlassian Cloud API (api.atlassian.com) to fetch and embed SLA data
into existing raw JSON files.
After backfill, run batch transform:
python -m connectors.jira.transform \
--raw-dir /data/src_data/raw/jira \
--output-dir /data/src_data/parquet/jira \
--attachments-dir /data/src_data/raw/jira/attachments
# Copy to distribution directory
cp -r /data/src_data/parquet/jira/* ~/server/parquet/jira/
SLA Polling (Open Tickets)
SLA elapsed values (first_response_elapsed_millis, time_to_resolution_elapsed_millis) only update when a webhook fires. For idle open tickets (~49 tickets, ~0.3% of dataset), these values go stale and no longer reflect the actual current elapsed time.
File: connectors/jira/scripts/poll_sla.py
The SLA polling job runs every 15 minutes via systemd timer (jira-sla-poll.timer) as root:data-ops and:
- Reads Parquet to find open issues with SLA data
- Fetches fresh SLA and status fields via JSM service account (cloud API)
- Updates raw JSON atomically (
tempfile.mkstemp()+os.fchmod(fd, 0o660)+os.replace()) - Triggers incremental Parquet transform (inside advisory file lock)
Self-healing: The poll fetches status, resolution, resolutiondate, and updated alongside the SLA fields. If a ticket is resolved in Jira but still appears "open" in Parquet (e.g. due to a missed webhook), the poll automatically corrects the status in JSON and re-transforms to Parquet. Log output: Self-healing: SUPPORT-XXXX is resolved in Jira. This was added in response to #203 where 12 tickets were permanently stale after a permission bug prevented webhooks from updating JSON files.
File locking: The entire read-modify-write + Parquet transform is wrapped in a per-issue advisory file lock (connectors/jira/file_lock.py) to prevent races with the webhook handler. The webhook handler (connectors/jira/service.py) uses the same lock. Different issue keys don't block each other.
Important — mkstemp and ACL: The issues/ directory uses POSIX ACLs with default:mask::rwx. tempfile.mkstemp() creates files with mode 0600, which overrides the ACL mask to --- and breaks group access for www-data (webhook handler) and deploy (batch transform). The os.fchmod(fd, 0o660) call immediately after mkstemp() restores the mask to rw-, preserving ACL-based access. See #203 for the full incident report.
# Manual run
python -m connectors.jira.scripts.poll_sla
# Dry run (count open issues)
python -m connectors.jira.scripts.poll_sla --dry-run
# Verbose logging
python -m connectors.jira.scripts.poll_sla --verbose
Return states:
updated— SLA fields refreshed, status unchangedhealed— status corrected (ticket was resolved in Jira but stale locally)skipped— no valid SLA data and ticket not resolvedfailed— API error or transform failure
Note: sla_cycle_type (ongoing/completed) is not stored in Parquet — compute it on-the-fly in DuckDB:
SELECT issue_key,
CASE WHEN status_category = 'Done' THEN 'completed' ELSE 'ongoing' END AS sla_cycle_type,
first_response_elapsed_millis,
time_to_resolution_elapsed_millis
FROM 'server/parquet/jira/issues/*.parquet'
WHERE first_response_elapsed_millis IS NOT NULL
Analyst Sync Configuration
Jira data is an optional dataset - not synced by default to save bandwidth.
Enable Jira sync:
# Edit local config (created on first sync_data.sh run)
nano ~/.config/data-analyst/sync.yaml
# Change:
datasets:
jira: true # Enable parquet data (~50MB)
jira_attachments: false # Keep false unless you need actual files
Then sync:
bash server/scripts/sync_data.sh
DuckDB views for Jira tables are created automatically if data exists:
jira_issues- main issues tablejira_comments- issue commentsjira_attachments- attachment metadata (filenames, sizes, URLs)jira_changelog- field change historyjira_issuelinks- links between issues (blocks, duplicates, relates to)jira_remote_links- external links (Confluence, Slack, etc.)
Attachment Access
Attachments (images, logs, PDFs) are stored separately from parquet data.
Option 1: Download per-ticket (recommended)
Download attachments for a specific ticket to local temp folder:
# Download all attachments for one ticket
rsync -avz data-analyst:server/jira_attachments/SUPPORT-1234/ /tmp/SUPPORT-1234/
# View locally
ls /tmp/SUPPORT-1234/
open /tmp/SUPPORT-1234/screenshot.png # macOS
This is fast (only downloads files for one ticket) and keeps your local machine clean.
Option 2: Sync attachments locally (for heavy analysis)
If you need frequent access to attachments, enable full sync:
# ~/.config/data-analyst/sync.yaml
datasets:
jira: true
jira_attachments: true # Syncs ~500MB+ of files
Then sync_data.sh will rsync attachments to ./server/jira_attachments/.
Finding attachment path from parquet
The jira_attachments table has a local_path column with the server path:
SELECT
issue_key,
filename,
local_path,
size_bytes
FROM jira_attachments
WHERE issue_key = 'SUPPORT-1234';
Result:
issue_key | filename | local_path | size_bytes
SUPPORT-1234 | screenshot.png | /data/src_data/raw/jira/attachments/SUPPORT-1234/... | 45678
To access locally (if synced): replace /data/src_data/raw/jira/attachments/ with ./server/jira_attachments/.
Future Improvements
Automatic Parquet regeneration after each webhook(Implemented: incremental transform)Incremental Parquet updates(Implemented: upsert by issue_key)Full historical sync from Jira(Implemented: jira_backfill.py)SLA polling for open tickets(Implemented: jira_poll_sla.py, 15min timer)- Comment attachment extraction (inline images in ADF)
- Custom field name resolution from Jira metadata API
- Attachment binary sync to analysts (currently metadata only)