* fix(security): close Jira webhook fail-open + path traversal (#83) Two related vulnerabilities: 1. Fail-open signature check: when JIRA_WEBHOOK_SECRET was unset, _verify_signature returned True and any unauthenticated POST to /webhooks/jira would run the full ingest pipeline. Now fail-closed — the handler short-circuits with 503 (operator-misconfiguration signal, distinct from 401 wrong-signature) when the secret is missing. 2. Path traversal via attacker-controlled issue_key: webhook payloads carry issue.key, which flowed unsanitized into save_issue (issues_dir / "{issue_key}.json"), download_attachment (attachments_dir / issue_key), and incremental_transform (raw_dir / "issues" / "{issue_key}.json"). A crafted webhook with issue.key="../../etc/passwd" could write outside the Jira data dir. Defense-in-depth: new connectors/jira/validation.py exposes is_valid_issue_key (whitelist regex ^[A-Z][A-Z0-9_]{0,31}-\d{1,12}$) and safe_join_under (Path.resolve() containment check). Both are enforced at the webhook entry point AND at every filesystem boundary in the connector. Tests: - New tests/test_jira_validation.py — unit tests for both helpers (parametrized invalid keys, traversal/symlink/absolute-path cases). - Webhook tests: test_unconfigured_secret_returns_503, test_path_traversal_in_issue_key_rejected (parametrized over 10 bad keys), test_valid_issue_key_accepted. CHANGELOG: two CRITICAL Fixed bullets under Unreleased. Closes #83. * fix(security): close remaining #83 review findings — webhookEvent traversal, _handle_deletion guard, regex tightening Reviewer of PR #93 flagged four MUST-FIXes: 1. _log_webhook_event used the attacker-controlled `webhookEvent` field as a filename component without sanitization. Payload with `webhookEvent: "../../tmp/pwn"` could escape WEBHOOK_LOG_DIR. Now: - non-`[A-Za-z0-9_-]` runs are replaced with `_` (dot excluded so `..` cannot survive sanitization as a directory component) - length capped at 64 chars - final path routed through safe_join_under New regression test `test_webhook_event_path_traversal_sanitized`. 2. _handle_deletion (connectors/jira/service.py:530) and process_webhook_event (line 487) still used raw issue_key in path builds. Even though the webhook handler validates upstream, the "defense-in-depth at every filesystem boundary" claim required these too. Both now run is_valid_issue_key and safe_join_under guards. 3. Regex `^[A-Z][A-Z0-9_]{0,31}-\d{1,12}$` permitted underscores in project keys. Atlassian's project-key validator does not — `A_B-1` is rejected by Jira itself. Tightened to `[A-Z0-9]` and updated tests: `ABC_DEF-1` is now invalid, added Cyrillic А-1 (lookalike), CRLF, and oversize cases to the bad-key parametrization. 4. Existing test test_deletion_of_nonexistent_issue_returns_true used `PROJ-NOEXIST` which is not a real Jira key shape. Updated to `PROJ-99999`. The test still exercises the same intent (deletion of issue with no local file is idempotent). 73/73 jira tests pass locally (test_jira_webhooks + test_jira_validation + test_jira_service + test_jira_service_full + test_jira_incremental). CHANGELOG updated to document the regex tightening and the new webhookEvent sanitization. Refs review of #93. * fix(tests): test_journey_jira tests assumed fail-open before #83 fix CI failure on PR #93 caught two journey tests that pinned the OLD fail-open contract: - test_webhook_with_no_secret_configured_accepted asserted 200 when JIRA_WEBHOOK_SECRET was unset. After the #83 fix that's a 503 (operator misconfig). Renamed to _refused and flipped the assertion. - test_webhook_empty_payload_rejected didn't set the secret, so the 503 short-circuit fired before the empty-payload 400 could. Set JIRA_WEBHOOK_SECRET in the patched Config so the test exercises the intended path. 56/56 jira journey + webhook + validation tests now pass. * fix(security): #93 round-3 — webhook fallback format + save_issue early validation Devin Review caught two real findings: 1. Webhook handler regression: the round-2 fix extracted issue_key only from event_data['issue']['key'], but process_webhook_event has long supported a fallback 'issue_key' top-level field for certain Jira event formats (e.g. delete events historically). The handler now blocks those events with 400 before they reach the service layer. Fix: mirror process_webhook_event's fallback in the handler — try issue.key first, fall through to event_data.get('issue_key') when empty. is_valid_issue_key still validates whichever source provided the key. 2. save_issue defense-in-depth was incomplete: is_valid_issue_key ran AFTER fetch_remote_links and fetch_sla_fields had already used the unvalidated issue_key in HTTP URL construction ({base_url}/issue/{issue_key}/remotelink etc.). A future internal caller invoking save_issue directly with attacker-controlled input could trigger outbound requests with a malicious path component (limited SSRF / URL-path manipulation against the Jira API server). Fix: move the is_valid_issue_key check to immediately after the null guard, before any HTTP request or filesystem op. Webhook layer still validates upstream, this is the second layer. 66 jira tests pass. Refs Devin Review of #93. * fix(changelog): #93 round-4 — add BREAKING marker to fail-closed bullet Devin Review caught: the JIRA_WEBHOOK_SECRET fail-closed change is a behavior change for operators (response code 503 vs old 200) that existing alerting may treat differently. Per CLAUDE.md changelog discipline rule, operators grep for **BREAKING** before bumping the pin. Added the marker + a short note on what action operators need to take (set the env var if they haven't). Refs Devin Review of #93. * fix: #93 round-5 — null-issue crash + comment drift Devin Review caught two findings on the round-4 commit: 1. Pre-existing crash on null issue field: a webhook payload with {"issue": null} (rather than omitting the key) caused event_data.get("issue", {}) to return None, then issue.get("key") raised AttributeError → unhandled 500. Pre-existing but reachable. Fix: 'event_data.get("issue") or {}' normalises None to {}, then the existing fallback / validation path returns 400 cleanly. New regression test test_null_issue_field_does_not_crash. 2. Inline comment drift: the comment at line 77 documented the allowed character class as [A-Za-z0-9._-] (with dot) but the regex at line 27 excludes dot deliberately (so '..' cannot survive sanitization). Fixed the comment to match. 52 jira tests pass. Refs Devin Review of #93 round 5. * fix: #93 round-6 — process_webhook_event also normalises null issue field Devin Review caught: the webhook handler at app/api/jira_webhooks.py correctly handles {"issue": null} via 'event_data.get("issue") or {}', but process_webhook_event at connectors/jira/service.py:509 still used the bare 'event_data.get("issue", {})' which returns None on explicit null. Internal callers (anything that invokes process_webhook_event without going through the HTTP handler) would hit the same AttributeError the round-5 fix closed at the handler layer. Same one-line fix. 32 jira tests pass. Refs Devin Review of #93 round 5. * fix: #93 round-7 — issue-key regex uses [0-9] not \d Devin Review caught: Python 3's \d matches any Unicode decimal digit (Arabic-Indic ٣, Bengali ৩, Devanagari ३, …). A key like TEST-٣ would pass the regex even though it's not a valid Jira input. Tightened to [0-9] (ASCII only). Added three Unicode-digit cases to the bad-key parametrization in test_jira_validation.py to lock in the contract. Refs Devin Review of #93 round 6. * fix: #93 round-8 — use \\Z anchor not $ in issue-key regex Devin Review caught: Python's $ anchor matches before a trailing \\n, so re.match('…$', 'TEST-1\\n') returns a match. is_valid_issue_key returned True for CRLF-injected keys. \\Z is hard end-of-string and closes that bypass. Manual verification: is_valid_issue_key('TEST-1\\n') → False (was True before fix) is_valid_issue_key('TEST-1\\r\\n') → False is_valid_issue_key('TEST-1') → True Refs Devin Review of #93 round 7. * docs: #93 round-9 — CHANGELOG regex matches implementation
301 lines
11 KiB
Python
301 lines
11 KiB
Python
"""
|
|
Incremental Jira transform - update single issue in Parquet files.
|
|
|
|
Called by webhook handler after issue JSON and attachments are saved.
|
|
Updates only the affected monthly Parquet file for efficient rsync.
|
|
"""
|
|
|
|
import json
|
|
import logging
|
|
import os
|
|
from datetime import datetime
|
|
from pathlib import Path
|
|
|
|
import pandas as pd
|
|
import pyarrow as pa
|
|
import pyarrow.parquet as pq
|
|
|
|
# Import transform functions from batch transform
|
|
from .file_lock import parquet_month_lock
|
|
from .validation import is_valid_issue_key, safe_join_under
|
|
from .transform import (
|
|
ATTACHMENTS_SCHEMA,
|
|
CHANGELOG_SCHEMA,
|
|
COMMENTS_SCHEMA,
|
|
ISSUES_SCHEMA,
|
|
ISSUELINKS_SCHEMA,
|
|
REMOTE_LINKS_SCHEMA,
|
|
apply_schema,
|
|
get_month_key,
|
|
transform_attachments,
|
|
transform_changelog,
|
|
transform_comments,
|
|
transform_issue,
|
|
transform_issuelinks,
|
|
transform_remote_links,
|
|
)
|
|
|
|
logging.basicConfig(level=logging.INFO)
|
|
logger = logging.getLogger(__name__)
|
|
|
|
# Default paths (can be overridden via environment)
|
|
DEFAULT_RAW_DIR = Path(os.environ.get("DATA_DIR", "/data")) / "extracts" / "jira" / "raw"
|
|
DEFAULT_OUTPUT_DIR = Path(os.environ.get("DATA_DIR", "/data")) / "extracts" / "jira" / "data"
|
|
|
|
|
|
def upsert_dataframe(
|
|
existing_df: pd.DataFrame | None,
|
|
new_records: list[dict],
|
|
key_column: str,
|
|
issue_key: str,
|
|
) -> pd.DataFrame:
|
|
"""
|
|
Upsert new records into existing DataFrame.
|
|
|
|
- Removes all rows matching issue_key
|
|
- Adds new records
|
|
|
|
Args:
|
|
existing_df: Existing DataFrame (or None if new file)
|
|
new_records: List of new records to add
|
|
key_column: Column used for matching (e.g., 'issue_key')
|
|
issue_key: Issue key to remove/replace
|
|
|
|
Returns:
|
|
Updated DataFrame
|
|
"""
|
|
new_df = pd.DataFrame(new_records) if new_records else pd.DataFrame()
|
|
|
|
if existing_df is None or existing_df.empty:
|
|
return new_df
|
|
|
|
if new_df.empty:
|
|
# Remove issue from existing data (deletion case)
|
|
return existing_df[existing_df[key_column] != issue_key].copy()
|
|
|
|
# Remove old records for this issue, add new ones
|
|
filtered = existing_df[existing_df[key_column] != issue_key]
|
|
return pd.concat([filtered, new_df], ignore_index=True)
|
|
|
|
|
|
def load_parquet_month(parquet_dir: Path, month_key: str) -> pd.DataFrame | None:
|
|
"""Load existing Parquet file for a month, or return None."""
|
|
parquet_path = parquet_dir / f"{month_key}.parquet"
|
|
if parquet_path.exists():
|
|
try:
|
|
return pd.read_parquet(parquet_path)
|
|
except Exception as e:
|
|
logger.warning(f"Failed to read {parquet_path}: {e}")
|
|
return None
|
|
|
|
|
|
def save_parquet_month(
|
|
df: pd.DataFrame,
|
|
schema: dict,
|
|
output_dir: Path,
|
|
month_key: str,
|
|
) -> Path:
|
|
"""Save DataFrame to monthly Parquet file with explicit schema."""
|
|
output_dir.mkdir(parents=True, exist_ok=True)
|
|
output_path = output_dir / f"{month_key}.parquet"
|
|
|
|
if df.empty:
|
|
# Don't write empty files, but delete if exists
|
|
if output_path.exists():
|
|
output_path.unlink()
|
|
logger.info(f"Removed empty {output_path}")
|
|
return output_path
|
|
|
|
table = apply_schema(df, schema)
|
|
pq.write_table(table, output_path)
|
|
logger.info(f"Saved {len(df)} records to {output_path}")
|
|
return output_path
|
|
|
|
|
|
def transform_single_issue(
|
|
issue_key: str,
|
|
raw_dir: Path | None = None,
|
|
output_dir: Path | None = None,
|
|
attachments_dir: Path | None = None,
|
|
deleted: bool = False,
|
|
) -> bool:
|
|
"""
|
|
Transform a single issue and update monthly Parquet files.
|
|
|
|
This is called by webhook handler after issue JSON is saved.
|
|
Only updates the month that the issue belongs to.
|
|
|
|
Args:
|
|
issue_key: Jira issue key (e.g., "SUPPORT-1234")
|
|
raw_dir: Directory with raw JSON files
|
|
output_dir: Output directory for Parquet files
|
|
attachments_dir: Directory with downloaded attachments
|
|
deleted: If True, remove issue from Parquet (deletion event)
|
|
|
|
Returns:
|
|
True if successful, False otherwise
|
|
"""
|
|
raw_dir = raw_dir or DEFAULT_RAW_DIR
|
|
output_dir = output_dir or DEFAULT_OUTPUT_DIR
|
|
attachments_dir = attachments_dir or (raw_dir / "attachments")
|
|
|
|
# Defense-in-depth: even if a stale/legacy code path bypasses webhook
|
|
# validation, the transform step will refuse a malformed key (issue #83).
|
|
if not is_valid_issue_key(issue_key):
|
|
logger.error(f"Refusing transform for malformed issue key: {issue_key!r}")
|
|
return False
|
|
issues_dir = raw_dir / "issues"
|
|
try:
|
|
json_path = safe_join_under(issues_dir, f"{issue_key}.json")
|
|
except ValueError as e:
|
|
logger.error(f"Path traversal blocked in transform for {issue_key!r}: {e}")
|
|
return False
|
|
|
|
if deleted:
|
|
# For deletion, we need to find which month the issue was in
|
|
# Check all monthly files - this is rare so OK to be slower
|
|
logger.info(f"Processing deletion for {issue_key}")
|
|
return _handle_deletion(issue_key, output_dir)
|
|
|
|
if not json_path.exists():
|
|
logger.error(f"Issue JSON not found: {json_path}")
|
|
return False
|
|
|
|
try:
|
|
# Load raw issue data
|
|
with open(json_path) as f:
|
|
raw_issue = json.load(f)
|
|
|
|
# Transform issue
|
|
issue_record = transform_issue(raw_issue)
|
|
issue_record["_raw_file"] = json_path.name
|
|
|
|
# Determine month
|
|
month_key = get_month_key(issue_record.get("created_at"))
|
|
logger.info(f"Updating {issue_key} in month {month_key}")
|
|
|
|
# Transform related data
|
|
comments_records = transform_comments(raw_issue)
|
|
attachments_records = transform_attachments(raw_issue, attachments_dir)
|
|
changelog_records = transform_changelog(raw_issue)
|
|
|
|
# Transform link/remote data outside lock (minimize hold time)
|
|
issuelinks_records = transform_issuelinks(raw_issue)
|
|
remote_links_records = transform_remote_links(raw_issue)
|
|
|
|
# Parquet read-modify-write under per-month lock to prevent
|
|
# "last writer wins" race when concurrent webhooks touch the
|
|
# same monthly partition (see issue #205).
|
|
with parquet_month_lock(output_dir, month_key):
|
|
updated_paths = []
|
|
|
|
# Issues
|
|
existing_issues = load_parquet_month(output_dir / "issues", month_key)
|
|
updated_issues = upsert_dataframe(existing_issues, [issue_record], "issue_key", issue_key)
|
|
path = save_parquet_month(updated_issues, ISSUES_SCHEMA, output_dir / "issues", month_key)
|
|
updated_paths.append(path)
|
|
|
|
# Comments
|
|
existing_comments = load_parquet_month(output_dir / "comments", month_key)
|
|
updated_comments = upsert_dataframe(existing_comments, comments_records, "issue_key", issue_key)
|
|
path = save_parquet_month(updated_comments, COMMENTS_SCHEMA, output_dir / "comments", month_key)
|
|
updated_paths.append(path)
|
|
|
|
# Attachments
|
|
existing_attachments = load_parquet_month(output_dir / "attachments", month_key)
|
|
updated_attachments = upsert_dataframe(existing_attachments, attachments_records, "issue_key", issue_key)
|
|
path = save_parquet_month(updated_attachments, ATTACHMENTS_SCHEMA, output_dir / "attachments", month_key)
|
|
updated_paths.append(path)
|
|
|
|
# Changelog
|
|
existing_changelog = load_parquet_month(output_dir / "changelog", month_key)
|
|
updated_changelog = upsert_dataframe(existing_changelog, changelog_records, "issue_key", issue_key)
|
|
path = save_parquet_month(updated_changelog, CHANGELOG_SCHEMA, output_dir / "changelog", month_key)
|
|
updated_paths.append(path)
|
|
|
|
# Issue links
|
|
existing_issuelinks = load_parquet_month(output_dir / "issuelinks", month_key)
|
|
updated_issuelinks = upsert_dataframe(existing_issuelinks, issuelinks_records, "issue_key", issue_key)
|
|
path = save_parquet_month(updated_issuelinks, ISSUELINKS_SCHEMA, output_dir / "issuelinks", month_key)
|
|
updated_paths.append(path)
|
|
|
|
# Remote links
|
|
existing_remote_links = load_parquet_month(output_dir / "remote_links", month_key)
|
|
updated_remote_links = upsert_dataframe(existing_remote_links, remote_links_records, "issue_key", issue_key)
|
|
path = save_parquet_month(updated_remote_links, REMOTE_LINKS_SCHEMA, output_dir / "remote_links", month_key)
|
|
updated_paths.append(path)
|
|
|
|
# Update extract.duckdb _meta for all affected tables
|
|
try:
|
|
from .extract_init import update_meta
|
|
extract_dir = output_dir.parent # output_dir is .../data, parent is .../jira
|
|
for table_name in ["issues", "comments", "attachments", "changelog", "issuelinks", "remote_links"]:
|
|
update_meta(extract_dir, table_name)
|
|
except Exception as meta_err:
|
|
logger.warning(f"Could not update extract.duckdb _meta: {meta_err}")
|
|
|
|
logger.info(f"Successfully updated {issue_key} in Parquet files")
|
|
return True
|
|
|
|
except Exception as e:
|
|
logger.error(f"Error transforming {issue_key}: {e}", exc_info=True)
|
|
return False
|
|
|
|
|
|
def _handle_deletion(
|
|
issue_key: str,
|
|
output_dir: Path,
|
|
) -> bool:
|
|
"""Handle issue deletion by removing from all monthly files."""
|
|
found = False
|
|
|
|
for table_name, schema in [
|
|
("issues", ISSUES_SCHEMA),
|
|
("comments", COMMENTS_SCHEMA),
|
|
("attachments", ATTACHMENTS_SCHEMA),
|
|
("changelog", CHANGELOG_SCHEMA),
|
|
("issuelinks", ISSUELINKS_SCHEMA),
|
|
("remote_links", REMOTE_LINKS_SCHEMA),
|
|
]:
|
|
table_dir = output_dir / table_name
|
|
if not table_dir.exists():
|
|
continue
|
|
|
|
for parquet_file in table_dir.glob("*.parquet"):
|
|
month_key = parquet_file.stem
|
|
try:
|
|
with parquet_month_lock(output_dir, month_key):
|
|
df = pd.read_parquet(parquet_file)
|
|
if "issue_key" in df.columns and issue_key in df["issue_key"].values:
|
|
df = df[df["issue_key"] != issue_key]
|
|
save_parquet_month(df, schema, table_dir, month_key)
|
|
|
|
found = True
|
|
logger.info(f"Removed {issue_key} from {parquet_file}")
|
|
except Exception as e:
|
|
logger.warning(f"Error checking {parquet_file}: {e}")
|
|
|
|
return found
|
|
|
|
|
|
if __name__ == "__main__":
|
|
import argparse
|
|
|
|
parser = argparse.ArgumentParser(description="Incremental Jira transform")
|
|
parser.add_argument("issue_key", help="Jira issue key (e.g., SUPPORT-1234)")
|
|
parser.add_argument("--raw-dir", type=Path, help="Raw JSON directory")
|
|
parser.add_argument("--output-dir", type=Path, help="Output Parquet directory")
|
|
parser.add_argument("--attachments-dir", type=Path, help="Attachments directory")
|
|
parser.add_argument("--deleted", action="store_true", help="Issue was deleted")
|
|
|
|
args = parser.parse_args()
|
|
|
|
success = transform_single_issue(
|
|
issue_key=args.issue_key,
|
|
raw_dir=args.raw_dir,
|
|
output_dir=args.output_dir,
|
|
attachments_dir=args.attachments_dir,
|
|
deleted=args.deleted,
|
|
)
|
|
|
|
exit(0 if success else 1)
|