agnes-the-ai-analyst/docs/archive/pd-ps-comments.md

# Corporate Memory V1 — review notes (pd × ps)

**Branch:** `pabu/local-dev` (worktree review)
**Date:** 2026-04-25
**Reviewers:** pd (with Claude Code), independent second opinion from Codex (GPT-5.5, xhigh reasoning)
**Scope:** verification flywheel + corporate memory V1 — `services/verification_detector/`, `services/corporate_memory/`, `app/api/memory.py`, schema `v7→v8`.

This document captures open questions and proposed changes raised during a walkthrough of the branch. It is not a blocker list — it is a starting point for a follow-up conversation. Severity ratings are pd's initial estimate; please push back where they feel off.

---

## Context

The branch implements a "verification flywheel": session JSONLs are scanned by `verification_detector`, an LLM extracts corrections / confirmations / unprompted definitions, and the resulting facts feed `knowledge_items` with calibrated confidence. Admin governance (HITL) gates everything before approval. New schema columns track provenance (`source_type`, `source_ref`, `confidence`, `domain`, `entities`, `valid_from/until`, `supersedes`, `sensitivity`, `is_personal`) and two new tables (`knowledge_contradictions`, `session_extraction_state`).

The mechanism is sound. The notes below are about edges where V1 calibration choices may not survive contact with real data, and one suspected access-control issue.

---

## 1. Knowledge-item deduplication

### Problem

`services/verification_detector/detector.py` derives the item ID as:

```python
id = "kv_" + sha256(f"{title}:{content}").hexdigest()[:12]
```

Two analysts independently surfacing the same fact will almost never produce identical `title` and `content` strings — LLM phrasing varies even on the same underlying claim. So this hash collides ~never in practice. It is a deterministic ID, not a deduplication mechanism.

The branch has no other dedup safeguard at create time. Practical result: the admin review queue will grow with near-duplicates that humans must triage.

### Our proposal

Do not attempt auto-dedup at create time. Instead, surface near-duplicates into the admin queue via three layers:

1. **V1.5** — entity-tag match: items sharing ≥N normalized entities become a "likely duplicate" candidate pair, exposed in the admin UI alongside the existing contradictions tab.
2. **V2** — embedding similarity: at create time, compute cosine against existing approved items in the same domain; threshold (e.g. ≥ 0.85) → flag as `likely_duplicate_of: <id>` for admin merge.
3. The contradiction detector already surfaces "soft contradictions" — these can be repurposed to also catch near-duplicates with a single LLM judge call (same infra, different prompt).

Reasoning: auto-merge at create time is risky (false positives bury new nuance under stale items). Admin queue spam is the lesser evil; embedding pre-filter at V2 keeps admin load bounded.

### Codex second opinion

> Codex run: `gpt-5.5`, `model_reasoning_effort = xhigh`, executed against the same files in worktree.

- **Verdict:** Partial. Don't auto-merge at create time, but pre-check likely duplicates *before* inserting into the admin queue (don't punt entirely to admin manual review).
- **Blind spot we missed:** contradiction detection is **not a duplicate detector**, and in the current pipeline it is not wired into `verification_detector.run()` at all. Items are created at `detector.py:178` but `detect_and_record()` is never called. The contradiction prompt explicitly says "specificity / different perspective is not a contradiction", so near-duplicates would not surface there even if it were wired.
- **Concrete alternative:** add a relation table `knowledge_item_relations(item_a_id, item_b_id, relation_type, score, resolved)` near `src/db.py:76`. Add `repo.create_relation()` near `src/repositories/knowledge.py:182`. In `detector.py:171`, run duplicate-candidate lookup (entity-overlap is fine for V1.5; embeddings later) and attach `relation_type='likely_duplicate'` so the admin queue shows duplicate candidates explicitly.
- **Severity:** medium. Must fix before broad historical backfill; acceptable for narrow V1 only if confirmations are limited and admins see explicit duplicate candidates.
- **Confidence:** 90%.

### Plan (revised after Codex)

- V1: accept that duplicate hint is missing; flag this loudly in PR description so reviewers know what's deferred.
- V1.5: add `knowledge_item_relations` table + entity-overlap-based duplicate suggestion. Surface in admin UI alongside contradictions. ~1 day work.
- V2: embedding similarity at create time, gated by domain.

**Severity:** medium (was: medium). Codex agrees on default but raises the bar for V1.5 — duplicate candidates should be surfaced explicitly, not discovered ad-hoc by admin.

---

## 2. Contradiction detection — synchronous + SQL pre-filter vs Anthropic Batch API

### Current

`services/corporate_memory/contradiction.py` runs synchronously per new item:

1. **Pre-filter (DuckDB):** find candidates with same `domain` + keyword match on `title.split()` words >3 chars. Limit 10.
2. **LLM-as-judge (sync):** Haiku prompt returns `{contradicts, explanation, severity, suggested_resolution}`.

The pre-filter has predictable recall problems: synonyms, paraphrases, cross-domain conflicts (a finance metric definition can contradict a data engineering definition of the same metric).

### Proposed migration to Batch API

Anthropic's Batch API offers ~50% cost reduction with async SLA (≤24h, typically <1h). This is attractive because contradiction detection does not need real-time response — admins review queues, not push notifications.

### Our proposal

Layered evolution:

- **V1**: keep synchronous + SQL filter as-is. Ship.
- **V1.5**: add embedding-based pre-filter. Voyage embeddings at ~$0.02/1M tokens are essentially free at our volume. Replace keyword filter with `cosine(item, candidate) > 0.6` per domain; LLM judge unchanged. Catches paraphrases the keyword filter misses.
- **V2**: switch the LLM judge phase to Batch API. Run a nightly sweep over `pending × approved` per domain. With 50% cost reduction and higher rate limits, we can afford O(N²) within domain shards (no pre-filter needed in Batch mode — let the model judge all pairs).

### Open questions

- **Hidden Batch API costs** beyond cost: dev experience (test cycles), observability (job tracking, retries), debug latency. Worth it?
- **Hybrid mode**: keep sync for high-priority sources (admin_mandate, user corrections) and batch for bulk (`session_transcript`, low-confidence pending)? Or single mode for simplicity?
- **Embedding threshold**: 0.6 is a guess. Calibrate against held-out labeled pairs from V1 data once we have them.

### Codex second opinion

- **Verdict:** Partial. Batch API is likely a V2 win for bulk sweeps, **but not as the only mode.**
- **Blind spot we missed (two of them, both severe):**
  1. The SQL pre-filter is **weaker than we described.** It is not `domain AND keyword`; it is `domain OR keyword` — `src/repositories/knowledge.py:287` joins all conditions with `OR`. Combined with `ORDER BY updated_at DESC LIMIT 10`, recent same-domain noise crowds out the actual conflict.
  2. Detector-created items **never invoke contradiction detection at all.** `detect_and_record()` exists in `services/corporate_memory/contradiction.py` but is not called from `services/verification_detector/detector.py`. So V1 contradiction governance is effectively a stub.
- **Concrete alternative:** first fix V1 before optimizing.
  1. In `src/repositories/knowledge.py:272`, build `domain = ?` as a top-level conjunct (`AND`) and the title/content keyword expansion as an inner `OR` group.
  2. In `services/verification_detector/detector.py:178`, after `repo.create()`, call `contradiction.detect_and_record(extractor, item_dict, repo)`. Or enqueue it for nightly batch — but it must run somewhere.
  3. For V2, use **one shared job model with two executors**: sync for admin/manual/high-priority items, batch for session backfills and nightly sweeps. Don't fork prompt/candidate logic.
- **On embeddings threshold:** `cosine > 0.6` is arbitrary. Calibrate using labeled pairs (duplicate / contradiction / related-but-compatible / unrelated). Optimize for recall before LLM judging — target `>95% recall` with bounded candidate count. **Use top-k plus threshold, not threshold alone.**
- **Severity:** **high.** If V1 claims contradiction governance, this needs fixing before merge — currently the feature is shipped but inert.
- **Confidence:** 92%.

### Plan (revised after Codex)

- **V1 must-fix (was missed):**
  - Fix `OR` → `AND domain + (OR keywords)` in `find_contradiction_candidates`.
  - Wire `detect_and_record()` from `verification_detector.detector.run()`. Or, if we don't want sync LLM cost in detector, enqueue items into a `pending_contradiction_check` table for a nightly batch.
  - Either of those, or remove the V1 claim that contradictions are surfaced.
- V1.5: embedding pre-filter (top-k + threshold, calibrated against labeled pairs). Single shared job model.
- V2: batch executor for nightly sweeps; sync executor for admin/manual; same prompt+candidate logic.

**Severity:** **high** (was: low for V1). Codex's nailed two bugs we missed entirely.

---

## 3. Confidence scoring — calibration, decay, feedback

### Current

`services/corporate_memory/confidence.py` is a hand-rolled lookup:

- `_BASE_CONFIDENCE`: hard-coded dict, e.g. `user_verification.correction = 0.90`, `admin_mandate = 1.00`, `claude_local_md = 0.50`.
- `_MODIFIER_EFFECTS`: hard-coded modifiers (`+0.05` per additional verifier, `+0.20` for admin confirmation).
- `apply_decay()`: linear `0.02` per month → everything reaches 0 at ~50 months including admin policies.

Three problems:

1. **Tuning requires deploy.** The numbers are guesses; in real use we will discover that, say, `user_verification.confirmation` (currently 0.60) is overoptimistic. Each iteration = code change + deploy.
2. **Linear decay is wrong for admin policies.** `admin_mandate = 1.00` decaying linearly to 0 in 50 months is sematically incorrect — admin policies are policies, not "aging facts". They are explicitly revoked, not slowly forgotten.
3. **No feedback from admin actions.** When an admin rejects an item, that signal is lost — we don't update the prior for `(source_type, detection_type)` based on observed approval rates.

### Our proposal

Three layers:

**A. Externalize all numbers to `instance.yaml`** (V1.5, mandatory):

```yaml
corporate_memory:
  confidence:
    base:
      user_verification.correction: 0.90
      user_verification.confirmation: 0.60
      user_verification.unprompted_definition: 0.90
      admin_mandate: 1.00
      claude_local_md: 0.50
      session_transcript: 0.50
    modifiers:
      additional_verifiers_per_user: 0.05
      admin_confirmed_bonus: 0.20
    decay:
      mode: exponential       # or "linear" for back-compat
      half_life_months: 12    # confidence halves every 12 months
      floor:
        admin_mandate: 0.50   # admin items never decay below 0.5
        user_verification.correction: 0.10
        claude_local_md: 0.0
        session_transcript: 0.0
```

**B. Switch to exponential decay with per-source-type floor** (V1.5):

```python
def apply_decay(confidence, created_at, source_type, half_life_months=12, floor_per_source=None):
    age_months = ...
    decayed = confidence * (0.5 ** (age_months / half_life_months))
    floor = (floor_per_source or {}).get(source_type, 0.0)
    return max(decayed, floor)
```

**C. Bayesian prior calibration from admin actions** (V2):

Nightly: compute `P(approve | source_type, detection_type)` over the last N items per category. Surface as a metric in admin UI. Initially: human-edited config update. Later: gated auto-update with selection-bias mitigation (random sampling holdout).

### Open questions

- **Power-law vs exponential decay.** Ebbinghaus forgetting curve research suggests power-law may match human knowledge half-life better. Does that translate to organizational knowledge? Probably yes for "memorized facts", probably no for "policies that are explicitly maintained".
- **Per-domain calibration.** Finance metrics churn quarterly; engineering conventions persist for years. Should `decay.half_life_months` be per-domain, not just per-source-type?
- **Selection bias in feedback loop.** If priors gate which items admins see (sorted by confidence), and we update priors from approval rate, low-confidence items rarely get reviewed → priors stay biased. Mitigation: reserve a small random sample (e.g. 10%) outside the priority queue for unbiased measurement.
- **No "post-create confirmation" mechanism.** When a new verification cites an existing approved item, only the new item gets confidence; the existing item doesn't receive a verification-count boost. This loses the "multiple analysts independently cited this" signal over time.

### Codex second opinion

- **Verdict:** Partial. External config + floors are good. **The decay framing is wrong, and there are bigger V1 bugs in the data flow.**
- **Blind spot we missed (significant):**
  1. **LLM-controlled confidence.** `services/verification_detector/prompts.py:37` asks the LLM to return `base_confidence` as part of the JSON output, and `services/verification_detector/detector.py:187` stores it directly into the DB. **Confidence should not be LLM-controlled.** The current `compute_confidence()` exists but is bypassed for verification-detector items.
  2. **Lost evidence.** The detector extracts `user_quote` from the LLM (the exact quote that constitutes the verification — the most valuable signal!) and **discards it.** It is not stored anywhere. Without `user_quote` and `detection_type` persisted in DB, future Bayesian re-calibration has no raw material.
  3. **Decay framing is wrong.** Confidence is being treated as "truth decays with age" (Ebbinghaus). Organizational facts usually change by **events, scope, or validity windows** — not by smooth forgetting. The `valid_from/valid_until` columns already exist; the right model is "fact has a validity window", not "fact slowly fades".
- **Concrete alternative:**
  1. Remove `base_confidence` from the LLM-required output in `services/verification_detector/schemas.py:30`, **or** ignore it on the read side. Compute confidence in code: `compute_confidence("user_verification", v["detection_type"])`.
  2. Add an **evidence table**: `verification_evidence(id, item_id, source_user, source_ref, detection_type, user_quote, created_at)`. Persist `user_quote` and `detection_type` per-verification. Multiple evidence rows per item enables real "additional verifiers" boost computation (one user × one quote = one evidence row).
  3. Split "evidence confidence" (signal strength from sources) from "freshness / review-due" (validity windows, explicit revocation). Don't conflate them in one number.
- **On power-law vs exponential:** secondary issue. Domain-specific volatility matters more — use **hierarchical priors**: global `(source_type, detection_type)` priors plus per-domain freshness policy.
- **On feedback loops:** mitigate via random review sampling, holdout calibration sets, and **never use learned priors as the sole gate for admin visibility.** Codex agrees with our concern but emphasizes random sampling more strongly.
- **Severity:** medium. Bayesian/config overhaul can wait, but **LLM-controlled confidence and missing evidence should be fixed before V1.**
- **Confidence:** 88%.

### Plan (revised after Codex)

- **V1 must-fix (was missed):**
  - Stop trusting LLM-returned `base_confidence`. Either drop it from `VERIFICATION_SCHEMA` or ignore it on insert and call `compute_confidence(...)` instead.
  - Add `verification_evidence` table to persist `user_quote` + `detection_type` + source reference. This is also what makes "multi-user verification boost" actually computable post-create.
- V1.5: externalize numbers to `instance.yaml` (A) + exponential decay with per-source floor (B).
- V2: Bayesian metric surface (read-only) + per-domain decay policy. Auto-update behind feature flag with random-sample holdout.

**Severity:** **high** for V1 must-fix items above; medium for the rest.

---

## 4. Entity resolution v1

### Problem

`services/corporate_memory/entities.py` does case-insensitive **substring** match against a static registry. Two consequences:

- **False positives** on short tokens. A registry containing `"MD"` matches every occurrence of `"markdown"`, `"command"`, `"admin"`, `"medical"`. Severe noise.
- **False negatives** on synonyms. `"churn"` does not match `"attrition"`, `"customer loss"`, `"logo churn"`. Severe miss.
- No lemmatization, no co-occurrence, no confidence on the match itself.

### Our proposal

Three layers:

**1. V1.5 — word-boundary regex + canonical+aliases registry:**

```yaml
corporate_memory:
  entities:
    metrics:
      - canonical: churn
        aliases: [attrition, customer loss, logo churn]
      - canonical: MRR
        aliases: [monthly recurring revenue, monthly revenue]
```

```python
import re
def resolve_entities(content, title, registry):
    text = f"{title} {content}"
    matched = set()
    for category, entries in registry.items():
        for entry in entries:
            patterns = [entry["canonical"]] + entry.get("aliases", [])
            for p in patterns:
                if re.search(rf"\b{re.escape(p)}\b", text, re.IGNORECASE):
                    matched.add(entry["canonical"])  # always store canonical
                    break
    return sorted(matched)
```

Cost: ~30 LOC, no new dependencies, deterministic, eliminates false positives on short tokens.

**2. V2 — embedding-based fuzzy match.** Pre-compute embeddings for all canonical entities (cache). Per item: embed text, cosine to each entity embedding, threshold 0.75. Catches paraphrases the alias list misses. Voyage cost is negligible.

**3. V2.5 — LLM extraction for low-entity items, batch mode.** Items with <2 resolved entities go into a nightly batch LLM job that suggests candidates for admin curation. This is also how the synonym registry grows over time without manual curation.

### Open questions

- **Skip V1.5, go straight to embeddings?** Argument for: per-item cost is ~$0.0000002, negligible. Argument against: word-boundary fix is so cheap (~30 LOC) it's a no-brainer; embeddings pull in voyage_sdk + caching infrastructure that we don't need yet.
- **Synonym registry maintenance.** Who curates it? Does it become a Big Ball Of Mud? Mitigation: V2.5 LLM-suggest pipeline auto-grows it; admin reviews additions.
- **Hierarchical entities.** "EMEA Sales" is a sub-team of "Sales". "MRR Q3" is an instance of "MRR". Layered approach doesn't model this. Do we need it for V1? Probably not. For V2? Maybe — depends on how often analysts ask drill-down questions.

### Codex second opinion

- **Verdict:** Partial. **Do not skip V1.5.** Embeddings are not a replacement for deterministic entity resolution.
- **Blind spot we missed:** embeddings are **bad at exact short entities, acronyms, product codes, metric names, and aliases where precision matters** (e.g. `MRR`, `NPS`, internal product codes). Substring → embeddings would trade one class of false positives for a different class of *silent* errors that are harder to detect. Word-boundary regex with a maintained alias list is *more correct* than embeddings for these cases, not just cheaper.
- **Concrete alternative:** rebuild `entities.py:14` to use a **typed registry with stable canonical IDs**:
  ```yaml
  metrics:
    - id: metric.churn
      canonical: churn
      aliases: [attrition, customer loss, logo churn]
      kind: metric
      parent_id: null
      blocked_terms: []
    - id: org.team.sales.emea
      canonical: EMEA Sales
      kind: team
      parent_id: org.team.sales
  ```
  Replace substring matching at `entities.py:57` with escaped word-boundary regex, with **special rules for short aliases** (e.g. require leading + trailing whitespace or punctuation). Store canonical **IDs** (`metric.churn`) in `knowledge_items.entities`, not display strings — display is a join. Use embeddings/LLM **only to suggest aliases or candidates for admin approval**, never to bypass deterministic resolution.
- **On hierarchies:** worth modeling minimally now. `parent_id` is sufficient. `EMEA Sales` rolls up to `Sales`. `MRR Q3` is "metric + time period", not a separate flat entity — model it as `entity_ref + valid_from/valid_until`.
- **Severity:** medium. Not a V1 blocker by itself, but **definitely required before V1.5** if dedup or contradiction depend on entities (which our V1.5 plan does).
- **Confidence:** 86%.

### Plan (revised after Codex)

- V1: ship as-is, accept noise.
- V1.5: typed registry with canonical IDs, word-boundary regex, **store IDs not strings**, support `parent_id` for hierarchies. Use this for duplicate-candidate detection (Q1 V1.5 plan). ~2 days now (more scope than originally planned).
- V2+: embeddings + LLM-suggest pipeline for **alias growth**, not entity resolution. Admin approves auto-suggested aliases.

**Severity:** medium. Codex sharpened the design — store canonical IDs, not strings; hierarchy is V1.5, not V2.

---

## 5. `is_personal` flag — suspected leak

### Code paths

`app/api/memory.py`:

- `POST /{item_id}/personal` (line ~196) — only the contributor (`item.source_user == user.email`) can flag. Correct.
- `GET /api/memory` (line ~63) — accepts query parameter `exclude_personal: bool = True`. Default excludes personal items, but the endpoint accepts `exclude_personal=False` from any authenticated user with no role check.
- `GET /api/memory/my-contributions` — returns contributor's own items including personal ones. Correct.

### Concern

Any authenticated user can request:

```
GET /api/memory?exclude_personal=false
```

…and receive items flagged `is_personal=true` by other users. The flag is meant to keep an item private to the contributor, but the list endpoint exposes it whenever the caller asks.

### Impact

`is_personal` was introduced for emergency-exit cases (the detector pulled out something private that shouldn't be team-wide). If any user can override the default, the flag provides false security. This is a confidentiality leak unless `is_personal` is purely an "admin convenience flag" — but that's not how the toggle UI presents it.

### Proposed fix

Two options:

**Option A — kill the override entirely.** Remove the `exclude_personal` query parameter. Personal items are never visible via `GET /api/memory`. Contributors see them only via `/my-contributions`. Admins see them via a separate admin endpoint guarded by `Role.KM_ADMIN`.

**Option B — gate the override by role.** Keep the parameter, but require `Role.KM_ADMIN` to set it to `False`. Add a server-side check; non-admin requests with `exclude_personal=False` get 403 or are silently coerced to `True`.

Option A is simpler, safer, and matches the most likely product intent. Option B preserves a flexible audit endpoint for admins but is easier to misuse.

### Codex second opinion

- **Verdict:** **Y. This is a leak.**
- **Blind spot we missed (worse than the list path alone):**
  - The public list endpoint passes caller-controlled `exclude_personal` into `repo.list_items()` at `app/api/memory.py:87`, and `repo.list_items()` only filters when that flag is true at `src/repositories/knowledge.py:113` — caller controls it.
  - **Search ignores `exclude_personal` entirely** via `app/api/memory.py:78` and `src/repositories/knowledge.py:119`. Any authenticated user can `GET /api/memory?search=foo` and see personal items unconditionally.
  - **Direct item access** (`/{id}`, `/{id}/provenance`, vote endpoints) has no `is_personal` check either. If a user knows or guesses an item ID, they retrieve it.
- **Concrete alternative:**
  1. For the public `GET /api/memory`, **force `exclude_personal=True`** unless `user.role in ("km_admin", "admin")`. Or simpler: always force true and leave personal review to admin endpoints.
  2. Add `exclude_personal` support to `repo.search()` (currently it's only on `list_items()`).
  3. Add a shared `_can_view_item(user, item)` check used by provenance, vote, and direct item actions. Personal items are visible only to the contributor + admins.
- **Severity:** **high. Must fix before merging V1.**
- **Confidence:** **99%.**

### Plan (revised after Codex)

- **V1 blocker.** Fix all three vectors (list, search, direct access), not just the list parameter.
  - Force `exclude_personal=True` for non-admins on `GET /api/memory`.
  - Plumb `exclude_personal` through `repo.search()`.
  - Add `_can_view_item(user, item)` helper, apply on `/{id}`, `/{id}/provenance`, and any other direct-item endpoint.
- Add regression test that an authenticated non-contributor cannot retrieve another user's `is_personal` item via list, search, or direct GET.

**Severity:** **high.**

---

## Top issues to address before merging V1

(Initial pd list was three items. Codex review added two more high-severity findings we missed entirely. Updated list below.)

1. **`is_personal` leak in full breadth (Q5).** Fix list, search, and direct-item access. Force `exclude_personal=True` for non-admins on `GET /api/memory`; plumb the filter through `repo.search()`; add `_can_view_item(user, item)` for `/{id}`, `/{id}/provenance`, vote endpoints. Add regression test. **Severity: high. Confidence: 99%.**
2. **Contradiction detection is shipped but inert (Q2).** `detect_and_record()` is never called from `verification_detector.run()`. Either wire it (sync after `repo.create()` or enqueue for batch) or remove the V1 claim that contradictions are surfaced. Also fix the `OR → AND` bug in the SQL pre-filter at `src/repositories/knowledge.py:272-287` so `domain` is a top-level conjunct and keywords are an inner `OR` group. **Severity: high.**
3. **LLM-controlled confidence + lost evidence (Q3).** Detector trusts `base_confidence` from the LLM and discards `user_quote`. Drop `base_confidence` from `VERIFICATION_SCHEMA` (or ignore on insert), call `compute_confidence(...)` in code instead. Add `verification_evidence` table to persist `user_quote`, `detection_type`, `source_user`, `source_ref` per evidence row. Without this, "additional verifiers boost" is computable in theory only. **Severity: high.**

The rest (Q1 explicit duplicate hint, Q2 batch sweep, Q3 YAML externalize + exponential decay, Q4 typed registry) can land in V1.5 as planned.

### What V1.5 must own (consequential refinements, not blockers)

- Q1: `knowledge_item_relations` table for explicit duplicate candidate hints in admin queue.
- Q2: embedding pre-filter (top-k + threshold, calibrated against labeled pairs).
- Q3: externalize confidence to `instance.yaml`; exponential decay with per-source floor.
- Q4: typed entity registry with canonical IDs, word-boundary regex, hierarchical `parent_id`.

---

## Process notes

- Worktree: `/Users/padak/github/agnes-pabu-local-dev`, branch `pabu/local-dev` tracks `origin/pabu/local-dev`.
- Walkthrough done file-by-file with paired reading; second-opinion run via Codex CLI (model `gpt-5.5`, `model_reasoning_effort = xhigh`) over the same files.
- Document is intended for round-trip discussion: pd commits the first pass, ps reads, replies inline or in PR comments.