agnes-the-ai-analyst/docs/archive/pd-ps-comments.md
ZdenekSrotyr a48524509a
docs: consolidate and de-clutter the documentation tree (#306)
CLAUDE.md rewritten (708 -> ~320 lines): four overlapping release
sections collapsed to one, stale v1->v35 schema history dropped (it
lives in CHANGELOG), marketplace endpoint internals and verbose
process sections moved out or tightened.

New focused docs:
- docs/RELEASING.md - release process, deploy workflows, CI quirks
  (RELEASE_TEMPLATE.md folded in as an appendix)
- docs/marketplace.md - marketplace ingestion + re-serving internals
- docs/README.md - documentation index by audience, linked from
  README.md and CLAUDE.md

Archived under docs/archive/: docs/superpowers/ (52 historical
planning artifacts), HACKATHON.md, pd-ps-comments.md,
security-audit-2026-04.md, future/NOTIFICATIONS.md.

Removed the docs/auto-install.md stub. Fixed dangling links in
connectors/jira/README.md and dev_docs/README.md, repointed
code/doc references to archived paths.
2026-05-14 18:54:22 +00:00

383 lines
27 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Corporate Memory V1 — review notes (pd × ps)
**Branch:** `pabu/local-dev` (worktree review)
**Date:** 2026-04-25
**Reviewers:** pd (with Claude Code), independent second opinion from Codex (GPT-5.5, xhigh reasoning)
**Scope:** verification flywheel + corporate memory V1 — `services/verification_detector/`, `services/corporate_memory/`, `app/api/memory.py`, schema `v7→v8`.
This document captures open questions and proposed changes raised during a walkthrough of the branch. It is not a blocker list — it is a starting point for a follow-up conversation. Severity ratings are pd's initial estimate; please push back where they feel off.
---
## Context
The branch implements a "verification flywheel": session JSONLs are scanned by `verification_detector`, an LLM extracts corrections / confirmations / unprompted definitions, and the resulting facts feed `knowledge_items` with calibrated confidence. Admin governance (HITL) gates everything before approval. New schema columns track provenance (`source_type`, `source_ref`, `confidence`, `domain`, `entities`, `valid_from/until`, `supersedes`, `sensitivity`, `is_personal`) and two new tables (`knowledge_contradictions`, `session_extraction_state`).
The mechanism is sound. The notes below are about edges where V1 calibration choices may not survive contact with real data, and one suspected access-control issue.
---
## 1. Knowledge-item deduplication
### Problem
`services/verification_detector/detector.py` derives the item ID as:
```python
id = "kv_" + sha256(f"{title}:{content}").hexdigest()[:12]
```
Two analysts independently surfacing the same fact will almost never produce identical `title` and `content` strings — LLM phrasing varies even on the same underlying claim. So this hash collides ~never in practice. It is a deterministic ID, not a deduplication mechanism.
The branch has no other dedup safeguard at create time. Practical result: the admin review queue will grow with near-duplicates that humans must triage.
### Our proposal
Do not attempt auto-dedup at create time. Instead, surface near-duplicates into the admin queue via three layers:
1. **V1.5** — entity-tag match: items sharing ≥N normalized entities become a "likely duplicate" candidate pair, exposed in the admin UI alongside the existing contradictions tab.
2. **V2** — embedding similarity: at create time, compute cosine against existing approved items in the same domain; threshold (e.g. ≥ 0.85) → flag as `likely_duplicate_of: <id>` for admin merge.
3. The contradiction detector already surfaces "soft contradictions" — these can be repurposed to also catch near-duplicates with a single LLM judge call (same infra, different prompt).
Reasoning: auto-merge at create time is risky (false positives bury new nuance under stale items). Admin queue spam is the lesser evil; embedding pre-filter at V2 keeps admin load bounded.
### Codex second opinion
> Codex run: `gpt-5.5`, `model_reasoning_effort = xhigh`, executed against the same files in worktree.
- **Verdict:** Partial. Don't auto-merge at create time, but pre-check likely duplicates *before* inserting into the admin queue (don't punt entirely to admin manual review).
- **Blind spot we missed:** contradiction detection is **not a duplicate detector**, and in the current pipeline it is not wired into `verification_detector.run()` at all. Items are created at `detector.py:178` but `detect_and_record()` is never called. The contradiction prompt explicitly says "specificity / different perspective is not a contradiction", so near-duplicates would not surface there even if it were wired.
- **Concrete alternative:** add a relation table `knowledge_item_relations(item_a_id, item_b_id, relation_type, score, resolved)` near `src/db.py:76`. Add `repo.create_relation()` near `src/repositories/knowledge.py:182`. In `detector.py:171`, run duplicate-candidate lookup (entity-overlap is fine for V1.5; embeddings later) and attach `relation_type='likely_duplicate'` so the admin queue shows duplicate candidates explicitly.
- **Severity:** medium. Must fix before broad historical backfill; acceptable for narrow V1 only if confirmations are limited and admins see explicit duplicate candidates.
- **Confidence:** 90%.
### Plan (revised after Codex)
- V1: accept that duplicate hint is missing; flag this loudly in PR description so reviewers know what's deferred.
- V1.5: add `knowledge_item_relations` table + entity-overlap-based duplicate suggestion. Surface in admin UI alongside contradictions. ~1 day work.
- V2: embedding similarity at create time, gated by domain.
**Severity:** medium (was: medium). Codex agrees on default but raises the bar for V1.5 — duplicate candidates should be surfaced explicitly, not discovered ad-hoc by admin.
---
## 2. Contradiction detection — synchronous + SQL pre-filter vs Anthropic Batch API
### Current
`services/corporate_memory/contradiction.py` runs synchronously per new item:
1. **Pre-filter (DuckDB):** find candidates with same `domain` + keyword match on `title.split()` words >3 chars. Limit 10.
2. **LLM-as-judge (sync):** Haiku prompt returns `{contradicts, explanation, severity, suggested_resolution}`.
The pre-filter has predictable recall problems: synonyms, paraphrases, cross-domain conflicts (a finance metric definition can contradict a data engineering definition of the same metric).
### Proposed migration to Batch API
Anthropic's Batch API offers ~50% cost reduction with async SLA (≤24h, typically <1h). This is attractive because contradiction detection does not need real-time response admins review queues, not push notifications.
### Our proposal
Layered evolution:
- **V1**: keep synchronous + SQL filter as-is. Ship.
- **V1.5**: add embedding-based pre-filter. Voyage embeddings at ~$0.02/1M tokens are essentially free at our volume. Replace keyword filter with `cosine(item, candidate) > 0.6` per domain; LLM judge unchanged. Catches paraphrases the keyword filter misses.
- **V2**: switch the LLM judge phase to Batch API. Run a nightly sweep over `pending × approved` per domain. With 50% cost reduction and higher rate limits, we can afford O(N²) within domain shards (no pre-filter needed in Batch mode let the model judge all pairs).
### Open questions
- **Hidden Batch API costs** beyond cost: dev experience (test cycles), observability (job tracking, retries), debug latency. Worth it?
- **Hybrid mode**: keep sync for high-priority sources (admin_mandate, user corrections) and batch for bulk (`session_transcript`, low-confidence pending)? Or single mode for simplicity?
- **Embedding threshold**: 0.6 is a guess. Calibrate against held-out labeled pairs from V1 data once we have them.
### Codex second opinion
- **Verdict:** Partial. Batch API is likely a V2 win for bulk sweeps, **but not as the only mode.**
- **Blind spot we missed (two of them, both severe):**
1. The SQL pre-filter is **weaker than we described.** It is not `domain AND keyword`; it is `domain OR keyword` `src/repositories/knowledge.py:287` joins all conditions with `OR`. Combined with `ORDER BY updated_at DESC LIMIT 10`, recent same-domain noise crowds out the actual conflict.
2. Detector-created items **never invoke contradiction detection at all.** `detect_and_record()` exists in `services/corporate_memory/contradiction.py` but is not called from `services/verification_detector/detector.py`. So V1 contradiction governance is effectively a stub.
- **Concrete alternative:** first fix V1 before optimizing.
1. In `src/repositories/knowledge.py:272`, build `domain = ?` as a top-level conjunct (`AND`) and the title/content keyword expansion as an inner `OR` group.
2. In `services/verification_detector/detector.py:178`, after `repo.create()`, call `contradiction.detect_and_record(extractor, item_dict, repo)`. Or enqueue it for nightly batch but it must run somewhere.
3. For V2, use **one shared job model with two executors**: sync for admin/manual/high-priority items, batch for session backfills and nightly sweeps. Don't fork prompt/candidate logic.
- **On embeddings threshold:** `cosine > 0.6` is arbitrary. Calibrate using labeled pairs (duplicate / contradiction / related-but-compatible / unrelated). Optimize for recall before LLM judging target `>95% recall` with bounded candidate count. **Use top-k plus threshold, not threshold alone.**
- **Severity:** **high.** If V1 claims contradiction governance, this needs fixing before merge currently the feature is shipped but inert.
- **Confidence:** 92%.
### Plan (revised after Codex)
- **V1 must-fix (was missed):**
- Fix `OR` `AND domain + (OR keywords)` in `find_contradiction_candidates`.
- Wire `detect_and_record()` from `verification_detector.detector.run()`. Or, if we don't want sync LLM cost in detector, enqueue items into a `pending_contradiction_check` table for a nightly batch.
- Either of those, or remove the V1 claim that contradictions are surfaced.
- V1.5: embedding pre-filter (top-k + threshold, calibrated against labeled pairs). Single shared job model.
- V2: batch executor for nightly sweeps; sync executor for admin/manual; same prompt+candidate logic.
**Severity:** **high** (was: low for V1). Codex's nailed two bugs we missed entirely.
---
## 3. Confidence scoring — calibration, decay, feedback
### Current
`services/corporate_memory/confidence.py` is a hand-rolled lookup:
- `_BASE_CONFIDENCE`: hard-coded dict, e.g. `user_verification.correction = 0.90`, `admin_mandate = 1.00`, `claude_local_md = 0.50`.
- `_MODIFIER_EFFECTS`: hard-coded modifiers (`+0.05` per additional verifier, `+0.20` for admin confirmation).
- `apply_decay()`: linear `0.02` per month everything reaches 0 at ~50 months including admin policies.
Three problems:
1. **Tuning requires deploy.** The numbers are guesses; in real use we will discover that, say, `user_verification.confirmation` (currently 0.60) is overoptimistic. Each iteration = code change + deploy.
2. **Linear decay is wrong for admin policies.** `admin_mandate = 1.00` decaying linearly to 0 in 50 months is sematically incorrect admin policies are policies, not "aging facts". They are explicitly revoked, not slowly forgotten.
3. **No feedback from admin actions.** When an admin rejects an item, that signal is lost we don't update the prior for `(source_type, detection_type)` based on observed approval rates.
### Our proposal
Three layers:
**A. Externalize all numbers to `instance.yaml`** (V1.5, mandatory):
```yaml
corporate_memory:
confidence:
base:
user_verification.correction: 0.90
user_verification.confirmation: 0.60
user_verification.unprompted_definition: 0.90
admin_mandate: 1.00
claude_local_md: 0.50
session_transcript: 0.50
modifiers:
additional_verifiers_per_user: 0.05
admin_confirmed_bonus: 0.20
decay:
mode: exponential # or "linear" for back-compat
half_life_months: 12 # confidence halves every 12 months
floor:
admin_mandate: 0.50 # admin items never decay below 0.5
user_verification.correction: 0.10
claude_local_md: 0.0
session_transcript: 0.0
```
**B. Switch to exponential decay with per-source-type floor** (V1.5):
```python
def apply_decay(confidence, created_at, source_type, half_life_months=12, floor_per_source=None):
age_months = ...
decayed = confidence * (0.5 ** (age_months / half_life_months))
floor = (floor_per_source or {}).get(source_type, 0.0)
return max(decayed, floor)
```
**C. Bayesian prior calibration from admin actions** (V2):
Nightly: compute `P(approve | source_type, detection_type)` over the last N items per category. Surface as a metric in admin UI. Initially: human-edited config update. Later: gated auto-update with selection-bias mitigation (random sampling holdout).
### Open questions
- **Power-law vs exponential decay.** Ebbinghaus forgetting curve research suggests power-law may match human knowledge half-life better. Does that translate to organizational knowledge? Probably yes for "memorized facts", probably no for "policies that are explicitly maintained".
- **Per-domain calibration.** Finance metrics churn quarterly; engineering conventions persist for years. Should `decay.half_life_months` be per-domain, not just per-source-type?
- **Selection bias in feedback loop.** If priors gate which items admins see (sorted by confidence), and we update priors from approval rate, low-confidence items rarely get reviewed priors stay biased. Mitigation: reserve a small random sample (e.g. 10%) outside the priority queue for unbiased measurement.
- **No "post-create confirmation" mechanism.** When a new verification cites an existing approved item, only the new item gets confidence; the existing item doesn't receive a verification-count boost. This loses the "multiple analysts independently cited this" signal over time.
### Codex second opinion
- **Verdict:** Partial. External config + floors are good. **The decay framing is wrong, and there are bigger V1 bugs in the data flow.**
- **Blind spot we missed (significant):**
1. **LLM-controlled confidence.** `services/verification_detector/prompts.py:37` asks the LLM to return `base_confidence` as part of the JSON output, and `services/verification_detector/detector.py:187` stores it directly into the DB. **Confidence should not be LLM-controlled.** The current `compute_confidence()` exists but is bypassed for verification-detector items.
2. **Lost evidence.** The detector extracts `user_quote` from the LLM (the exact quote that constitutes the verification the most valuable signal!) and **discards it.** It is not stored anywhere. Without `user_quote` and `detection_type` persisted in DB, future Bayesian re-calibration has no raw material.
3. **Decay framing is wrong.** Confidence is being treated as "truth decays with age" (Ebbinghaus). Organizational facts usually change by **events, scope, or validity windows** not by smooth forgetting. The `valid_from/valid_until` columns already exist; the right model is "fact has a validity window", not "fact slowly fades".
- **Concrete alternative:**
1. Remove `base_confidence` from the LLM-required output in `services/verification_detector/schemas.py:30`, **or** ignore it on the read side. Compute confidence in code: `compute_confidence("user_verification", v["detection_type"])`.
2. Add an **evidence table**: `verification_evidence(id, item_id, source_user, source_ref, detection_type, user_quote, created_at)`. Persist `user_quote` and `detection_type` per-verification. Multiple evidence rows per item enables real "additional verifiers" boost computation (one user × one quote = one evidence row).
3. Split "evidence confidence" (signal strength from sources) from "freshness / review-due" (validity windows, explicit revocation). Don't conflate them in one number.
- **On power-law vs exponential:** secondary issue. Domain-specific volatility matters more use **hierarchical priors**: global `(source_type, detection_type)` priors plus per-domain freshness policy.
- **On feedback loops:** mitigate via random review sampling, holdout calibration sets, and **never use learned priors as the sole gate for admin visibility.** Codex agrees with our concern but emphasizes random sampling more strongly.
- **Severity:** medium. Bayesian/config overhaul can wait, but **LLM-controlled confidence and missing evidence should be fixed before V1.**
- **Confidence:** 88%.
### Plan (revised after Codex)
- **V1 must-fix (was missed):**
- Stop trusting LLM-returned `base_confidence`. Either drop it from `VERIFICATION_SCHEMA` or ignore it on insert and call `compute_confidence(...)` instead.
- Add `verification_evidence` table to persist `user_quote` + `detection_type` + source reference. This is also what makes "multi-user verification boost" actually computable post-create.
- V1.5: externalize numbers to `instance.yaml` (A) + exponential decay with per-source floor (B).
- V2: Bayesian metric surface (read-only) + per-domain decay policy. Auto-update behind feature flag with random-sample holdout.
**Severity:** **high** for V1 must-fix items above; medium for the rest.
---
## 4. Entity resolution v1
### Problem
`services/corporate_memory/entities.py` does case-insensitive **substring** match against a static registry. Two consequences:
- **False positives** on short tokens. A registry containing `"MD"` matches every occurrence of `"markdown"`, `"command"`, `"admin"`, `"medical"`. Severe noise.
- **False negatives** on synonyms. `"churn"` does not match `"attrition"`, `"customer loss"`, `"logo churn"`. Severe miss.
- No lemmatization, no co-occurrence, no confidence on the match itself.
### Our proposal
Three layers:
**1. V1.5 — word-boundary regex + canonical+aliases registry:**
```yaml
corporate_memory:
entities:
metrics:
- canonical: churn
aliases: [attrition, customer loss, logo churn]
- canonical: MRR
aliases: [monthly recurring revenue, monthly revenue]
```
```python
import re
def resolve_entities(content, title, registry):
text = f"{title} {content}"
matched = set()
for category, entries in registry.items():
for entry in entries:
patterns = [entry["canonical"]] + entry.get("aliases", [])
for p in patterns:
if re.search(rf"\b{re.escape(p)}\b", text, re.IGNORECASE):
matched.add(entry["canonical"]) # always store canonical
break
return sorted(matched)
```
Cost: ~30 LOC, no new dependencies, deterministic, eliminates false positives on short tokens.
**2. V2 — embedding-based fuzzy match.** Pre-compute embeddings for all canonical entities (cache). Per item: embed text, cosine to each entity embedding, threshold 0.75. Catches paraphrases the alias list misses. Voyage cost is negligible.
**3. V2.5 — LLM extraction for low-entity items, batch mode.** Items with <2 resolved entities go into a nightly batch LLM job that suggests candidates for admin curation. This is also how the synonym registry grows over time without manual curation.
### Open questions
- **Skip V1.5, go straight to embeddings?** Argument for: per-item cost is ~$0.0000002, negligible. Argument against: word-boundary fix is so cheap (~30 LOC) it's a no-brainer; embeddings pull in voyage_sdk + caching infrastructure that we don't need yet.
- **Synonym registry maintenance.** Who curates it? Does it become a Big Ball Of Mud? Mitigation: V2.5 LLM-suggest pipeline auto-grows it; admin reviews additions.
- **Hierarchical entities.** "EMEA Sales" is a sub-team of "Sales". "MRR Q3" is an instance of "MRR". Layered approach doesn't model this. Do we need it for V1? Probably not. For V2? Maybe depends on how often analysts ask drill-down questions.
### Codex second opinion
- **Verdict:** Partial. **Do not skip V1.5.** Embeddings are not a replacement for deterministic entity resolution.
- **Blind spot we missed:** embeddings are **bad at exact short entities, acronyms, product codes, metric names, and aliases where precision matters** (e.g. `MRR`, `NPS`, internal product codes). Substring embeddings would trade one class of false positives for a different class of *silent* errors that are harder to detect. Word-boundary regex with a maintained alias list is *more correct* than embeddings for these cases, not just cheaper.
- **Concrete alternative:** rebuild `entities.py:14` to use a **typed registry with stable canonical IDs**:
```yaml
metrics:
- id: metric.churn
canonical: churn
aliases: [attrition, customer loss, logo churn]
kind: metric
parent_id: null
blocked_terms: []
- id: org.team.sales.emea
canonical: EMEA Sales
kind: team
parent_id: org.team.sales
```
Replace substring matching at `entities.py:57` with escaped word-boundary regex, with **special rules for short aliases** (e.g. require leading + trailing whitespace or punctuation). Store canonical **IDs** (`metric.churn`) in `knowledge_items.entities`, not display strings display is a join. Use embeddings/LLM **only to suggest aliases or candidates for admin approval**, never to bypass deterministic resolution.
- **On hierarchies:** worth modeling minimally now. `parent_id` is sufficient. `EMEA Sales` rolls up to `Sales`. `MRR Q3` is "metric + time period", not a separate flat entity model it as `entity_ref + valid_from/valid_until`.
- **Severity:** medium. Not a V1 blocker by itself, but **definitely required before V1.5** if dedup or contradiction depend on entities (which our V1.5 plan does).
- **Confidence:** 86%.
### Plan (revised after Codex)
- V1: ship as-is, accept noise.
- V1.5: typed registry with canonical IDs, word-boundary regex, **store IDs not strings**, support `parent_id` for hierarchies. Use this for duplicate-candidate detection (Q1 V1.5 plan). ~2 days now (more scope than originally planned).
- V2+: embeddings + LLM-suggest pipeline for **alias growth**, not entity resolution. Admin approves auto-suggested aliases.
**Severity:** medium. Codex sharpened the design store canonical IDs, not strings; hierarchy is V1.5, not V2.
---
## 5. `is_personal` flag — suspected leak
### Code paths
`app/api/memory.py`:
- `POST /{item_id}/personal` (line ~196) only the contributor (`item.source_user == user.email`) can flag. Correct.
- `GET /api/memory` (line ~63) accepts query parameter `exclude_personal: bool = True`. Default excludes personal items, but the endpoint accepts `exclude_personal=False` from any authenticated user with no role check.
- `GET /api/memory/my-contributions` returns contributor's own items including personal ones. Correct.
### Concern
Any authenticated user can request:
```
GET /api/memory?exclude_personal=false
```
and receive items flagged `is_personal=true` by other users. The flag is meant to keep an item private to the contributor, but the list endpoint exposes it whenever the caller asks.
### Impact
`is_personal` was introduced for emergency-exit cases (the detector pulled out something private that shouldn't be team-wide). If any user can override the default, the flag provides false security. This is a confidentiality leak unless `is_personal` is purely an "admin convenience flag" but that's not how the toggle UI presents it.
### Proposed fix
Two options:
**Option A — kill the override entirely.** Remove the `exclude_personal` query parameter. Personal items are never visible via `GET /api/memory`. Contributors see them only via `/my-contributions`. Admins see them via a separate admin endpoint guarded by `Role.KM_ADMIN`.
**Option B — gate the override by role.** Keep the parameter, but require `Role.KM_ADMIN` to set it to `False`. Add a server-side check; non-admin requests with `exclude_personal=False` get 403 or are silently coerced to `True`.
Option A is simpler, safer, and matches the most likely product intent. Option B preserves a flexible audit endpoint for admins but is easier to misuse.
### Codex second opinion
- **Verdict:** **Y. This is a leak.**
- **Blind spot we missed (worse than the list path alone):**
- The public list endpoint passes caller-controlled `exclude_personal` into `repo.list_items()` at `app/api/memory.py:87`, and `repo.list_items()` only filters when that flag is true at `src/repositories/knowledge.py:113` caller controls it.
- **Search ignores `exclude_personal` entirely** via `app/api/memory.py:78` and `src/repositories/knowledge.py:119`. Any authenticated user can `GET /api/memory?search=foo` and see personal items unconditionally.
- **Direct item access** (`/{id}`, `/{id}/provenance`, vote endpoints) has no `is_personal` check either. If a user knows or guesses an item ID, they retrieve it.
- **Concrete alternative:**
1. For the public `GET /api/memory`, **force `exclude_personal=True`** unless `user.role in ("km_admin", "admin")`. Or simpler: always force true and leave personal review to admin endpoints.
2. Add `exclude_personal` support to `repo.search()` (currently it's only on `list_items()`).
3. Add a shared `_can_view_item(user, item)` check used by provenance, vote, and direct item actions. Personal items are visible only to the contributor + admins.
- **Severity:** **high. Must fix before merging V1.**
- **Confidence:** **99%.**
### Plan (revised after Codex)
- **V1 blocker.** Fix all three vectors (list, search, direct access), not just the list parameter.
- Force `exclude_personal=True` for non-admins on `GET /api/memory`.
- Plumb `exclude_personal` through `repo.search()`.
- Add `_can_view_item(user, item)` helper, apply on `/{id}`, `/{id}/provenance`, and any other direct-item endpoint.
- Add regression test that an authenticated non-contributor cannot retrieve another user's `is_personal` item via list, search, or direct GET.
**Severity:** **high.**
---
## Top issues to address before merging V1
(Initial pd list was three items. Codex review added two more high-severity findings we missed entirely. Updated list below.)
1. **`is_personal` leak in full breadth (Q5).** Fix list, search, and direct-item access. Force `exclude_personal=True` for non-admins on `GET /api/memory`; plumb the filter through `repo.search()`; add `_can_view_item(user, item)` for `/{id}`, `/{id}/provenance`, vote endpoints. Add regression test. **Severity: high. Confidence: 99%.**
2. **Contradiction detection is shipped but inert (Q2).** `detect_and_record()` is never called from `verification_detector.run()`. Either wire it (sync after `repo.create()` or enqueue for batch) or remove the V1 claim that contradictions are surfaced. Also fix the `OR → AND` bug in the SQL pre-filter at `src/repositories/knowledge.py:272-287` so `domain` is a top-level conjunct and keywords are an inner `OR` group. **Severity: high.**
3. **LLM-controlled confidence + lost evidence (Q3).** Detector trusts `base_confidence` from the LLM and discards `user_quote`. Drop `base_confidence` from `VERIFICATION_SCHEMA` (or ignore on insert), call `compute_confidence(...)` in code instead. Add `verification_evidence` table to persist `user_quote`, `detection_type`, `source_user`, `source_ref` per evidence row. Without this, "additional verifiers boost" is computable in theory only. **Severity: high.**
The rest (Q1 explicit duplicate hint, Q2 batch sweep, Q3 YAML externalize + exponential decay, Q4 typed registry) can land in V1.5 as planned.
### What V1.5 must own (consequential refinements, not blockers)
- Q1: `knowledge_item_relations` table for explicit duplicate candidate hints in admin queue.
- Q2: embedding pre-filter (top-k + threshold, calibrated against labeled pairs).
- Q3: externalize confidence to `instance.yaml`; exponential decay with per-source floor.
- Q4: typed entity registry with canonical IDs, word-boundary regex, hierarchical `parent_id`.
---
## Process notes
- Worktree: `/Users/padak/github/agnes-pabu-local-dev`, branch `pabu/local-dev` tracks `origin/pabu/local-dev`.
- Walkthrough done file-by-file with paired reading; second-opinion run via Codex CLI (model `gpt-5.5`, `model_reasoning_effort = xhigh`) over the same files.
- Document is intended for round-trip discussion: pd commits the first pass, ps reads, replies inline or in PR comments.