agnes-the-ai-analyst/docs/archive/pd-ps-comments.md
ZdenekSrotyr a48524509a
docs: consolidate and de-clutter the documentation tree (#306)
CLAUDE.md rewritten (708 -> ~320 lines): four overlapping release
sections collapsed to one, stale v1->v35 schema history dropped (it
lives in CHANGELOG), marketplace endpoint internals and verbose
process sections moved out or tightened.

New focused docs:
- docs/RELEASING.md - release process, deploy workflows, CI quirks
  (RELEASE_TEMPLATE.md folded in as an appendix)
- docs/marketplace.md - marketplace ingestion + re-serving internals
- docs/README.md - documentation index by audience, linked from
  README.md and CLAUDE.md

Archived under docs/archive/: docs/superpowers/ (52 historical
planning artifacts), HACKATHON.md, pd-ps-comments.md,
security-audit-2026-04.md, future/NOTIFICATIONS.md.

Removed the docs/auto-install.md stub. Fixed dangling links in
connectors/jira/README.md and dev_docs/README.md, repointed
code/doc references to archived paths.
2026-05-14 18:54:22 +00:00

27 KiB
Raw Blame History

Corporate Memory V1 — review notes (pd × ps)

Branch: pabu/local-dev (worktree review) Date: 2026-04-25 Reviewers: pd (with Claude Code), independent second opinion from Codex (GPT-5.5, xhigh reasoning) Scope: verification flywheel + corporate memory V1 — services/verification_detector/, services/corporate_memory/, app/api/memory.py, schema v7→v8.

This document captures open questions and proposed changes raised during a walkthrough of the branch. It is not a blocker list — it is a starting point for a follow-up conversation. Severity ratings are pd's initial estimate; please push back where they feel off.


Context

The branch implements a "verification flywheel": session JSONLs are scanned by verification_detector, an LLM extracts corrections / confirmations / unprompted definitions, and the resulting facts feed knowledge_items with calibrated confidence. Admin governance (HITL) gates everything before approval. New schema columns track provenance (source_type, source_ref, confidence, domain, entities, valid_from/until, supersedes, sensitivity, is_personal) and two new tables (knowledge_contradictions, session_extraction_state).

The mechanism is sound. The notes below are about edges where V1 calibration choices may not survive contact with real data, and one suspected access-control issue.


1. Knowledge-item deduplication

Problem

services/verification_detector/detector.py derives the item ID as:

id = "kv_" + sha256(f"{title}:{content}").hexdigest()[:12]

Two analysts independently surfacing the same fact will almost never produce identical title and content strings — LLM phrasing varies even on the same underlying claim. So this hash collides ~never in practice. It is a deterministic ID, not a deduplication mechanism.

The branch has no other dedup safeguard at create time. Practical result: the admin review queue will grow with near-duplicates that humans must triage.

Our proposal

Do not attempt auto-dedup at create time. Instead, surface near-duplicates into the admin queue via three layers:

  1. V1.5 — entity-tag match: items sharing ≥N normalized entities become a "likely duplicate" candidate pair, exposed in the admin UI alongside the existing contradictions tab.
  2. V2 — embedding similarity: at create time, compute cosine against existing approved items in the same domain; threshold (e.g. ≥ 0.85) → flag as likely_duplicate_of: <id> for admin merge.
  3. The contradiction detector already surfaces "soft contradictions" — these can be repurposed to also catch near-duplicates with a single LLM judge call (same infra, different prompt).

Reasoning: auto-merge at create time is risky (false positives bury new nuance under stale items). Admin queue spam is the lesser evil; embedding pre-filter at V2 keeps admin load bounded.

Codex second opinion

Codex run: gpt-5.5, model_reasoning_effort = xhigh, executed against the same files in worktree.

  • Verdict: Partial. Don't auto-merge at create time, but pre-check likely duplicates before inserting into the admin queue (don't punt entirely to admin manual review).
  • Blind spot we missed: contradiction detection is not a duplicate detector, and in the current pipeline it is not wired into verification_detector.run() at all. Items are created at detector.py:178 but detect_and_record() is never called. The contradiction prompt explicitly says "specificity / different perspective is not a contradiction", so near-duplicates would not surface there even if it were wired.
  • Concrete alternative: add a relation table knowledge_item_relations(item_a_id, item_b_id, relation_type, score, resolved) near src/db.py:76. Add repo.create_relation() near src/repositories/knowledge.py:182. In detector.py:171, run duplicate-candidate lookup (entity-overlap is fine for V1.5; embeddings later) and attach relation_type='likely_duplicate' so the admin queue shows duplicate candidates explicitly.
  • Severity: medium. Must fix before broad historical backfill; acceptable for narrow V1 only if confirmations are limited and admins see explicit duplicate candidates.
  • Confidence: 90%.

Plan (revised after Codex)

  • V1: accept that duplicate hint is missing; flag this loudly in PR description so reviewers know what's deferred.
  • V1.5: add knowledge_item_relations table + entity-overlap-based duplicate suggestion. Surface in admin UI alongside contradictions. ~1 day work.
  • V2: embedding similarity at create time, gated by domain.

Severity: medium (was: medium). Codex agrees on default but raises the bar for V1.5 — duplicate candidates should be surfaced explicitly, not discovered ad-hoc by admin.


2. Contradiction detection — synchronous + SQL pre-filter vs Anthropic Batch API

Current

services/corporate_memory/contradiction.py runs synchronously per new item:

  1. Pre-filter (DuckDB): find candidates with same domain + keyword match on title.split() words >3 chars. Limit 10.
  2. LLM-as-judge (sync): Haiku prompt returns {contradicts, explanation, severity, suggested_resolution}.

The pre-filter has predictable recall problems: synonyms, paraphrases, cross-domain conflicts (a finance metric definition can contradict a data engineering definition of the same metric).

Proposed migration to Batch API

Anthropic's Batch API offers ~50% cost reduction with async SLA (≤24h, typically <1h). This is attractive because contradiction detection does not need real-time response — admins review queues, not push notifications.

Our proposal

Layered evolution:

  • V1: keep synchronous + SQL filter as-is. Ship.
  • V1.5: add embedding-based pre-filter. Voyage embeddings at ~$0.02/1M tokens are essentially free at our volume. Replace keyword filter with cosine(item, candidate) > 0.6 per domain; LLM judge unchanged. Catches paraphrases the keyword filter misses.
  • V2: switch the LLM judge phase to Batch API. Run a nightly sweep over pending × approved per domain. With 50% cost reduction and higher rate limits, we can afford O(N²) within domain shards (no pre-filter needed in Batch mode — let the model judge all pairs).

Open questions

  • Hidden Batch API costs beyond cost: dev experience (test cycles), observability (job tracking, retries), debug latency. Worth it?
  • Hybrid mode: keep sync for high-priority sources (admin_mandate, user corrections) and batch for bulk (session_transcript, low-confidence pending)? Or single mode for simplicity?
  • Embedding threshold: 0.6 is a guess. Calibrate against held-out labeled pairs from V1 data once we have them.

Codex second opinion

  • Verdict: Partial. Batch API is likely a V2 win for bulk sweeps, but not as the only mode.
  • Blind spot we missed (two of them, both severe):
    1. The SQL pre-filter is weaker than we described. It is not domain AND keyword; it is domain OR keywordsrc/repositories/knowledge.py:287 joins all conditions with OR. Combined with ORDER BY updated_at DESC LIMIT 10, recent same-domain noise crowds out the actual conflict.
    2. Detector-created items never invoke contradiction detection at all. detect_and_record() exists in services/corporate_memory/contradiction.py but is not called from services/verification_detector/detector.py. So V1 contradiction governance is effectively a stub.
  • Concrete alternative: first fix V1 before optimizing.
    1. In src/repositories/knowledge.py:272, build domain = ? as a top-level conjunct (AND) and the title/content keyword expansion as an inner OR group.
    2. In services/verification_detector/detector.py:178, after repo.create(), call contradiction.detect_and_record(extractor, item_dict, repo). Or enqueue it for nightly batch — but it must run somewhere.
    3. For V2, use one shared job model with two executors: sync for admin/manual/high-priority items, batch for session backfills and nightly sweeps. Don't fork prompt/candidate logic.
  • On embeddings threshold: cosine > 0.6 is arbitrary. Calibrate using labeled pairs (duplicate / contradiction / related-but-compatible / unrelated). Optimize for recall before LLM judging — target >95% recall with bounded candidate count. Use top-k plus threshold, not threshold alone.
  • Severity: high. If V1 claims contradiction governance, this needs fixing before merge — currently the feature is shipped but inert.
  • Confidence: 92%.

Plan (revised after Codex)

  • V1 must-fix (was missed):
    • Fix ORAND domain + (OR keywords) in find_contradiction_candidates.
    • Wire detect_and_record() from verification_detector.detector.run(). Or, if we don't want sync LLM cost in detector, enqueue items into a pending_contradiction_check table for a nightly batch.
    • Either of those, or remove the V1 claim that contradictions are surfaced.
  • V1.5: embedding pre-filter (top-k + threshold, calibrated against labeled pairs). Single shared job model.
  • V2: batch executor for nightly sweeps; sync executor for admin/manual; same prompt+candidate logic.

Severity: high (was: low for V1). Codex's nailed two bugs we missed entirely.


3. Confidence scoring — calibration, decay, feedback

Current

services/corporate_memory/confidence.py is a hand-rolled lookup:

  • _BASE_CONFIDENCE: hard-coded dict, e.g. user_verification.correction = 0.90, admin_mandate = 1.00, claude_local_md = 0.50.
  • _MODIFIER_EFFECTS: hard-coded modifiers (+0.05 per additional verifier, +0.20 for admin confirmation).
  • apply_decay(): linear 0.02 per month → everything reaches 0 at ~50 months including admin policies.

Three problems:

  1. Tuning requires deploy. The numbers are guesses; in real use we will discover that, say, user_verification.confirmation (currently 0.60) is overoptimistic. Each iteration = code change + deploy.
  2. Linear decay is wrong for admin policies. admin_mandate = 1.00 decaying linearly to 0 in 50 months is sematically incorrect — admin policies are policies, not "aging facts". They are explicitly revoked, not slowly forgotten.
  3. No feedback from admin actions. When an admin rejects an item, that signal is lost — we don't update the prior for (source_type, detection_type) based on observed approval rates.

Our proposal

Three layers:

A. Externalize all numbers to instance.yaml (V1.5, mandatory):

corporate_memory:
  confidence:
    base:
      user_verification.correction: 0.90
      user_verification.confirmation: 0.60
      user_verification.unprompted_definition: 0.90
      admin_mandate: 1.00
      claude_local_md: 0.50
      session_transcript: 0.50
    modifiers:
      additional_verifiers_per_user: 0.05
      admin_confirmed_bonus: 0.20
    decay:
      mode: exponential       # or "linear" for back-compat
      half_life_months: 12    # confidence halves every 12 months
      floor:
        admin_mandate: 0.50   # admin items never decay below 0.5
        user_verification.correction: 0.10
        claude_local_md: 0.0
        session_transcript: 0.0

B. Switch to exponential decay with per-source-type floor (V1.5):

def apply_decay(confidence, created_at, source_type, half_life_months=12, floor_per_source=None):
    age_months = ...
    decayed = confidence * (0.5 ** (age_months / half_life_months))
    floor = (floor_per_source or {}).get(source_type, 0.0)
    return max(decayed, floor)

C. Bayesian prior calibration from admin actions (V2):

Nightly: compute P(approve | source_type, detection_type) over the last N items per category. Surface as a metric in admin UI. Initially: human-edited config update. Later: gated auto-update with selection-bias mitigation (random sampling holdout).

Open questions

  • Power-law vs exponential decay. Ebbinghaus forgetting curve research suggests power-law may match human knowledge half-life better. Does that translate to organizational knowledge? Probably yes for "memorized facts", probably no for "policies that are explicitly maintained".
  • Per-domain calibration. Finance metrics churn quarterly; engineering conventions persist for years. Should decay.half_life_months be per-domain, not just per-source-type?
  • Selection bias in feedback loop. If priors gate which items admins see (sorted by confidence), and we update priors from approval rate, low-confidence items rarely get reviewed → priors stay biased. Mitigation: reserve a small random sample (e.g. 10%) outside the priority queue for unbiased measurement.
  • No "post-create confirmation" mechanism. When a new verification cites an existing approved item, only the new item gets confidence; the existing item doesn't receive a verification-count boost. This loses the "multiple analysts independently cited this" signal over time.

Codex second opinion

  • Verdict: Partial. External config + floors are good. The decay framing is wrong, and there are bigger V1 bugs in the data flow.
  • Blind spot we missed (significant):
    1. LLM-controlled confidence. services/verification_detector/prompts.py:37 asks the LLM to return base_confidence as part of the JSON output, and services/verification_detector/detector.py:187 stores it directly into the DB. Confidence should not be LLM-controlled. The current compute_confidence() exists but is bypassed for verification-detector items.
    2. Lost evidence. The detector extracts user_quote from the LLM (the exact quote that constitutes the verification — the most valuable signal!) and discards it. It is not stored anywhere. Without user_quote and detection_type persisted in DB, future Bayesian re-calibration has no raw material.
    3. Decay framing is wrong. Confidence is being treated as "truth decays with age" (Ebbinghaus). Organizational facts usually change by events, scope, or validity windows — not by smooth forgetting. The valid_from/valid_until columns already exist; the right model is "fact has a validity window", not "fact slowly fades".
  • Concrete alternative:
    1. Remove base_confidence from the LLM-required output in services/verification_detector/schemas.py:30, or ignore it on the read side. Compute confidence in code: compute_confidence("user_verification", v["detection_type"]).
    2. Add an evidence table: verification_evidence(id, item_id, source_user, source_ref, detection_type, user_quote, created_at). Persist user_quote and detection_type per-verification. Multiple evidence rows per item enables real "additional verifiers" boost computation (one user × one quote = one evidence row).
    3. Split "evidence confidence" (signal strength from sources) from "freshness / review-due" (validity windows, explicit revocation). Don't conflate them in one number.
  • On power-law vs exponential: secondary issue. Domain-specific volatility matters more — use hierarchical priors: global (source_type, detection_type) priors plus per-domain freshness policy.
  • On feedback loops: mitigate via random review sampling, holdout calibration sets, and never use learned priors as the sole gate for admin visibility. Codex agrees with our concern but emphasizes random sampling more strongly.
  • Severity: medium. Bayesian/config overhaul can wait, but LLM-controlled confidence and missing evidence should be fixed before V1.
  • Confidence: 88%.

Plan (revised after Codex)

  • V1 must-fix (was missed):
    • Stop trusting LLM-returned base_confidence. Either drop it from VERIFICATION_SCHEMA or ignore it on insert and call compute_confidence(...) instead.
    • Add verification_evidence table to persist user_quote + detection_type + source reference. This is also what makes "multi-user verification boost" actually computable post-create.
  • V1.5: externalize numbers to instance.yaml (A) + exponential decay with per-source floor (B).
  • V2: Bayesian metric surface (read-only) + per-domain decay policy. Auto-update behind feature flag with random-sample holdout.

Severity: high for V1 must-fix items above; medium for the rest.


4. Entity resolution v1

Problem

services/corporate_memory/entities.py does case-insensitive substring match against a static registry. Two consequences:

  • False positives on short tokens. A registry containing "MD" matches every occurrence of "markdown", "command", "admin", "medical". Severe noise.
  • False negatives on synonyms. "churn" does not match "attrition", "customer loss", "logo churn". Severe miss.
  • No lemmatization, no co-occurrence, no confidence on the match itself.

Our proposal

Three layers:

1. V1.5 — word-boundary regex + canonical+aliases registry:

corporate_memory:
  entities:
    metrics:
      - canonical: churn
        aliases: [attrition, customer loss, logo churn]
      - canonical: MRR
        aliases: [monthly recurring revenue, monthly revenue]
import re
def resolve_entities(content, title, registry):
    text = f"{title} {content}"
    matched = set()
    for category, entries in registry.items():
        for entry in entries:
            patterns = [entry["canonical"]] + entry.get("aliases", [])
            for p in patterns:
                if re.search(rf"\b{re.escape(p)}\b", text, re.IGNORECASE):
                    matched.add(entry["canonical"])  # always store canonical
                    break
    return sorted(matched)

Cost: ~30 LOC, no new dependencies, deterministic, eliminates false positives on short tokens.

2. V2 — embedding-based fuzzy match. Pre-compute embeddings for all canonical entities (cache). Per item: embed text, cosine to each entity embedding, threshold 0.75. Catches paraphrases the alias list misses. Voyage cost is negligible.

3. V2.5 — LLM extraction for low-entity items, batch mode. Items with <2 resolved entities go into a nightly batch LLM job that suggests candidates for admin curation. This is also how the synonym registry grows over time without manual curation.

Open questions

  • Skip V1.5, go straight to embeddings? Argument for: per-item cost is ~$0.0000002, negligible. Argument against: word-boundary fix is so cheap (~30 LOC) it's a no-brainer; embeddings pull in voyage_sdk + caching infrastructure that we don't need yet.
  • Synonym registry maintenance. Who curates it? Does it become a Big Ball Of Mud? Mitigation: V2.5 LLM-suggest pipeline auto-grows it; admin reviews additions.
  • Hierarchical entities. "EMEA Sales" is a sub-team of "Sales". "MRR Q3" is an instance of "MRR". Layered approach doesn't model this. Do we need it for V1? Probably not. For V2? Maybe — depends on how often analysts ask drill-down questions.

Codex second opinion

  • Verdict: Partial. Do not skip V1.5. Embeddings are not a replacement for deterministic entity resolution.
  • Blind spot we missed: embeddings are bad at exact short entities, acronyms, product codes, metric names, and aliases where precision matters (e.g. MRR, NPS, internal product codes). Substring → embeddings would trade one class of false positives for a different class of silent errors that are harder to detect. Word-boundary regex with a maintained alias list is more correct than embeddings for these cases, not just cheaper.
  • Concrete alternative: rebuild entities.py:14 to use a typed registry with stable canonical IDs:
    metrics:
      - id: metric.churn
        canonical: churn
        aliases: [attrition, customer loss, logo churn]
        kind: metric
        parent_id: null
        blocked_terms: []
      - id: org.team.sales.emea
        canonical: EMEA Sales
        kind: team
        parent_id: org.team.sales
    
    Replace substring matching at entities.py:57 with escaped word-boundary regex, with special rules for short aliases (e.g. require leading + trailing whitespace or punctuation). Store canonical IDs (metric.churn) in knowledge_items.entities, not display strings — display is a join. Use embeddings/LLM only to suggest aliases or candidates for admin approval, never to bypass deterministic resolution.
  • On hierarchies: worth modeling minimally now. parent_id is sufficient. EMEA Sales rolls up to Sales. MRR Q3 is "metric + time period", not a separate flat entity — model it as entity_ref + valid_from/valid_until.
  • Severity: medium. Not a V1 blocker by itself, but definitely required before V1.5 if dedup or contradiction depend on entities (which our V1.5 plan does).
  • Confidence: 86%.

Plan (revised after Codex)

  • V1: ship as-is, accept noise.
  • V1.5: typed registry with canonical IDs, word-boundary regex, store IDs not strings, support parent_id for hierarchies. Use this for duplicate-candidate detection (Q1 V1.5 plan). ~2 days now (more scope than originally planned).
  • V2+: embeddings + LLM-suggest pipeline for alias growth, not entity resolution. Admin approves auto-suggested aliases.

Severity: medium. Codex sharpened the design — store canonical IDs, not strings; hierarchy is V1.5, not V2.


5. is_personal flag — suspected leak

Code paths

app/api/memory.py:

  • POST /{item_id}/personal (line ~196) — only the contributor (item.source_user == user.email) can flag. Correct.
  • GET /api/memory (line ~63) — accepts query parameter exclude_personal: bool = True. Default excludes personal items, but the endpoint accepts exclude_personal=False from any authenticated user with no role check.
  • GET /api/memory/my-contributions — returns contributor's own items including personal ones. Correct.

Concern

Any authenticated user can request:

GET /api/memory?exclude_personal=false

…and receive items flagged is_personal=true by other users. The flag is meant to keep an item private to the contributor, but the list endpoint exposes it whenever the caller asks.

Impact

is_personal was introduced for emergency-exit cases (the detector pulled out something private that shouldn't be team-wide). If any user can override the default, the flag provides false security. This is a confidentiality leak unless is_personal is purely an "admin convenience flag" — but that's not how the toggle UI presents it.

Proposed fix

Two options:

Option A — kill the override entirely. Remove the exclude_personal query parameter. Personal items are never visible via GET /api/memory. Contributors see them only via /my-contributions. Admins see them via a separate admin endpoint guarded by Role.KM_ADMIN.

Option B — gate the override by role. Keep the parameter, but require Role.KM_ADMIN to set it to False. Add a server-side check; non-admin requests with exclude_personal=False get 403 or are silently coerced to True.

Option A is simpler, safer, and matches the most likely product intent. Option B preserves a flexible audit endpoint for admins but is easier to misuse.

Codex second opinion

  • Verdict: Y. This is a leak.
  • Blind spot we missed (worse than the list path alone):
    • The public list endpoint passes caller-controlled exclude_personal into repo.list_items() at app/api/memory.py:87, and repo.list_items() only filters when that flag is true at src/repositories/knowledge.py:113 — caller controls it.
    • Search ignores exclude_personal entirely via app/api/memory.py:78 and src/repositories/knowledge.py:119. Any authenticated user can GET /api/memory?search=foo and see personal items unconditionally.
    • Direct item access (/{id}, /{id}/provenance, vote endpoints) has no is_personal check either. If a user knows or guesses an item ID, they retrieve it.
  • Concrete alternative:
    1. For the public GET /api/memory, force exclude_personal=True unless user.role in ("km_admin", "admin"). Or simpler: always force true and leave personal review to admin endpoints.
    2. Add exclude_personal support to repo.search() (currently it's only on list_items()).
    3. Add a shared _can_view_item(user, item) check used by provenance, vote, and direct item actions. Personal items are visible only to the contributor + admins.
  • Severity: high. Must fix before merging V1.
  • Confidence: 99%.

Plan (revised after Codex)

  • V1 blocker. Fix all three vectors (list, search, direct access), not just the list parameter.
    • Force exclude_personal=True for non-admins on GET /api/memory.
    • Plumb exclude_personal through repo.search().
    • Add _can_view_item(user, item) helper, apply on /{id}, /{id}/provenance, and any other direct-item endpoint.
  • Add regression test that an authenticated non-contributor cannot retrieve another user's is_personal item via list, search, or direct GET.

Severity: high.


Top issues to address before merging V1

(Initial pd list was three items. Codex review added two more high-severity findings we missed entirely. Updated list below.)

  1. is_personal leak in full breadth (Q5). Fix list, search, and direct-item access. Force exclude_personal=True for non-admins on GET /api/memory; plumb the filter through repo.search(); add _can_view_item(user, item) for /{id}, /{id}/provenance, vote endpoints. Add regression test. Severity: high. Confidence: 99%.
  2. Contradiction detection is shipped but inert (Q2). detect_and_record() is never called from verification_detector.run(). Either wire it (sync after repo.create() or enqueue for batch) or remove the V1 claim that contradictions are surfaced. Also fix the OR → AND bug in the SQL pre-filter at src/repositories/knowledge.py:272-287 so domain is a top-level conjunct and keywords are an inner OR group. Severity: high.
  3. LLM-controlled confidence + lost evidence (Q3). Detector trusts base_confidence from the LLM and discards user_quote. Drop base_confidence from VERIFICATION_SCHEMA (or ignore on insert), call compute_confidence(...) in code instead. Add verification_evidence table to persist user_quote, detection_type, source_user, source_ref per evidence row. Without this, "additional verifiers boost" is computable in theory only. Severity: high.

The rest (Q1 explicit duplicate hint, Q2 batch sweep, Q3 YAML externalize + exponential decay, Q4 typed registry) can land in V1.5 as planned.

What V1.5 must own (consequential refinements, not blockers)

  • Q1: knowledge_item_relations table for explicit duplicate candidate hints in admin queue.
  • Q2: embedding pre-filter (top-k + threshold, calibrated against labeled pairs).
  • Q3: externalize confidence to instance.yaml; exponential decay with per-source floor.
  • Q4: typed entity registry with canonical IDs, word-boundary regex, hierarchical parent_id.

Process notes

  • Worktree: /Users/padak/github/agnes-pabu-local-dev, branch pabu/local-dev tracks origin/pabu/local-dev.
  • Walkthrough done file-by-file with paired reading; second-opinion run via Codex CLI (model gpt-5.5, model_reasoning_effort = xhigh) over the same files.
  • Document is intended for round-trip discussion: pd commits the first pass, ps reads, replies inline or in PR comments.