agnes-the-ai-analyst/src/fts.py
ZdenekSrotyr c5d67faad2
feat(memory): DuckDB FTS BM25 search for knowledge items (#121) (#326)
* feat(memory): DuckDB FTS BM25 search for knowledge items (#121)

Replaces `title ILIKE '%q%' OR content ILIKE '%q%'` ranked by
insertion order with BM25 relevance ranking via the DuckDB `fts`
extension. Czech queries like `cesky` match documents containing
`česky` (`strip_accents=1` + `lower=1`).

Architecture:
- src/fts.py — ensure_fts_loaded / ensure_knowledge_fts_index helpers.
  The extension is per-connection (INSTALL persisted at engine level,
  LOAD per-conn). Both helpers are idempotent and soft-fail on
  unavailability with a logged WARNING.
- Schema v47 (_v46_to_v47) — builds the initial BM25 index over
  knowledge_items(title, content) keyed by id. Migration is
  best-effort against ANY exception (not just duckdb.Error) so the
  schema bump cannot get stuck on v46 if a non-DuckDB error escapes
  the helper.
- KnowledgeRepository.search — FTS-or-ILIKE dichotomy with execute-
  time fallback. Same filter surface (statuses / category / domain /
  source_type / personal / audience / dismissed) either way.
  ensure_fts_loaded() returning True only guarantees the extension is
  loadable, NOT that the index exists — migration soft-fail or a
  concurrent overwrite=1 rebuild's drop-then-create window leaves the
  extension loaded but the index missing. The BM25 execute is wrapped
  in try/except duckdb.Error → ILIKE retry so transient failures
  cannot 500 the /api/memory?search= endpoint.
- KnowledgeRepository.count_items — mirrors the same FTS-or-ILIKE
  decision tree plus the execute-time fallback so the count always
  matches the paginated result set.
- Per-mutation rebuild — create and title-or-content update rebuild
  the index via overwrite=1 PRAGMA. Status flips skip (token stream
  unchanged).
- app/main.py lifespan rebuilds once at boot as a safety net for
  instances already on v47 across restarts.
- bm25_score column shape: ILIKE fallback now selects
  `NULL AS bm25_score` so the result column set matches the FTS
  path. Consumers can read the score uniformly; absence of relevance
  ranking is signalled by the column being None everywhere, not
  missing.

Tests in tests/test_knowledge_fts_search.py (9 tests):
- BM25 multi-term match set + adversarial-review fix asserting
  higher-density doc ranks first (skipped if extension unavailable).
- bm25_score column attached when extension available.
- ILIKE fallback path on search + count_items via patched
  ensure_fts_loaded → False; bm25_score is None on this path.
- Adversarial-review fix: search and count_items also fall back when
  the extension is loaded but the index is missing (simulated via
  drop_fts_index PRAGMA — the exact production failure mode the
  fallback guards against).
- Index rebuild on create (new item searchable immediately).
- Title update re-surfaces row under new term, drops old.
- Czech-diacritic round-trip (cesky query → česky doc).

Pinned schema-version asserts bumped 46 → 47 (test_db_schema_version,
test_home_stats, test_schema_v42_migration, test_schema_v46_migration).

Closes #121.

* release: 0.54.20 — Corporate Memory BM25 search + All-Items bulk-edit batch bar
2026-05-15 20:10:59 +02:00

78 lines
2.9 KiB
Python

"""DuckDB FTS extension helpers for knowledge-item BM25 search (issue #121).
The extension is per-connection: ``INSTALL fts`` is persisted at the engine
level, ``LOAD fts`` must run on every fresh DuckDB connection. The index
over ``knowledge_items`` is a *static snapshot* — DuckDB doesn't track
base-table INSERT / UPDATE / DELETE automatically.
We rebuild on demand inside ``search()`` (cheap at corpus sizes
< a few thousand rows; acceptance bound is sub-100ms) and fall back to
``ILIKE`` when the extension can't be loaded — offline installs and
sandboxed CI runners that block extension downloads must still serve
the search box, just without relevance ranking.
``strip_accents=1`` lets queries like ``cesky`` match documents
containing ``česky`` — the Czech-diacritics acceptance from #121.
"""
from __future__ import annotations
import logging
import duckdb
logger = logging.getLogger(__name__)
def ensure_fts_loaded(conn: duckdb.DuckDBPyConnection) -> bool:
"""``INSTALL`` + ``LOAD`` the DuckDB ``fts`` extension on ``conn``.
Idempotent: re-running on a connection that already has the extension
loaded is a no-op. Returns ``True`` on success, ``False`` on any
DuckDB error (network-blocked install, sandboxed CI runner without
extension repo access, etc.). Callers should fall back to ``ILIKE``
on ``False``.
"""
try:
conn.execute("INSTALL fts")
conn.execute("LOAD fts")
return True
except duckdb.Error as e:
logger.warning(
"DuckDB fts extension unavailable; knowledge search will fall back to ILIKE: %s",
e,
)
return False
def ensure_knowledge_fts_index(conn: duckdb.DuckDBPyConnection) -> bool:
"""Create (or rebuild) the BM25 FTS index over ``knowledge_items``.
The index covers ``title`` and ``content``, keyed by ``id``.
``overwrite=1`` makes the call idempotent: if the index already
exists it's dropped and rebuilt from the current snapshot — which is
how we keep it in sync with INSERT/UPDATE/DELETE without per-row
update hooks (DuckDB FTS doesn't ship those).
``strip_accents=1`` + ``lower=1`` give us case- and diacritic-
insensitive matching out of the box (``cesky`` → ``česky``).
Returns ``True`` if the index is now usable, ``False`` if either
the extension or the PRAGMA call failed. Falsy return is the signal
for ``KnowledgeRepository.search`` to use the ILIKE path.
"""
if not ensure_fts_loaded(conn):
return False
try:
conn.execute(
"PRAGMA create_fts_index("
"'main.knowledge_items', 'id', 'title', 'content', "
"strip_accents=1, lower=1, overwrite=1)"
)
return True
except duckdb.Error as e:
logger.warning(
"Failed to (re)create FTS index on knowledge_items; falling back to ILIKE: %s",
e,
)
return False