* feat(memory): DuckDB FTS BM25 search for knowledge items (#121) Replaces `title ILIKE '%q%' OR content ILIKE '%q%'` ranked by insertion order with BM25 relevance ranking via the DuckDB `fts` extension. Czech queries like `cesky` match documents containing `česky` (`strip_accents=1` + `lower=1`). Architecture: - src/fts.py — ensure_fts_loaded / ensure_knowledge_fts_index helpers. The extension is per-connection (INSTALL persisted at engine level, LOAD per-conn). Both helpers are idempotent and soft-fail on unavailability with a logged WARNING. - Schema v47 (_v46_to_v47) — builds the initial BM25 index over knowledge_items(title, content) keyed by id. Migration is best-effort against ANY exception (not just duckdb.Error) so the schema bump cannot get stuck on v46 if a non-DuckDB error escapes the helper. - KnowledgeRepository.search — FTS-or-ILIKE dichotomy with execute- time fallback. Same filter surface (statuses / category / domain / source_type / personal / audience / dismissed) either way. ensure_fts_loaded() returning True only guarantees the extension is loadable, NOT that the index exists — migration soft-fail or a concurrent overwrite=1 rebuild's drop-then-create window leaves the extension loaded but the index missing. The BM25 execute is wrapped in try/except duckdb.Error → ILIKE retry so transient failures cannot 500 the /api/memory?search= endpoint. - KnowledgeRepository.count_items — mirrors the same FTS-or-ILIKE decision tree plus the execute-time fallback so the count always matches the paginated result set. - Per-mutation rebuild — create and title-or-content update rebuild the index via overwrite=1 PRAGMA. Status flips skip (token stream unchanged). - app/main.py lifespan rebuilds once at boot as a safety net for instances already on v47 across restarts. - bm25_score column shape: ILIKE fallback now selects `NULL AS bm25_score` so the result column set matches the FTS path. Consumers can read the score uniformly; absence of relevance ranking is signalled by the column being None everywhere, not missing. Tests in tests/test_knowledge_fts_search.py (9 tests): - BM25 multi-term match set + adversarial-review fix asserting higher-density doc ranks first (skipped if extension unavailable). - bm25_score column attached when extension available. - ILIKE fallback path on search + count_items via patched ensure_fts_loaded → False; bm25_score is None on this path. - Adversarial-review fix: search and count_items also fall back when the extension is loaded but the index is missing (simulated via drop_fts_index PRAGMA — the exact production failure mode the fallback guards against). - Index rebuild on create (new item searchable immediately). - Title update re-surfaces row under new term, drops old. - Czech-diacritic round-trip (cesky query → česky doc). Pinned schema-version asserts bumped 46 → 47 (test_db_schema_version, test_home_stats, test_schema_v42_migration, test_schema_v46_migration). Closes #121. * release: 0.54.20 — Corporate Memory BM25 search + All-Items bulk-edit batch bar
78 lines
2.9 KiB
Python
78 lines
2.9 KiB
Python
"""DuckDB FTS extension helpers for knowledge-item BM25 search (issue #121).
|
|
|
|
The extension is per-connection: ``INSTALL fts`` is persisted at the engine
|
|
level, ``LOAD fts`` must run on every fresh DuckDB connection. The index
|
|
over ``knowledge_items`` is a *static snapshot* — DuckDB doesn't track
|
|
base-table INSERT / UPDATE / DELETE automatically.
|
|
|
|
We rebuild on demand inside ``search()`` (cheap at corpus sizes
|
|
< a few thousand rows; acceptance bound is sub-100ms) and fall back to
|
|
``ILIKE`` when the extension can't be loaded — offline installs and
|
|
sandboxed CI runners that block extension downloads must still serve
|
|
the search box, just without relevance ranking.
|
|
|
|
``strip_accents=1`` lets queries like ``cesky`` match documents
|
|
containing ``česky`` — the Czech-diacritics acceptance from #121.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import logging
|
|
|
|
import duckdb
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
|
|
def ensure_fts_loaded(conn: duckdb.DuckDBPyConnection) -> bool:
|
|
"""``INSTALL`` + ``LOAD`` the DuckDB ``fts`` extension on ``conn``.
|
|
|
|
Idempotent: re-running on a connection that already has the extension
|
|
loaded is a no-op. Returns ``True`` on success, ``False`` on any
|
|
DuckDB error (network-blocked install, sandboxed CI runner without
|
|
extension repo access, etc.). Callers should fall back to ``ILIKE``
|
|
on ``False``.
|
|
"""
|
|
try:
|
|
conn.execute("INSTALL fts")
|
|
conn.execute("LOAD fts")
|
|
return True
|
|
except duckdb.Error as e:
|
|
logger.warning(
|
|
"DuckDB fts extension unavailable; knowledge search will fall back to ILIKE: %s",
|
|
e,
|
|
)
|
|
return False
|
|
|
|
|
|
def ensure_knowledge_fts_index(conn: duckdb.DuckDBPyConnection) -> bool:
|
|
"""Create (or rebuild) the BM25 FTS index over ``knowledge_items``.
|
|
|
|
The index covers ``title`` and ``content``, keyed by ``id``.
|
|
``overwrite=1`` makes the call idempotent: if the index already
|
|
exists it's dropped and rebuilt from the current snapshot — which is
|
|
how we keep it in sync with INSERT/UPDATE/DELETE without per-row
|
|
update hooks (DuckDB FTS doesn't ship those).
|
|
|
|
``strip_accents=1`` + ``lower=1`` give us case- and diacritic-
|
|
insensitive matching out of the box (``cesky`` → ``česky``).
|
|
|
|
Returns ``True`` if the index is now usable, ``False`` if either
|
|
the extension or the PRAGMA call failed. Falsy return is the signal
|
|
for ``KnowledgeRepository.search`` to use the ILIKE path.
|
|
"""
|
|
if not ensure_fts_loaded(conn):
|
|
return False
|
|
try:
|
|
conn.execute(
|
|
"PRAGMA create_fts_index("
|
|
"'main.knowledge_items', 'id', 'title', 'content', "
|
|
"strip_accents=1, lower=1, overwrite=1)"
|
|
)
|
|
return True
|
|
except duckdb.Error as e:
|
|
logger.warning(
|
|
"Failed to (re)create FTS index on knowledge_items; falling back to ILIKE: %s",
|
|
e,
|
|
)
|
|
return False
|