Skip to content

fix: avoid ZeroDivisionError in BM25 retrieval on a tokenless corpus#11619

Open
santino18727-debug wants to merge 1 commit into
deepset-ai:mainfrom
santino18727-debug:fix/bm25-empty-corpus-zerodivision-11598
Open

fix: avoid ZeroDivisionError in BM25 retrieval on a tokenless corpus#11619
santino18727-debug wants to merge 1 commit into
deepset-ai:mainfrom
santino18727-debug:fix/bm25-empty-corpus-zerodivision-11598

Conversation

@santino18727-debug

Copy link
Copy Markdown

Related Issues

Fixes #11598

Proposed Changes

InMemoryDocumentStore.bm25_retrieval filters out documents whose content is None, but empty strings pass that filter. When every stored document has empty content, the corpus has an empty BM25 vocabulary and an average document length of zero, and all three algorithms then divide by zero at query time (BM25Okapi also divides by the zero vocabulary size when computing eps).

This happens in practice when an ingestion pipeline produces empty chunks (e.g. a converter returning empty strings for scanned/blank pages): the writes succeed and the store only blows up later, at query time.

This guards the scoring step: when the average document length is zero, every candidate is scored 0.0. The existing non-positive-score handling then produces the expected results — unscaled BM25Okapi returns the documents with score 0.0, while BM25L/BM25Plus return an empty list. Corpora with at least one non-empty document are unaffected.

How did you test it?

Added a parametrized unit test (over all three algorithms) that writes empty-content documents and asserts retrieval no longer raises. Existing BM25 tests pass.

Notes for the reviewer

Includes a release note. Prepared with the assistance of an AI agent; reviewed and verified locally (the issue's reproduction + tests + ruff).

…eepset-ai#11598)

`InMemoryDocumentStore.bm25_retrieval` raised `ZeroDivisionError` when every
stored document had empty (but not None) content: such a corpus has no
vocabulary and an average document length of zero, so all three BM25
algorithms divided by zero at query time. Score every candidate as 0.0 in
this case; the existing non-positive-score handling then keeps them for
unscaled BM25Okapi and drops them for BM25L/BM25Plus.
@santino18727-debug santino18727-debug requested a review from a team as a code owner June 13, 2026 06:19
@santino18727-debug santino18727-debug requested review from bogdankostic and removed request for a team June 13, 2026 06:19
@vercel

vercel Bot commented Jun 13, 2026

Copy link
Copy Markdown

@santino18727-debug is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@CLAassistant

CLAassistant commented Jun 13, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: bm25_retrieval raises ZeroDivisionError when all stored documents have empty content

2 participants