fix: avoid `ZeroDivisionError` in BM25 retrieval on a tokenless corpus by santino18727-debug · Pull Request #11619 · deepset-ai/haystack

santino18727-debug · 2026-06-13T06:19:57Z

Related Issues

Proposed Changes

InMemoryDocumentStore.bm25_retrieval filters out documents whose content is None, but empty strings pass that filter. When every stored document has empty content, the corpus has an empty BM25 vocabulary and an average document length of zero, and all three algorithms then divide by zero at query time (BM25Okapi also divides by the zero vocabulary size when computing eps).

This happens in practice when an ingestion pipeline produces empty chunks (e.g. a converter returning empty strings for scanned/blank pages): the writes succeed and the store only blows up later, at query time.

This guards the scoring step: when the average document length is zero, every candidate is scored 0.0. The existing non-positive-score handling then produces the expected results — unscaled BM25Okapi returns the documents with score 0.0, while BM25L/BM25Plus return an empty list. Corpora with at least one non-empty document are unaffected.

How did you test it?

Added a parametrized unit test (over all three algorithms) that writes empty-content documents and asserts retrieval no longer raises. Existing BM25 tests pass.

Notes for the reviewer

Includes a release note. Prepared with the assistance of an AI agent; reviewed and verified locally (the issue's reproduction + tests + ruff).

…eepset-ai#11598) `InMemoryDocumentStore.bm25_retrieval` raised `ZeroDivisionError` when every stored document had empty (but not None) content: such a corpus has no vocabulary and an average document length of zero, so all three BM25 algorithms divided by zero at query time. Score every candidate as 0.0 in this case; the existing non-positive-score handling then keeps them for unscaled BM25Okapi and drops them for BM25L/BM25Plus.

vercel · 2026-06-13T06:20:03Z

@santino18727-debug is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

CLAassistant · 2026-06-13T06:20:05Z

All committers have signed the CLA.

santino18727-debug requested a review from a team as a code owner June 13, 2026 06:19

santino18727-debug requested review from bogdankostic and removed request for a team June 13, 2026 06:19

github-actions Bot added the topic:tests label Jun 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: avoid `ZeroDivisionError` in BM25 retrieval on a tokenless corpus#11619

fix: avoid `ZeroDivisionError` in BM25 retrieval on a tokenless corpus#11619
santino18727-debug wants to merge 1 commit into
deepset-ai:mainfrom
santino18727-debug:fix/bm25-empty-corpus-zerodivision-11598

santino18727-debug commented Jun 13, 2026

Uh oh!

vercel Bot commented Jun 13, 2026

Uh oh!

CLAassistant commented Jun 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

santino18727-debug commented Jun 13, 2026

Related Issues

Proposed Changes

How did you test it?

Notes for the reviewer

Uh oh!

vercel Bot commented Jun 13, 2026

Uh oh!

CLAassistant commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CLAassistant commented Jun 13, 2026 •

edited

Loading