fix: avoid ZeroDivisionError in BM25 retrieval on a tokenless corpus#11619
Open
santino18727-debug wants to merge 1 commit into
Open
fix: avoid ZeroDivisionError in BM25 retrieval on a tokenless corpus#11619santino18727-debug wants to merge 1 commit into
ZeroDivisionError in BM25 retrieval on a tokenless corpus#11619santino18727-debug wants to merge 1 commit into
Conversation
…eepset-ai#11598) `InMemoryDocumentStore.bm25_retrieval` raised `ZeroDivisionError` when every stored document had empty (but not None) content: such a corpus has no vocabulary and an average document length of zero, so all three BM25 algorithms divided by zero at query time. Score every candidate as 0.0 in this case; the existing non-positive-score handling then keeps them for unscaled BM25Okapi and drops them for BM25L/BM25Plus.
|
@santino18727-debug is attempting to deploy a commit to the deepset Team on Vercel. A member of the Team first needs to authorize it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related Issues
Fixes #11598
Proposed Changes
InMemoryDocumentStore.bm25_retrievalfilters out documents whose content isNone, but empty strings pass that filter. When every stored document has empty content, the corpus has an empty BM25 vocabulary and an average document length of zero, and all three algorithms then divide by zero at query time (BM25Okapialso divides by the zero vocabulary size when computingeps).This happens in practice when an ingestion pipeline produces empty chunks (e.g. a converter returning empty strings for scanned/blank pages): the writes succeed and the store only blows up later, at query time.
This guards the scoring step: when the average document length is zero, every candidate is scored
0.0. The existing non-positive-score handling then produces the expected results — unscaledBM25Okapireturns the documents with score0.0, whileBM25L/BM25Plusreturn an empty list. Corpora with at least one non-empty document are unaffected.How did you test it?
Added a parametrized unit test (over all three algorithms) that writes empty-content documents and asserts retrieval no longer raises. Existing BM25 tests pass.
Notes for the reviewer
Includes a release note. Prepared with the assistance of an AI agent; reviewed and verified locally (the issue's reproduction + tests +
ruff).