feat(tantivy): add Tantivy full-text global index via Rust FFI#346
Open
spaces-X wants to merge 14 commits into
Open
feat(tantivy): add Tantivy full-text global index via Rust FFI#346spaces-X wants to merge 14 commits into
spaces-X wants to merge 14 commits into
Conversation
Add a Rust tantivy-based FTS global index as a second backend alongside Lucene, wired into CMake via cbindgen + Corrosion, with 10 functional unit tests.
…tests Cross-read tests for tantivy archives shared between paimon-java and paimon-cpp, using fixtures from paimon-java's TantivyIndexFixtureGen and covering both directions.
Companion infra for the tantivy-fts integration (no production logic): devcontainer, CI workflows, sanitizer flags, and cross-platform build fixes.
Fix io_meta being null on the reader path and the jieba dictionary directory not being set when constructing the tantivy index.
Install the log bridge once on first reader Create so Rust log records surface through glog in production binaries, not only in unit tests.
…ctor for unscored search Replace the DocSetCollector + HashSet + per-doc fast-field path with a RowIdCollector that opens the row_id column once per segment and reads it inline.
Repurpose Path B as a true unscored LIMIT N: LimitedDocSetCollector stops collecting past N via a shared atomic, skipping BM25 scoring entirely.
Add an optional min_score applied after scoring but before sort/truncate, letting FE push `score() > X` down through the FFI into the tantivy engine.
Adapt to base AddBatch gaining relative_row_ids and GlobalIndexIOMeta dropping range_end, mirroring lucene; update the 8 affected tantivy test files.
setup_rust.sh pins rustc 1.88.0 (min required by the transitive time crate); build_paimon.sh turns off PAIMON_ENABLE_TANTIVY on the gcc-8 image (no Rust there), mirroring the existing LUMINA/LANCE handling.
Expand the abbreviated 'Licensed under the Apache License, Version 2.0.' line to the full Apache 2.0 boilerplate so the RAT license check recognizes it.
…nt, codespell) Apply clang-format/cmake-format; fix cpplint (functional char-casts -> static_cast, int64_t/PRId64 instead of long, NOLINT for the cbindgen-generated header include) and a codespell typo.
9e4bc01 to
77658c1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Add an experimental
tantivy-fulltextglobal index backend alongsidelucene-fts.This change:
tantivy/jieba-rsFFI crate into CMake via Corrosion and cbindgenlimit,pre_filter, BM25 score opt-in, andmin_scorefilteringTests
Added / covered by:
cargo test --manifest-path third_party/tantivy_ffi/Cargo.tomlpaimon-tantivy-smoke-testpaimon-tantivy-ffi-testpaimon-tantivy-tokenizer-testpaimon-tantivy-writer-testpaimon-tantivy-reader-testpaimon-tantivy-filter-limit-testpaimon-tantivy-index-testpaimon-tantivy-streaming-testpaimon-tantivy-java-compat-testpaimon-tantivy-lucene-coexist-testpaimon-tantivy-equivalence-testpaimon-global-index-testAPI and Format
Yes.
API:
include/paimon/predicate/full_text_search.haddswith_scoreandmin_score.limitis now a truncation switch and no longer implies BM25 score computation.ReplacePreFilterpreserves scoring-related flags.Format:
tantivy-fulltextpacked archive format compatible with paimon-java Tantivy archives.lucene-ftsstorage format is not changed.Protocol:
Documentation
Yes, this introduces a new experimental
tantivy-fulltextglobal index feature.This patch includes fixture READMEs and smoke-test script usage, but no separate user-facing documentation page.
Generative AI tooling
Generated-by: Codex (GPT-5) and Claude Opus 4.8