Skip to content

feat(tantivy): add Tantivy full-text global index via Rust FFI#346

Open
spaces-X wants to merge 14 commits into
alibaba:mainfrom
spaces-X:baseline-tantivy
Open

feat(tantivy): add Tantivy full-text global index via Rust FFI#346
spaces-X wants to merge 14 commits into
alibaba:mainfrom
spaces-X:baseline-tantivy

Conversation

@spaces-X
Copy link
Copy Markdown

@spaces-X spaces-X commented Jun 8, 2026

Purpose

Add an experimental tantivy-fulltext global index backend alongside lucene-fts.

This change:

  • wires a Rust tantivy / jieba-rs FFI crate into CMake via Corrosion and cbindgen
  • adds C++ Tantivy global index writer, reader, factory registration, archive parsing, streaming I/O callbacks, and Rust log bridging
  • supports full-text search query types with limit, pre_filter, BM25 score opt-in, and min_score filtering
  • adds Java <-> C++ Tantivy archive compatibility fixtures and cross-read coverage
  • adds CI/devcontainer Rust setup and a targeted Tantivy smoke test script

Tests

Added / covered by:

  • cargo test --manifest-path third_party/tantivy_ffi/Cargo.toml
  • paimon-tantivy-smoke-test
  • paimon-tantivy-ffi-test
  • paimon-tantivy-tokenizer-test
  • paimon-tantivy-writer-test
  • paimon-tantivy-reader-test
  • paimon-tantivy-filter-limit-test
  • paimon-tantivy-index-test
  • paimon-tantivy-streaming-test
  • paimon-tantivy-java-compat-test
  • paimon-tantivy-lucene-coexist-test
  • paimon-tantivy-equivalence-test
  • paimon-global-index-test

API and Format

Yes.

API:

  • include/paimon/predicate/full_text_search.h adds with_score and min_score.
  • limit is now a truncation switch and no longer implies BM25 score computation.
  • ReplacePreFilter preserves scoring-related flags.

Format:

  • Adds a new tantivy-fulltext packed archive format compatible with paimon-java Tantivy archives.
  • Existing lucene-fts storage format is not changed.

Protocol:

  • No external protocol change. The new Rust/C FFI boundary is internal to the Tantivy backend.

Documentation

Yes, this introduces a new experimental tantivy-fulltext global index feature.

This patch includes fixture READMEs and smoke-test script usage, but no separate user-facing documentation page.

Generative AI tooling

Generated-by: Codex (GPT-5) and Claude Opus 4.8

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Jun 8, 2026

CLA assistant check
All committers have signed the CLA.

spaces-X and others added 13 commits June 8, 2026 15:24
Add a Rust tantivy-based FTS global index as a second backend alongside Lucene,
wired into CMake via cbindgen + Corrosion, with 10 functional unit tests.
…tests

Cross-read tests for tantivy archives shared between paimon-java and paimon-cpp,
using fixtures from paimon-java's TantivyIndexFixtureGen and covering both directions.
Companion infra for the tantivy-fts integration (no production logic): devcontainer,
CI workflows, sanitizer flags, and cross-platform build fixes.
Fix io_meta being null on the reader path and the jieba dictionary directory
not being set when constructing the tantivy index.
Install the log bridge once on first reader Create so Rust log records surface
through glog in production binaries, not only in unit tests.
…ctor for unscored search

Replace the DocSetCollector + HashSet + per-doc fast-field path with a RowIdCollector
that opens the row_id column once per segment and reads it inline.
Repurpose Path B as a true unscored LIMIT N: LimitedDocSetCollector stops collecting
past N via a shared atomic, skipping BM25 scoring entirely.
Add an optional min_score applied after scoring but before sort/truncate, letting FE
push `score() > X` down through the FFI into the tantivy engine.
Adapt to base AddBatch gaining relative_row_ids and GlobalIndexIOMeta dropping
range_end, mirroring lucene; update the 8 affected tantivy test files.
setup_rust.sh pins rustc 1.88.0 (min required by the transitive time crate); build_paimon.sh turns off PAIMON_ENABLE_TANTIVY on the gcc-8 image (no Rust there), mirroring the existing LUMINA/LANCE handling.
Expand the abbreviated 'Licensed under the Apache License, Version 2.0.' line to the full Apache 2.0 boilerplate so the RAT license check recognizes it.
…nt, codespell)

Apply clang-format/cmake-format; fix cpplint (functional char-casts -> static_cast, int64_t/PRId64 instead of long, NOLINT for the cbindgen-generated header include) and a codespell typo.
@spaces-X spaces-X force-pushed the baseline-tantivy branch from 9e4bc01 to 77658c1 Compare June 8, 2026 07:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants