[global-index-eslib] integrates the Elasticsearch (Lucene) index engine into Paimon's global index system#8000
Open
CrownChu wants to merge 20 commits into
Open
Conversation
Extend the GlobalIndex SPI, build path, and query path to support one index builder handling multiple columns (e.g. Lucene indexing title + content + tags together). Key changes: - GlobalIndexerFactory/GlobalIndexer: add List<DataField> create overloads - GlobalIndexMultiColumnWriter: new interface for multi-column writes - GlobalIndexBuilderUtils: toIndexFileMetas/createIndexWriter accept List<DataField> - GlobalIndexScanner: route extraFieldIds to same reader group - VectorScanImpl/FullTextScanImpl: match against extraFieldIds - GenericIndexTopoBuilder (Flink): multi-column projection and writer dispatch - DefaultGlobalIndexBuilder/TopoBuilder (Spark): multi-column support - All single-column APIs preserved for backward compatibility
Allow index_column parameter to accept comma-separated column names (e.g. "title,embedding") for both Flink and Spark procedures. Add List<String> overload for GenericIndexTopoBuilder.buildIndexAndExecute.
…e into GlobalIndexBuilderUtils
…n, and restore observability logs
… index (indexFieldId=-1)
…-column for unsupported index types
…k, and multi-column guard
…count is unlimited
Resolve conflict in GenericIndexTopoBuilder: keep multi-column write logic, unify variable name to rowsSeen.
…alar filter support - Implement ESIndexGlobalIndexReader/Writer/Indexer/Factory for ESLib-based multi-index - Inject shared ExecutorService from factory layer for parallel cluster search - Add volatile closed flag with checkNotClosed() for concurrent close safety - Implement all scalar/keyword/text filter visitor methods via ESLib IndexFilter API - Change eslib groupId to io.github.crownchu, version 1.0.0 - Add GitHub-hosted Maven repository for CI dependency resolution
75ebc0d to
b4d2f23
Compare
Contributor
|
Thanks for the contribution. This is a large new global-index integration and the current CI status has many failing jobs, so I am holding off on approval for now. Please get the build/test matrix green first, then it will be easier to do a meaningful code review. |
…i' into feature-globalindex-support-multi-eslib
- bump all module versions 1.5-SNAPSHOT -> 1.5-es-SNAPSHOT (es fork build) - paimon-eslib: ESIndexGlobalIndexWriter / ESIndexGlobalIndexerFactory multi-column WIP + E2E test - paimon-flink GenericIndexTopoBuilder + ESLibGlobalIndexITCase Checkpoint before merging github/feature-globalindex-support-multi. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- pom.xml: eslib-core dependency classifier lucene912 -> lucene9 - ESIndexGlobalIndexE2ETest: PaimonLucene912Codec -> PaimonLucene9Codec (fixes 'package org.elasticsearch.eslib.adapter.lucene912 does not exist') Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lindex-support-multi-eslib Resolved 18 conflicts. Both the fork and upstream independently implemented multi-column global index + vector/fulltext read; adopted upstream's on-disk data model (GlobalIndexMeta: indexFieldId=primary, extraFieldIds=rest) since the merged metadata layer enforces it, while preserving eslib multi-column support by routing the full ordered field list through GlobalIndexerFactory.create(List<DataField>). Dropped the fork's MULTI_COLUMN_INDEX_FIELD_ID(-1) sentinel encoding. Kept paimon-eslib module alongside upstream's new paimon-vector. Verified: paimon-common, paimon-core, paimon-flink-common, paimon-spark-common all compile.
Switch paimon-eslib's eslib-api/eslib-core coordinates from io.github.crownchu to io.github.paimon.eslib and add the public OSS maven repo (https://es-demo-test.oss-cn-hangzhou.aliyuncs.com/maven/) so GitHub CI can resolve the eslib artifacts without internal alibaba maven access. Updated the shade-plugin includes to the new groupId. Verified: clean-settings build (no alibaba mirror) downloads eslib-api/eslib-core from the OSS repo and paimon-eslib compiles (BUILD SUCCESS).
5dd45fa to
e96fefa
Compare
Align the whole reactor (88 module poms) to 1.5-SNAPSHOT to match upstream apache paimon. Previously fork modules used 1.5-es-SNAPSHOT while the merged-in paimon-mosaic / paimon-vector modules declared parent paimon-parent:1.5-SNAPSHOT, which broke CI at the project-reading stage (Non-resolvable parent POM). Unifying the version resolves it.
… multi-column Drop the fork-only List<DataField> overloads and route the eslib multi-column path through upstream's create(DataField indexField, List<DataField> extraFields, Options) API. This makes GlobalIndexer, GlobalIndexerFactory and GlobalIndexBuilderUtils byte-identical to upstream master (shrinks the PR diff). - GlobalIndexer/GlobalIndexerFactory/GlobalIndexBuilderUtils: reverted to upstream - ESIndexGlobalIndexerFactory: override create(DataField, extraFields, Options) - GlobalIndexScanner / VectorReadImpl / FullTextReadImpl: pass (indexField, extraFields) - flink GenericIndexTopoBuilder / spark DefaultGlobalIndexBuilder: pass (indexField, extraFields) Verified: paimon-common/core/flink-common/spark-common/eslib compile; paimon-core test-compile passes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR integrates the Elasticsearch (Lucene) index engine into Paimon's global index system via the paimon-eslib module, enabling Paimon tables to directly leverage the ES engine's vector search (DiskBBQ / HNSW) and scalar/text
filtering capabilities. Key changes:
Details
Paimon Integration with the ES Index Engine
Paimon's global index is bridged to the ES (Lucene) engine through paimon-eslib: on the write side, Flink uses this engine to build vector/scalar indexes (producing ESLib archive files); on the query side, ES's paimon-store mounts
and reuses the same engine. This PR completes the query-side engine's parallel search, concurrency safety, and filter operators.
Parallel Search
The search thread pool lifecycle is owned by the factory layer (ESIndexGlobalIndexerFactory), lazily initialized and shared across all readers. Configurable via global-index.es-index.read-search-threads:
The executor is injected through the full chain: Factory → Indexer → Reader → ESIndexSearcher → Lucene Codec (via the SearchExecutorHolder ThreadLocal bridge, to work around Lucene SPI's no-arg constructor constraint).
Scalar Filter
ESIndexGlobalIndexReader now implements all visitor methods, dispatching to ESLib's unified filter API:
Dependency Publishing
ESLib jars are published to a GitHub raw repository (CrownChu/es-paimon-lib-releases) with full Maven metadata and checksums. paimon-eslib/pom.xml declares the repository so CI can resolve dependencies without manual local
installation.