[global-index-eslib] integrates the Elasticsearch (Lucene) index engine into Paimon's global index system by CrownChu · Pull Request #8000 · apache/paimon

CrownChu · 2026-05-27T15:19:01Z

Summary

This PR integrates the Elasticsearch (Lucene) index engine into Paimon's global index system via the paimon-eslib module, enabling Paimon tables to directly leverage the ES engine's vector search (DiskBBQ / HNSW) and scalar/text
filtering capabilities. Key changes:

Parallel cluster search: Inject a shared ExecutorService from ESIndexGlobalIndexerFactory for DiskBBQ search, replacing per-reader thread pool creation
Concurrent close safety: Add a volatile closed flag with checkNotClosed() to fix a race condition in the reader lifecycle
Full scalar/text filtering: Implement all GlobalIndexReader visitor methods (visitEqual, visitLessThan, visitStartsWith, visitLike, visitIsNull, etc.), dispatching to ESLib's unified IndexFilter / ScalarPredicate API
Dependency coordinates: Change the eslib dependency groupId from org.elasticsearch to io.github.crownchu, version from 1.0.0-SNAPSHOT to 1.0.0
Public dependency repository: Add a GitHub-hosted Maven repository for public CI dependency resolution

Details

Paimon Integration with the ES Index Engine

Paimon's global index is bridged to the ES (Lucene) engine through paimon-eslib: on the write side, Flink uses this engine to build vector/scalar indexes (producing ESLib archive files); on the query side, ES's paimon-store mounts
and reuses the same engine. This PR completes the query-side engine's parallel search, concurrency safety, and filter operators.

Parallel Search

The search thread pool lifecycle is owned by the factory layer (ESIndexGlobalIndexerFactory), lazily initialized and shared across all readers. Configurable via global-index.es-index.read-search-threads:

-1 (default): auto = CPU/2, min 2
0: disable parallel search (serial only)
0: use the specified thread count

The executor is injected through the full chain: Factory → Indexer → Reader → ESIndexSearcher → Lucene Codec (via the SearchExecutorHolder ThreadLocal bridge, to work around Lucene SPI's no-arg constructor constraint).

Scalar Filter

ESIndexGlobalIndexReader now implements all visitor methods, dispatching to ESLib's unified filter API:

Numeric comparisons → ScalarPredicate.eq/lt/lte/gt/gte/in/notIn
Text matching → IndexFilter.TextFilter with TERM / PREFIX / WILDCARD ops
Null checks → IndexFilter.exists() / notExists()

Dependency Publishing

ESLib jars are published to a GitHub raw repository (CrownChu/es-paimon-lib-releases) with full Maven metadata and checksums. paimon-eslib/pom.xml declares the repository so CI can resolve dependencies without manual local
installation.

Extend the GlobalIndex SPI, build path, and query path to support one index builder handling multiple columns (e.g. Lucene indexing title + content + tags together). Key changes: - GlobalIndexerFactory/GlobalIndexer: add List<DataField> create overloads - GlobalIndexMultiColumnWriter: new interface for multi-column writes - GlobalIndexBuilderUtils: toIndexFileMetas/createIndexWriter accept List<DataField> - GlobalIndexScanner: route extraFieldIds to same reader group - VectorScanImpl/FullTextScanImpl: match against extraFieldIds - GenericIndexTopoBuilder (Flink): multi-column projection and writer dispatch - DefaultGlobalIndexBuilder/TopoBuilder (Spark): multi-column support - All single-column APIs preserved for backward compatibility

Allow index_column parameter to accept comma-separated column names (e.g. "title,embedding") for both Flink and Spark procedures. Add List<String> overload for GenericIndexTopoBuilder.buildIndexAndExecute.

…ds validation

…e into GlobalIndexBuilderUtils

…hod extraction

…n, and restore observability logs

… index (indexFieldId=-1)

…-column for unsupported index types

…k, and multi-column guard

…count is unlimited

Resolve conflict in GenericIndexTopoBuilder: keep multi-column write logic, unify variable name to rowsSeen.

…alar filter support - Implement ESIndexGlobalIndexReader/Writer/Indexer/Factory for ESLib-based multi-index - Inject shared ExecutorService from factory layer for parallel cluster search - Add volatile closed flag with checkNotClosed() for concurrent close safety - Implement all scalar/keyword/text filter visitor methods via ESLib IndexFilter API - Change eslib groupId to io.github.crownchu, version 1.0.0 - Add GitHub-hosted Maven repository for CI dependency resolution

leaves12138 · 2026-05-31T12:59:51Z

Thanks for the contribution. This is a large new global-index integration and the current CI status has many failing jobs, so I am holding off on approval for now. Please get the build/test matrix green first, then it will be easier to do a meaningful code review.

…i' into feature-globalindex-support-multi-eslib

- bump all module versions 1.5-SNAPSHOT -> 1.5-es-SNAPSHOT (es fork build) - paimon-eslib: ESIndexGlobalIndexWriter / ESIndexGlobalIndexerFactory multi-column WIP + E2E test - paimon-flink GenericIndexTopoBuilder + ESLibGlobalIndexITCase Checkpoint before merging github/feature-globalindex-support-multi. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- pom.xml: eslib-core dependency classifier lucene912 -> lucene9 - ESIndexGlobalIndexE2ETest: PaimonLucene912Codec -> PaimonLucene9Codec (fixes 'package org.elasticsearch.eslib.adapter.lucene912 does not exist') Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…lindex-support-multi-eslib Resolved 18 conflicts. Both the fork and upstream independently implemented multi-column global index + vector/fulltext read; adopted upstream's on-disk data model (GlobalIndexMeta: indexFieldId=primary, extraFieldIds=rest) since the merged metadata layer enforces it, while preserving eslib multi-column support by routing the full ordered field list through GlobalIndexerFactory.create(List<DataField>). Dropped the fork's MULTI_COLUMN_INDEX_FIELD_ID(-1) sentinel encoding. Kept paimon-eslib module alongside upstream's new paimon-vector. Verified: paimon-common, paimon-core, paimon-flink-common, paimon-spark-common all compile.

Switch paimon-eslib's eslib-api/eslib-core coordinates from io.github.crownchu to io.github.paimon.eslib and add the public OSS maven repo (https://es-demo-test.oss-cn-hangzhou.aliyuncs.com/maven/) so GitHub CI can resolve the eslib artifacts without internal alibaba maven access. Updated the shade-plugin includes to the new groupId. Verified: clean-settings build (no alibaba mirror) downloads eslib-api/eslib-core from the OSS repo and paimon-eslib compiles (BUILD SUCCESS).

Align the whole reactor (88 module poms) to 1.5-SNAPSHOT to match upstream apache paimon. Previously fork modules used 1.5-es-SNAPSHOT while the merged-in paimon-mosaic / paimon-vector modules declared parent paimon-parent:1.5-SNAPSHOT, which broke CI at the project-reading stage (Non-resolvable parent POM). Unifying the version resolves it.

… multi-column Drop the fork-only List<DataField> overloads and route the eslib multi-column path through upstream's create(DataField indexField, List<DataField> extraFields, Options) API. This makes GlobalIndexer, GlobalIndexerFactory and GlobalIndexBuilderUtils byte-identical to upstream master (shrinks the PR diff). - GlobalIndexer/GlobalIndexerFactory/GlobalIndexBuilderUtils: reverted to upstream - ESIndexGlobalIndexerFactory: override create(DataField, extraFields, Options) - GlobalIndexScanner / VectorReadImpl / FullTextReadImpl: pass (indexField, extraFields) - flink GenericIndexTopoBuilder / spark DefaultGlobalIndexBuilder: pass (indexField, extraFields) Verified: paimon-common/core/flink-common/spark-common/eslib compile; paimon-core test-compile passes.

CrownChu added 13 commits May 25, 2026 20:38

[globalindex] Support multi-column in CreateGlobalIndexProcedure

c9d47d0

Allow index_column parameter to accept comma-separated column names (e.g. "title,embedding") for both Flink and Spark procedures. Add List<String> overload for GenericIndexTopoBuilder.buildIndexAndExecute.

[globalindex] Fix multi-column index metadata storage and resolveFiel…

8c4d5f2

…ds validation

[globalindex] Fix GenericIndexTopoBuilder multi-column null value error

a28f05f

[globalindex] Extract findMinNonIndexableRowId and filterEntriesBefor…

0cfc7ef

…e into GlobalIndexBuilderUtils

[globalindex] Fix test to reference GlobalIndexBuilderUtils after met…

a9a731d

…hod extraction

[globalindex] Fix multi-column writer projection, add BTree validatio…

55a445a

…n, and restore observability logs

[globalindex] Fix MERGE INTO crash when table has multi-column global…

cdf2605

… index (indexFieldId=-1)

[globalindex] Fix FullText/Vector read path mismatch and reject multi…

1cb4d24

…-column for unsupported index types

[globalindex] Add input validation, Spark schema filtering, null chec…

dc550e1

…k, and multi-column guard

[globalindex] Reject duplicate index columns and document why column …

be43a94

…count is unlimited

Merge apache/master into feature-globalindex-support-multi

0ac045a

Resolve conflict in GenericIndexTopoBuilder: keep multi-column write logic, unify variable name to rowsSeen.

CrownChu force-pushed the feature-globalindex-support-multi-eslib branch from 75ebc0d to b4d2f23 Compare May 27, 2026 15:27

CrownChu changed the title ~~[paimon-eslib] Support parallel search, scalar filter predicates, and public Maven dependency~~ [global-index-eslib] integrates the Elasticsearch (Lucene) index engine into Paimon's global index system May 28, 2026

CrownChu and others added 5 commits June 5, 2026 16:20

Merge remote-tracking branch 'github/feature-globalindex-support-mult…

c525d90

…i' into feature-globalindex-support-multi-eslib

CrownChu force-pushed the feature-globalindex-support-multi-eslib branch 2 times, most recently from 5dd45fa to e96fefa Compare June 16, 2026 10:08

CrownChu added 2 commits June 16, 2026 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[global-index-eslib] integrates the Elasticsearch (Lucene) index engine into Paimon's global index system#8000

[global-index-eslib] integrates the Elasticsearch (Lucene) index engine into Paimon's global index system#8000
CrownChu wants to merge 20 commits into
apache:masterfrom
CrownChu:feature-globalindex-support-multi-eslib

CrownChu commented May 27, 2026 •

edited

Loading

Uh oh!

leaves12138 commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CrownChu commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leaves12138 commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CrownChu commented May 27, 2026 •

edited

Loading