Skip to content

[global-index-eslib] integrates the Elasticsearch (Lucene) index engine into Paimon's global index system#8000

Open
CrownChu wants to merge 20 commits into
apache:masterfrom
CrownChu:feature-globalindex-support-multi-eslib
Open

[global-index-eslib] integrates the Elasticsearch (Lucene) index engine into Paimon's global index system#8000
CrownChu wants to merge 20 commits into
apache:masterfrom
CrownChu:feature-globalindex-support-multi-eslib

Conversation

@CrownChu

@CrownChu CrownChu commented May 27, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR integrates the Elasticsearch (Lucene) index engine into Paimon's global index system via the paimon-eslib module, enabling Paimon tables to directly leverage the ES engine's vector search (DiskBBQ / HNSW) and scalar/text
filtering capabilities. Key changes:

  • Parallel cluster search: Inject a shared ExecutorService from ESIndexGlobalIndexerFactory for DiskBBQ search, replacing per-reader thread pool creation
  • Concurrent close safety: Add a volatile closed flag with checkNotClosed() to fix a race condition in the reader lifecycle
  • Full scalar/text filtering: Implement all GlobalIndexReader visitor methods (visitEqual, visitLessThan, visitStartsWith, visitLike, visitIsNull, etc.), dispatching to ESLib's unified IndexFilter / ScalarPredicate API
  • Dependency coordinates: Change the eslib dependency groupId from org.elasticsearch to io.github.crownchu, version from 1.0.0-SNAPSHOT to 1.0.0
  • Public dependency repository: Add a GitHub-hosted Maven repository for public CI dependency resolution

Details

Paimon Integration with the ES Index Engine

Paimon's global index is bridged to the ES (Lucene) engine through paimon-eslib: on the write side, Flink uses this engine to build vector/scalar indexes (producing ESLib archive files); on the query side, ES's paimon-store mounts
and reuses the same engine. This PR completes the query-side engine's parallel search, concurrency safety, and filter operators.

Parallel Search

The search thread pool lifecycle is owned by the factory layer (ESIndexGlobalIndexerFactory), lazily initialized and shared across all readers. Configurable via global-index.es-index.read-search-threads:

  • -1 (default): auto = CPU/2, min 2
  • 0: disable parallel search (serial only)
  • 0: use the specified thread count

The executor is injected through the full chain: Factory → Indexer → Reader → ESIndexSearcher → Lucene Codec (via the SearchExecutorHolder ThreadLocal bridge, to work around Lucene SPI's no-arg constructor constraint).

Scalar Filter

ESIndexGlobalIndexReader now implements all visitor methods, dispatching to ESLib's unified filter API:

  • Numeric comparisons → ScalarPredicate.eq/lt/lte/gt/gte/in/notIn
  • Text matching → IndexFilter.TextFilter with TERM / PREFIX / WILDCARD ops
  • Null checks → IndexFilter.exists() / notExists()

Dependency Publishing

ESLib jars are published to a GitHub raw repository (CrownChu/es-paimon-lib-releases) with full Maven metadata and checksums. paimon-eslib/pom.xml declares the repository so CI can resolve dependencies without manual local
installation.

CrownChu added 13 commits May 25, 2026 20:38
Extend the GlobalIndex SPI, build path, and query path to support
one index builder handling multiple columns (e.g. Lucene indexing
title + content + tags together). Key changes:

- GlobalIndexerFactory/GlobalIndexer: add List<DataField> create overloads
- GlobalIndexMultiColumnWriter: new interface for multi-column writes
- GlobalIndexBuilderUtils: toIndexFileMetas/createIndexWriter accept List<DataField>
- GlobalIndexScanner: route extraFieldIds to same reader group
- VectorScanImpl/FullTextScanImpl: match against extraFieldIds
- GenericIndexTopoBuilder (Flink): multi-column projection and writer dispatch
- DefaultGlobalIndexBuilder/TopoBuilder (Spark): multi-column support
- All single-column APIs preserved for backward compatibility
Allow index_column parameter to accept comma-separated column names
(e.g. "title,embedding") for both Flink and Spark procedures.
Add List<String> overload for GenericIndexTopoBuilder.buildIndexAndExecute.
Resolve conflict in GenericIndexTopoBuilder: keep multi-column write
logic, unify variable name to rowsSeen.
…alar filter support

- Implement ESIndexGlobalIndexReader/Writer/Indexer/Factory for ESLib-based multi-index
- Inject shared ExecutorService from factory layer for parallel cluster search
- Add volatile closed flag with checkNotClosed() for concurrent close safety
- Implement all scalar/keyword/text filter visitor methods via ESLib IndexFilter API
- Change eslib groupId to io.github.crownchu, version 1.0.0
- Add GitHub-hosted Maven repository for CI dependency resolution
@CrownChu CrownChu force-pushed the feature-globalindex-support-multi-eslib branch from 75ebc0d to b4d2f23 Compare May 27, 2026 15:27
@CrownChu CrownChu changed the title [paimon-eslib] Support parallel search, scalar filter predicates, and public Maven dependency [global-index-eslib] integrates the Elasticsearch (Lucene) index engine into Paimon's global index system May 28, 2026
@leaves12138

Copy link
Copy Markdown
Contributor

Thanks for the contribution. This is a large new global-index integration and the current CI status has many failing jobs, so I am holding off on approval for now. Please get the build/test matrix green first, then it will be easier to do a meaningful code review.

CrownChu and others added 5 commits June 5, 2026 16:20
…i' into feature-globalindex-support-multi-eslib
- bump all module versions 1.5-SNAPSHOT -> 1.5-es-SNAPSHOT (es fork build)
- paimon-eslib: ESIndexGlobalIndexWriter / ESIndexGlobalIndexerFactory multi-column WIP + E2E test
- paimon-flink GenericIndexTopoBuilder + ESLibGlobalIndexITCase

Checkpoint before merging github/feature-globalindex-support-multi.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- pom.xml: eslib-core dependency classifier lucene912 -> lucene9
- ESIndexGlobalIndexE2ETest: PaimonLucene912Codec -> PaimonLucene9Codec
  (fixes 'package org.elasticsearch.eslib.adapter.lucene912 does not exist')

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lindex-support-multi-eslib

Resolved 18 conflicts. Both the fork and upstream independently implemented
multi-column global index + vector/fulltext read; adopted upstream's on-disk
data model (GlobalIndexMeta: indexFieldId=primary, extraFieldIds=rest) since the
merged metadata layer enforces it, while preserving eslib multi-column support by
routing the full ordered field list through GlobalIndexerFactory.create(List<DataField>).
Dropped the fork's MULTI_COLUMN_INDEX_FIELD_ID(-1) sentinel encoding.
Kept paimon-eslib module alongside upstream's new paimon-vector.

Verified: paimon-common, paimon-core, paimon-flink-common, paimon-spark-common all compile.
Switch paimon-eslib's eslib-api/eslib-core coordinates from io.github.crownchu
to io.github.paimon.eslib and add the public OSS maven repo
(https://es-demo-test.oss-cn-hangzhou.aliyuncs.com/maven/) so GitHub CI can
resolve the eslib artifacts without internal alibaba maven access.
Updated the shade-plugin includes to the new groupId.

Verified: clean-settings build (no alibaba mirror) downloads eslib-api/eslib-core
from the OSS repo and paimon-eslib compiles (BUILD SUCCESS).
@CrownChu CrownChu force-pushed the feature-globalindex-support-multi-eslib branch 2 times, most recently from 5dd45fa to e96fefa Compare June 16, 2026 10:08
CrownChu added 2 commits June 16, 2026 18:51
Align the whole reactor (88 module poms) to 1.5-SNAPSHOT to match upstream
apache paimon. Previously fork modules used 1.5-es-SNAPSHOT while the
merged-in paimon-mosaic / paimon-vector modules declared parent
paimon-parent:1.5-SNAPSHOT, which broke CI at the project-reading stage
(Non-resolvable parent POM). Unifying the version resolves it.
… multi-column

Drop the fork-only List<DataField> overloads and route the eslib multi-column
path through upstream's create(DataField indexField, List<DataField> extraFields,
Options) API. This makes GlobalIndexer, GlobalIndexerFactory and
GlobalIndexBuilderUtils byte-identical to upstream master (shrinks the PR diff).

- GlobalIndexer/GlobalIndexerFactory/GlobalIndexBuilderUtils: reverted to upstream
- ESIndexGlobalIndexerFactory: override create(DataField, extraFields, Options)
- GlobalIndexScanner / VectorReadImpl / FullTextReadImpl: pass (indexField, extraFields)
- flink GenericIndexTopoBuilder / spark DefaultGlobalIndexBuilder: pass (indexField, extraFields)

Verified: paimon-common/core/flink-common/spark-common/eslib compile;
paimon-core test-compile passes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants