Port Panama SIMD kernels to C++ using Google Highway by r-devulap · Pull Request #668 · datastax/jvector

r-devulap · 2026-05-26T15:15:21Z

This PR rewrites JVector's Panama Vector API-based SIMD kernels with native C++ implementations using Google Highway, a portable SIMD library that compiles a single kernel source into multiple ISA targets (SSE42, AVX2 and AVX-512) and dispatches at runtime.

What changed

Introduce Google Highway as a git submodule.
jvector_simd.c is replaced by jvector_simd.cpp (ISA dispatch shim) and jvector_simd_kernels.cpp (all Highway kernel implementations), with a new meson.build driving multi-target compilation (adds meson as a build dependency).
All FP32, PQ, and NVQ kernels are ported to Highway, with a new calculatePartialSelfSum kernel.
Ported the optimizations in Wire calculatePartialSums to native SIMD via Panama FFI downcall #651 to Google Highway.
NativeSimdOps.java is regenerated via jextract to match the updated C API
NativeVectorUtilSupport is updated and the native kernels are now unconditionally preferred over any Panama fallback for dot product, L2, and cosine distance

The README file jvector-native/src/main/c/README.md is a good start before reviewing the code.

github-actions · 2026-05-26T15:15:38Z

Before you submit for review:

Does your PR follow guidelines from CONTRIBUTIONS.md?
Did you summarize what this PR does clearly and concisely?
Did you include performance data for changes which may be performance impacting?
Did you include useful docs for any user-facing changes or features?
Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
Did you trigger and review regression testing results against the base branch via Run Bench Main?
Did you adhere to the code formatting guidelines (TBD)
Did you group your changes for easy review, providing meaningful descriptions for each commit?
Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

jshook

Looks ok to me.

MarkWolters

I think it looks good. I did also use Bob to do a review and double check myself and it had this comment (I understand what it is saying but I will leave it to your discretion if it is correct):

jvector_simd_kernels.cpp lines 571 and 943

Both calculate_partial_sums_f32 and calculate_partial_sums_self_magnitude_f32 have a size==2 fast path that uses hn::Shuffle2301 for the horizontal add. This is wrong.

For size==2, centroids are interleaved as [c0[0], c0[1], c1[0], c1[1], ...]. The goal is to sum adjacent pairs within each centroid. Shuffle2301 on [a, b, c, d] produces [c, d, a, b] — swaps 64-bit halves — so
score + Shuffle2301(score) mixes elements from different centroids. The correct shuffle is Shuffle1032, which swaps adjacent 32-bit elements: [b, a, d, c], so score + Shuffle1032(score) gives [s0+s1, s0+s1,
s2+s3, s2+s3], which is the correct per-centroid sum.

Impact: Any PQ index with size==2 subspaces (e.g., 128-dim vectors with 64 subspaces) silently produces wrong search scores. The scalar fallback is never reached because the fast path advances the loop index
past all centroids.

Fix:
// Line 571 and 943: change
hn::Shuffle2301(score) → hn::Shuffle1032(score)
hn::Shuffle2301(sum) → hn::Shuffle1032(sum)

r-devulap · 2026-05-27T01:34:52Z

Impact: Any PQ index with size==2 subspaces (e.g., 128-dim vectors with 64 subspaces) silently produces wrong search scores. The scalar fallback is never reached because the fast path advances the loop index past all centroids.

Fix: // Line 571 and 943: change hn::Shuffle2301(score) → hn::Shuffle1032(score) hn::Shuffle2301(sum) → hn::Shuffle1032(sum)

Not sure why Bob thinks that, but from Google Highway documentation, Shuffle2301 is the right instruction. Modifying it to Shuffle1032 results in our unit tests failing.

V: {u,i,f}{32}
V Shuffle2301(V): returns blocks with 32-bit halves swapped inside 64-bit halves.

r-devulap · 2026-05-27T02:54:33Z

Not sure why Bob thinks that, but from Google Highway documentation, Shuffle2301 is the right instruction. Modifying it to Shuffle1032 results in our unit tests failing.

I believe the confusion came from incorrect comments in the C++ file—Bob likely relied on those rather than the official Google Highway documentation to interpret the intrinsic. I’ve since corrected the comments here: 9a52b4d

ashkrisk · 2026-05-29T05:19:31Z

+    exit 2
+fi
+
+BUILD_DIR="build"


Should we use a subfolder in maven's target directory, for example jvector-native/target/meson-build? It keeps both Java and C++ build artefacts in the same place and has the added bonus that maven will automatically wipe everything on running mvn clean.

- Replace jvector_simd.c + jvector_simd_check.c with C++ using Highway - Add jvector_simd.cpp (JNI dispatch layer) and jvector_simd_kernels.cpp/h (all SIMD kernel implementations: FP32, PQ, NVQ) - Add meson.build for building with Highway targets - Add Google Highway as git submodule (third_party/highway) - Add supporting headers: jvector_cpuFeatures.h, assertHwyTargets.h - Regenerate NativeSimdOps.java JNI bindings from new jvector_simd.h - Add __fsid_t.java and max_align_t.java (jextract-generated stubs) - Remove AVX-512 check from NativeVectorizationProvider; replace with x86_64 architecture guard (Highway selects best ISA at runtime) - Remove AVX-512 test from NativeSimdOpsTest - Update jextract_vector_simd.sh for new header layout - Update README with Highway build instructions

Wire up FP32 SIMD kernels in NativeVectorUtilSupport: - dotProduct(v1, v2) and dotProduct(v1, offset, v2, offset, len) via dot_product_f32 (dispatches to best ISA via Highway) - cosine(v1, v2) and cosine(v1, offset, v2, offset, len) via cosine_f32 - squareDistance(v1, v2) and squareDistance(v1, offset, v2, offset, len) via euclidean_f32 - addInPlace(v1, v2) and addInPlace(v1, scalar) via add_in_place_f32 / add_scalar_in_place_f32 - subInPlace(v1, v2) and subInPlace(v1, scalar) via sub_in_place_f32 / sub_scalar_in_place_f32 - max(v) via max_f32 - minInPlace(v1, v2) via min_in_place_f32 FP32 distance kernels are gated on length >= 128 (below that threshold the Panama vector fallback is used).

Wire up PQ SIMD kernels in NativeVectorUtilSupport: - assembleAndSum: switch to assemble_and_sum_f32 (was _512 variant) - assembleAndSumPQ: replace Java fallback with assemble_and_sum_pq_f32 native call; validates ordinal offsets are 0 via assertions - pqDecodedCosineSimilarity: switch to pq_decoded_cosine_similarity_f32 (was _512 variant); passes length as long - calculatePartialSums (new): dispatches to calculate_partial_sums_euclidean_f32 or calculate_partial_sums_dot_f32 based on VectorSimilarityFunction

Wire up NVQ (Non-uniform Vector Quantization) SIMD kernels in NativeVectorUtilSupport: - nvqShuffleQueryInPlace8bit: pre-shuffle query vector for fast-lane dequantization in scoring kernels - nvqQuantize8bit: quantize float vector to 8-bit NVQ representation - nvqLoss / nvqUniformLoss: compute quantization loss for parameter tuning - nvqSquareL2Distance8bit: L2 distance between float query and 8-bit quantized vector - nvqDotProduct8bit: dot product between float query and 8-bit quantized vector - nvqCosine8bit: cosine similarity; native returns packed int64 (low 32 bits = dot sum, high 32 bits = quantized magnitude), unpacked to float[]

…pped

akash-shankaran · 2026-06-12T05:42:56Z

+JVECTOR_SIMD_API void    sub_scalar_in_place_f32(float* v1, float value, size_t length);
+JVECTOR_SIMD_API float   max_f32(const float* v, size_t length);
+JVECTOR_SIMD_API void    min_in_place_f32(float* v1, const float* v2, size_t length);
+#ifdef __cplusplus


@r-devulap - looks like the bulk_quantized_shuffle_* kernels are not present in your new API spec? Are they being removed in the highway implementation, or migrated elsewhere?

jvector/jvector-native/src/main/c/jvector_simd.c

Line 457 in 863833e

void bulk_quantized_shuffle_dot_f32_512(const unsigned char* shuffles, int codebookCount, const char* quantizedPartials, float delta, float best, float* results) {

bulk_quantized_shuffle_euclidean_f32_512
bulk_quantized_shuffle_dot_f32_512
bulk_quantized_shuffle_cosine_f32_512

While those kernels exist in the main branch, they aren't used anywhere in the code base and hence I didn't port them to Highway.

akash-shankaran · 2026-06-18T04:19:11Z

+        sum_aa = hn::MulAdd(va, va, sum_aa);
+        sum_bb = hn::MulAdd(vb, vb, sum_bb);
+    }
+    return hn::ReduceSum(tag, sum_ab)


so if either input on the sqrtf method produces a 0, this would lead to a NaN I think? Should this be expected behavior? Or return 0?
Even if this returns NaN, this should be documented

akash-shankaran · 2026-06-18T04:27:23Z

+template <class D>
+HWY_INLINE hn::Vec<D> LoadDup256(D d, const float *HWY_RESTRICT ptr)
+{
+    static_assert(hn::MaxLanes(d) <= 16,


wouldn't this fail on ARM 1024 bit ISAs, part of SVE and SVE2? Perhaps consider adding a todo to cover those architectures in the future.

AFAIk, there is no ARM CPU with 1024 byte wide lanes. This can revisited if that ever comes up.

https://developer.arm.com/documentation/102340/0100/Introducing-SVE2

The functionality exists in ARM cpu's today.
"Silicon partners can choose a suitable vector length design implementation for hardware that varies between 128 bits and 2048 bits, at 128-bit increments."

akash-shankaran · 2026-06-18T04:29:44Z

@@ -0,0 +1,3 @@
+[submodule "jvector-native/src/main/c/third_party/highway"]
+	path = jvector-native/src/main/c/third_party/highway
+	url = https://github.com/google/highway.git


seems like you're depending on the main branch of google highway.. would it make sense to pin to a versioned release of highway instead? This will help with version binding with jvector releases.

akash-shankaran · 2026-06-18T04:32:54Z

+
+### Building native libraries
+
+The native SIMD library (`libjvector.so`) requires **g++ 11+** and is built by the script


g++ 11 is an older release of g++, which might be missing out the newer improvements, isa-additions in the library. Why pin to an older version, and not something more recent like 15 or 16?

Its a minimum requirement, not a version pinned to: g++ 11+.

akash-shankaran · 2026-06-18T04:38:01Z

+limitations under the License.
+-->
+
+# JVector Native SIMD Library


should the path under jvector-native/src/main/c/* be changed to
jvector-native/src/main/* or something different? Given there will be no c code going forward.

akash-shankaran · 2026-06-18T04:40:37Z

+# limitations under the License.
+
+project('jvector_simd_kernels', 'cpp',
+  version: '0.1.0',


since the native library is strongly coupled with jvector, would it make sense to keep the version number across both the same to avoid confusion?

Unless there is a plan to decouple the native code from the library, where it makes sense to keep them separate.

akash-shankaran · 2026-06-18T04:46:41Z

+
+# Highway headers are used as an include directory only.
+# HWY_COMPILE_ONLY_STATIC / HWY_COMPILE_ONLY_SCALAR bypass the
+# dynamic-dispatch runtime, so only headers are needed at compile time.


would it be a good idea to keep meson.build files in a separate directory from .cpp and .h files?

meson already builds in a separate directory and all the build artifacts are in that cleanly separated from source. I am not sure this adds any value

akash-shankaran · 2026-06-18T04:57:18Z

+The script:
+1. Verifies prerequisites (g++, meson, ninja, Highway submodule).
+2. Runs `meson setup build --wipe --buildtype=release` then `meson compile`.
+3. Copies the versioned `.so` to `../resources/libjvector.so` where the Java


If I understand this correctly - as a consumer of the jvector library, we will consume the version of the native library which is built and published as part of the jvector versioned release. Is that accurate?

If that is the case, then it becomes architecture dependent presently as you are compiling using g++ with -march=skylake etc. which make it x86 dependent only.

Thinking out further, some possible thoughts:

Would you consider multiple jvector releases - x86 based and ARM based (and potentially ppc based)?

Perhaps a single release with multiple .so files each covering different arch would suffice? The dispatch layer would need to load the correct .so based on the environment it is executed upon.

Would this be a capability better addressed by a user, e.g. opensearch who makes their choices?

If that is the case, then it becomes architecture dependent presently as you are compiling using g++ with -march=skylake etc. which make it x86 dependent only.

I think we spoke about this in the meeting along with @MarkWolters. This patch currently only builds SIMD native module only for x86 (AVX512, AVX2, SSE4.2) with auto detection of SIMD features. I am not sure how the jar files will be packaged when we build this for ARM as well, but I assume we will build multiple shared library (all of which should be the jar file) and at runtime it will auto pick the right one. @MarkWolters is a better person to answer this and can confirm if that is right.

r-devulap requested review from MarkWolters, ashkrisk, jshook and tlwillke as code owners May 26, 2026 15:15

r-devulap force-pushed the hwy-native-funcs branch from eb12f8e to 73efe07 Compare May 26, 2026 15:34

jshook reviewed May 26, 2026

View reviewed changes

Comment thread .github/workflows/run-bench.yml

jshook reviewed May 26, 2026

View reviewed changes

Comment thread jvector-native/src/main/c/third_party/highway

jshook approved these changes May 26, 2026

View reviewed changes

MarkWolters approved these changes May 26, 2026

View reviewed changes

Comment thread .github/workflows/run-bench.yml

Comment thread .github/workflows/run-bench.yml

Comment thread jvector-native/src/main/c/jextract_vector_simd.sh

Comment thread jvector-native/src/test/java/io/github/jbellis/jvector/vector/cnative/NativeSimdOpsTest.java

ashkrisk reviewed Jun 8, 2026

View reviewed changes

r-devulap added 16 commits June 10, 2026 05:11

Always prefer native dp, l2 and cosine

f7bf3d3

Add new calculatePartialSelfSum

de8f80f

Add Meson and Ninja installation to GitHub Actions workflows

1551c35

Add git submodule initialization to GitHub Actions workflows

dce8070

Exclude .gitmodules from RAT checks

369fcb3

Remove recursive submodule init from unit tests workflow

cdb7e53

Use DataStax license headers and exclude native build artifacts from RAT

85f267c

Remove unnecessary file and duplicate gcc version check

b57c437

Fix Shuffle2301/Shuffle1032 comments: diagrams and semantics were swa…

92789e4

…pped

Fix RAT excludes: add module-relative paths for build/ and third_party/

05600e5

Use snake case consistently

ac89981

Use --auto-install-deps instead of --auto-install-gcc

0fae13f

r-devulap force-pushed the hwy-native-funcs branch from 69f3d38 to 0fae13f Compare June 10, 2026 05:11

akash-shankaran reviewed Jun 12, 2026

View reviewed changes

This was referenced Jun 15, 2026

Wire calculatePartialSums to native SIMD via Panama FFI downcall #651

Closed

Use bulk writes instead of per-element writes in vector serialization #681

Open

Release docs #677

Merged

Usage of imprecise fp-model=fast. #360

Closed

akash-shankaran reviewed Jun 18, 2026

View reviewed changes


		### Building native libraries

		The native SIMD library (`libjvector.so`) requires g++ 11+ and is built by the script

Conversation

r-devulap commented May 26, 2026

What changed

Uh oh!

github-actions Bot commented May 26, 2026 • edited by r-devulap Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jshook left a comment

Choose a reason for hiding this comment

Uh oh!

MarkWolters left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

r-devulap commented May 27, 2026

Uh oh!

r-devulap commented May 27, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

akash-shankaran Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

github-actions Bot commented May 26, 2026 •

edited by r-devulap

Loading

akash-shankaran Jun 12, 2026 •

edited

Loading