Port Panama SIMD kernels to C++ using Google Highway #668
Conversation
|
Before you submit for review:
If you did not complete any of these, then please explain below. |
eb12f8e to
73efe07
Compare
MarkWolters
left a comment
There was a problem hiding this comment.
I think it looks good. I did also use Bob to do a review and double check myself and it had this comment (I understand what it is saying but I will leave it to your discretion if it is correct):
jvector_simd_kernels.cpp lines 571 and 943
Both calculate_partial_sums_f32 and calculate_partial_sums_self_magnitude_f32 have a size==2 fast path that uses hn::Shuffle2301 for the horizontal add. This is wrong.
For size==2, centroids are interleaved as [c0[0], c0[1], c1[0], c1[1], ...]. The goal is to sum adjacent pairs within each centroid. Shuffle2301 on [a, b, c, d] produces [c, d, a, b] — swaps 64-bit halves — so
score + Shuffle2301(score) mixes elements from different centroids. The correct shuffle is Shuffle1032, which swaps adjacent 32-bit elements: [b, a, d, c], so score + Shuffle1032(score) gives [s0+s1, s0+s1,
s2+s3, s2+s3], which is the correct per-centroid sum.
Impact: Any PQ index with size==2 subspaces (e.g., 128-dim vectors with 64 subspaces) silently produces wrong search scores. The scalar fallback is never reached because the fast path advances the loop index
past all centroids.
Fix:
// Line 571 and 943: change
hn::Shuffle2301(score) → hn::Shuffle1032(score)
hn::Shuffle2301(sum) → hn::Shuffle1032(sum)
Not sure why Bob thinks that, but from Google Highway documentation, Shuffle2301 is the right instruction. Modifying it to Shuffle1032 results in our unit tests failing. |
I believe the confusion came from incorrect comments in the C++ file—Bob likely relied on those rather than the official Google Highway documentation to interpret the intrinsic. I’ve since corrected the comments here: 9a52b4d |
| exit 2 | ||
| fi | ||
|
|
||
| BUILD_DIR="build" |
There was a problem hiding this comment.
Should we use a subfolder in maven's target directory, for example jvector-native/target/meson-build? It keeps both Java and C++ build artefacts in the same place and has the added bonus that maven will automatically wipe everything on running mvn clean.
- Replace jvector_simd.c + jvector_simd_check.c with C++ using Highway - Add jvector_simd.cpp (JNI dispatch layer) and jvector_simd_kernels.cpp/h (all SIMD kernel implementations: FP32, PQ, NVQ) - Add meson.build for building with Highway targets - Add Google Highway as git submodule (third_party/highway) - Add supporting headers: jvector_cpuFeatures.h, assertHwyTargets.h - Regenerate NativeSimdOps.java JNI bindings from new jvector_simd.h - Add __fsid_t.java and max_align_t.java (jextract-generated stubs) - Remove AVX-512 check from NativeVectorizationProvider; replace with x86_64 architecture guard (Highway selects best ISA at runtime) - Remove AVX-512 test from NativeSimdOpsTest - Update jextract_vector_simd.sh for new header layout - Update README with Highway build instructions
Wire up FP32 SIMD kernels in NativeVectorUtilSupport: - dotProduct(v1, v2) and dotProduct(v1, offset, v2, offset, len) via dot_product_f32 (dispatches to best ISA via Highway) - cosine(v1, v2) and cosine(v1, offset, v2, offset, len) via cosine_f32 - squareDistance(v1, v2) and squareDistance(v1, offset, v2, offset, len) via euclidean_f32 - addInPlace(v1, v2) and addInPlace(v1, scalar) via add_in_place_f32 / add_scalar_in_place_f32 - subInPlace(v1, v2) and subInPlace(v1, scalar) via sub_in_place_f32 / sub_scalar_in_place_f32 - max(v) via max_f32 - minInPlace(v1, v2) via min_in_place_f32 FP32 distance kernels are gated on length >= 128 (below that threshold the Panama vector fallback is used).
Wire up PQ SIMD kernels in NativeVectorUtilSupport: - assembleAndSum: switch to assemble_and_sum_f32 (was _512 variant) - assembleAndSumPQ: replace Java fallback with assemble_and_sum_pq_f32 native call; validates ordinal offsets are 0 via assertions - pqDecodedCosineSimilarity: switch to pq_decoded_cosine_similarity_f32 (was _512 variant); passes length as long - calculatePartialSums (new): dispatches to calculate_partial_sums_euclidean_f32 or calculate_partial_sums_dot_f32 based on VectorSimilarityFunction
Wire up NVQ (Non-uniform Vector Quantization) SIMD kernels in NativeVectorUtilSupport: - nvqShuffleQueryInPlace8bit: pre-shuffle query vector for fast-lane dequantization in scoring kernels - nvqQuantize8bit: quantize float vector to 8-bit NVQ representation - nvqLoss / nvqUniformLoss: compute quantization loss for parameter tuning - nvqSquareL2Distance8bit: L2 distance between float query and 8-bit quantized vector - nvqDotProduct8bit: dot product between float query and 8-bit quantized vector - nvqCosine8bit: cosine similarity; native returns packed int64 (low 32 bits = dot sum, high 32 bits = quantized magnitude), unpacked to float[]
69f3d38 to
0fae13f
Compare
| JVECTOR_SIMD_API void sub_scalar_in_place_f32(float* v1, float value, size_t length); | ||
| JVECTOR_SIMD_API float max_f32(const float* v, size_t length); | ||
| JVECTOR_SIMD_API void min_in_place_f32(float* v1, const float* v2, size_t length); | ||
| #ifdef __cplusplus |
There was a problem hiding this comment.
@r-devulap - looks like the bulk_quantized_shuffle_* kernels are not present in your new API spec? Are they being removed in the highway implementation, or migrated elsewhere?
bulk_quantized_shuffle_euclidean_f32_512
bulk_quantized_shuffle_dot_f32_512
bulk_quantized_shuffle_cosine_f32_512
There was a problem hiding this comment.
While those kernels exist in the main branch, they aren't used anywhere in the code base and hence I didn't port them to Highway.
| sum_aa = hn::MulAdd(va, va, sum_aa); | ||
| sum_bb = hn::MulAdd(vb, vb, sum_bb); | ||
| } | ||
| return hn::ReduceSum(tag, sum_ab) |
There was a problem hiding this comment.
so if either input on the sqrtf method produces a 0, this would lead to a NaN I think? Should this be expected behavior? Or return 0?
Even if this returns NaN, this should be documented
| template <class D> | ||
| HWY_INLINE hn::Vec<D> LoadDup256(D d, const float *HWY_RESTRICT ptr) | ||
| { | ||
| static_assert(hn::MaxLanes(d) <= 16, |
There was a problem hiding this comment.
wouldn't this fail on ARM 1024 bit ISAs, part of SVE and SVE2? Perhaps consider adding a todo to cover those architectures in the future.
There was a problem hiding this comment.
AFAIk, there is no ARM CPU with 1024 byte wide lanes. This can revisited if that ever comes up.
There was a problem hiding this comment.
https://developer.arm.com/documentation/102340/0100/Introducing-SVE2
The functionality exists in ARM cpu's today.
"Silicon partners can choose a suitable vector length design implementation for hardware that varies between 128 bits and 2048 bits, at 128-bit increments."
| @@ -0,0 +1,3 @@ | |||
| [submodule "jvector-native/src/main/c/third_party/highway"] | |||
| path = jvector-native/src/main/c/third_party/highway | |||
| url = https://github.com/google/highway.git | |||
There was a problem hiding this comment.
seems like you're depending on the main branch of google highway.. would it make sense to pin to a versioned release of highway instead? This will help with version binding with jvector releases.
|
|
||
| ### Building native libraries | ||
|
|
||
| The native SIMD library (`libjvector.so`) requires **g++ 11+** and is built by the script |
There was a problem hiding this comment.
g++ 11 is an older release of g++, which might be missing out the newer improvements, isa-additions in the library. Why pin to an older version, and not something more recent like 15 or 16?
There was a problem hiding this comment.
Its a minimum requirement, not a version pinned to: g++ 11+.
| limitations under the License. | ||
| --> | ||
|
|
||
| # JVector Native SIMD Library |
There was a problem hiding this comment.
should the path under jvector-native/src/main/c/* be changed to
jvector-native/src/main/* or something different? Given there will be no c code going forward.
| # limitations under the License. | ||
|
|
||
| project('jvector_simd_kernels', 'cpp', | ||
| version: '0.1.0', |
There was a problem hiding this comment.
since the native library is strongly coupled with jvector, would it make sense to keep the version number across both the same to avoid confusion?
Unless there is a plan to decouple the native code from the library, where it makes sense to keep them separate.
|
|
||
| # Highway headers are used as an include directory only. | ||
| # HWY_COMPILE_ONLY_STATIC / HWY_COMPILE_ONLY_SCALAR bypass the | ||
| # dynamic-dispatch runtime, so only headers are needed at compile time. |
There was a problem hiding this comment.
would it be a good idea to keep meson.build files in a separate directory from .cpp and .h files?
There was a problem hiding this comment.
meson already builds in a separate directory and all the build artifacts are in that cleanly separated from source. I am not sure this adds any value
| The script: | ||
| 1. Verifies prerequisites (g++, meson, ninja, Highway submodule). | ||
| 2. Runs `meson setup build --wipe --buildtype=release` then `meson compile`. | ||
| 3. Copies the versioned `.so` to `../resources/libjvector.so` where the Java |
There was a problem hiding this comment.
If I understand this correctly - as a consumer of the jvector library, we will consume the version of the native library which is built and published as part of the jvector versioned release. Is that accurate?
If that is the case, then it becomes architecture dependent presently as you are compiling using g++ with -march=skylake etc. which make it x86 dependent only.
Thinking out further, some possible thoughts:
- Would you consider multiple jvector releases - x86 based and ARM based (and potentially ppc based)?
- Perhaps a single release with multiple .so files each covering different arch would suffice? The dispatch layer would need to load the correct .so based on the environment it is executed upon.
- Would this be a capability better addressed by a user, e.g. opensearch who makes their choices?
There was a problem hiding this comment.
If that is the case, then it becomes architecture dependent presently as you are compiling using g++ with -march=skylake etc. which make it x86 dependent only.
I think we spoke about this in the meeting along with @MarkWolters. This patch currently only builds SIMD native module only for x86 (AVX512, AVX2, SSE4.2) with auto detection of SIMD features. I am not sure how the jar files will be packaged when we build this for ARM as well, but I assume we will build multiple shared library (all of which should be the jar file) and at runtime it will auto pick the right one. @MarkWolters is a better person to answer this and can confirm if that is right.
This PR rewrites JVector's Panama Vector API-based SIMD kernels with native C++ implementations using Google Highway, a portable SIMD library that compiles a single kernel source into multiple ISA targets (SSE42, AVX2 and AVX-512) and dispatches at runtime.
What changed
jvector_simd.cis replaced byjvector_simd.cpp(ISA dispatch shim) andjvector_simd_kernels.cpp(all Highway kernel implementations), with a newmeson.builddriving multi-target compilation (adds meson as a build dependency).calculatePartialSelfSumkernel.NativeSimdOps.javais regenerated viajextractto match the updated C APINativeVectorUtilSupportis updated and the native kernels are now unconditionally preferred over any Panama fallback for dot product, L2, and cosine distanceThe README file
jvector-native/src/main/c/README.mdis a good start before reviewing the code.