Accelerating vector operations on the JVM using the new jdk.incubator.vector module
Introduction
In my work on Elastiknn, I’ve spent many hours looking for ways to optimize vector operations on the Java Virtual Machine (JVM).
The jdk.incubator.vector
module, introduced in JDK 16 as part of JEP 338 and Project Panama, is the first opportunity I’ve encountered for significant performance improvements in this area.
I recently had some time to experiment with this new module, and I cover my benchmarks and findings in this post. Overall, I found improvements in operations per second on the order of 2x to 3x compared to simple baselines. If vector operations are a bottleneck in your application, I recommend you try out this module.
Background
For our purposes, a “vector” is simply an array of floats: float[]
in Java, Array[Float]
in Scala, etc.
These are common in data science and machine learning.
The specific operations I’m interested in are:
- cosine similarity of two vectors
- dot product of two vectors
- L1 distance (aka, Taxicab distance) between two vectors
- L2 distance (aka, Euclidean distance) between two vectors
These are used commonly in nearest neighbor search.
The jdk.incubator.vector
API provides a way to access hardware-level optimizations for processing vectors.
The two hardware-level optimizations mentioned in JEP 338 are Streaming SIMD Extensions (SSE) and Advanced Vector Extensions (AVX).
Here’s my over-simplified understanding of these optimizations: the various processor vendors have agreed on a set of CPU instructions for operating directly on vectors.
Just like they provide hardware-level instructions for adding two scalars, they now provide hardware-level instructions for adding two vectors.
These optimized instructions have been accessible for many years in lower-level languages like C and C++.
As part of Project Panama, the JDK has recently exposed an API to leverage these optimized instructions directly from JVM languages.
This API is contained in the jdk.incubator.vector
module.
Benchmark Setup
My benchmarks measure operations per second for five implementations of each of the four vector operations. I start with a simple baseline and working through four possible optimizations.
I implemented the benchmark in Java and Scala: the actual vector operations are in Java, and the benchmark harness is in Scala. I use the Java Microbenchmark Harness (JMH) framework, via the sbt-jmh plugin, to execute the benchmark. This is my first time using JMH in any serious capacity, so I’m happy to hear feedback about better or simpler ways to use it.
Each variation implements this Java interface:
public interface VectorOperations {
// https://en.wikipedia.org/wiki/Cosine_similarity
double cosineSimilarity(float[] v1, float[] v2);
// https://en.wikipedia.org/wiki/Dot_product
double dotProduct(float[] v1, float[] v2);
// https://en.wikipedia.org/wiki/Taxicab_geometry
double l1Distance(float[] v1, float[] v2);
// https://en.wikipedia.org/wiki/Euclidean_distance
double l2Distance(float[] v1, float[] v2);
}
I verify the correctness of these optimizations by running tests that run the baseline operation and the optimized operation on a pair of random vectors and check for parity in the results.
All benchmarks operate on a pair of randomly-generated floating-point vectors containing 999 elements. I chose length 999 specifically to make us deal with some additional complexity in the implementation. This will make more sense later in the post.
All benchmarks run on Oracle JDK 19.0.2, installed via asdf: $ asdf install java oracle-19.0.2
.
All benchmarks run on my 2018 Mac Mini, which has an Intel i7-8700B processor with SSE4.1, SSE4.2, and AVX2 instruction set extensions.
Finally, all code is available in my site-projects repository: jdk-incubator-vector-optimizations
Baseline Implementation
We start with a baseline implementation of these vector operations. No clever tricks here. This is what we get if we take the definition for each operation from Wikipedia and translate it verbatim into Java:
public class BaselineVectorOperations implements VectorOperations {
public double cosineSimilarity(float[] v1, float[] v2) {
double dotProd = 0.0;
double v1SqrSum = 0.0;
double v2SqrSum = 0.0;
for (int i = 0; i < v1.length; i++) {
dotProd += v1[i] * v2[i];
v1SqrSum += Math.pow(v1[i], 2);
v2SqrSum += Math.pow(v2[i], 2);
}
return dotProd / (Math.sqrt(v1SqrSum) * Math.sqrt(v2SqrSum));
}
public double dotProduct(float[] v1, float[] v2) {
float dotProd = 0f;
for (int i = 0; i < v1.length; i++) dotProd += v1[i] * v2[i];
return dotProd;
}
public double l1Distance(float[] v1, float[] v2) {
double sumAbsDiff = 0.0;
for (int i = 0; i < v1.length; i++) sumAbsDiff += Math.abs(v1[i] - v2[i]);
return sumAbsDiff;
}
public double l2Distance(float[] v1, float[] v2) {
double sumSqrDiff = 0.0;
for (int i = 0; i < v1.length; i++) sumSqrDiff += Math.pow(v1[i] - v2[i], 2);
return Math.sqrt(sumSqrDiff);
}
}
This produces the following results:
Benchmark Mode Cnt Score Error Units
Bench.cosineSimilarityBaseline thrpt 6 689586.585 ± 52359.955 ops/s
Bench.dotProductBaseline thrpt 6 1162553.117 ± 21328.673 ops/s
Bench.l1DistanceBaseline thrpt 6 1095704.529 ± 32440.308 ops/s
Bench.l2DistanceBaseline thrpt 6 951909.125 ± 23376.234 ops/s
Fused Multiply Add (Math.fma
)
Next, we introduce an optimization based on the “Fused Multiply Add” operator, implemented by the Math.fma
method, documented here.
Math.fma
takes three floats, a, b, c
, and executes a * b + c
as a single operation.
This has some advantages with respect to floating point error and performance.
Basically, executing one operation is generally faster than two operations, and incurs only one rounding error.
I found a way to use Math.fma
in all the vector operations except l1Distance
.
The implementations look like this:
public class FmaVectorOperations implements VectorOperations {
public double cosineSimilarity(float[] v1, float[] v2) {
double dotProd = 0.0;
double v1SqrSum = 0.0;
double v2SqrSum = 0.0;
for (int i = 0; i < v1.length; i++) {
dotProd = Math.fma(v1[i], v2[i], dotProd);
v1SqrSum = Math.fma(v1[i], v1[i], v1SqrSum);
v2SqrSum = Math.fma(v2[i], v2[i], v2SqrSum);
}
return dotProd / (Math.sqrt(v1SqrSum) * Math.sqrt(v2SqrSum));
}
public double dotProduct(float[] v1, float[] v2) {
float dotProd = 0f;
for (int i = 0; i < v1.length; i++) dotProd = Math.fma(v1[i], v2[i], dotProd);
return dotProd;
}
public double l1Distance(float[] v1, float[] v2) {
// Does not actually leverage Math.fma.
double sumAbsDiff = 0.0;
for (int i = 0; i < v1.length; i++) sumAbsDiff += Math.abs(v1[i] - v2[i]);
return sumAbsDiff;
}
public double l2Distance(float[] v1, float[] v2) {
double sumSqrDiff = 0.0;
float diff;
for (int i = 0; i < v1.length; i++) {
diff = v1[i] - v2[i];
sumSqrDiff = Math.fma(diff, diff, sumSqrDiff);
}
return Math.sqrt(sumSqrDiff);
}
}
The results are:
Benchmark Mode Cnt Score Error Units
Bench.cosineSimilarityBaseline thrpt 6 689586.585 ± 52359.955 ops/s
Bench.cosineSimilarityFma thrpt 6 1086514.074 ± 15190.380 ops/s
Bench.dotProductBaseline thrpt 6 1162553.117 ± 21328.673 ops/s
Bench.dotProductFma thrpt 6 1073368.454 ± 104684.436 ops/s
Bench.l1DistanceBaseline thrpt 6 1095704.529 ± 32440.308 ops/s
Bench.l1DistanceFma thrpt 6 1098354.824 ± 13870.211 ops/s
Bench.l2DistanceBaseline thrpt 6 951909.125 ± 23376.234 ops/s
Bench.l2DistanceFma thrpt 6 1101736.286 ± 11985.949 ops/s
We see some improvement in two of the four cases:
cosineSimilarity
is ~1.6x faster.l2Distance
is ~1.1x faster.dotProduct
remains about the same, maybe a little worse.l1Distance
does not leverage this optimization and predictably does not change much.
jdk.incubator.vector
Now we jump into optimizations using the jdk.incubator.vector
module!
Crash course
I have to start with a crash course on this module. I also highly recommend reading the examples in JEP 338.
We are primarily using the jdk.incubator.vector.FloatVector
class.
Given a pair of float[]
arrays, the general pattern for using jdk.incubator.vector
is as follows:
- We iterate over strides (i.e., segments or chunks) of the two arrays.
- At each iteration, we use the
FloatVector.fromArray
method to copy the current stride from each array into aFloatVector
. - We call methods on the
FloatVector
instances to execute mathematical operations. For example, if we haveFloatVector
sfv1
andfv2
, thenfv1.mul(fv2).reduceLanes(VectorOperations.ADD)
runs a pairwise multiplication and sums the results.
We also need to know about a helper class called jdk.incubator.vector.VectorSpecies
, which involves the following:
- Defines the stride length used for vector operations.
- Provides helper methods for iterating over the arrays in strides.
- Is required to copy values from an array and into a
FloatVector
.
At the time of writing, there are four species: SPECIES_64
, SPECIES_128
, SPECIES_256
, SPECIES_512
, and two aliases: SPECIES_MAX
, and SPECIES_PREFERRED
.
The numbers 64, 128, 256, and 512 refer to the number of bits in a FloatVector
.
A Java float
uses 4 bytes, or 32 bits, so a vector with SPECIES_256
lets us operate on 256 / 32 = 8 floats in a single operation.
I found it’s best to just stick with SPECIES_PREFERRED
, which defaults to SPECIES_256
on my Mac Mini.
Throughput can actually decrease drastically with a suboptimal VectorSpecies
.
Finally, we need to consider what to do when the array length is less than or not a multiple of the VectorSpecies
length.
If our source arrays have a length that’s equal or a multiple of the stride length, then we can iterate over the strides with no elements left over.
Otherwise, we need to figure out what to do with the “tail” of elements that did not fill up a stride.
The way we handle this tail can have some non-negligible performance impact. This is why I chose to benchmark with vectors of length 999. There are three options for dealing with the tail, and they are the subject of the following three sections.
VectorMask on every Iteration
The first way to handle the tail is to avoid handling it by using a VectorMask
on every stride.
We use the species to define the VectorMask
, and then pass through the mask when creating the FloatVector
.
I refer to this as Jep338FullMask
, and the implementations look like this:
public class Jep338FullMaskVectorOperations implements VectorOperations{
private final VectorSpecies<Float> species = FloatVector.SPECIES_PREFERRED;
public double cosineSimilarity(float[] v1, float[] v2) {
double dotProd = 0.0;
double v1SqrSum = 0.0;
double v2SqrSum = 0.0;
FloatVector fv1, fv2;
for (int i = 0; i < v1.length; i += species.length()) {
VectorMask<Float> m = species.indexInRange(i, v1.length);
fv1 = FloatVector.fromArray(species, v1, i, m);
fv2 = FloatVector.fromArray(species, v2, i, m);
dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
v1SqrSum += fv1.mul(fv1).reduceLanes(VectorOperators.ADD);
v2SqrSum += fv2.mul(fv2).reduceLanes(VectorOperators.ADD);
}
return dotProd / (Math.sqrt(v1SqrSum) * Math.sqrt(v2SqrSum));
}
public double dotProduct(float[] v1, float[] v2) {
double dotProd = 0f;
FloatVector fv1, fv2;
for (int i = 0; i < v1.length; i += species.length()) {
VectorMask<Float> m = species.indexInRange(i, v1.length);
fv1 = FloatVector.fromArray(species, v1, i, m);
fv2 = FloatVector.fromArray(species, v2, i, m);
dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
}
return dotProd;
}
public double l1Distance(float[] v1, float[] v2) {
double sumAbsDiff = 0.0;
FloatVector fv1, fv2;
for (int i = 0; i < v1.length; i += species.length()) {
VectorMask<Float> m = species.indexInRange(i, v1.length);
fv1 = FloatVector.fromArray(species, v1, i, m);
fv2 = FloatVector.fromArray(species, v2, i, m);
sumAbsDiff += fv1.sub(fv2).abs().reduceLanes(VectorOperators.ADD);
}
return sumAbsDiff;
}
public double l2Distance(float[] v1, float[] v2) {
double sumSqrDiff = 0f;
FloatVector fv1, fv2, fv3;
for (int i = 0; i < v1.length; i+= species.length()) {
VectorMask<Float> m = species.indexInRange(i, v1.length);
fv1 = FloatVector.fromArray(species, v1, i, m);
fv2 = FloatVector.fromArray(species, v2, i, m);
fv3 = fv1.sub(fv2);
// For some unknown reason, fv3.mul(fv3) is significantly faster than fv3.pow(2).
sumSqrDiff += fv3.mul(fv3).reduceLanes(VectorOperators.ADD);
}
return Math.sqrt(sumSqrDiff);
}
}
The results are:
Benchmark Mode Cnt Score Error Units
Bench.cosineSimilarityBaseline thrpt 6 689586.585 ± 52359.955 ops/s
Bench.cosineSimilarityJep338FullMask thrpt 6 548425.342 ± 19160.168 ops/s
Bench.dotProductBaseline thrpt 6 1162553.117 ± 21328.673 ops/s
Bench.dotProductJep338FullMask thrpt 6 384319.569 ± 20067.109 ops/s
Bench.l1DistanceBaseline thrpt 6 1095704.529 ± 32440.308 ops/s
Bench.l1DistanceJep338FullMask thrpt 6 356044.308 ± 5186.842 ops/s
Bench.l2DistanceBaseline thrpt 6 951909.125 ± 23376.234 ops/s
Bench.l2DistanceJep338FullMask thrpt 6 376810.628 ± 3531.977 ops/s
The new implementation is actually significantly slower than the baseline.
Fortunately, the authors of JEP 338 mention this explicitly:
Since a mask is used in all iterations, the above implementation may not achieve optimal performance for large array lengths.
I haven’t looked extensively enough to understand why, but it seems like the
Edit: The paper Java Vector API: Benchmarking and Performance Analysis discusses the performance of VectorMask
is either expensive to create, expensive to use, or maybe both.indexInRange
, which is used to compute a VectorMask
, in section 5.2.
It turns out that indexInRange
is only optimized on certain platforms, and degrades poorly on others.
Loop over the Tail
We can also handle the tail using a plain loop.
I refer to this as Jep338TailLoop
, and the implementations look like this:
public class Jep338TailLoopVectorOperations implements VectorOperations{
private final VectorSpecies<Float> species = FloatVector.SPECIES_PREFERRED;
public double cosineSimilarity(float[] v1, float[] v2) {
double dotProd = 0.0;
double v1SqrSum = 0.0;
double v2SqrSum = 0.0;
int i = 0;
int bound = species.loopBound(v1.length);
FloatVector fv1, fv2;
for (; i < bound; i += species.length()) {
fv1 = FloatVector.fromArray(species, v1, i);
fv2 = FloatVector.fromArray(species, v2, i);
dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
v1SqrSum += fv1.mul(fv1).reduceLanes(VectorOperators.ADD);
v2SqrSum += fv2.mul(fv2).reduceLanes(VectorOperators.ADD);
}
for (; i < v1.length; i++) {
dotProd = Math.fma(v1[i], v2[i], dotProd);
v1SqrSum = Math.fma(v1[i], v1[i], v1SqrSum);
v2SqrSum = Math.fma(v2[i], v2[i], v2SqrSum);
}
return dotProd / (Math.sqrt(v1SqrSum) * Math.sqrt(v2SqrSum));
}
public double dotProduct(float[] v1, float[] v2) {
double dotProd = 0f;
int i = 0;
int bound = species.loopBound(v1.length);
FloatVector fv1, fv2;
for (; i < bound; i += species.length()) {
fv1 = FloatVector.fromArray(species, v1, i);
fv2 = FloatVector.fromArray(species, v2, i);
dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
}
for (; i < v1.length; i++) {
dotProd = Math.fma(v1[i], v2[i], dotProd);
}
return dotProd;
}
public double l1Distance(float[] v1, float[] v2) {
double sumAbsDiff = 0.0;
int i = 0;
int bound = species.loopBound(v1.length);
FloatVector fv1, fv2;
for (; i < bound; i += species.length()) {
fv1 = FloatVector.fromArray(species, v1, i);
fv2 = FloatVector.fromArray(species, v2, i);
sumAbsDiff += fv1.sub(fv2).abs().reduceLanes(VectorOperators.ADD);
}
for (; i < v1.length; i++) {
sumAbsDiff += Math.abs(v1[i] - v2[i]);
}
return sumAbsDiff;
}
public double l2Distance(float[] v1, float[] v2) {
double sumSqrDiff = 0f;
int i = 0;
int bound = species.loopBound(v1.length);
FloatVector fv1, fv2, fv3;
for (; i < bound; i+= species.length()) {
fv1 = FloatVector.fromArray(species, v1, i);
fv2 = FloatVector.fromArray(species, v2, i);
fv3 = fv1.sub(fv2);
// For some unknown reason, fv3.mul(fv3) is significantly faster than fv3.pow(2).
sumSqrDiff += fv3.mul(fv3).reduceLanes(VectorOperators.ADD);
}
for (; i < v1.length; i++) {
float diff = v1[i] - v2[i];
sumSqrDiff = Math.fma(diff, diff, sumSqrDiff);
}
return Math.sqrt(sumSqrDiff);
}
}
The results are:
Benchmark Mode Cnt Score Error Units
Bench.cosineSimilarityBaseline thrpt 6 689586.585 ± 52359.955 ops/s
Bench.cosineSimilarityJep338TailLoop thrpt 6 1169365.506 ± 5940.850 ops/s
Bench.dotProductBaseline thrpt 6 1162553.117 ± 21328.673 ops/s
Bench.dotProductJep338TailLoop thrpt 6 3317032.038 ± 19343.830 ops/s
Bench.l1DistanceBaseline thrpt 6 1095704.529 ± 32440.308 ops/s
Bench.l1DistanceJep338TailLoop thrpt 6 2816348.680 ± 35389.932 ops/s
Bench.l2DistanceBaseline thrpt 6 951909.125 ± 23376.234 ops/s
Bench.l2DistanceJep338TailLoop thrpt 6 2897756.618 ± 34180.451 ops/s
This time we have a significant improvement over the baseline:
cosineSimilarity
is ~1.7x faster.dotProduct
is ~3.1x faster.l1Distance
is ~2.6x faster.l2Distance
is ~3x faster.
VectorMask on the Tail
What if we use a VectorMask
, but only on the tail of vector.
Is that faster than using a loop on the tail?
I refer to this as Jep338TailMask
, and the implementation looks like this:
public class Jep338TailMaskVectorOperations implements VectorOperations {
private final VectorSpecies<Float> species = FloatVector.SPECIES_PREFERRED;
public double cosineSimilarity(float[] v1, float[] v2) {
double dotProd = 0.0;
double v1SqrSum = 0.0;
double v2SqrSum = 0.0;
int i = 0;
int bound = species.loopBound(v1.length);
FloatVector fv1, fv2;
for (; i < bound; i += species.length()) {
fv1 = FloatVector.fromArray(species, v1, i);
fv2 = FloatVector.fromArray(species, v2, i);
dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
v1SqrSum += fv1.mul(fv1).reduceLanes(VectorOperators.ADD);
v2SqrSum += fv2.mul(fv2).reduceLanes(VectorOperators.ADD);
}
if (i < v1.length) {
VectorMask<Float> m = species.indexInRange(i, v1.length);
fv1 = FloatVector.fromArray(species, v1, i, m);
fv2 = FloatVector.fromArray(species, v2, i, m);
dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
v1SqrSum += fv1.mul(fv1).reduceLanes(VectorOperators.ADD);
v2SqrSum += fv2.mul(fv2).reduceLanes(VectorOperators.ADD);
}
return dotProd / (Math.sqrt(v1SqrSum) * Math.sqrt(v2SqrSum));
}
public double dotProduct(float[] v1, float[] v2) {
double dotProd = 0f;
int i = 0;
int bound = species.loopBound(v1.length);
FloatVector fv1, fv2;
for (; i < bound; i += species.length()) {
fv1 = FloatVector.fromArray(species, v1, i);
fv2 = FloatVector.fromArray(species, v2, i);
dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
}
if (i < v1.length) {
VectorMask<Float> m = species.indexInRange(i, v1.length);
fv1 = FloatVector.fromArray(species, v1, i, m);
fv2 = FloatVector.fromArray(species, v2, i, m);
dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
}
return dotProd;
}
public double l1Distance(float[] v1, float[] v2) {
double sumAbsDiff = 0.0;
int i = 0;
int bound = species.loopBound(v1.length);
FloatVector fv1, fv2;
for (; i < bound; i += species.length()) {
fv1 = FloatVector.fromArray(species, v1, i);
fv2 = FloatVector.fromArray(species, v2, i);
sumAbsDiff += fv1.sub(fv2).abs().reduceLanes(VectorOperators.ADD);
}
if (i < v1.length) {
VectorMask<Float> m = species.indexInRange(i, v1.length);
fv1 = FloatVector.fromArray(species, v1, i, m);
fv2 = FloatVector.fromArray(species, v2, i, m);
sumAbsDiff += fv1.sub(fv2).abs().reduceLanes(VectorOperators.ADD);
}
return sumAbsDiff;
}
public double l2Distance(float[] v1, float[] v2) {
double sumSqrDiff = 0f;
int i = 0;
int bound = species.loopBound(v1.length);
FloatVector fv1, fv2, fv3;
for (; i < bound; i+= species.length()) {
fv1 = FloatVector.fromArray(species, v1, i);
fv2 = FloatVector.fromArray(species, v2, i);
fv3 = fv1.sub(fv2);
// For some unknown reason, fv3.mul(fv3) is significantly faster than fv3.pow(2).
sumSqrDiff += fv3.mul(fv3).reduceLanes(VectorOperators.ADD);
}
if (i < v1.length) {
VectorMask<Float> m = species.indexInRange(i, v1.length);
fv1 = FloatVector.fromArray(species, v1, i, m);
fv2 = FloatVector.fromArray(species, v2, i, m);
fv3 = fv1.sub(fv2);
sumSqrDiff += fv3.mul(fv3).reduceLanes(VectorOperators.ADD);
}
return Math.sqrt(sumSqrDiff);
}
}
The results are:
Benchmark Mode Cnt Score Error Units
Bench.cosineSimilarityBaseline thrpt 6 689586.585 ± 52359.955 ops/s
Bench.cosineSimilarityJep338TailLoop thrpt 6 1169365.506 ± 5940.850 ops/s
Bench.cosineSimilarityJep338TailMask thrpt 6 1166971.620 ± 6927.790 ops/s
Bench.dotProductBaseline thrpt 6 1162553.117 ± 21328.673 ops/s
Bench.dotProductJep338TailLoop thrpt 6 3317032.038 ± 19343.830 ops/s
Bench.dotProductJep338TailMask thrpt 6 2740443.003 ± 467202.628 ops/s
Bench.l1DistanceBaseline thrpt 6 1095704.529 ± 32440.308 ops/s
Bench.l1DistanceJep338TailLoop thrpt 6 2816348.680 ± 35389.932 ops/s
Bench.l1DistanceJep338TailMask thrpt 6 2717614.796 ± 14014.855 ops/s
Bench.l2DistanceBaseline thrpt 6 951909.125 ± 23376.234 ops/s
Bench.l2DistanceJep338TailLoop thrpt 6 2897756.618 ± 34180.451 ops/s
Bench.l2DistanceJep338TailMask thrpt 6 2492492.274 ± 11376.759 ops/s
Across the board, using a VectorMask
is clearly slower than just using a simple loop on the tail.
Complete Benchmark Results
Here are the full results once more for comparison:
Benchmark Mode Cnt Score Error Units
Bench.cosineSimilarityBaseline thrpt 6 689586.585 ± 52359.955 ops/s
Bench.cosineSimilarityFma thrpt 6 1086514.074 ± 15190.380 ops/s
Bench.cosineSimilarityJep338FullMask thrpt 6 548425.342 ± 19160.168 ops/s
Bench.cosineSimilarityJep338TailLoop thrpt 6 1169365.506 ± 5940.850 ops/s
Bench.cosineSimilarityJep338TailMask thrpt 6 1166971.620 ± 6927.790 ops/s
Bench.dotProductBaseline thrpt 6 1162553.117 ± 21328.673 ops/s
Bench.dotProductFma thrpt 6 1073368.454 ± 104684.436 ops/s
Bench.dotProductJep338FullMask thrpt 6 384319.569 ± 20067.109 ops/s
Bench.dotProductJep338TailLoop thrpt 6 3317032.038 ± 19343.830 ops/s
Bench.dotProductJep338TailMask thrpt 6 2740443.003 ± 467202.628 ops/s
Bench.l1DistanceBaseline thrpt 6 1095704.529 ± 32440.308 ops/s
Bench.l1DistanceFma thrpt 6 1098354.824 ± 13870.211 ops/s
Bench.l1DistanceJep338FullMask thrpt 6 356044.308 ± 5186.842 ops/s
Bench.l1DistanceJep338TailLoop thrpt 6 2816348.680 ± 35389.932 ops/s
Bench.l1DistanceJep338TailMask thrpt 6 2717614.796 ± 14014.855 ops/s
Bench.l2DistanceBaseline thrpt 6 951909.125 ± 23376.234 ops/s
Bench.l2DistanceFma thrpt 6 1101736.286 ± 11985.949 ops/s
Bench.l2DistanceJep338FullMask thrpt 6 376810.628 ± 3531.977 ops/s
Bench.l2DistanceJep338TailLoop thrpt 6 2897756.618 ± 34180.451 ops/s
Bench.l2DistanceJep338TailMask thrpt 6 2492492.274 ± 11376.759 ops/s
To summarize, the fastest approach for all operations is Jep338TailLoop
,
This uses the jdk.incubator.vector
API for all strides until the tail of the vectors, and then uses a loop to handle the tails.
Compared to the baseline, this approach yields some substantial improvements:
cosineSimilarity
is ~1.7x faster.dotProduct
is ~3.1x faster.l1Distance
is ~2.6x faster.l2Distance
is ~3x faster.
Takeaways
I’ll close with my takeaways from this benchmark.
If vector operations are a bottleneck in your application, and jdk.incubator.vector
is available on your platform, then it’s worth a try.
In my benchmarks, the speedup was anywhere from 1.7x to 3.1x.
When using jdk.incubator.vector
, carefully consider and benchmark the usage of VectorMask
.
This abstraction seems quite expensive.
If jdk.incubator.vector
is not available, then try using java.lang.Math.fma
where possible.
This still offers a noticeable speedup.
There are also some other optimized methods in java.lang.Math
that seem like they could be useful.
My only aesthetic complaint is that the API forces us to duplicate code to handle the vector tail. However, the API is still far simpler and far more readable than the analagous APIs I’ve seen in C and C++.
Overall, I’m quite impressed by the jdk.incubator.vector
module, and I’m excited to see this opportunity for tighter integration between the hardware and JVM.
Appendix
Related Material
Here is some related material that I found useful:
- Limiting Factors in a Dot Product Calculation by Richard Startin, July 2018.
- This Twitter thread about jdk.incubator.vector by Denis Makogon, September 2022.
- Java Vector API: Benchmarking and Performance Analysis by Basso, et. al, February 2023.
- CS494 Lecture Notes - Some simple SIMD examples, Dr. James Plank, November 2019.
jdk.incubator.vector
in Elastiknn
February 26, 2023
If you’d like to see an example of this in a real codebase, I incorporated jdk.incubator.vector
as an optional optimization in Elastiknn in this pull request: alexklibisz/elastiknn #496.
This led to a speedup anywhere from 1.05x to 1.2x on the Elastiknn benchmarks.
FloatVector::pow is significantly slower than FloatVector::mul
February 26, 2023
While working on the benchmarks above, I found an interesting performance pitfall.
Namely, given a FloatVector
fv
, fv.mul(fv)
is 36x faster than fv.pow(2)
.
The benchmark:
@State(Scope.Benchmark)
class BenchPowVsMulFixtures {
implicit private val rng: Random = new Random(0)
val species: VectorSpecies[lang.Float] = FloatVector.SPECIES_PREFERRED
val v: Array[Float] = (0 until species.length()).map(_ => rng.nextFloat()).toArray
val fv: FloatVector = FloatVector.fromArray(species, v, 0)
}
class BenchPowVsMul {
@Benchmark
@BenchmarkMode(Array(Mode.Throughput))
@Fork(value = 1)
@Warmup(time = 5, iterations = 3)
@Measurement(time = 5, iterations = 6)
def mul(f: BenchPowVsMulFixtures): Unit = f.fv.mul(f.fv)
@Benchmark
@BenchmarkMode(Array(Mode.Throughput))
@Fork(value = 1)
@Warmup(time = 5, iterations = 3)
@Measurement(time = 5, iterations = 6)
def pow(f: BenchPowVsMulFixtures): Unit = f.fv.pow(2)
}
The results:
Benchmark Mode Cnt Score Error Units
BenchPowVsMul.mul thrpt 6 1235649170.757 ± 216838871.439 ops/s
BenchPowVsMul.pow thrpt 6 34105529.504 ± 4899654.049 ops/s
I have no idea why this would be, but it seems like it could be a bug in the underlying implementation.
Update (March 1, 2023): I messaged the panama dev mailing list and got an explanation for this:
The performance difference you observe is because the pow operation is falling back to scalar code (Math.pow on each lane element) and not using vector instructions. On x86 linux or windows you should observe better performance of the pow operation because it should leverage code from Intel’s Short Vector Math Library [1], but that code OS specific and is not currently ported on Mac OS.
Results on Apple Silicon (M1 Macbook Air)
February 26, 2023
I was curious how this would look on Apple silicon, so I also ran the benchmark on my 2020 M1 Macbook Air. Other than the host machine, the benchmark setup is identical to the results above.
Benchmark Mode Cnt Score Error Units
Bench.cosineSimilarityBaseline thrpt 6 1081276.495 ± 39124.749 ops/s
Bench.cosineSimilarityFma thrpt 6 836076.757 ± 47.184 ops/s
Bench.cosineSimilarityJep338FullMask thrpt 6 1050298.090 ± 81.960 ops/s
Bench.cosineSimilarityJep338TailLoop thrpt 6 1532220.920 ± 179274.549 ops/s
Bench.cosineSimilarityJep338TailMask thrpt 6 1452240.849 ± 6500.704 ops/s
Bench.dotProductBaseline thrpt 6 1136628.672 ± 180804.344 ops/s
Bench.dotProductFma thrpt 6 912200.315 ± 8839.657 ops/s
Bench.dotProductJep338FullMask thrpt 6 272444.658 ± 1642.048 ops/s
Bench.dotProductJep338TailLoop thrpt 6 4062575.031 ± 1541.393 ops/s
Bench.dotProductJep338TailMask thrpt 6 3372980.017 ± 4655.095 ops/s
Bench.l1DistanceBaseline thrpt 6 1134803.520 ± 22165.180 ops/s
Bench.l1DistanceFma thrpt 6 1146026.997 ± 2952.262 ops/s
Bench.l1DistanceJep338FullMask thrpt 6 271181.722 ± 416.756 ops/s
Bench.l1DistanceJep338TailLoop thrpt 6 4062832.939 ± 249.915 ops/s
Bench.l1DistanceJep338TailMask thrpt 6 3362605.808 ± 20805.696 ops/s
Bench.l2DistanceBaseline thrpt 6 1108095.885 ± 4237.677 ops/s
Bench.l2DistanceFma thrpt 6 860659.029 ± 8911.938 ops/s
Bench.l2DistanceJep338FullMask thrpt 6 269202.529 ± 326.229 ops/s
Bench.l2DistanceJep338TailLoop thrpt 6 2026410.994 ± 201.837 ops/s
Bench.l2DistanceJep338TailMask thrpt 6 3273131.452 ± 11284.378 ops/s
Here are the Baseline measurements merged:
Chip Benchmark Mode Cnt Score Error Units
Intel i7 Bench.cosineSimilarityBaseline thrpt 6 689586.585 ± 52359.955 ops/s
Apple M1 Bench.cosineSimilarityBaseline thrpt 6 1081276.495 ± 39124.749 ops/s
Intel i7 Bench.dotProductBaseline thrpt 6 1162553.117 ± 21328.673 ops/s
Apple M1 Bench.dotProductBaseline thrpt 6 1136628.672 ± 180804.344 ops/s
Intel i7 Bench.l1DistanceBaseline thrpt 6 1095704.529 ± 32440.308 ops/s
Apple M1 Bench.l1DistanceBaseline thrpt 6 1134803.520 ± 22165.180 ops/s
Intel i7 Bench.l2DistanceBaseline thrpt 6 951909.125 ± 23376.234 ops/s
Apple M1 Bench.l2DistanceBaseline thrpt 6 1108095.885 ± 4237.677 ops/s
The cosineSimilarity
baseline is noticeably faster on the M1.
The others are comparable.
Here are the Jep338TailLoop measurements merged:
Chip Benchmark Mode Cnt Score Error Units
Intel i7 Bench.cosineSimilarityJep338TailLoop thrpt 6 1169365.506 ± 5940.850 ops/s
Apple M1 Bench.cosineSimilarityJep338TailLoop thrpt 6 1532220.920 ± 179274.549 ops/s
Intel i7 Bench.dotProductJep338TailLoop thrpt 6 3317032.038 ± 19343.830 ops/s
Apple M1 Bench.dotProductJep338TailLoop thrpt 6 4062575.031 ± 1541.393 ops/s
Intel i7 Bench.l1DistanceJep338TailLoop thrpt 6 2816348.680 ± 35389.932 ops/s
Apple M1 Bench.l1DistanceJep338TailLoop thrpt 6 4062832.939 ± 249.915 ops/s
Intel i7 Bench.l2DistanceJep338TailLoop thrpt 6 2897756.618 ± 34180.451 ops/s
Apple M1 Bench.l2DistanceJep338TailLoop thrpt 6 2026410.994 ± 201.837 ops/s
In this case, the M1 is faster in all but the l2Distance
.
The M1’s error bounds are also impressively tight on the three operations that outperform the Intel.
June 7, 2023
I re-ran the benchmark on my M1 Max Macbook Pro, and the results were completely different! So please beware of benchmarking on the M-series chips. It’s an interesting endeavor, but also even more of a rabbit hole than benchmarking on x86.
Java Vector API: Benchmarking and Performance Analysis by Basso, et. al
February 27, 2023
I discovered the paper Java Vector API: Benchmarking and Performance Analysis by Basso, et. al shortly after releasing this post. It looks like they beat me to the release by about week! This paper is an extremely thorough analysis of the topic, so I also highly recommend reading it.
In section 5.2, the authors discuss the negative performance effects of using indexInRange
.
This matches up well with what I observed in the Jep338FullMask
optimization.
It turns out that using indexInRange
to build a VectorMask
is only performant on systems supporting predicate registers.
I guess this particular feature is not supported by my Intel Mac Mini nor by my M1 Macbook Air.
In section 5.3, the authors discuss how .pow
is far slower than .mul
.
This aligns with my findings, also discussed in the appendix on this post.
Although, I observed a much larger difference when evaluating the two operations in isolation.
Bug in Jep338FullMaskVectorOperations
May 20, 2023
I had a small bug in my original implementation of Jep338FullMaskVectorOperations
, which I fixed in this PR on 5/20/23 and updated the code in this post.
Thanks to Twitter user Varun Thacker for finding it and proposing a fix.
Warning: Math.fma can be extremely slow on some platforms
In late May 2023, I noticed a Tweet from long-time Lucene contributor Uwe Schindler discussing how Lucene had also implemented vector optimizations based on the Panama Vector API. I replied with a link to this post and my implementation in Elastiknn. Uwe kindly responded with a warning about performance pitfalls of Math.fma.
I’m still exploring the effects of this in Elastiknn. I have some rough data indicating that it is in fact significantly slower on some platforms, so I’ll likely remove it. I figured it’s worth quickly mentioning here.