Introduction

In my work on Elastiknn, I’ve spent many hours looking for ways to optimize vector operations on the Java Virtual Machine (JVM).

The jdk.incubator.vector module, introduced in JDK 16 as part of JEP 338 and Project Panama, is the first opportunity I’ve encountered for significant performance improvements in this area.

I recently had some time to experiment with this new module, and I cover my benchmarks and findings in this post. Overall, I found improvements in operations per second on the order of 2x to 3x compared to simple baselines. If vector operations are a bottleneck in your application, I recommend you try out this module.

Background

For our purposes, a “vector” is simply an array of floats: float[] in Java, Array[Float] in Scala, etc. These are common in data science and machine learning.

The specific operations I’m interested in are:

These are used commonly in nearest neighbor search.

The jdk.incubator.vector API provides a way to access hardware-level optimizations for processing vectors. The two hardware-level optimizations mentioned in JEP 338 are Streaming SIMD Extensions (SSE) and Advanced Vector Extensions (AVX).

Here’s my over-simplified understanding of these optimizations: the various processor vendors have agreed on a set of CPU instructions for operating directly on vectors. Just like they provide hardware-level instructions for adding two scalars, they now provide hardware-level instructions for adding two vectors. These optimized instructions have been accessible for many years in lower-level languages like C and C++. As part of Project Panama, the JDK has recently exposed an API to leverage these optimized instructions directly from JVM languages. This API is contained in the jdk.incubator.vector module.

Benchmark Setup

My benchmarks measure operations per second for five implementations of each of the four vector operations. I start with a simple baseline and working through four possible optimizations.

I implemented the benchmark in Java and Scala: the actual vector operations are in Java, and the benchmark harness is in Scala. I use the Java Microbenchmark Harness (JMH) framework, via the sbt-jmh plugin, to execute the benchmark. This is my first time using JMH in any serious capacity, so I’m happy to hear feedback about better or simpler ways to use it.

Each variation implements this Java interface:

public interface VectorOperations {
    // https://en.wikipedia.org/wiki/Cosine_similarity
    double cosineSimilarity(float[] v1, float[] v2);

    // https://en.wikipedia.org/wiki/Dot_product
    double dotProduct(float[] v1, float[] v2);

    // https://en.wikipedia.org/wiki/Taxicab_geometry
    double l1Distance(float[] v1, float[] v2);

    // https://en.wikipedia.org/wiki/Euclidean_distance
    double l2Distance(float[] v1, float[] v2);
}

I verify the correctness of these optimizations by running tests that run the baseline operation and the optimized operation on a pair of random vectors and check for parity in the results.

All benchmarks operate on a pair of randomly-generated floating-point vectors containing 999 elements. I chose length 999 specifically to make us deal with some additional complexity in the implementation. This will make more sense later in the post.

All benchmarks run on Oracle JDK 19.0.2, installed via asdf: $ asdf install java oracle-19.0.2.

All benchmarks run on my 2018 Mac Mini, which has an Intel i7-8700B processor with SSE4.1, SSE4.2, and AVX2 instruction set extensions.

Finally, all code is available in my site-projects repository: jdk-incubator-vector-optimizations

Baseline Implementation

We start with a baseline implementation of these vector operations. No clever tricks here. This is what we get if we take the definition for each operation from Wikipedia and translate it verbatim into Java:

public class BaselineVectorOperations implements VectorOperations {

    public double cosineSimilarity(float[] v1, float[] v2) {
        double dotProd = 0.0;
        double v1SqrSum = 0.0;
        double v2SqrSum = 0.0;
        for (int i = 0; i < v1.length; i++) {
            dotProd += v1[i] * v2[i];
            v1SqrSum += Math.pow(v1[i], 2);
            v2SqrSum += Math.pow(v2[i], 2);
        }
        return dotProd / (Math.sqrt(v1SqrSum) * Math.sqrt(v2SqrSum));
    }

    public double dotProduct(float[] v1, float[] v2) {
        float dotProd = 0f;
        for (int i = 0; i < v1.length; i++) dotProd += v1[i] * v2[i];
        return dotProd;
    }

    public double l1Distance(float[] v1, float[] v2) {
        double sumAbsDiff = 0.0;
        for (int i = 0; i < v1.length; i++) sumAbsDiff += Math.abs(v1[i] - v2[i]);
        return sumAbsDiff;
    }

    public double l2Distance(float[] v1, float[] v2) {
        double sumSqrDiff = 0.0;
        for (int i = 0; i < v1.length; i++) sumSqrDiff += Math.pow(v1[i] - v2[i], 2);
        return Math.sqrt(sumSqrDiff);
    }
}

This produces the following results:

Benchmark                              Mode  Cnt         Score        Error  Units
Bench.cosineSimilarityBaseline        thrpt    6    689586.585 ±  52359.955  ops/s

Bench.dotProductBaseline              thrpt    6   1162553.117 ±  21328.673  ops/s

Bench.l1DistanceBaseline              thrpt    6   1095704.529 ±  32440.308  ops/s

Bench.l2DistanceBaseline              thrpt    6    951909.125 ±  23376.234  ops/s

Fused Multiply Add (Math.fma)

Next, we introduce an optimization based on the “Fused Multiply Add” operator, implemented by the Math.fma method, documented here.

Math.fma takes three floats, a, b, c, and executes a * b + c as a single operation. This has some advantages with respect to floating point error and performance. Basically, executing one operation is generally faster than two operations, and incurs only one rounding error.

I found a way to use Math.fma in all the vector operations except l1Distance.

The implementations look like this:

public class FmaVectorOperations implements VectorOperations {

    public double cosineSimilarity(float[] v1, float[] v2) {
        double dotProd = 0.0;
        double v1SqrSum = 0.0;
        double v2SqrSum = 0.0;
        for (int i = 0; i < v1.length; i++) {
            dotProd = Math.fma(v1[i], v2[i], dotProd);
            v1SqrSum = Math.fma(v1[i], v1[i], v1SqrSum);
            v2SqrSum = Math.fma(v2[i], v2[i], v2SqrSum);
        }
        return dotProd / (Math.sqrt(v1SqrSum) * Math.sqrt(v2SqrSum));
    }

    public double dotProduct(float[] v1, float[] v2) {
        float dotProd = 0f;
        for (int i = 0; i < v1.length; i++) dotProd = Math.fma(v1[i], v2[i], dotProd);
        return dotProd;
    }

    public double l1Distance(float[] v1, float[] v2) {
        // Does not actually leverage Math.fma.
        double sumAbsDiff = 0.0;
        for (int i = 0; i < v1.length; i++) sumAbsDiff += Math.abs(v1[i] - v2[i]);
        return sumAbsDiff;
    }

    public double l2Distance(float[] v1, float[] v2) {
        double sumSqrDiff = 0.0;
        float diff;
        for (int i = 0; i < v1.length; i++) {
            diff = v1[i] - v2[i];
            sumSqrDiff = Math.fma(diff, diff, sumSqrDiff);
        }
        return Math.sqrt(sumSqrDiff);
    }
}

The results are:

Benchmark                              Mode  Cnt         Score        Error  Units
Bench.cosineSimilarityBaseline        thrpt    6    689586.585 ±  52359.955  ops/s
Bench.cosineSimilarityFma             thrpt    6   1086514.074 ±  15190.380  ops/s

Bench.dotProductBaseline              thrpt    6   1162553.117 ±  21328.673  ops/s
Bench.dotProductFma                   thrpt    6   1073368.454 ± 104684.436  ops/s

Bench.l1DistanceBaseline              thrpt    6   1095704.529 ±  32440.308  ops/s
Bench.l1DistanceFma                   thrpt    6   1098354.824 ±  13870.211  ops/s

Bench.l2DistanceBaseline              thrpt    6    951909.125 ±  23376.234  ops/s
Bench.l2DistanceFma                   thrpt    6   1101736.286 ±  11985.949  ops/s

We see some improvement in two of the four cases:

  • cosineSimilarity is ~1.6x faster.
  • l2Distance is ~1.1x faster.
  • dotProduct remains about the same, maybe a little worse.
  • l1Distance does not leverage this optimization and predictably does not change much.

jdk.incubator.vector

Now we jump into optimizations using the jdk.incubator.vector module!

Crash course

I have to start with a crash course on this module. I also highly recommend reading the examples in JEP 338.

We are primarily using the jdk.incubator.vector.FloatVector class.

Given a pair of float[] arrays, the general pattern for using jdk.incubator.vector is as follows:

  • We iterate over strides (i.e., segments or chunks) of the two arrays.
  • At each iteration, we use the FloatVector.fromArray method to copy the current stride from each array into a FloatVector.
  • We call methods on the FloatVector instances to execute mathematical operations. For example, if we have FloatVectors fv1 and fv2, then fv1.mul(fv2).reduceLanes(VectorOperations.ADD) runs a pairwise multiplication and sums the results.

We also need to know about a helper class called jdk.incubator.vector.VectorSpecies, which involves the following:

  • Defines the stride length used for vector operations.
  • Provides helper methods for iterating over the arrays in strides.
  • Is required to copy values from an array and into a FloatVector.

At the time of writing, there are four species: SPECIES_64, SPECIES_128, SPECIES_256, SPECIES_512, and two aliases: SPECIES_MAX, and SPECIES_PREFERRED. The numbers 64, 128, 256, and 512 refer to the number of bits in a FloatVector. A Java float uses 4 bytes, or 32 bits, so a vector with SPECIES_256 lets us operate on 256 / 32 = 8 floats in a single operation. I found it’s best to just stick with SPECIES_PREFERRED, which defaults to SPECIES_256 on my Mac Mini. Throughput can actually decrease drastically with a suboptimal VectorSpecies.

Finally, we need to consider what to do when the array length is less than or not a multiple of the VectorSpecies length. If our source arrays have a length that’s equal or a multiple of the stride length, then we can iterate over the strides with no elements left over. Otherwise, we need to figure out what to do with the “tail” of elements that did not fill up a stride.

The way we handle this tail can have some non-negligible performance impact. This is why I chose to benchmark with vectors of length 999. There are three options for dealing with the tail, and they are the subject of the following three sections.

VectorMask on every Iteration

The first way to handle the tail is to avoid handling it by using a VectorMask on every stride. We use the species to define the VectorMask, and then pass through the mask when creating the FloatVector.

I refer to this as Jep338FullMask, and the implementations look like this:

public class Jep338FullMaskVectorOperations implements VectorOperations{

    private final VectorSpecies<Float> species = FloatVector.SPECIES_PREFERRED;

    public double cosineSimilarity(float[] v1, float[] v2) {
        double dotProd = 0.0;
        double v1SqrSum = 0.0;
        double v2SqrSum = 0.0;
        FloatVector fv1, fv2;
        for (int i = 0; i < v1.length; i += species.length()) {
            VectorMask<Float> m = species.indexInRange(i, v1.length);
            fv1 = FloatVector.fromArray(species, v1, i, m);
            fv2 = FloatVector.fromArray(species, v2, i, m);
            dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
            v1SqrSum += fv1.mul(fv1).reduceLanes(VectorOperators.ADD);
            v2SqrSum += fv2.mul(fv2).reduceLanes(VectorOperators.ADD);
        }
        return dotProd / (Math.sqrt(v1SqrSum) * Math.sqrt(v2SqrSum));
    }

    public double dotProduct(float[] v1, float[] v2) {
        double dotProd = 0f;
        FloatVector fv1, fv2;
        for (int i = 0; i < v1.length; i += species.length()) {
            VectorMask<Float> m = species.indexInRange(i, v1.length);
            fv1 = FloatVector.fromArray(species, v1, i, m);
            fv2 = FloatVector.fromArray(species, v2, i, m);
            dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
        }
        return dotProd;
    }

    public double l1Distance(float[] v1, float[] v2) {
        double sumAbsDiff = 0.0;
        FloatVector fv1, fv2;
        for (int i = 0; i < v1.length; i += species.length()) {
            VectorMask<Float> m = species.indexInRange(i, v1.length);
            fv1 = FloatVector.fromArray(species, v1, i, m);
            fv2 = FloatVector.fromArray(species, v2, i, m);
            sumAbsDiff += fv1.sub(fv2).abs().reduceLanes(VectorOperators.ADD);
        }
        return sumAbsDiff;
    }

    public double l2Distance(float[] v1, float[] v2) {
        double sumSqrDiff = 0f;
        FloatVector fv1, fv2, fv3;
        for (int i = 0; i < v1.length; i+= species.length()) {
            VectorMask<Float> m = species.indexInRange(i, v1.length);
            fv1 = FloatVector.fromArray(species, v1, i, m);
            fv2 = FloatVector.fromArray(species, v2, i, m);
            fv3 = fv1.sub(fv2);
            // For some unknown reason, fv3.mul(fv3) is significantly faster than fv3.pow(2).
            sumSqrDiff += fv3.mul(fv3).reduceLanes(VectorOperators.ADD);
        }
        return Math.sqrt(sumSqrDiff);
    }
}

The results are:

Benchmark                              Mode  Cnt         Score        Error  Units
Bench.cosineSimilarityBaseline        thrpt    6    689586.585 ±  52359.955  ops/s
Bench.cosineSimilarityJep338FullMask  thrpt    6    548425.342 ±  19160.168  ops/s

Bench.dotProductBaseline              thrpt    6   1162553.117 ±  21328.673  ops/s
Bench.dotProductJep338FullMask        thrpt    6    384319.569 ±  20067.109  ops/s

Bench.l1DistanceBaseline              thrpt    6   1095704.529 ±  32440.308  ops/s
Bench.l1DistanceJep338FullMask        thrpt    6    356044.308 ±   5186.842  ops/s

Bench.l2DistanceBaseline              thrpt    6    951909.125 ±  23376.234  ops/s
Bench.l2DistanceJep338FullMask        thrpt    6    376810.628 ±   3531.977  ops/s

The new implementation is actually significantly slower than the baseline.

Fortunately, the authors of JEP 338 mention this explicitly:

Since a mask is used in all iterations, the above implementation may not achieve optimal performance for large array lengths.

I haven’t looked extensively enough to understand why, but it seems like the VectorMask is either expensive to create, expensive to use, or maybe both. Edit: The paper Java Vector API: Benchmarking and Performance Analysis discusses the performance of indexInRange, which is used to compute a VectorMask, in section 5.2. It turns out that indexInRange is only optimized on certain platforms, and degrades poorly on others.

Loop over the Tail

We can also handle the tail using a plain loop.

I refer to this as Jep338TailLoop, and the implementations look like this:

public class Jep338TailLoopVectorOperations implements VectorOperations{

    private final VectorSpecies<Float> species = FloatVector.SPECIES_PREFERRED;

    public double cosineSimilarity(float[] v1, float[] v2) {
        double dotProd = 0.0;
        double v1SqrSum = 0.0;
        double v2SqrSum = 0.0;
        int i = 0;
        int bound = species.loopBound(v1.length);
        FloatVector fv1, fv2;
        for (; i < bound; i += species.length()) {
            fv1 = FloatVector.fromArray(species, v1, i);
            fv2 = FloatVector.fromArray(species, v2, i);
            dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
            v1SqrSum += fv1.mul(fv1).reduceLanes(VectorOperators.ADD);
            v2SqrSum += fv2.mul(fv2).reduceLanes(VectorOperators.ADD);
        }
        for (; i < v1.length; i++) {
            dotProd = Math.fma(v1[i], v2[i], dotProd);
            v1SqrSum = Math.fma(v1[i], v1[i], v1SqrSum);
            v2SqrSum = Math.fma(v2[i], v2[i], v2SqrSum);
        }
        return dotProd / (Math.sqrt(v1SqrSum) * Math.sqrt(v2SqrSum));
    }

    public double dotProduct(float[] v1, float[] v2) {
        double dotProd = 0f;
        int i = 0;
        int bound = species.loopBound(v1.length);
        FloatVector fv1, fv2;
        for (; i < bound; i += species.length()) {
            fv1 = FloatVector.fromArray(species, v1, i);
            fv2 = FloatVector.fromArray(species, v2, i);
            dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
        }
        for (; i < v1.length; i++) {
            dotProd = Math.fma(v1[i], v2[i], dotProd);
        }
        return dotProd;
    }

    public double l1Distance(float[] v1, float[] v2) {
        double sumAbsDiff = 0.0;
        int i = 0;
        int bound = species.loopBound(v1.length);
        FloatVector fv1, fv2;
        for (; i < bound; i += species.length()) {
            fv1 = FloatVector.fromArray(species, v1, i);
            fv2 = FloatVector.fromArray(species, v2, i);
            sumAbsDiff += fv1.sub(fv2).abs().reduceLanes(VectorOperators.ADD);
        }
        for (; i < v1.length; i++) {
            sumAbsDiff += Math.abs(v1[i] - v2[i]);
        }
        return sumAbsDiff;
    }

    public double l2Distance(float[] v1, float[] v2) {
        double sumSqrDiff = 0f;
        int i = 0;
        int bound = species.loopBound(v1.length);
        FloatVector fv1, fv2, fv3;
        for (; i < bound; i+= species.length()) {
            fv1 = FloatVector.fromArray(species, v1, i);
            fv2 = FloatVector.fromArray(species, v2, i);
            fv3 = fv1.sub(fv2);
            // For some unknown reason, fv3.mul(fv3) is significantly faster than fv3.pow(2).
            sumSqrDiff += fv3.mul(fv3).reduceLanes(VectorOperators.ADD);
        }
        for (; i < v1.length; i++) {
            float diff = v1[i] - v2[i];
            sumSqrDiff = Math.fma(diff, diff, sumSqrDiff);
        }
        return Math.sqrt(sumSqrDiff);
    }
}

The results are:

Benchmark                              Mode  Cnt         Score        Error  Units
Bench.cosineSimilarityBaseline        thrpt    6    689586.585 ±  52359.955  ops/s
Bench.cosineSimilarityJep338TailLoop  thrpt    6   1169365.506 ±   5940.850  ops/s

Bench.dotProductBaseline              thrpt    6   1162553.117 ±  21328.673  ops/s
Bench.dotProductJep338TailLoop        thrpt    6   3317032.038 ±  19343.830  ops/s

Bench.l1DistanceBaseline              thrpt    6   1095704.529 ±  32440.308  ops/s
Bench.l1DistanceJep338TailLoop        thrpt    6   2816348.680 ±  35389.932  ops/s

Bench.l2DistanceBaseline              thrpt    6    951909.125 ±  23376.234  ops/s
Bench.l2DistanceJep338TailLoop        thrpt    6   2897756.618 ±  34180.451  ops/s

This time we have a significant improvement over the baseline:

  • cosineSimilarity is ~1.7x faster.
  • dotProduct is ~3.1x faster.
  • l1Distance is ~2.6x faster.
  • l2Distance is ~3x faster.

VectorMask on the Tail

What if we use a VectorMask, but only on the tail of vector. Is that faster than using a loop on the tail?

I refer to this as Jep338TailMask, and the implementation looks like this:

public class Jep338TailMaskVectorOperations implements VectorOperations {

    private final VectorSpecies<Float> species = FloatVector.SPECIES_PREFERRED;

    public double cosineSimilarity(float[] v1, float[] v2) {
        double dotProd = 0.0;
        double v1SqrSum = 0.0;
        double v2SqrSum = 0.0;
        int i = 0;
        int bound = species.loopBound(v1.length);
        FloatVector fv1, fv2;
        for (; i < bound; i += species.length()) {
            fv1 = FloatVector.fromArray(species, v1, i);
            fv2 = FloatVector.fromArray(species, v2, i);
            dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
            v1SqrSum += fv1.mul(fv1).reduceLanes(VectorOperators.ADD);
            v2SqrSum += fv2.mul(fv2).reduceLanes(VectorOperators.ADD);
        }
        if (i < v1.length) {
            VectorMask<Float> m = species.indexInRange(i, v1.length);
            fv1 = FloatVector.fromArray(species, v1, i, m);
            fv2 = FloatVector.fromArray(species, v2, i, m);
            dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
            v1SqrSum += fv1.mul(fv1).reduceLanes(VectorOperators.ADD);
            v2SqrSum += fv2.mul(fv2).reduceLanes(VectorOperators.ADD);
        }
        return dotProd / (Math.sqrt(v1SqrSum) * Math.sqrt(v2SqrSum));
    }

    public double dotProduct(float[] v1, float[] v2) {
        double dotProd = 0f;
        int i = 0;
        int bound = species.loopBound(v1.length);
        FloatVector fv1, fv2;
        for (; i < bound; i += species.length()) {
            fv1 = FloatVector.fromArray(species, v1, i);
            fv2 = FloatVector.fromArray(species, v2, i);
            dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
        }
        if (i < v1.length) {
            VectorMask<Float> m = species.indexInRange(i, v1.length);
            fv1 = FloatVector.fromArray(species, v1, i, m);
            fv2 = FloatVector.fromArray(species, v2, i, m);
            dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
        }
        return dotProd;
    }

    public double l1Distance(float[] v1, float[] v2) {
        double sumAbsDiff = 0.0;
        int i = 0;
        int bound = species.loopBound(v1.length);
        FloatVector fv1, fv2;
        for (; i < bound; i += species.length()) {
            fv1 = FloatVector.fromArray(species, v1, i);
            fv2 = FloatVector.fromArray(species, v2, i);
            sumAbsDiff += fv1.sub(fv2).abs().reduceLanes(VectorOperators.ADD);
        }
        if (i < v1.length) {
            VectorMask<Float> m = species.indexInRange(i, v1.length);
            fv1 = FloatVector.fromArray(species, v1, i, m);
            fv2 = FloatVector.fromArray(species, v2, i, m);
            sumAbsDiff += fv1.sub(fv2).abs().reduceLanes(VectorOperators.ADD);
        }
        return sumAbsDiff;
    }

    public double l2Distance(float[] v1, float[] v2) {
        double sumSqrDiff = 0f;
        int i = 0;
        int bound = species.loopBound(v1.length);
        FloatVector fv1, fv2, fv3;
        for (; i < bound; i+= species.length()) {
            fv1 = FloatVector.fromArray(species, v1, i);
            fv2 = FloatVector.fromArray(species, v2, i);
            fv3 = fv1.sub(fv2);
            // For some unknown reason, fv3.mul(fv3) is significantly faster than fv3.pow(2).
            sumSqrDiff += fv3.mul(fv3).reduceLanes(VectorOperators.ADD);
        }
        if (i < v1.length) {
            VectorMask<Float> m = species.indexInRange(i, v1.length);
            fv1 = FloatVector.fromArray(species, v1, i, m);
            fv2 = FloatVector.fromArray(species, v2, i, m);
            fv3 = fv1.sub(fv2);
            sumSqrDiff += fv3.mul(fv3).reduceLanes(VectorOperators.ADD);
        }
        return Math.sqrt(sumSqrDiff);
    }
}

The results are:

Benchmark                              Mode  Cnt         Score        Error  Units
Bench.cosineSimilarityBaseline        thrpt    6    689586.585 ±  52359.955  ops/s
Bench.cosineSimilarityJep338TailLoop  thrpt    6   1169365.506 ±   5940.850  ops/s
Bench.cosineSimilarityJep338TailMask  thrpt    6   1166971.620 ±   6927.790  ops/s

Bench.dotProductBaseline              thrpt    6   1162553.117 ±  21328.673  ops/s
Bench.dotProductJep338TailLoop        thrpt    6   3317032.038 ±  19343.830  ops/s
Bench.dotProductJep338TailMask        thrpt    6   2740443.003 ± 467202.628  ops/s

Bench.l1DistanceBaseline              thrpt    6   1095704.529 ±  32440.308  ops/s
Bench.l1DistanceJep338TailLoop        thrpt    6   2816348.680 ±  35389.932  ops/s
Bench.l1DistanceJep338TailMask        thrpt    6   2717614.796 ±  14014.855  ops/s

Bench.l2DistanceBaseline              thrpt    6    951909.125 ±  23376.234  ops/s
Bench.l2DistanceJep338TailLoop        thrpt    6   2897756.618 ±  34180.451  ops/s
Bench.l2DistanceJep338TailMask        thrpt    6   2492492.274 ±  11376.759  ops/s

Across the board, using a VectorMask is clearly slower than just using a simple loop on the tail.

Complete Benchmark Results

Here are the full results once more for comparison:

Benchmark                              Mode  Cnt         Score        Error  Units
Bench.cosineSimilarityBaseline        thrpt    6    689586.585 ±  52359.955  ops/s
Bench.cosineSimilarityFma             thrpt    6   1086514.074 ±  15190.380  ops/s
Bench.cosineSimilarityJep338FullMask  thrpt    6    548425.342 ±  19160.168  ops/s
Bench.cosineSimilarityJep338TailLoop  thrpt    6   1169365.506 ±   5940.850  ops/s
Bench.cosineSimilarityJep338TailMask  thrpt    6   1166971.620 ±   6927.790  ops/s

Bench.dotProductBaseline              thrpt    6   1162553.117 ±  21328.673  ops/s
Bench.dotProductFma                   thrpt    6   1073368.454 ± 104684.436  ops/s
Bench.dotProductJep338FullMask        thrpt    6    384319.569 ±  20067.109  ops/s
Bench.dotProductJep338TailLoop        thrpt    6   3317032.038 ±  19343.830  ops/s
Bench.dotProductJep338TailMask        thrpt    6   2740443.003 ± 467202.628  ops/s

Bench.l1DistanceBaseline              thrpt    6   1095704.529 ±  32440.308  ops/s
Bench.l1DistanceFma                   thrpt    6   1098354.824 ±  13870.211  ops/s
Bench.l1DistanceJep338FullMask        thrpt    6    356044.308 ±   5186.842  ops/s
Bench.l1DistanceJep338TailLoop        thrpt    6   2816348.680 ±  35389.932  ops/s
Bench.l1DistanceJep338TailMask        thrpt    6   2717614.796 ±  14014.855  ops/s

Bench.l2DistanceBaseline              thrpt    6    951909.125 ±  23376.234  ops/s
Bench.l2DistanceFma                   thrpt    6   1101736.286 ±  11985.949  ops/s
Bench.l2DistanceJep338FullMask        thrpt    6    376810.628 ±   3531.977  ops/s
Bench.l2DistanceJep338TailLoop        thrpt    6   2897756.618 ±  34180.451  ops/s
Bench.l2DistanceJep338TailMask        thrpt    6   2492492.274 ±  11376.759  ops/s

To summarize, the fastest approach for all operations is Jep338TailLoop, This uses the jdk.incubator.vector API for all strides until the tail of the vectors, and then uses a loop to handle the tails. Compared to the baseline, this approach yields some substantial improvements:

  • cosineSimilarity is ~1.7x faster.
  • dotProduct is ~3.1x faster.
  • l1Distance is ~2.6x faster.
  • l2Distance is ~3x faster.

Takeaways

I’ll close with my takeaways from this benchmark.

If vector operations are a bottleneck in your application, and jdk.incubator.vector is available on your platform, then it’s worth a try. In my benchmarks, the speedup was anywhere from 1.7x to 3.1x.

When using jdk.incubator.vector, carefully consider and benchmark the usage of VectorMask. This abstraction seems quite expensive.

If jdk.incubator.vector is not available, then try using java.lang.Math.fma where possible. This still offers a noticeable speedup. There are also some other optimized methods in java.lang.Math that seem like they could be useful.

My only aesthetic complaint is that the API forces us to duplicate code to handle the vector tail. However, the API is still far simpler and far more readable than the analagous APIs I’ve seen in C and C++.

Overall, I’m quite impressed by the jdk.incubator.vector module, and I’m excited to see this opportunity for tighter integration between the hardware and JVM.

Appendix

Here is some related material that I found useful:

jdk.incubator.vector in Elastiknn

February 26, 2023

If you’d like to see an example of this in a real codebase, I incorporated jdk.incubator.vector as an optional optimization in Elastiknn in this pull request: alexklibisz/elastiknn #496.

This led to a speedup anywhere from 1.05x to 1.2x on the Elastiknn benchmarks.

FloatVector::pow is significantly slower than FloatVector::mul

February 26, 2023

While working on the benchmarks above, I found an interesting performance pitfall. Namely, given a FloatVector fv, fv.mul(fv) is 36x faster than fv.pow(2).

The benchmark:

@State(Scope.Benchmark)
class BenchPowVsMulFixtures {
  implicit private val rng: Random = new Random(0)
  val species: VectorSpecies[lang.Float] = FloatVector.SPECIES_PREFERRED
  val v: Array[Float] = (0 until species.length()).map(_ => rng.nextFloat()).toArray
  val fv: FloatVector = FloatVector.fromArray(species, v, 0)
}

class BenchPowVsMul {

  @Benchmark
  @BenchmarkMode(Array(Mode.Throughput))
  @Fork(value = 1)
  @Warmup(time = 5, iterations = 3)
  @Measurement(time = 5, iterations = 6)
  def mul(f: BenchPowVsMulFixtures): Unit = f.fv.mul(f.fv)
  
  @Benchmark
  @BenchmarkMode(Array(Mode.Throughput))
  @Fork(value = 1)
  @Warmup(time = 5, iterations = 3)
  @Measurement(time = 5, iterations = 6)
  def pow(f: BenchPowVsMulFixtures): Unit = f.fv.pow(2)
}

The results:

Benchmark           Mode  Cnt           Score           Error  Units
BenchPowVsMul.mul  thrpt    6  1235649170.757 ± 216838871.439  ops/s
BenchPowVsMul.pow  thrpt    6    34105529.504 ±   4899654.049  ops/s

I have no idea why this would be, but it seems like it could be a bug in the underlying implementation.

Update (March 1, 2023): I messaged the panama dev mailing list and got an explanation for this:

The performance difference you observe is because the pow operation is falling back to scalar code (Math.pow on each lane element) and not using vector instructions. On x86 linux or windows you should observe better performance of the pow operation because it should leverage code from Intel’s Short Vector Math Library [1], but that code OS specific and is not currently ported on Mac OS.

View the full thread here.

Results on Apple Silicon (M1 Macbook Air)

February 26, 2023

I was curious how this would look on Apple silicon, so I also ran the benchmark on my 2020 M1 Macbook Air. Other than the host machine, the benchmark setup is identical to the results above.

Benchmark                              Mode  Cnt           Score         Error  Units
Bench.cosineSimilarityBaseline        thrpt    6     1081276.495 ±   39124.749  ops/s
Bench.cosineSimilarityFma             thrpt    6      836076.757 ±      47.184  ops/s
Bench.cosineSimilarityJep338FullMask  thrpt    6     1050298.090 ±      81.960  ops/s
Bench.cosineSimilarityJep338TailLoop  thrpt    6     1532220.920 ±  179274.549  ops/s
Bench.cosineSimilarityJep338TailMask  thrpt    6     1452240.849 ±    6500.704  ops/s

Bench.dotProductBaseline              thrpt    6     1136628.672 ±  180804.344  ops/s
Bench.dotProductFma                   thrpt    6      912200.315 ±    8839.657  ops/s
Bench.dotProductJep338FullMask        thrpt    6      272444.658 ±    1642.048  ops/s
Bench.dotProductJep338TailLoop        thrpt    6     4062575.031 ±    1541.393  ops/s
Bench.dotProductJep338TailMask        thrpt    6     3372980.017 ±    4655.095  ops/s

Bench.l1DistanceBaseline              thrpt    6     1134803.520 ±   22165.180  ops/s
Bench.l1DistanceFma                   thrpt    6     1146026.997 ±    2952.262  ops/s
Bench.l1DistanceJep338FullMask        thrpt    6      271181.722 ±     416.756  ops/s
Bench.l1DistanceJep338TailLoop        thrpt    6     4062832.939 ±     249.915  ops/s
Bench.l1DistanceJep338TailMask        thrpt    6     3362605.808 ±   20805.696  ops/s

Bench.l2DistanceBaseline              thrpt    6     1108095.885 ±    4237.677  ops/s
Bench.l2DistanceFma                   thrpt    6      860659.029 ±    8911.938  ops/s
Bench.l2DistanceJep338FullMask        thrpt    6      269202.529 ±     326.229  ops/s
Bench.l2DistanceJep338TailLoop        thrpt    6     2026410.994 ±     201.837  ops/s
Bench.l2DistanceJep338TailMask        thrpt    6     3273131.452 ±   11284.378  ops/s

Here are the Baseline measurements merged:

Chip     Benchmark                        Mode  Cnt        Score         Error  Units
Intel i7 Bench.cosineSimilarityBaseline  thrpt    6   689586.585 ±   52359.955  ops/s
Apple M1 Bench.cosineSimilarityBaseline  thrpt    6  1081276.495 ±   39124.749  ops/s

Intel i7 Bench.dotProductBaseline        thrpt    6  1162553.117 ±   21328.673  ops/s
Apple M1 Bench.dotProductBaseline        thrpt    6  1136628.672 ±  180804.344  ops/s

Intel i7 Bench.l1DistanceBaseline        thrpt    6  1095704.529 ±   32440.308  ops/s
Apple M1 Bench.l1DistanceBaseline        thrpt    6  1134803.520 ±   22165.180  ops/s

Intel i7 Bench.l2DistanceBaseline        thrpt    6   951909.125 ±   23376.234  ops/s
Apple M1 Bench.l2DistanceBaseline        thrpt    6  1108095.885 ±    4237.677  ops/s

The cosineSimilarity baseline is noticeably faster on the M1. The others are comparable.

Here are the Jep338TailLoop measurements merged:

Chip     Benchmark                              Mode  Cnt        Score         Error  Units
Intel i7 Bench.cosineSimilarityJep338TailLoop  thrpt    6  1169365.506 ±    5940.850  ops/s
Apple M1 Bench.cosineSimilarityJep338TailLoop  thrpt    6  1532220.920 ±  179274.549  ops/s

Intel i7 Bench.dotProductJep338TailLoop        thrpt    6  3317032.038 ±   19343.830  ops/s
Apple M1 Bench.dotProductJep338TailLoop        thrpt    6  4062575.031 ±    1541.393  ops/s

Intel i7 Bench.l1DistanceJep338TailLoop        thrpt    6  2816348.680 ±   35389.932  ops/s
Apple M1 Bench.l1DistanceJep338TailLoop        thrpt    6  4062832.939 ±     249.915  ops/s

Intel i7 Bench.l2DistanceJep338TailLoop        thrpt    6  2897756.618 ±   34180.451  ops/s
Apple M1 Bench.l2DistanceJep338TailLoop        thrpt    6  2026410.994 ±     201.837  ops/s

In this case, the M1 is faster in all but the l2Distance. The M1’s error bounds are also impressively tight on the three operations that outperform the Intel.

June 7, 2023

I re-ran the benchmark on my M1 Max Macbook Pro, and the results were completely different! So please beware of benchmarking on the M-series chips. It’s an interesting endeavor, but also even more of a rabbit hole than benchmarking on x86.

Java Vector API: Benchmarking and Performance Analysis by Basso, et. al

February 27, 2023

I discovered the paper Java Vector API: Benchmarking and Performance Analysis by Basso, et. al shortly after releasing this post. It looks like they beat me to the release by about week! This paper is an extremely thorough analysis of the topic, so I also highly recommend reading it.

In section 5.2, the authors discuss the negative performance effects of using indexInRange. This matches up well with what I observed in the Jep338FullMask optimization. It turns out that using indexInRange to build a VectorMask is only performant on systems supporting predicate registers. I guess this particular feature is not supported by my Intel Mac Mini nor by my M1 Macbook Air.

In section 5.3, the authors discuss how .pow is far slower than .mul. This aligns with my findings, also discussed in the appendix on this post. Although, I observed a much larger difference when evaluating the two operations in isolation.

Bug in Jep338FullMaskVectorOperations

May 20, 2023

I had a small bug in my original implementation of Jep338FullMaskVectorOperations, which I fixed in this PR on 5/20/23 and updated the code in this post. Thanks to Twitter user Varun Thacker for finding it and proposing a fix.

Warning: Math.fma can be extremely slow on some platforms

In late May 2023, I noticed a Tweet from long-time Lucene contributor Uwe Schindler discussing how Lucene had also implemented vector optimizations based on the Panama Vector API. I replied with a link to this post and my implementation in Elastiknn. Uwe kindly responded with a warning about performance pitfalls of Math.fma.

I’m still exploring the effects of this in Elastiknn. I have some rough data indicating that it is in fact significantly slower on some platforms, so I’ll likely remove it. I figured it’s worth quickly mentioning here.

Updated:

Comments