Alex Klibisz

Connecting a Gli-Net SFT1200 Travel Router to a pfSense OpenVPN server

2024-02-11T15:00:00+00:00

TLDR

Just show me the answer!

Accelerating vector operations on the JVM using the new jdk.incubator.vector module

2023-02-25T15:00:00+00:00

Introduction

In my work on Elastiknn, I’ve spent many hours looking for ways to optimize vector operations on the Java Virtual Machine (JVM).

The jdk.incubator.vector module, introduced in JDK 16 as part of JEP 338 and Project Panama, is the first opportunity I’ve encountered for significant performance improvements in this area.

I recently had some time to experiment with this new module, and I cover my benchmarks and findings in this post. Overall, I found improvements in operations per second on the order of 2x to 3x compared to simple baselines. If vector operations are a bottleneck in your application, I recommend you try out this module.

Background

For our purposes, a “vector” is simply an array of floats: float[] in Java, Array[Float] in Scala, etc. These are common in data science and machine learning.

The specific operations I’m interested in are:

cosine similarity of two vectors
dot product of two vectors
L1 distance (aka, Taxicab distance) between two vectors
L2 distance (aka, Euclidean distance) between two vectors

These are used commonly in nearest neighbor search.

The jdk.incubator.vector API provides a way to access hardware-level optimizations for processing vectors. The two hardware-level optimizations mentioned in JEP 338 are Streaming SIMD Extensions (SSE) and Advanced Vector Extensions (AVX).

Here’s my over-simplified understanding of these optimizations: the various processor vendors have agreed on a set of CPU instructions for operating directly on vectors. Just like they provide hardware-level instructions for adding two scalars, they now provide hardware-level instructions for adding two vectors. These optimized instructions have been accessible for many years in lower-level languages like C and C++. As part of Project Panama, the JDK has recently exposed an API to leverage these optimized instructions directly from JVM languages. This API is contained in the jdk.incubator.vector module.

Benchmark Setup

My benchmarks measure operations per second for five implementations of each of the four vector operations. I start with a simple baseline and working through four possible optimizations.

I implemented the benchmark in Java and Scala: the actual vector operations are in Java, and the benchmark harness is in Scala. I use the Java Microbenchmark Harness (JMH) framework, via the sbt-jmh plugin, to execute the benchmark. This is my first time using JMH in any serious capacity, so I’m happy to hear feedback about better or simpler ways to use it.

Each variation implements this Java interface:

public interface VectorOperations {
    // https://en.wikipedia.org/wiki/Cosine_similarity
    double cosineSimilarity(float[] v1, float[] v2);

    // https://en.wikipedia.org/wiki/Dot_product
    double dotProduct(float[] v1, float[] v2);

    // https://en.wikipedia.org/wiki/Taxicab_geometry
    double l1Distance(float[] v1, float[] v2);

    // https://en.wikipedia.org/wiki/Euclidean_distance
    double l2Distance(float[] v1, float[] v2);
}

I verify the correctness of these optimizations by running tests that run the baseline operation and the optimized operation on a pair of random vectors and check for parity in the results.

All benchmarks operate on a pair of randomly-generated floating-point vectors containing 999 elements. I chose length 999 specifically to make us deal with some additional complexity in the implementation. This will make more sense later in the post.

All benchmarks run on Oracle JDK 19.0.2, installed via asdf: $ asdf install java oracle-19.0.2.

All benchmarks run on my 2018 Mac Mini, which has an Intel i7-8700B processor with SSE4.1, SSE4.2, and AVX2 instruction set extensions.

Finally, all code is available in my site-projects repository: jdk-incubator-vector-optimizations

Baseline Implementation

We start with a baseline implementation of these vector operations. No clever tricks here. This is what we get if we take the definition for each operation from Wikipedia and translate it verbatim into Java:

public class BaselineVectorOperations implements VectorOperations {

    public double cosineSimilarity(float[] v1, float[] v2) {
        double dotProd = 0.0;
        double v1SqrSum = 0.0;
        double v2SqrSum = 0.0;
        for (int i = 0; i < v1.length; i++) {
            dotProd += v1[i] * v2[i];
            v1SqrSum += Math.pow(v1[i], 2);
            v2SqrSum += Math.pow(v2[i], 2);
        }
        return dotProd / (Math.sqrt(v1SqrSum) * Math.sqrt(v2SqrSum));
    }

    public double dotProduct(float[] v1, float[] v2) {
        float dotProd = 0f;
        for (int i = 0; i < v1.length; i++) dotProd += v1[i] * v2[i];
        return dotProd;
    }

    public double l1Distance(float[] v1, float[] v2) {
        double sumAbsDiff = 0.0;
        for (int i = 0; i < v1.length; i++) sumAbsDiff += Math.abs(v1[i] - v2[i]);
        return sumAbsDiff;
    }

    public double l2Distance(float[] v1, float[] v2) {
        double sumSqrDiff = 0.0;
        for (int i = 0; i < v1.length; i++) sumSqrDiff += Math.pow(v1[i] - v2[i], 2);
        return Math.sqrt(sumSqrDiff);
    }
}

This produces the following results:

Benchmark                              Mode  Cnt         Score        Error  Units
Bench.cosineSimilarityBaseline        thrpt    6    689586.585 ±  52359.955  ops/s

Bench.dotProductBaseline              thrpt    6   1162553.117 ±  21328.673  ops/s

Bench.l1DistanceBaseline              thrpt    6   1095704.529 ±  32440.308  ops/s

Bench.l2DistanceBaseline              thrpt    6    951909.125 ±  23376.234  ops/s

Fused Multiply Add (`Math.fma`)

Next, we introduce an optimization based on the “Fused Multiply Add” operator, implemented by the Math.fma method, documented here.

Math.fma takes three floats, a, b, c, and executes a * b + c as a single operation. This has some advantages with respect to floating point error and performance. Basically, executing one operation is generally faster than two operations, and incurs only one rounding error.

I found a way to use Math.fma in all the vector operations except l1Distance.

The implementations look like this:

public class FmaVectorOperations implements VectorOperations {

    public double cosineSimilarity(float[] v1, float[] v2) {
        double dotProd = 0.0;
        double v1SqrSum = 0.0;
        double v2SqrSum = 0.0;
        for (int i = 0; i < v1.length; i++) {
            dotProd = Math.fma(v1[i], v2[i], dotProd);
            v1SqrSum = Math.fma(v1[i], v1[i], v1SqrSum);
            v2SqrSum = Math.fma(v2[i], v2[i], v2SqrSum);
        }
        return dotProd / (Math.sqrt(v1SqrSum) * Math.sqrt(v2SqrSum));
    }

    public double dotProduct(float[] v1, float[] v2) {
        float dotProd = 0f;
        for (int i = 0; i < v1.length; i++) dotProd = Math.fma(v1[i], v2[i], dotProd);
        return dotProd;
    }

    public double l1Distance(float[] v1, float[] v2) {
        // Does not actually leverage Math.fma.
        double sumAbsDiff = 0.0;
        for (int i = 0; i < v1.length; i++) sumAbsDiff += Math.abs(v1[i] - v2[i]);
        return sumAbsDiff;
    }

    public double l2Distance(float[] v1, float[] v2) {
        double sumSqrDiff = 0.0;
        float diff;
        for (int i = 0; i < v1.length; i++) {
            diff = v1[i] - v2[i];
            sumSqrDiff = Math.fma(diff, diff, sumSqrDiff);
        }
        return Math.sqrt(sumSqrDiff);
    }
}

The results are:

Benchmark                              Mode  Cnt         Score        Error  Units
Bench.cosineSimilarityBaseline        thrpt    6    689586.585 ±  52359.955  ops/s
Bench.cosineSimilarityFma             thrpt    6   1086514.074 ±  15190.380  ops/s

Bench.dotProductBaseline              thrpt    6   1162553.117 ±  21328.673  ops/s
Bench.dotProductFma                   thrpt    6   1073368.454 ± 104684.436  ops/s

Bench.l1DistanceBaseline              thrpt    6   1095704.529 ±  32440.308  ops/s
Bench.l1DistanceFma                   thrpt    6   1098354.824 ±  13870.211  ops/s

Bench.l2DistanceBaseline              thrpt    6    951909.125 ±  23376.234  ops/s
Bench.l2DistanceFma                   thrpt    6   1101736.286 ±  11985.949  ops/s

We see some improvement in two of the four cases:

cosineSimilarity is ~1.6x faster.
l2Distance is ~1.1x faster.
dotProduct remains about the same, maybe a little worse.
l1Distance does not leverage this optimization and predictably does not change much.

jdk.incubator.vector

Now we jump into optimizations using the jdk.incubator.vector module!

Crash course

I have to start with a crash course on this module. I also highly recommend reading the examples in JEP 338.

We are primarily using the jdk.incubator.vector.FloatVector class.

Given a pair of float[] arrays, the general pattern for using jdk.incubator.vector is as follows:

We iterate over strides (i.e., segments or chunks) of the two arrays.
At each iteration, we use the FloatVector.fromArray method to copy the current stride from each array into a FloatVector.
We call methods on the FloatVector instances to execute mathematical operations. For example, if we have FloatVectors fv1 and fv2, then fv1.mul(fv2).reduceLanes(VectorOperations.ADD) runs a pairwise multiplication and sums the results.

We also need to know about a helper class called jdk.incubator.vector.VectorSpecies, which involves the following:

Defines the stride length used for vector operations.
Provides helper methods for iterating over the arrays in strides.
Is required to copy values from an array and into a FloatVector.

At the time of writing, there are four species: SPECIES_64, SPECIES_128, SPECIES_256, SPECIES_512, and two aliases: SPECIES_MAX, and SPECIES_PREFERRED. The numbers 64, 128, 256, and 512 refer to the number of bits in a FloatVector. A Java float uses 4 bytes, or 32 bits, so a vector with SPECIES_256 lets us operate on 256 / 32 = 8 floats in a single operation. I found it’s best to just stick with SPECIES_PREFERRED, which defaults to SPECIES_256 on my Mac Mini. Throughput can actually decrease drastically with a suboptimal VectorSpecies.

Finally, we need to consider what to do when the array length is less than or not a multiple of the VectorSpecies length. If our source arrays have a length that’s equal or a multiple of the stride length, then we can iterate over the strides with no elements left over. Otherwise, we need to figure out what to do with the “tail” of elements that did not fill up a stride.

The way we handle this tail can have some non-negligible performance impact. This is why I chose to benchmark with vectors of length 999. There are three options for dealing with the tail, and they are the subject of the following three sections.

VectorMask on every Iteration

The first way to handle the tail is to avoid handling it by using a VectorMask on every stride. We use the species to define the VectorMask, and then pass through the mask when creating the FloatVector.

I refer to this as Jep338FullMask, and the implementations look like this:

public class Jep338FullMaskVectorOperations implements VectorOperations{

    private final VectorSpecies<Float> species = FloatVector.SPECIES_PREFERRED;

    public double cosineSimilarity(float[] v1, float[] v2) {
        double dotProd = 0.0;
        double v1SqrSum = 0.0;
        double v2SqrSum = 0.0;
        FloatVector fv1, fv2;
        for (int i = 0; i < v1.length; i += species.length()) {
            VectorMask<Float> m = species.indexInRange(i, v1.length);
            fv1 = FloatVector.fromArray(species, v1, i, m);
            fv2 = FloatVector.fromArray(species, v2, i, m);
            dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
            v1SqrSum += fv1.mul(fv1).reduceLanes(VectorOperators.ADD);
            v2SqrSum += fv2.mul(fv2).reduceLanes(VectorOperators.ADD);
        }
        return dotProd / (Math.sqrt(v1SqrSum) * Math.sqrt(v2SqrSum));
    }

    public double dotProduct(float[] v1, float[] v2) {
        double dotProd = 0f;
        FloatVector fv1, fv2;
        for (int i = 0; i < v1.length; i += species.length()) {
            VectorMask<Float> m = species.indexInRange(i, v1.length);
            fv1 = FloatVector.fromArray(species, v1, i, m);
            fv2 = FloatVector.fromArray(species, v2, i, m);
            dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
        }
        return dotProd;
    }

    public double l1Distance(float[] v1, float[] v2) {
        double sumAbsDiff = 0.0;
        FloatVector fv1, fv2;
        for (int i = 0; i < v1.length; i += species.length()) {
            VectorMask<Float> m = species.indexInRange(i, v1.length);
            fv1 = FloatVector.fromArray(species, v1, i, m);
            fv2 = FloatVector.fromArray(species, v2, i, m);
            sumAbsDiff += fv1.sub(fv2).abs().reduceLanes(VectorOperators.ADD);
        }
        return sumAbsDiff;
    }

    public double l2Distance(float[] v1, float[] v2) {
        double sumSqrDiff = 0f;
        FloatVector fv1, fv2, fv3;
        for (int i = 0; i < v1.length; i+= species.length()) {
            VectorMask<Float> m = species.indexInRange(i, v1.length);
            fv1 = FloatVector.fromArray(species, v1, i, m);
            fv2 = FloatVector.fromArray(species, v2, i, m);
            fv3 = fv1.sub(fv2);
            // For some unknown reason, fv3.mul(fv3) is significantly faster than fv3.pow(2).
            sumSqrDiff += fv3.mul(fv3).reduceLanes(VectorOperators.ADD);
        }
        return Math.sqrt(sumSqrDiff);
    }
}

The results are:

Benchmark                              Mode  Cnt         Score        Error  Units
Bench.cosineSimilarityBaseline        thrpt    6    689586.585 ±  52359.955  ops/s
Bench.cosineSimilarityJep338FullMask  thrpt    6    548425.342 ±  19160.168  ops/s

Bench.dotProductBaseline              thrpt    6   1162553.117 ±  21328.673  ops/s
Bench.dotProductJep338FullMask        thrpt    6    384319.569 ±  20067.109  ops/s

Bench.l1DistanceBaseline              thrpt    6   1095704.529 ±  32440.308  ops/s
Bench.l1DistanceJep338FullMask        thrpt    6    356044.308 ±   5186.842  ops/s

Bench.l2DistanceBaseline              thrpt    6    951909.125 ±  23376.234  ops/s
Bench.l2DistanceJep338FullMask        thrpt    6    376810.628 ±   3531.977  ops/s

The new implementation is actually significantly slower than the baseline.

Fortunately, the authors of JEP 338 mention this explicitly:

Since a mask is used in all iterations, the above implementation may not achieve optimal performance for large array lengths.

~~I haven’t looked extensively enough to understand why, but it seems like the VectorMask is either expensive to create, expensive to use, or maybe both.~~ Edit: The paper Java Vector API: Benchmarking and Performance Analysis discusses the performance of indexInRange, which is used to compute a VectorMask, in section 5.2. It turns out that indexInRange is only optimized on certain platforms, and degrades poorly on others.

Loop over the Tail

We can also handle the tail using a plain loop.

I refer to this as Jep338TailLoop, and the implementations look like this:

public class Jep338TailLoopVectorOperations implements VectorOperations{

    private final VectorSpecies<Float> species = FloatVector.SPECIES_PREFERRED;

    public double cosineSimilarity(float[] v1, float[] v2) {
        double dotProd = 0.0;
        double v1SqrSum = 0.0;
        double v2SqrSum = 0.0;
        int i = 0;
        int bound = species.loopBound(v1.length);
        FloatVector fv1, fv2;
        for (; i < bound; i += species.length()) {
            fv1 = FloatVector.fromArray(species, v1, i);
            fv2 = FloatVector.fromArray(species, v2, i);
            dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
            v1SqrSum += fv1.mul(fv1).reduceLanes(VectorOperators.ADD);
            v2SqrSum += fv2.mul(fv2).reduceLanes(VectorOperators.ADD);
        }
        for (; i < v1.length; i++) {
            dotProd = Math.fma(v1[i], v2[i], dotProd);
            v1SqrSum = Math.fma(v1[i], v1[i], v1SqrSum);
            v2SqrSum = Math.fma(v2[i], v2[i], v2SqrSum);
        }
        return dotProd / (Math.sqrt(v1SqrSum) * Math.sqrt(v2SqrSum));
    }

    public double dotProduct(float[] v1, float[] v2) {
        double dotProd = 0f;
        int i = 0;
        int bound = species.loopBound(v1.length);
        FloatVector fv1, fv2;
        for (; i < bound; i += species.length()) {
            fv1 = FloatVector.fromArray(species, v1, i);
            fv2 = FloatVector.fromArray(species, v2, i);
            dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
        }
        for (; i < v1.length; i++) {
            dotProd = Math.fma(v1[i], v2[i], dotProd);
        }
        return dotProd;
    }

    public double l1Distance(float[] v1, float[] v2) {
        double sumAbsDiff = 0.0;
        int i = 0;
        int bound = species.loopBound(v1.length);
        FloatVector fv1, fv2;
        for (; i < bound; i += species.length()) {
            fv1 = FloatVector.fromArray(species, v1, i);
            fv2 = FloatVector.fromArray(species, v2, i);
            sumAbsDiff += fv1.sub(fv2).abs().reduceLanes(VectorOperators.ADD);
        }
        for (; i < v1.length; i++) {
            sumAbsDiff += Math.abs(v1[i] - v2[i]);
        }
        return sumAbsDiff;
    }

    public double l2Distance(float[] v1, float[] v2) {
        double sumSqrDiff = 0f;
        int i = 0;
        int bound = species.loopBound(v1.length);
        FloatVector fv1, fv2, fv3;
        for (; i < bound; i+= species.length()) {
            fv1 = FloatVector.fromArray(species, v1, i);
            fv2 = FloatVector.fromArray(species, v2, i);
            fv3 = fv1.sub(fv2);
            // For some unknown reason, fv3.mul(fv3) is significantly faster than fv3.pow(2).
            sumSqrDiff += fv3.mul(fv3).reduceLanes(VectorOperators.ADD);
        }
        for (; i < v1.length; i++) {
            float diff = v1[i] - v2[i];
            sumSqrDiff = Math.fma(diff, diff, sumSqrDiff);
        }
        return Math.sqrt(sumSqrDiff);
    }
}

The results are:

Benchmark                              Mode  Cnt         Score        Error  Units
Bench.cosineSimilarityBaseline        thrpt    6    689586.585 ±  52359.955  ops/s
Bench.cosineSimilarityJep338TailLoop  thrpt    6   1169365.506 ±   5940.850  ops/s

Bench.dotProductBaseline              thrpt    6   1162553.117 ±  21328.673  ops/s
Bench.dotProductJep338TailLoop        thrpt    6   3317032.038 ±  19343.830  ops/s

Bench.l1DistanceBaseline              thrpt    6   1095704.529 ±  32440.308  ops/s
Bench.l1DistanceJep338TailLoop        thrpt    6   2816348.680 ±  35389.932  ops/s

Bench.l2DistanceBaseline              thrpt    6    951909.125 ±  23376.234  ops/s
Bench.l2DistanceJep338TailLoop        thrpt    6   2897756.618 ±  34180.451  ops/s

This time we have a significant improvement over the baseline:

cosineSimilarity is ~1.7x faster.
dotProduct is ~3.1x faster.
l1Distance is ~2.6x faster.
l2Distance is ~3x faster.

VectorMask on the Tail

What if we use a VectorMask, but only on the tail of vector. Is that faster than using a loop on the tail?

I refer to this as Jep338TailMask, and the implementation looks like this:

public class Jep338TailMaskVectorOperations implements VectorOperations {

    private final VectorSpecies<Float> species = FloatVector.SPECIES_PREFERRED;

    public double cosineSimilarity(float[] v1, float[] v2) {
        double dotProd = 0.0;
        double v1SqrSum = 0.0;
        double v2SqrSum = 0.0;
        int i = 0;
        int bound = species.loopBound(v1.length);
        FloatVector fv1, fv2;
        for (; i < bound; i += species.length()) {
            fv1 = FloatVector.fromArray(species, v1, i);
            fv2 = FloatVector.fromArray(species, v2, i);
            dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
            v1SqrSum += fv1.mul(fv1).reduceLanes(VectorOperators.ADD);
            v2SqrSum += fv2.mul(fv2).reduceLanes(VectorOperators.ADD);
        }
        if (i < v1.length) {
            VectorMask<Float> m = species.indexInRange(i, v1.length);
            fv1 = FloatVector.fromArray(species, v1, i, m);
            fv2 = FloatVector.fromArray(species, v2, i, m);
            dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
            v1SqrSum += fv1.mul(fv1).reduceLanes(VectorOperators.ADD);
            v2SqrSum += fv2.mul(fv2).reduceLanes(VectorOperators.ADD);
        }
        return dotProd / (Math.sqrt(v1SqrSum) * Math.sqrt(v2SqrSum));
    }

    public double dotProduct(float[] v1, float[] v2) {
        double dotProd = 0f;
        int i = 0;
        int bound = species.loopBound(v1.length);
        FloatVector fv1, fv2;
        for (; i < bound; i += species.length()) {
            fv1 = FloatVector.fromArray(species, v1, i);
            fv2 = FloatVector.fromArray(species, v2, i);
            dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
        }
        if (i < v1.length) {
            VectorMask<Float> m = species.indexInRange(i, v1.length);
            fv1 = FloatVector.fromArray(species, v1, i, m);
            fv2 = FloatVector.fromArray(species, v2, i, m);
            dotProd += fv1.mul(fv2).reduceLanes(VectorOperators.ADD);
        }
        return dotProd;
    }

    public double l1Distance(float[] v1, float[] v2) {
        double sumAbsDiff = 0.0;
        int i = 0;
        int bound = species.loopBound(v1.length);
        FloatVector fv1, fv2;
        for (; i < bound; i += species.length()) {
            fv1 = FloatVector.fromArray(species, v1, i);
            fv2 = FloatVector.fromArray(species, v2, i);
            sumAbsDiff += fv1.sub(fv2).abs().reduceLanes(VectorOperators.ADD);
        }
        if (i < v1.length) {
            VectorMask<Float> m = species.indexInRange(i, v1.length);
            fv1 = FloatVector.fromArray(species, v1, i, m);
            fv2 = FloatVector.fromArray(species, v2, i, m);
            sumAbsDiff += fv1.sub(fv2).abs().reduceLanes(VectorOperators.ADD);
        }
        return sumAbsDiff;
    }

    public double l2Distance(float[] v1, float[] v2) {
        double sumSqrDiff = 0f;
        int i = 0;
        int bound = species.loopBound(v1.length);
        FloatVector fv1, fv2, fv3;
        for (; i < bound; i+= species.length()) {
            fv1 = FloatVector.fromArray(species, v1, i);
            fv2 = FloatVector.fromArray(species, v2, i);
            fv3 = fv1.sub(fv2);
            // For some unknown reason, fv3.mul(fv3) is significantly faster than fv3.pow(2).
            sumSqrDiff += fv3.mul(fv3).reduceLanes(VectorOperators.ADD);
        }
        if (i < v1.length) {
            VectorMask<Float> m = species.indexInRange(i, v1.length);
            fv1 = FloatVector.fromArray(species, v1, i, m);
            fv2 = FloatVector.fromArray(species, v2, i, m);
            fv3 = fv1.sub(fv2);
            sumSqrDiff += fv3.mul(fv3).reduceLanes(VectorOperators.ADD);
        }
        return Math.sqrt(sumSqrDiff);
    }
}

The results are:

Benchmark                              Mode  Cnt         Score        Error  Units
Bench.cosineSimilarityBaseline        thrpt    6    689586.585 ±  52359.955  ops/s
Bench.cosineSimilarityJep338TailLoop  thrpt    6   1169365.506 ±   5940.850  ops/s
Bench.cosineSimilarityJep338TailMask  thrpt    6   1166971.620 ±   6927.790  ops/s

Bench.dotProductBaseline              thrpt    6   1162553.117 ±  21328.673  ops/s
Bench.dotProductJep338TailLoop        thrpt    6   3317032.038 ±  19343.830  ops/s
Bench.dotProductJep338TailMask        thrpt    6   2740443.003 ± 467202.628  ops/s

Bench.l1DistanceBaseline              thrpt    6   1095704.529 ±  32440.308  ops/s
Bench.l1DistanceJep338TailLoop        thrpt    6   2816348.680 ±  35389.932  ops/s
Bench.l1DistanceJep338TailMask        thrpt    6   2717614.796 ±  14014.855  ops/s

Bench.l2DistanceBaseline              thrpt    6    951909.125 ±  23376.234  ops/s
Bench.l2DistanceJep338TailLoop        thrpt    6   2897756.618 ±  34180.451  ops/s
Bench.l2DistanceJep338TailMask        thrpt    6   2492492.274 ±  11376.759  ops/s

Across the board, using a VectorMask is clearly slower than just using a simple loop on the tail.

Complete Benchmark Results

Here are the full results once more for comparison:

Benchmark                              Mode  Cnt         Score        Error  Units
Bench.cosineSimilarityBaseline        thrpt    6    689586.585 ±  52359.955  ops/s
Bench.cosineSimilarityFma             thrpt    6   1086514.074 ±  15190.380  ops/s
Bench.cosineSimilarityJep338FullMask  thrpt    6    548425.342 ±  19160.168  ops/s
Bench.cosineSimilarityJep338TailLoop  thrpt    6   1169365.506 ±   5940.850  ops/s
Bench.cosineSimilarityJep338TailMask  thrpt    6   1166971.620 ±   6927.790  ops/s

Bench.dotProductBaseline              thrpt    6   1162553.117 ±  21328.673  ops/s
Bench.dotProductFma                   thrpt    6   1073368.454 ± 104684.436  ops/s
Bench.dotProductJep338FullMask        thrpt    6    384319.569 ±  20067.109  ops/s
Bench.dotProductJep338TailLoop        thrpt    6   3317032.038 ±  19343.830  ops/s
Bench.dotProductJep338TailMask        thrpt    6   2740443.003 ± 467202.628  ops/s

Bench.l1DistanceBaseline              thrpt    6   1095704.529 ±  32440.308  ops/s
Bench.l1DistanceFma                   thrpt    6   1098354.824 ±  13870.211  ops/s
Bench.l1DistanceJep338FullMask        thrpt    6    356044.308 ±   5186.842  ops/s
Bench.l1DistanceJep338TailLoop        thrpt    6   2816348.680 ±  35389.932  ops/s
Bench.l1DistanceJep338TailMask        thrpt    6   2717614.796 ±  14014.855  ops/s

Bench.l2DistanceBaseline              thrpt    6    951909.125 ±  23376.234  ops/s
Bench.l2DistanceFma                   thrpt    6   1101736.286 ±  11985.949  ops/s
Bench.l2DistanceJep338FullMask        thrpt    6    376810.628 ±   3531.977  ops/s
Bench.l2DistanceJep338TailLoop        thrpt    6   2897756.618 ±  34180.451  ops/s
Bench.l2DistanceJep338TailMask        thrpt    6   2492492.274 ±  11376.759  ops/s

To summarize, the fastest approach for all operations is Jep338TailLoop, This uses the jdk.incubator.vector API for all strides until the tail of the vectors, and then uses a loop to handle the tails. Compared to the baseline, this approach yields some substantial improvements:

cosineSimilarity is ~1.7x faster.
dotProduct is ~3.1x faster.
l1Distance is ~2.6x faster.
l2Distance is ~3x faster.

Takeaways

I’ll close with my takeaways from this benchmark.

If vector operations are a bottleneck in your application, and jdk.incubator.vector is available on your platform, then it’s worth a try. In my benchmarks, the speedup was anywhere from 1.7x to 3.1x.

When using jdk.incubator.vector, carefully consider and benchmark the usage of VectorMask. This abstraction seems quite expensive.

If jdk.incubator.vector is not available, then try using java.lang.Math.fma where possible. This still offers a noticeable speedup. There are also some other optimized methods in java.lang.Math that seem like they could be useful.

My only aesthetic complaint is that the API forces us to duplicate code to handle the vector tail. However, the API is still far simpler and far more readable than the analagous APIs I’ve seen in C and C++.

Overall, I’m quite impressed by the jdk.incubator.vector module, and I’m excited to see this opportunity for tighter integration between the hardware and JVM.

Appendix

Here is some related material that I found useful:

Limiting Factors in a Dot Product Calculation by Richard Startin, July 2018.
This Twitter thread about jdk.incubator.vector by Denis Makogon, September 2022.
Java Vector API: Benchmarking and Performance Analysis by Basso, et. al, February 2023.
CS494 Lecture Notes - Some simple SIMD examples, Dr. James Plank, November 2019.

`jdk.incubator.vector` in Elastiknn

February 26, 2023

If you’d like to see an example of this in a real codebase, I incorporated jdk.incubator.vector as an optional optimization in Elastiknn in this pull request: alexklibisz/elastiknn #496.

This led to a speedup anywhere from 1.05x to 1.2x on the Elastiknn benchmarks.

FloatVector::pow is significantly slower than FloatVector::mul

February 26, 2023

While working on the benchmarks above, I found an interesting performance pitfall. Namely, given a FloatVector fv, fv.mul(fv) is 36x faster than fv.pow(2).

The benchmark:

@State(Scope.Benchmark)
class BenchPowVsMulFixtures {
  implicit private val rng: Random = new Random(0)
  val species: VectorSpecies[lang.Float] = FloatVector.SPECIES_PREFERRED
  val v: Array[Float] = (0 until species.length()).map(_ => rng.nextFloat()).toArray
  val fv: FloatVector = FloatVector.fromArray(species, v, 0)
}

class BenchPowVsMul {

  @Benchmark
  @BenchmarkMode(Array(Mode.Throughput))
  @Fork(value = 1)
  @Warmup(time = 5, iterations = 3)
  @Measurement(time = 5, iterations = 6)
  def mul(f: BenchPowVsMulFixtures): Unit = f.fv.mul(f.fv)
  
  @Benchmark
  @BenchmarkMode(Array(Mode.Throughput))
  @Fork(value = 1)
  @Warmup(time = 5, iterations = 3)
  @Measurement(time = 5, iterations = 6)
  def pow(f: BenchPowVsMulFixtures): Unit = f.fv.pow(2)
}

The results:

Benchmark           Mode  Cnt           Score           Error  Units
BenchPowVsMul.mul  thrpt    6  1235649170.757 ± 216838871.439  ops/s
BenchPowVsMul.pow  thrpt    6    34105529.504 ±   4899654.049  ops/s

~~I have no idea why this would be, but it seems like it could be a bug in the underlying implementation.~~

Update (March 1, 2023): I messaged the panama dev mailing list and got an explanation for this:

The performance difference you observe is because the pow operation is falling back to scalar code (Math.pow on each lane element) and not using vector instructions. On x86 linux or windows you should observe better performance of the pow operation because it should leverage code from Intel’s Short Vector Math Library [1], but that code OS specific and is not currently ported on Mac OS.

View the full thread here.

Results on Apple Silicon (M1 Macbook Air)

February 26, 2023

I was curious how this would look on Apple silicon, so I also ran the benchmark on my 2020 M1 Macbook Air. Other than the host machine, the benchmark setup is identical to the results above.

Benchmark                              Mode  Cnt           Score         Error  Units
Bench.cosineSimilarityBaseline        thrpt    6     1081276.495 ±   39124.749  ops/s
Bench.cosineSimilarityFma             thrpt    6      836076.757 ±      47.184  ops/s
Bench.cosineSimilarityJep338FullMask  thrpt    6     1050298.090 ±      81.960  ops/s
Bench.cosineSimilarityJep338TailLoop  thrpt    6     1532220.920 ±  179274.549  ops/s
Bench.cosineSimilarityJep338TailMask  thrpt    6     1452240.849 ±    6500.704  ops/s

Bench.dotProductBaseline              thrpt    6     1136628.672 ±  180804.344  ops/s
Bench.dotProductFma                   thrpt    6      912200.315 ±    8839.657  ops/s
Bench.dotProductJep338FullMask        thrpt    6      272444.658 ±    1642.048  ops/s
Bench.dotProductJep338TailLoop        thrpt    6     4062575.031 ±    1541.393  ops/s
Bench.dotProductJep338TailMask        thrpt    6     3372980.017 ±    4655.095  ops/s

Bench.l1DistanceBaseline              thrpt    6     1134803.520 ±   22165.180  ops/s
Bench.l1DistanceFma                   thrpt    6     1146026.997 ±    2952.262  ops/s
Bench.l1DistanceJep338FullMask        thrpt    6      271181.722 ±     416.756  ops/s
Bench.l1DistanceJep338TailLoop        thrpt    6     4062832.939 ±     249.915  ops/s
Bench.l1DistanceJep338TailMask        thrpt    6     3362605.808 ±   20805.696  ops/s

Bench.l2DistanceBaseline              thrpt    6     1108095.885 ±    4237.677  ops/s
Bench.l2DistanceFma                   thrpt    6      860659.029 ±    8911.938  ops/s
Bench.l2DistanceJep338FullMask        thrpt    6      269202.529 ±     326.229  ops/s
Bench.l2DistanceJep338TailLoop        thrpt    6     2026410.994 ±     201.837  ops/s
Bench.l2DistanceJep338TailMask        thrpt    6     3273131.452 ±   11284.378  ops/s

Here are the Baseline measurements merged:

Chip     Benchmark                        Mode  Cnt        Score         Error  Units
Intel i7 Bench.cosineSimilarityBaseline  thrpt    6   689586.585 ±   52359.955  ops/s
Apple M1 Bench.cosineSimilarityBaseline  thrpt    6  1081276.495 ±   39124.749  ops/s

Intel i7 Bench.dotProductBaseline        thrpt    6  1162553.117 ±   21328.673  ops/s
Apple M1 Bench.dotProductBaseline        thrpt    6  1136628.672 ±  180804.344  ops/s

Intel i7 Bench.l1DistanceBaseline        thrpt    6  1095704.529 ±   32440.308  ops/s
Apple M1 Bench.l1DistanceBaseline        thrpt    6  1134803.520 ±   22165.180  ops/s

Intel i7 Bench.l2DistanceBaseline        thrpt    6   951909.125 ±   23376.234  ops/s
Apple M1 Bench.l2DistanceBaseline        thrpt    6  1108095.885 ±    4237.677  ops/s

The cosineSimilarity baseline is noticeably faster on the M1. The others are comparable.

Here are the Jep338TailLoop measurements merged:

Chip     Benchmark                              Mode  Cnt        Score         Error  Units
Intel i7 Bench.cosineSimilarityJep338TailLoop  thrpt    6  1169365.506 ±    5940.850  ops/s
Apple M1 Bench.cosineSimilarityJep338TailLoop  thrpt    6  1532220.920 ±  179274.549  ops/s

Intel i7 Bench.dotProductJep338TailLoop        thrpt    6  3317032.038 ±   19343.830  ops/s
Apple M1 Bench.dotProductJep338TailLoop        thrpt    6  4062575.031 ±    1541.393  ops/s

Intel i7 Bench.l1DistanceJep338TailLoop        thrpt    6  2816348.680 ±   35389.932  ops/s
Apple M1 Bench.l1DistanceJep338TailLoop        thrpt    6  4062832.939 ±     249.915  ops/s

Intel i7 Bench.l2DistanceJep338TailLoop        thrpt    6  2897756.618 ±   34180.451  ops/s
Apple M1 Bench.l2DistanceJep338TailLoop        thrpt    6  2026410.994 ±     201.837  ops/s

In this case, the M1 is faster in all but the l2Distance. The M1’s error bounds are also impressively tight on the three operations that outperform the Intel.

June 7, 2023

I re-ran the benchmark on my M1 Max Macbook Pro, and the results were completely different! So please beware of benchmarking on the M-series chips. It’s an interesting endeavor, but also even more of a rabbit hole than benchmarking on x86.

Java Vector API: Benchmarking and Performance Analysis by Basso, et. al

February 27, 2023

I discovered the paper Java Vector API: Benchmarking and Performance Analysis by Basso, et. al shortly after releasing this post. It looks like they beat me to the release by about week! This paper is an extremely thorough analysis of the topic, so I also highly recommend reading it.

In section 5.2, the authors discuss the negative performance effects of using indexInRange. This matches up well with what I observed in the Jep338FullMask optimization. It turns out that using indexInRange to build a VectorMask is only performant on systems supporting predicate registers. I guess this particular feature is not supported by my Intel Mac Mini nor by my M1 Macbook Air.

In section 5.3, the authors discuss how .pow is far slower than .mul. This aligns with my findings, also discussed in the appendix on this post. Although, I observed a much larger difference when evaluating the two operations in isolation.

Bug in Jep338FullMaskVectorOperations

May 20, 2023

I had a small bug in my original implementation of Jep338FullMaskVectorOperations, which I fixed in this PR on 5/20/23 and updated the code in this post. Thanks to Twitter user Varun Thacker for finding it and proposing a fix.

Warning: Math.fma can be extremely slow on some platforms

In late May 2023, I noticed a Tweet from long-time Lucene contributor Uwe Schindler discussing how Lucene had also implemented vector optimizations based on the Panama Vector API. I replied with a link to this post and my implementation in Elastiknn. Uwe kindly responded with a warning about performance pitfalls of Math.fma.

I’m still exploring the effects of this in Elastiknn. I have some rough data indicating that it is in fact significantly slower on some platforms, so I’ll likely remove it. I figured it’s worth quickly mentioning here.

Are Postgres functions faster than queries? (a very simple benchmark)

2022-12-18T15:00:00+00:00

Introduction

In my work with Postgres and other databases, I’ve often heard the statement “stored procedures (i.e., functions in Postgres) are faster than queries.”

To be precise, “stored procedure” refers to a Postgres function, defined using the create function command, and executed by passing a string like select * from some_function(...) from the client to the database server. “Query” refers to a standard SQL query (e.g., select * from some_table where id = 10), defined by the client, and executed by passing the literal query string from the client to the database server.

I’m willing to accept that a function is faster than a query if the function avoids passing intermediate results back to the client. There’s an obvious cost to sending bytes over the network, so, all else equal, a function that keeps intermediate results in the database is going to execute faster than multiple queries that pass intermediate results to the client.

However, that’s not really what I’m after. I’m more interesting in answering this question:

Assuming the function and query are executing the same underlying statements and passing the same data back to the client, is the function faster than the query?

In this post, I take a first pass at answering this question based on a very simple benchmark.

Expand for the spoiler!

My takeaways from _Effective Software Testing_ by Maurício Aniche

2022-06-19T15:00:00+00:00

Introduction

I recently finished reading Effective Software Testing by Maurício Aniche as part of a weekly book club with some teammates at work. This post summarizes my biggest takeaways from the book. I also took the liberty to mix in my own related musings about software testing.

My Background

It might be useful to quickly mention my background, as it affects my opinion of the book:

I work primarily on cloud services for energy systems, using a combination of Scala, Akka, Postgres, Kafka, Kubernetes, and a long-tail of other technologies.
The cloud services I work on exist in large part to model and interact with hardware, but I have very little firmware experience. So I can’t say much about firmware testing, other than I do know enough to appreciate it can be a different beast.
I enjoy writing automated unit and integration tests. I value the peace of mind it gives me about the services I build and maintain, especially as they evolve. I also enjoy the challenge of finding a way to test and verify something particularly tricky.

Overall Impressions

First, I recommend this book to any professional software engineer.

To summarize it in one statement: the book provides thorough coverage of testing techniques, with practical tips and examples for practitioners, based on a foundation of research and experience.

I found the book gave precise terms for some of the best practices I’ve learned from experience. For example, I had an intuitive understanding that I can efficiently test a complex logical expression by perturbing each parameter such that it affects the result. The book taught me that this idea has a name: modified condition/decision coverage.

I especially recommend the book to those who dread testing. It presents a first-principles motivation for testing and a set of reliable approaches and tools for effective testing.

The book uses Java for all examples, so it’s particularly valuable for engineers already in that ecosystem, but it’s also accessible to those working in other languages. Having worked primarily in Scala and Python, I can say the techniques translate to those languages, too.

For those in a rush, I recommend reading the introduction and summary of each chapter, and then reading chapters 2, 6, and 7 (specification-based testing, test doubles and mocks, and designing for testability) as they were particularly information-rich.

Takeaways

Test effectively and systematically

In chapter 1, the author establishes that we should focus on effective and systematic testing, and presents methods for both throughout the book.

How do we know we’re testing effectively?

I find it helpful to consider the antithesis to effective testing: ad-hoc testing.

When we practice ad-hoc testing, we implement a feature and then toss in a few test cases just before submitting it for review. We test whatever comes to mind – maybe an example from the Jira or Github ticket, maybe something we copy-paste from another test. Over time, we end up with a hodgepodge of miscellaneous tests that people happened to think of. It’s bloated, difficult to evolve, and leaves us with either a feeling of pointlessness or a false confidence in the correctness based only on the quantity of tests. I’ve found this is also what leaves engineers with a dislike for testing.

In contrast, effective testing is the process by which we arrive at a set of tests that verify the correctness of an implementation, are maintainable, and yield an efficient ratio of time spent testing to number of bugs prevented.

Another way to think about it is to consider the information or signal-to-noise ratio of each test. Effective testing reminds me of the 20 questions game I played as a kid. We get to ask our software 20 questions. Based on the answers, we choose to deploy to production. We better pick questions with a high signal-to-noise ratio!

How do we know we’re testing systematically?

One of my favorite heuristics presented in the book is that two engineers from the same team, given the same requirements, working in the same codebase, should arrive at the same test suite.

We rarely get a chance to run this experiment, but we can also consider our reaction to tests in code review. If we find it hard to follow tests, or find a teammate’s tests arbitrary or surprising, it might mean the team needs to make its testing practices more systematic.

Exhaustive testing is intractable, but that’s no excuse

In chapter 1, the author establishes that exhaustively testing all possible paths through any interesting piece of software is an intractable problem.

As a simple example, the author presents a system with N boolean configurations. This system requires 2^N test cases for exhaustive coverage. At N=266 we exceed the number of atoms in the visible universe.

So we should accept we can’t test exhaustively. Does this make testing pointless? No.

The author shows us how we can write fewer, but more effective tests through a combination of techniques: partitioning, boundary analysis, pruning unrealistic test cases, and the kind of creativity that results from thoroughly understanding the domain.

If we understand how inputs affect behavior, the boundaries at which inputs change behavior, and which inputs are nonsensical, we can prune the explosion of test cases down to a handful that actually matter. The author walks through an example of this type of pruning in chapter 2.

A brief musing: I find it interesting that, in some cases, property-based testing (sometimes called randomized testing) is an antidote to the intractability of exhaustive testing. As a trivial example, imagine we’re testing a method that multiplies two positive integers. For any pair of inputs, we can verify the method’s output using addition. If we use new randomized inputs every time we run the test suite, we can, in the limit, explore and verify the entire input space. I particularly enjoyed this talk about randomized testing in Apache Lucene and Solr.

Code coverage is a tool, not a goal

The author introduces code coverage in chapter 3.

The rough idea of code coverage is that we can run a tool to identify parts of our software (classes, methods, branches, etc.) which are untested by our test suite. We can review the untested parts and adapt or add tests to cover them.

Like any simple metric, code coverage can be abused. If we consider coverage as the ultimate goal, we risk wasting time on pointless tests and arriving at a false sense of confidence about the correctness of our software.

However, as the author argues in chapter 3, we can still use code coverage to offload the cognitive burden of identifying the parts of software we might have forgotten to test.

I use a code coverage tool anytime I’m implementing or refactoring a substantial feature. I periodically run the tool and examine the outputs to catch any blind spots. When I find untested areas that are particularly risky or interesting, I revise or add test cases to cover them.

In some situations it’s fine to omit a test case. For example, I generally trust the correctness of a language’s standard library or an established open-source library.

At times, I’ve also integrated code coverage as a requirement in continuous integration. I still have reservations about this, as some tools can be flaky or misleading. At the very least, code coverage in CI is a good way to set a lower-bound “safety net” for coverage.

Mutation testing seems like a powerful complement to code coverage

The author briefly introduces mutation testing in chapter 3.

The rough idea of mutation testing is that we can automatically generate mutations of our codebase and run the test suite against each mutation. A mutation can be as simple as flipping < to > or == to !=. If our tests are effective, then at least one test should fail for each mutation. If no tests fail, then we should consider adding a test case to cover the mutation.

This seems like a particularly powerful complement to code coverage. It’s easy to get perfect code coverage with mediocre tests: just call each method and make an inconsequential assertion. It seems more difficult to get perfect code coverage and mutation coverage with mediocre tests.

I’ve experimented with some libraries in the Stryker Mutator ecosystem for toy projects, but still need to run them on something more interesting.

The role of specification-based testing and structural testing

The author covers specification-based testing and structural testing in chapters 2 and 3.

Prior to reading the book, I had an intuitive understanding of specification-based testing but zero knowledge of structural testing as a distinct concept.

Here’s how my current understanding looks.

In specification-based testing, we start with a non-code specification and implement tests to demonstrate our system adheres to the specification. These tests usually sound something like, “some HTTP call returns a 401 when the caller’s token has expired.” With a sufficiently literate testing framework, specification-based tests can be read by a non-technical teammate and impart confidence that a feature is implemented as specified.

Structural tests, on the other hand, largely ignore the specification and tailor specifically to the implementation details. Maybe there’s something particularly interesting about the way we choose to return a 401 response vs. a 403 response, but it’s never stated explicitly in the specification. Structural tests are a good place to exercise and document this kind of detail.

At the risk of painting in broad strokes, maybe we can distinguish the tests by audience? Specification-based tests are for the product owners who provided the specification, and structural tests are for the other engineers who will continue to maintain and evolve the implementation.

Language design matters

The author covers contract design in chapter 4.

This gets into the details of input validation, pre-conditions, post-conditions, exceptions vs. return values, and so on.

Seeing the concrete implementations of these concepts in Java made me particularly grateful for the Scala language and how its primitives simplify contract design. Some quick examples:

Scala’s case classes are far less verbose than Java POJOs. This makes it simple to quickly define and provide an informative return type or a test data type.
Scala has native Try and Either constructs. These make it simple to provide known errors as return values instead of throwing exceptions.
Scala has had Option since day one. This means null is virtually non-existent (though, sadly, technically still legal) in Scala code.
Scala can verify a pattern-match is exhaustive at compile-time.
Scala has libraries that encode invariants in a type. For example, NonEmptyList and NonEmptySet in the cats library.

So, even though I agree with the author that “Correct by Design” is a myth (chapter 1), certain languages offer primitives that get us closer to this mythical goal.

Stubs vs. mocks and testing state vs. interaction

The author covers dummies, fakes, stubs, mocks, spies, and fixtures in chapter 6.

I found the distinction between stubs and mocks particularly useful.

In summary, a stub is an object that adheres to some API with a simplistic implementation (e.g., methods return hard-coded data). A mock does the same thing, but also provides mechanisms to verify how it was used (e.g., the number of times a specific method was called).

An analogous distinction is that of state testing vs. interaction testing. My understanding is that state testing lets us verify the ultimate state (the result) of some method, whereas interaction testing lets us verify the specific interactions (of classes, methods, etc.) which led to this state. Stubs facilitate state testing, whereas mocks facilitate interaction testing.

In most cases we really only care about the ultimate state, so stubs are sufficient. In other cases, we might actually care that a specific method was invoked with a specific input a specific number of times, etc. In these cases, we need mocks.

I think I’ve only ever needed to verify interactions for very performance-sensitive code. Even then, a broader benchmarking harness was more useful for catching performance regressions.

Writing testable code is important and dependency injection helps

The author covers techniques for writing testable code in chapter 7.

The fundamental argument for writing testable code is simple: if code is difficult to test, we won’t test it, and we’ll inevitably introduce bugs. I agree with this, both in principle and from experience.

One pattern seems particularly useful for writing testable code: dependency injection.

The core idea of dependency injection (DI) is pretty simple. Define an abstract interface for each area of our domain and for each external dependency. Implement a concrete implementation for each interface. Critically, don’t let the concrete implementations communicate directly with other implementations. Instead, they can only communicate through the interfaces.

When executed well, the benefit of DI is that we can test each implementation in isolation. We do this by injecting a controlled test double (dummy, fake, stub, mock, etc.) for each of the interfaces required by a particular implementation.

There are dedicated libraries for DI in any mainstream language. In Scala, I’ve found it’s usually sufficient to just use a trait for the interface, a final class for the implementation, and constructor parameters for the actual dependency injection.

Like any other technique, DI can be taken to a counterproductive extreme. There are definitely cases where we want to test one or more concrete implementations together, covered in chapter 9.

TDD is a method for implementing, not a method for testing

The author covers test-driven development (TDD) in chapter 8.

I found the most interesting point in this chapter was the distinction of TDD as a method to guide development, not a method for testing.

In other words, we can use TDD to arrive at an implementation, but we still need to re-evaluate the tests. We might keep some or most of the tests. Or we might realize the tests were effective as a means to an end (the implementation), but are not actually effective tests.

This was a refreshing perspective. I’ve found it challenging to follow TDD as I previously understood it, simply because the tests I want to write while implementing are usually different from the tests I want to write to verify the implementation. When I first start on an implementation, I’m generally satisfied testing with some scratch code in a simple main method that I know I’ll later discard. When I verify the implementation, I want to optimize for readability and maintainability.

Only integration test the dependencies that you exclusively own

The author covers large tests, including system and integration tests, in chapter 9.

Based on my experience, and the author’s perspectives, I’ve arrived at a heuristic for integration testing: I only integration test the external dependencies that my service exclusively owns.

For example, if my service exclusively owns a Postgres database, I’ll write integration tests against a realistic, local Postgres container. If my service also depends on some other shared service, I stub or mock all interactions with that service.

If I find that I actually need to think about the implementation details of another service (e.g., the underlying database), there’s probably a bug or a leaky abstraction in the other service. Instead of leaking the detail into my service, I work with the owner to fix the underlying issue. The alternative is that every service knows implementation details about other services, which is clearly untenable.

In other words, every service owner should write tests to ensure the service can be trusted to work as described in the API contract.

Keep stubs small and specific

The author covers some test smells and anti-patterns in chapter 10.

One that I was particularly happy to see covered was fixtures that are too general. This test smell consists of a large fixture that is shared by many tests.

Another way I’ve seen this play out is a singleton stub shared by many tests. This approach is expedient when first implementing a test suite, as it lets us share the same stub in many places (DRY). However, it quickly becomes brittle: when many tests depend on the specific behaviors of a shared stub, a trivial change or addition in the stub will break several unrelated tests.

The solution is to keep stubs as close and specific as possible to the tests that use them. It might result in more test code overall, but it keeps the tests decoupled and easier to maintain.

Conclusion

Buy and read Effective Software Testing by Maurício Aniche. It’s well worth the time and cost!

Appendix

Discussion

The Trending in Testing weekly newsletter highlighted this post in issue #38.

Optimizing Postgres Text Search with Trigrams

2022-02-18T15:00:00+00:00

Introduction

In this post, we’ll implement and optimize a text search system based on Postgres Trigrams.

We’ll start with some fundamental concepts, then define a test environment based on a dataset of 8.9 million Amazon reviews, then cover three possible optimizations.

Our search will start very slow, about 360 seconds. With some thoughtful optimization we’ll end up at just over 100 milliseconds – a ~3600x speedup! These optimizations won’t apply perfectly to every text search use-case, but they should at the very least spark some ideas.

Defining “text search”

For our purposes, “text search” is defined as follows:

We have a database table with multiple text columns.
A user provides one input: a query string (think Google or Amazon search box).
We search for ten rows in our table for which at least one of the columns matches the query string. A match can be either exact or fuzzy. For example, the query string “foobar” is an exact match for the value “foobarbaz” and a fuzzy match for the value “foo bar”.
Once we’ve found ten such rows, we score them, re-rank them by their scores, and return them to the user.

Why Postgres?

Postgres is a ubiquitous relational database, but dedicated search systems like Solr, Elasticsearch, and Opensearch are far better-known for text search.

Still, Postgres offers some competent text search functionality, with several benefits over a dedicated search system:

We avoid operating additional infrastructure (e.g., an Elasticsearch cluster).
We avoid syncing data between systems (e.g., Postgres to Elasticsearch).
We avoid re-implementing non-trivial features (e.g., multi-tenant authorization).

Having implemented and operated search functionality on both Postgres and Elasticsearch, my current heuristics for choosing between them are:

If search is our core offering, or we can’t afford to fit our searchable text in Postgres’ shared memory buffer, use Elasticsearch and earmark one engineer for operations and tuning.
Otherwise, Postgres is probably good enough. At the very least, we’ll end up with a competent baseline for further improvement.

As a case-study, Gitlab has publicly documented their journey in growing from Postgres trigram-based search to advanced search in Elasticsearch.¹

What are Trigrams?

A trigram is simply a three-character sequence from a string.

For example, the trigrams in "hello" are {" h"," he","hel","ell","llo","lo "}.

Trigrams present a simple solution for string comparison: to compare two strings, we can count the number of trigrams they share.

For example, “hello” and “helo” share 4 trigrams. “hello” and “bye” share 0. This isn’t fool-proof: “hello” shares more trigrams with “jello” than with “hi”. But it’s usually a strong baseline.

Why not Full Text Search?

Postgres also offers Full Text Search. I’m personally more familiar with trigram-based search, which is part of the reason we’ll use trigrams in this post.

From some experimenting with Full Text Search, I’ve found the API focuses more narrowly on natural language text (i.e., words, spaces, punctuation), than on general-purpose text (i.e., natural language text and product SKUs and email addresses, …).²

Still, some optimizations in this post might translate nicely to Full Text Search.

The Test Environment

The full test environment is available on Github. Let’s have a look at some specific components.

Host

Everything is running on my Dell XPS-9570 Laptop with an Intel i7-8750H, 32GB of memory, an SSD, and Ubuntu 20.04. Exact timing will vary by the host environment, but the relative performance should be host-agnostic.

Postgres

We’ll use Postgres version 14.1.0, running the bitnami/postgresql:14.1.0 container. All examples should work on Postgres >= 13.

We’ll set two server configurations:

# Provide 4GB of memory for buffers.
# This lets us keep the full table and indexes in memory.
shared_buffers = 4096MB

# Set to true so that `explain (... buffers ...) will include IO timings.
track_io_timing = true

We’ll also set two query planning configurations:

-- This disables parallel gathers, as I've found they produce highly
-- variable results depending on the host system.
set max_parallel_workers_per_gather = 0;

-- This makes it more likely that the query planner chooses an index scan
-- instead of another strategy. I've found this will generally improve
-- performance on any system with an SSD.
set random_page_cost = 0.9;

Amazon Review Dataset

We’ll use data from the Amazon Review Dataset to demonstrate our optimizations.³

Specifically, we’ll use text properties from the 5-core Book Reviews Subset, a dataset of 8.9 million reviews for books sold on Amazon. An example review shows the shape of our dataset:

{
  "reviewerID": "A2SUAM1J3GNN3B",
  "asin": "0000013714",
  "reviewerName": "J. McDonald",
  "helpful": [2, 3],
  "reviewText": "I bought this for my husband who plays the piano. ...",
  "overall": 5.0,
  "summary": "Heavenly Highway Hymns",
  "unixReviewTime": 1252800000,
  "reviewTime": "09 13, 2009"
}

Each of these reviews includes five text properties: reviewerID, asin, reviewerName, reviewText, summary. reviewerID and asin are machine-generated identifiers. reviewerName, reviewText, summary are free-form human-generated text.

For simplicity, we’ll ignore reviewText and make a table with the remaining text properties:

create table reviews (
  review_id bigserial primary key,
  reviewer_id varchar(50),
  reviewer_name varchar(100),
  asin varchar(50),
  summary varchar(1000)
);

I wouldn’t recommend this schema in a real application. For example, the reviewer name should be factored out to a reviewer table. But it’s good enough for a demo.

It takes about 8 minutes to populate the table and the table size ends up at 926MB.

Example Query Strings

We’ll use one of my favorite authors, Michael Lewis, as a test subject.⁴

Specifically, we’ll search for two variations of his name:

“Michael Lewis” – the correct spelling – to find exact matches
“Michael Louis” – a plausible misspeling – to find fuzzy matches

To avoid confusion, let’s refer to these as the exact name and the fuzzy name.

Explain (Analyze, Buffers)

We’ll use Postgres’ explain (analyze, buffers) command to evaluate performance. This command takes a query, executes it, and returns the query plan and execution details.

This is not a fool-proof solution. A better benchmarking harness would include realistic application request patterns, authentication, authorization, logging, serialization, etc. However, building such a harness would be pointlessly cumbersome, as it would need to be re-implemented for any other non-trivial application.

The main thing we’re looking at is how the query plan, execution time, and I/O statistics⁵ react to optimizations. This is enough to conclude one approach is better than another.

To make this a bit more aesthetically appealing, I cobbled together an embedded version of the excellent PEV2 (Postgres Explain Visualizer 2) project.⁶

Let’s look at an example. I ran this query:

explain (analyze, buffers)
select count(review_id) from reviews;

Which produced this output:

Aggregate  (cost=177675.25..177675.26 rows=1 width=8) (actual time=3282.696..3282.697 rows=1 loops=1)
  Buffers: shared hit=10 read=24314
  I/O Timings: read=121.636
  ->  Index Only Scan using reviews_pkey on reviews  (cost=0.43..155430.15 rows=8898041 width=8) (actual time=1.341..1679.504 rows=8898041 loops=1)
    Heap Fetches: 0
    Buffers: shared hit=10 read=24314
    I/O Timings: read=121.636
Planning Time: 0.138 ms
Execution Time: 3282.768 ms

The PEV2 viewer shows the query plan visualization, raw query plan, query, and some stats.

select count(review_id) from reviews;

Aggregate  (cost=177675.25..177675.26 rows=1 width=8) (actual time=3282.696..3282.697 rows=1 loops=1)
  Buffers: shared hit=10 read=24314
  I/O Timings: read=121.636
  ->  Index Only Scan using reviews_pkey on reviews  (cost=0.43..155430.15 rows=8898041 width=8) (actual time=1.341..1679.504 rows=8898041 loops=1)
    Heap Fetches: 0
    Buffers: shared hit=10 read=24314
    I/O Timings: read=121.636
Planning Time: 0.138 ms
Execution Time: 3282.768 ms

Clicking around a bit in the plan reveals three important results:

Total planning and execution time: we spent 0.138ms planning and about 3.2s executing.
Timing and types of execution: we spent 1.6s in an Aggregate and 1.6s in an Index Only Scan. The Index Only Scan tells us we were able to make use of an index in this query.
Timing, amount, and types of I/O: if we expand the Index Only Scan and open the IO and Buffers tab, we see we spent 122ms on I/O, hit 10 blocks, and read 24,314 blocks. A hit means the block was in the shared buffer cache. A read means we went to the filesystem cache or SSD.

Baseline Search Query

With our test environment explained, let’s build a relatively simple baseline query pattern based on the trigram similarity function and its corresponding operators: % and <->.

Trigram Operators

Those already familiar with trigram similarity, %, and <-> can safely skip this section.

First, we need the function similarity(text1, text2). This function breaks both texts into a set of trigrams, computes the intersection of sets, computes the union of sets, and divides the intersection size by the union size to produce a score between 0 and 1. In other words, this is the Jaccard Index of the sets of trigrams.

The query below give us some intuition about similarity('abc', 'abb'), 'abc' % 'abb' and 'abc' <-> 'abb':

with input as (select 'abc' as text1, 'abb' as text2)
select
  show_trgm(text1) as "text1 trigrams",
  show_trgm(text2) as "text2 trigrams",
  array(select t1.t1 
        from unnest(show_trgm(text1)) t1, 
             unnest(show_trgm(text2)) t2 where t1.t1 = t2.t2) as "intersection",
  array(select t1.t1 from unnest(show_trgm(text1)) t1 
        union 
        select t2.t2 from unnest(show_trgm(text2)) t2) as "union",
  round(similarity(text1, text2)::numeric, 3) as "similarity",
  text1 % text2 as "text1 % text2",
  text1 <-> text2 as "text1 <-> text2"
from input;

This produces:

text1 trigrams	text2 trigrams	intersection	union	similarity	text1 % text2	text1 <-> text2
{ a, ab,abc,bc }	{ a, ab,abb,bb }	{ a, ab}	{bb ,abb, a, ab,bc ,abc}	0.333	true	0.666

For “abc” and “abb”, the intersection size is 2 and union size is 6, so 2/6 is a similarity of 1/3.

The operator text1 % text2 returns true if similarity(text1, text2) exceeds a pre-defined threshold setting, pg_trgm.similarity_threshold. The default threshold is 0.3, so select 'abc' % 'abc' returns true.

The operator text1 <-> text2 returns the distance between text1 and text2, which is just 1 - similarity(text1, text2), so select 'abc' <-> 'abb' returns 2/3.

Why do we need these operators if they just alias the similarity function? At the risk of spoiling one of the optimizations, operators can leverage an index, and functions cannot.

Trigram Search Query

Let’s use these operators to search for reviews where summary matches the exact name.

(We should compare the query string against all text columns, but we start with just summary for simplicity.)

The query looks like this:

with input as (select 'Michael Lewis' as q) -- (1)
select review_id,
       1.0 - (summary <-> input.q) as score -- (4)
from reviews, input
where input.q % summary -- (2)
order by input.q <-> summary limit 10; -- (3)

Let’s break the query into components, numbered in correspondence to the comments above:

This is a Common Table Expression (CTE). It gives us a way to reference the query string as a variable, input.q.
We use input.q % summary to filter the table down to a set of candidate rows. For each of these rows, input.q and summary have a trigram similarity greater than or equal to 0.3.
Once we’ve found candidate rows, we sort them by the trigram distance between input.q and summary and keep the top 10. We want the rows with highest similarity, which is equivalent to lowest distance. So we sort by the distance operator in ascending order.
In order to return the score to the user, we just subtract the trigram distance from 1.0.

Let’s look at the results and performance for the exact name:

review_id	summary	score
589771	Michael Lewis Fan	0.7777
2113780	Michael Lewis Fan	0.7777
2111282	Michael Lewis bland?	0.6999
2114048	MIchael Lewis is Good	0.6666
2100962	Not Michael Lewis’ Best	0.6086
610753	Not Michael Lewis’ Best	0.6086
2111364	Boomerang, Michael Lewis	0.5833
2111212	Michael Lewis is amazing	0.5833
2111190	Michael Lewis on a Roll	0.5833
2108446	Go Long on Michael Lewis	0.5833

with input as (select 'Michael Lewis' as q)
select review_id,
       1.0 - (summary <-> input.q) as score
from reviews, input
where input.q % summary
order by input.q <-> summary limit 10;

Limit  (cost=229772.34..229772.37 rows=10 width=20) (actual time=94817.773..94817.777 rows=10 loops=1)
  Buffers: shared hit=118549
  ->  Sort  (cost=229772.34..229774.39 rows=819 width=20) (actual time=94817.772..94817.773 rows=10 loops=1)
        Sort Key: (('Michael Lewis'::text <-> (reviews.summary)::text))
        Sort Method: top-N heapsort  Memory: 26kB
        Buffers: shared hit=118549
        ->  Seq Scan on reviews  (cost=0.00..229754.64 rows=819 width=20) (actual time=171.828..94816.588 rows=761 loops=1)
              Filter: ('Michael Lewis'::text % (summary)::text)
              Rows Removed by Filter: 8897280
              Buffers: shared hit=118549
Planning Time: 1.669 ms
Execution Time: 94817.814 ms

And again for the fuzzy name:

review_id	summary	score
4341036	Lo Michael	0.6666
4341045	Lo, Michael	0.6666
4341030	Lo Michael!	0.6666
4341034	Lo,Michael	0.6666
4341027	Lo, Michael!	0.6666
4341043	Lo,Michael	0.6666
4341025	Lo, Michael!	0.6666
4341026	Lo. Michael !	0.6666
4341029	Lo, Michael!	0.6666
4341050	Lo michael	0.6666

with input as (select 'Michael Louis' as q)
select review_id,
       1.0 - (summary <-> input.q) as score
from reviews, input
where input.q % summary
order by input.q <-> summary limit 10;

Limit  (cost=229792.85..229792.87 rows=10 width=20) (actual time=94591.716..94591.720 rows=10 loops=1)
  Buffers: shared hit=118549
  ->  Sort  (cost=229792.85..229794.90 rows=821 width=20) (actual time=94591.715..94591.716 rows=10 loops=1)
        Sort Key: (('Michael Louis'::text <-> (reviews.summary)::text))
        Sort Method: top-N heapsort  Memory: 26kB
        Buffers: shared hit=118549
        ->  Seq Scan on reviews  (cost=0.00..229775.11 rows=821 width=20) (actual time=176.054..94590.575 rows=729 loops=1)
              Filter: ('Michael Louis'::text % (summary)::text)
              Rows Removed by Filter: 8897312
              Buffers: shared hit=118549
Planning Time: 1.761 ms
Execution Time: 94591.758 ms

Qualitatively speaking, the results are reasonable. There are some exact matches for the exact name, and a few summaries containing “Lo” and “Michael” that match the fuzzy name.

👎 But the performance is terrible: over 94 seconds to find ten results!

If we extrapolate this to all four text columns, we can estimate a runtime of over 360 seconds.

How are we spending this time? The query plans suggest the following:

About 94s in Seq Scan on reviews. This is a sequential scan on the reviews table, which means Postgres iterates over all rows and keeps those that satisfy input.q % summary. This returns 761 and 729 matches for the exact and fuzzy names, respectively. The IO & Buffers tabs indicate this also involved reading 926MB of data (i.e., the whole table) from the in-memory cache. It’s better than going to SSD, but it’s still non-negligible.
About 1ms in Sort, which sorts the matches by input.q <-> summary.
Less than 1ms in Limit, which takes the first ten of the sorted rows.

Baseline Summary

Here’s what we know about our trigram search query so far:

Qualitatively, it’s not Google, but the results are reasonable.
The query is unusably slow (over 94s).
It spends virtually all its time scanning the reviews table.

Optimizations

Let’s get into some optimizations to see if we can improve on the baseline.

Indexing

The first optimization should be unsurprising: we’ll create an index for the text field.

Trigrams support both GIN and GiST index types. The main difference is that GiST supports filtering and sorting, whereas GIN only supports filtering. Since our search query involves sorting by trigram distance, we’ll use GiST.

At a high level, the GiST index works by building a lookup table from each trigram to the list or rows containing the trigram. At query time, Postgres takes the trigrams from the query string and asks the index, “which rows contain these trigrams?” The trigrams are stored as a signature (i.e., a hash), and sometimes the signatures can collide.

Since Postgres 13, the GiST index type includes a parameter called siglen, which lets us control the precision of the signature. Here’s how the docs describe it:

gist_trgm_ops GiST opclass approximates a set of trigrams as a bitmap signature. Its optional integer parameter siglen determines the signature length in bytes. The default length is 12 bytes. Valid values of signature length are between 1 and 2024 bytes. Longer signatures lead to a more precise search (scanning a smaller fraction of the index and fewer heap pages), at the cost of a larger index.

In short, higher siglen should translate to more precise search (i.e., fewer signature collisions), at the cost of a larger index.

We’ll start with a GiST index with siglen=64, check performance, then repeat with siglen=256.

GiST with siglen=64

create index reviews_summary_trgm_gist_idx on reviews 
  using gist(summary gist_trgm_ops(siglen=64));
vacuum analyze reviews;

This takes about 10 minutes to build and ends up using about 1000MB of storage.

Does it make a difference for performance?

For the exact name, we find:

with input as (select 'Michael Lewis' as q)
select review_id,
       1.0 - (summary <-> input.q) as score
from reviews, input
where input.q % summary
order by input.q <-> summary limit 10;

Limit  (cost=0.42..9.81 rows=10 width=20) (actual time=4181.401..4216.478 rows=10 loops=1)
  Buffers: shared hit=135684
  ->  Index Scan using reviews_summary_trgm_gist_idx on reviews  (cost=0.42..771.80 rows=821 width=20) (actual time=4181.400..4216.474 rows=10 loops=1)
        Index Cond: ((summary)::text % 'Michael Lewis'::text)
        Order By: ((summary)::text <-> 'Michael Lewis'::text)
        Buffers: shared hit=135684
Planning Time: 1.933 ms
Execution Time: 4216.519 ms

And for the fuzzy name:

with input as (select 'Michael Louis' as q)
select review_id,
       1.0 - (summary <-> input.q) as score
from reviews, input
where input.q % summary
order by input.q <-> summary limit 10;

Limit  (cost=0.42..9.81 rows=10 width=20) (actual time=4330.713..4330.850 rows=10 loops=1)
  Buffers: shared hit=135447
  ->  Index Scan using reviews_summary_trgm_gist_idx on reviews  (cost=0.42..771.80 rows=821 width=20) (actual time=4330.711..4330.845 rows=10 loops=1)
        Index Cond: ((summary)::text % 'Michael Louis'::text)
        Order By: ((summary)::text <-> 'Michael Louis'::text)
        Buffers: shared hit=135447
Planning Time: 1.829 ms
Execution Time: 4330.889 ms

👍 This is a significant improvement: from over 94 seconds to under 4.5 seconds!

If we extrapolate this to all four text columns, we can estimate a runtime of under 20 seconds.

The query plans tell us how we’re making better use of time:

About 4.3s in an Index Scan on the new reviews_summary_trgm_gist_idx index. The Misc tab indicates Postgres uses the index for filtering (Index Cond) and sorting (Order By). The IO & Buffers tab indicates we’re accessing 1.03GB of data from the cache. We don’t know precisely, but this data is some combination of the index and the rows.
Less than 40ms in Limit. As far as I can tell, this is a trivial pass-through, as the index scan has already returned exactly ten rows.

GiST with siglen=256

Let’s try again with siglen=256:

drop index reviews_summary_trgm_gist_idx;
create index reviews_summary_trgm_gist_idx on reviews
  using gist(summary gist_trgm_ops(siglen=256));
vacuum analyze reviews;

This takes about 15 minutes to build and uses 1036MB of storage.

For the exact name, we find:

with input as (select 'Michael Lewis' as q)
select review_id,
       1.0 - (summary <-> input.q) as score
from reviews, input
where input.q % summary
order by input.q <-> summary limit 10;

Limit  (cost=0.42..9.81 rows=10 width=20) (actual time=503.082..1996.835 rows=10 loops=1)
  Buffers: shared hit=62167
  ->  Index Scan using reviews_summary_trgm_gist_idx on reviews  (cost=0.42..771.80 rows=821 width=20) (actual time=503.079..1996.828 rows=10 loops=1)
        Index Cond: ((summary)::text % 'Michael Lewis'::text)
        Order By: ((summary)::text <-> 'Michael Lewis'::text)
        Buffers: shared hit=62167
Planning Time: 5.283 ms
Execution Time: 1997.397 ms

And for the fuzzy name:

with input as (select 'Michael Louis' as q)
select review_id,
       1.0 - (summary <-> input.q) as score
from reviews, input
where input.q % summary
order by input.q <-> summary limit 10;

Limit  (cost=0.42..9.81 rows=10 width=20) (actual time=707.952..708.081 rows=10 loops=1)
  Buffers: shared hit=22639
  ->  Index Scan using reviews_summary_trgm_gist_idx on reviews  (cost=0.42..771.80 rows=821 width=20) (actual time=707.951..708.078 rows=10 loops=1)
        Index Cond: ((summary)::text % 'Michael Louis'::text)
        Order By: ((summary)::text <-> 'Michael Louis'::text)
        Buffers: shared hit=22639
Planning Time: 1.654 ms
Execution Time: 708.577 ms

👍 Another improvement: from 4.5 seconds to under 2 seconds!

If we extrapolate this to all four text columns, we can estimate a runtime of about 8 seconds.

Why does Siglen Matter?

Inspection of these results leads to two questions:

Why is siglen=256 over 2x faster than siglen=64?
For siglen=256, why is the exact name over 2x faster than the fuzzy name?

We can begin to answer these by looking at the IO & Buffers tabs, which tell us how much data was accessed. The numbers work out like this:

Siglen	Query String	Data Accessed	Access Type
64	exact name	1.04GB	hit (from in-memory cache)
64	fuzzy name	1.03GB	hit (from in-memory cache)
256	exact name	486MB	hit (from in-memory cache)
256	fuzzy name	177MB	hit (from in-memory cache)

Even though this data is in memory, decreasing the amount accessed makes a difference.

I’m still working on an intuitive understanding of why these two specific values of siglen work out to these specific differences, but that’s likely a topic for another post.⁵

Indexing Summary

Here’s what we know about indexing:

Adding a GiST index yields a significant speedup: 94s → 4.5s.
Increasing the siglen parameter from 64 to 256 yields another speedup: 4.5s → 2s.
The siglen parameter affects the number of buffers read to execute the index scan: greater siglen → fewer buffers → faster query.

Separate Exact and Trigram Search Queries

Recall that we’re interested in both exact and fuzzy matches. So far, we’ve used a single trigram search query to satisfy both match types. Trigrams are useful for fuzzy matches, but are they really necessary for exact matches?

Let’s take a step back, compose an exact-only search query, and see what we can do with it.

The `ilike` operator

The boolean operator text1 ilike '%' || text2 || '%' will return true if text1 contains text2, ignoring capitalization.

Here are some examples:

select
   'abc' ilike '%' || 'ab' || '%' as "abc contains ab",
   'abc' ilike '%' || 'AB' || '%' as "abc contains AB",
   'abc' ilike '%' || 'abc' || '%' as "abc contains abc",
   'abc' ilike '%' || 'abb' || '%' as "abc contains abb"

This produces:

abc contains ab	abc contains AB	abc contains abc	abc contains abb
true	true	true	false

Exact-Only Search Query

We can use the ilike operator compose an exact-only search query:

with input as (select 'Michael Lewis' as q)
select review_id,
       1.0 as score -- (2)
from reviews, input
where summary ilike '%' || input.q || '%' -- (1)
limit 10; -- (3)

We use the ilike operator to filter for rows where summary contains the query string.
Since each summary contains the query string, we simply assign a score of 1.0.
We just want ten of them. They all have the same score, so no need to sort.

How does it perform on our query strings?

For the exact name, we find:

with input as (select 'Michael Lewis' as q)
select review_id,
       1.0 as score
from reviews, input
where summary ilike '%' || input.q || '%'
limit 10;

Limit  (cost=0.42..9.71 rows=10 width=40) (actual time=2.955..6.431 rows=10 loops=1)
  Buffers: shared hit=865
  ->  Index Scan using reviews_summary_trgm_gist_idx on reviews  (cost=0.42..763.59 rows=821 width=40) (actual time=2.952..6.425 rows=10 loops=1)
        Index Cond: ((summary)::text ~~* '%Michael Lewis%'::text)
        Buffers: shared hit=865
Planning Time: 0.413 ms
Execution Time: 6.456 ms

And for the fuzzy name:

with input as (select 'Michael Louis' as q)
select review_id,
       1.0 as score
from reviews, input
where summary ilike '%' || input.q || '%'
limit 10;

Limit  (cost=0.42..9.71 rows=10 width=40) (actual time=10.582..10.583 rows=0 loops=1)
  Buffers: shared hit=1429
  ->  Index Scan using reviews_summary_trgm_gist_idx on reviews  (cost=0.42..763.59 rows=821 width=40) (actual time=10.581..10.581 rows=0 loops=1)
    Index Cond: ((summary)::text ~~* '%Michael Louis%'::text)
    Rows Removed by Index Recheck: 1
    Buffers: shared hit=1429
Planning Time: 0.340 ms
Execution Time: 10.007 ms

👍 A significant improvement: from 2s to 10ms!

If we extrapolate to all four text columns, we’re down to potentially 40ms.

This tells us that finding exact matches with an exact-only query is significantly faster than finding them with a trigram search query.

The query plan is roughly the same as our trigram search query, basically just an Index Scan on reviews, but the amount of data accessed is significantly lower: under 12MB.

Crucially, this presents an opportunity for optimization: given a query string and a desired number of results, we first attempt to very quickly search for exact matches. If we find the desired number of results, we can skip the fuzzy search entirely. If we don’t find all the results, we run the fuzzy query. If we want to get fancy, we can even run the two searches in parallel and cancel the fuzzy search if our exact search is sufficient.

Separate Queries Summary

Here’s what we know about separating exact and trigram search queries:

An exact-only query accesses significantly less data than a trigram query: 177MB → 11MB
An exact-only query is significantly faster than a trigram query: 2s → 10ms
If the exact-only query finds enough results, we can skip the fuzzy query.
In the best case, we turn a 2s search into a 10ms search.
In the worst case, we turn a 2s search into a 2.01s search.

Single Query for All Text Columns

So far our search queries have only checked for matches in the summary column, and we’ve been extrapolating the timing.

Now is the time to stop extrapolating and compose a query that actually checks all four text columns. Let’s look at three ways we can make this happen.

Four Single-Column Queries

The simplest method to check each of the columns is to simply search for every column separately. Then we would deduplicate and re-rank the results in application code.

To do this, we start by building indexes on the three remaining columns:

create index reviews_reviewer_id_trgm_gist_idx on reviews
  using gist(reviewer_id gist_trgm_ops(siglen=256));
create index reviews_reviewer_name_trgm_gist_idx on reviews
  using gist(reviewer_name gist_trgm_ops(siglen=256));
create index reviews_asin_trgm_gist_idx on reviews
  using gist(asin gist_trgm_ops(siglen=256));
vacuum analyze reviews;

Each of these takes about fifteen minutes to build and uses about 690MB of storage.

The trigram search query is just a union of the original trigram search query on each column:

with input as (select 'Michael Lewis' as q)
(select review_id, 1.0 - (reviewer_id <-> input.q) as score
from reviews, input
where input.q % reviewer_id
order by input.q <-> reviewer_id limit 10)
union all
(select review_id, 1.0 - (reviewer_name <-> input.q) as score
from reviews, input
where input.q % reviewer_name
order by input.q <-> reviewer_name limit 10)
union all
(select review_id, 1.0 - (summary <-> input.q) as score
from reviews, input
where input.q % summary
order by input.q <-> summary limit 10)
union all
(select review_id, 1.0 - (asin <-> input.q) as score
from reviews, input
where input.q % asin
order by input.q <-> asin limit 10);

The exact-only query follows the same pattern:

explain (analyze, buffers)
with input as (select 'Michael Lewis' as q)
(select review_id, 1.0 as score
from reviews, input
where reviewer_id ilike '%' || input.q || '%'
limit 10)
union all
(select review_id, 1.0 as score
from reviews, input
where reviewer_name ilike '%' || input.q || '%'
limit 10)
union all
(select review_id, 1.0 as score
from reviews, input
where summary ilike '%' || input.q || '%'
limit 10)
union all
(select review_id, 1.0 as score
from reviews, input
where asin ilike '%' || input.q || '%'
limit 10);

Before analyzing the query execution, let’s review our thinking on how long this should take.

Our latest queries looked at the summary column and took about 10ms for exact-only search and 2s for trigram search. We have four text columns, so it’s not crazy to estimate somewhere between 40ms and 8s for four one-column queries.

The actual performance works out like this:

Query	Query String	Execution Time	Buffer Hits
Trigram	exact name	10.7s	336986
Trigram	fuzzy name	11.2s	336998
Exact-Only	exact name	144ms	12684
Exact-Only	fuzzy name	94ms	9263

👎 The performance is pretty bad: about 11s to find ten matches.

All four plans are roughly identical, so let’s look at the trigram query for the exact name:

with input as (select 'Michael Lewis' as q)
(select review_id, 1.0 - (reviewer_id <-> input.q) as score
from reviews, input
where input.q % reviewer_id
order by input.q <-> reviewer_id limit 10)
union all
(select review_id, 1.0 - (reviewer_name <-> input.q) as score
from reviews, input
where input.q % reviewer_name
order by input.q <-> reviewer_name limit 10)
union all
(select review_id, 1.0 - (summary <-> input.q) as score
from reviews, input
where input.q % summary
order by input.q <-> summary limit 10)
union all
(select review_id, 1.0 - (asin <-> input.q) as score
from reviews, input
where input.q % asin
order by input.q <-> asin limit 10);

Append  (cost=64064.97..256647.56 rows=40 width=16) (actual time=6764.697..10795.095 rows=20 loops=1)
  Buffers: shared hit=336986
  CTE input
    ->  Result  (cost=0.00..0.01 rows=1 width=32) (actual time=0.001..0.002 rows=1 loops=1)
"  ->  Subquery Scan on ""*SELECT* 1_1""  (cost=64064.96..64065.09 rows=10 width=16) (actual time=2506.643..2506.645 rows=0 loops=1)"
        Buffers: shared hit=69700
        ->  Limit  (cost=64064.96..64064.99 rows=10 width=20) (actual time=2506.642..2506.643 rows=0 loops=1)
              Buffers: shared hit=69700
              ->  Sort  (cost=64064.96..64287.41 rows=88980 width=20) (actual time=2506.641..2506.642 rows=0 loops=1)
                    Sort Key: ((input.q <-> (reviews.reviewer_id)::text))
                    Sort Method: quicksort  Memory: 25kB
                    Buffers: shared hit=69700
                    ->  Nested Loop  (cost=0.42..62142.14 rows=88980 width=20) (actual time=2506.636..2506.636 rows=0 loops=1)
                          Buffers: shared hit=69700
                          ->  CTE Scan on input  (cost=0.00..0.02 rows=1 width=32) (actual time=0.003..0.005 rows=1 loops=1)
                          ->  Index Scan using reviews_reviewer_id_trgm_gist_idx on reviews  (cost=0.42..60584.97 rows=88980 width=22) (actual time=2506.628..2506.628 rows=0 loops=1)
                                Index Cond: ((reviewer_id)::text % input.q)
                                Buffers: shared hit=69700
"  ->  Subquery Scan on ""*SELECT* 2""  (cost=64056.86..64056.99 rows=10 width=16) (actual time=4258.051..4258.058 rows=10 loops=1)"
        Buffers: shared hit=133720
        ->  Limit  (cost=64056.86..64056.89 rows=10 width=20) (actual time=4258.048..4258.052 rows=10 loops=1)
              Buffers: shared hit=133720
              ->  Sort  (cost=64056.86..64279.31 rows=88980 width=20) (actual time=4258.047..4258.049 rows=10 loops=1)
                    Sort Key: ((input_1.q <-> (reviews_1.reviewer_name)::text))
                    Sort Method: top-N heapsort  Memory: 26kB
                    Buffers: shared hit=133720
                    ->  Nested Loop  (cost=0.42..62134.04 rows=88980 width=20) (actual time=0.750..4239.400 rows=50214 loops=1)
                          Buffers: shared hit=133720
                          ->  CTE Scan on input input_1  (cost=0.00..0.02 rows=1 width=32) (actual time=0.001..0.002 rows=1 loops=1)
                          ->  Index Scan using reviews_reviewer_name_trgm_gist_idx on reviews reviews_1  (cost=0.42..60576.87 rows=88980 width=24) (actual time=0.722..3483.767 rows=50214 loops=1)
                                Index Cond: ((reviewer_name)::text % input_1.q)
                                Buffers: shared hit=133720
"  ->  Subquery Scan on ""*SELECT* 3""  (cost=64460.96..64461.09 rows=10 width=16) (actual time=4022.956..4022.963 rows=10 loops=1)"
        Buffers: shared hit=132744
        ->  Limit  (cost=64460.96..64460.99 rows=10 width=20) (actual time=4022.954..4022.957 rows=10 loops=1)
              Buffers: shared hit=132744
              ->  Sort  (cost=64460.96..64683.41 rows=88980 width=20) (actual time=4022.953..4022.954 rows=10 loops=1)
                    Sort Key: ((input_2.q <-> (reviews_2.summary)::text))
                    Sort Method: top-N heapsort  Memory: 26kB
                    Buffers: shared hit=132744
                    ->  Nested Loop  (cost=0.42..62538.14 rows=88980 width=20) (actual time=8.015..4022.513 rows=761 loops=1)
                          Buffers: shared hit=132744
                          ->  CTE Scan on input input_2  (cost=0.00..0.02 rows=1 width=32) (actual time=0.000..0.002 rows=1 loops=1)
                          ->  Index Scan using reviews_summary_trgm_gist_idx on reviews reviews_2  (cost=0.42..60980.97 rows=88980 width=34) (actual time=7.986..4009.293 rows=761 loops=1)
                                Index Cond: ((summary)::text % input_2.q)
                                Buffers: shared hit=132744
"  ->  Subquery Scan on ""*SELECT* 4""  (cost=64064.06..64064.19 rows=10 width=16) (actual time=7.418..7.419 rows=0 loops=1)"
        Buffers: shared hit=822
        ->  Limit  (cost=64064.06..64064.09 rows=10 width=20) (actual time=7.417..7.418 rows=0 loops=1)
              Buffers: shared hit=822
              ->  Sort  (cost=64064.06..64286.51 rows=88980 width=20) (actual time=7.416..7.417 rows=0 loops=1)
                    Sort Key: ((input_3.q <-> (reviews_3.asin)::text))
                    Sort Method: quicksort  Memory: 25kB
                    Buffers: shared hit=822
                    ->  Nested Loop  (cost=0.42..62141.24 rows=88980 width=20) (actual time=7.410..7.411 rows=0 loops=1)
                          Buffers: shared hit=822
                          ->  CTE Scan on input input_3  (cost=0.00..0.02 rows=1 width=32) (actual time=0.000..0.001 rows=1 loops=1)
                          ->  Index Scan using reviews_asin_trgm_gist_idx on reviews reviews_3  (cost=0.42..60584.07 rows=88980 width=19) (actual time=7.406..7.406 rows=0 loops=1)
                                Index Cond: ((asin)::text % input_3.q)
                                Buffers: shared hit=822
Planning Time: 0.433 ms
Execution Time: 10795.196 ms

Here’s how we spend this time:

Just over 10s in four Index Scan blocks (one per column). These scans return 50,214 rows for reviewer_name, 761 rows for summary and 0 rows for asin and reviewer_id. In total, they access about 2.59GB of data from the shared buffer cache.
769ms in Nested Loop blocks. These loops combine the input with the Index Scan results. It’s rather surprising that we spend any significant time here, but we could easily optimize this out by getting rid of the input CTE.

If we want to search all four text columns, we’ll need to think a bit harder!

One Four-Column Query with Disjunctions

As a second pass, what if we flatten the four unioned queries into a single disjunctive query?

select review_id,
       (1 - least(
        input.q <-> reviewer_id,
        input.q <-> reviewer_name,
        input.q <-> summary,
        input.q <-> asin)) as score -- (3)
from reviews, input
where input.q % reviewer_id
   or input.q % reviewer_name
   or input.q % summary
   or input.q % asin -- (1)
order by least(
    input.q <-> reviewer_id,
    input.q <-> reviewer_name,
    input.q <-> summary,
    input.q <-> asin) limit 10; -- (2)

Explaining the numbered components:

We keep the row as a candidate if it’s a trigram match for any of the four columns.
We sort the candidates by the lowest trigram distance to any of the four queries.
We score the candidates by one minus the lowest trigram distance to any of the four queries. This is equivalent to the greatest trigram similarity.

The performance works out like this:

Query	Query String	Execution Time	Buffer Hits
Trigram	exact name	13.8s	323953
Trigram	fuzzy name	13.9s	324728
Exact-Only	exact name	162ms	14705
Exact-Only	fuzzy name	153ms	12987

👎 The performance is even worse: about 14s to find ten matches.

explain (analyze, buffers)
with input as (select 'Michael Lewis' as q)
select review_id,
       (1 - least(
        input.q <-> reviewer_id,
        input.q <-> reviewer_name,
        input.q <-> summary,
        input.q <-> asin)) as score -- (3)
from reviews, input
where input.q % reviewer_id
   or input.q % reviewer_name
   or input.q % summary
   or input.q % asin -- (1)
order by least(
    input.q <-> reviewer_id,
    input.q <-> reviewer_name,
    input.q <-> summary,
    input.q <-> asin) limit 10; -- (2)

Limit  (cost=5389.77..5389.79 rows=10 width=20) (actual time=13856.366..13856.370 rows=10 loops=1)
  Buffers: shared hit=323953
  ->  Sort  (cost=5389.77..5403.40 rows=5452 width=20) (actual time=13856.364..13856.367 rows=10 loops=1)
"        Sort Key: (LEAST(('Michael Lewis'::text <-> (reviews.reviewer_id)::text), ('Michael Lewis'::text <-> (reviews.reviewer_name)::text), ('Michael Lewis'::text <-> (reviews.summary)::text), ('Michael Lewis'::text <-> (reviews.asin)::text)))"
        Sort Method: top-N heapsort  Memory: 26kB
        Buffers: shared hit=323953
        ->  Bitmap Heap Scan on reviews  (cost=102.01..5271.95 rows=5452 width=20) (actual time=10013.108..13837.707 rows=50929 loops=1)
              Recheck Cond: (('Michael Lewis'::text % (reviewer_id)::text) OR ('Michael Lewis'::text % (reviewer_name)::text) OR ('Michael Lewis'::text % (summary)::text) OR ('Michael Lewis'::text % (asin)::text))
              Filter: (('Michael Lewis'::text % (reviewer_id)::text) OR ('Michael Lewis'::text % (reviewer_name)::text) OR ('Michael Lewis'::text % (summary)::text) OR ('Michael Lewis'::text % (asin)::text))
              Heap Blocks: exact=34021
              Buffers: shared hit=323953
              ->  BitmapOr  (cost=102.01..102.01 rows=5453 width=0) (actual time=9995.890..9995.892 rows=0 loops=1)
                    Buffers: shared hit=289932
                    ->  Bitmap Index Scan on reviews_reviewer_id_trgm_gist_idx  (cost=0.00..15.07 rows=874 width=0) (actual time=2488.516..2488.517 rows=0 loops=1)
                          Index Cond: ((reviewer_id)::text % 'Michael Lewis'::text)
                          Buffers: shared hit=69700
                    ->  Bitmap Index Scan on reviews_reviewer_name_trgm_gist_idx  (cost=0.00..48.25 rows=2898 width=0) (actual time=3429.824..3429.824 rows=50214 loops=1)
                          Index Cond: ((reviewer_name)::text % 'Michael Lewis'::text)
                          Buffers: shared hit=87344
                    ->  Bitmap Index Scan on reviews_summary_trgm_gist_idx  (cost=0.00..18.28 rows=821 width=0) (actual time=4070.030..4070.030 rows=761 loops=1)
                          Index Cond: ((summary)::text % 'Michael Lewis'::text)
                          Buffers: shared hit=132066
                    ->  Bitmap Index Scan on reviews_asin_trgm_gist_idx  (cost=0.00..14.96 rows=859 width=0) (actual time=7.515..7.515 rows=0 loops=1)
                          Index Cond: ((asin)::text % 'Michael Lewis'::text)
                          Buffers: shared hit=822
Planning Time: 5.831 ms
Execution Time: 13856.489 ms

Here’s how we spend this time:

About 10s in four Bitmap Index Scan blocks, one per text column. Just like the previous iteration, these scans return 50,214 and 761 rows for reviewer_name and summary, respectively. In total, they access about 2.24GB of data from the shared buffer cache.
About 3s in a Bitmap Heap Scan. This step deduplicates the data returned from the Bitmap Index Scan blocks. Unfortunately, it only removes 46 of the 50975 rows returned from the scans, and it accesses another 266MB of data from the shared buffer cache.
About 20ms in a Sort block that sorts the 50,929 rows returned from the previous blocks.

Alas, the query is a bit more compact, but it doesn’t make very good use of time.

One Four-Column Query with an Expression Index

Let’s give this one more try. For this final pass, we’ll need to introduce two new concepts: expression indexes and trigram word_similarity.

Expression Indexes

An Expression Index lets us apply some function to a set of columns (all on the same table) and index the resulting values.

The canonical example is a query for a full name against a table with first_name and last_name columns:

SELECT * FROM people WHERE (first_name || ' ' || last_name) = 'John Smith';

We don’t want to store a full_name column, as that would duplicate data and probably drift. Instead, we can create an index on the same name concatenation:

CREATE INDEX people_names ON people ((first_name || ' ' || last_name));

Then, any query with the same expression can leverage the index – pretty cool if you ask me.

`word_similarity`

The trigram word_similarity(text1, text2) function is a variation on the similarity(text1, text2) function.

As a reminder, similarity(text1, text2) computes the intersection-over-union of the two trigram sets. In contrast, word_similarity(text1, text2) computes the greatest similarity between the set of trigrams in the first string and any continuous extent of an ordered set of trigrams in the second string.

That is quite a mouthful. For our purposes, the point is this: similarity is sensitive to the length of the two strings, whereas word_similarity is not!

Let’s look at an example that demonstrates the sensitivity to string length:

select text1, text2, similarity(text1, text2), word_similarity(text1, text2)
from
(values ('louis', 'lewis'),
        ('louis', 'a lewis c'),
        ('louis', 'aa lewis cc'),
        ('louis', 'aaa lewis ccc')) v(text1, text2);

Note how the similarity decreases as the length of text2 increases, whereas word_similarity remains constant.

text1	text2	similarity	word_similarity
louis	lewis	0.2	0.2
louis	a lewis c	0.14285715	0.2
louis	aa lewis cc	0.125	0.2
louis	aaa lewis ccc	0.11111111	0.2

Why does this property matter? I don’t want to give too much away, but we just described an indexing technique that leverages concatenated text columns. Concatenated text columns are, by definition, longer than individual text columns.

Some final details, for sake of completeness:

The order of arguments matters. word_similarity(text1, text2) will only equal word_similarity(text2, text1) if text1 = text2.
The text1 <<-> text2 operator is used to compute word_similarity distance, i.e., 1 - word_similarity(text1, text2). This is analogous to text1 <-> text2 and 1 - similarity(text1, text2).
The text1 <<% text2 operator is used to filter for word_similarity(text1, text2) exceeding a fixed threshold. The default threshold is 0.6.

A Blazing Fast Search Query

Let’s put our knowledge of expression indexes and word_similarity to use.

We’ll start by building an index on the concatenation expression of all four text columns. We have to coalesce the columns to empty strings, as they are all nullable.

create index reviews_searchable_text_trgm_gist_idx on reviews
  using gist((
      coalesce(asin, '') || ' ' ||
      coalesce(reviewer_id, '') || ' ' ||
      coalesce(reviewer_name, '') || ' ' ||
      coalesce(summary, ''))  gist_trgm_ops(siglen=256));

This takes about 16 minutes to build and ends up using about 2.2GB of storage.

Now we need a search query that can leverage this index. Behold, our new trigram search query:

with input as (select 'Michael Louis' as q)
select review_id,
      1 - (input.q <<-> (coalesce(asin, '') || ' ' || 
      coalesce(reviewer_id, '') || ' ' ||
      coalesce(reviewer_name, '') || ' ' ||
      coalesce(summary, ''))) as score                    -- (3)
from reviews, input
where input.q <% (coalesce(asin, '') || ' ' ||
      coalesce(reviewer_id, '') || ' ' ||
      coalesce(reviewer_name, '') || ' ' ||
      coalesce(summary, ''))                              -- (1)
order by input.q <<-> (coalesce(asin, '') || ' ' ||
      coalesce(reviewer_id, '') || ' ' ||
      coalesce(reviewer_name, '') || ' ' ||
      coalesce(summary, '')) limit 10;                    -- (2)

The numbered components should help cut through the concatenations:

We use input.q <% concatenated_columns to filter the table down to a set of candidate rows. For each of these rows, input.q and the concatenated columns have a trigram word similarity greater than or equal to 0.6.
Once we have candidate rows, we compute and sort by the trigram word distance between input.q and the concatenated columns.
In order to return the score, we just subtract the trigram word distance from 1.0.

The corresponding exact-only search query looks similar:

explain (analyze, buffers)
with input as (select 'Michael Lewis' as q)
select review_id,
      1.0 as score
from reviews, input
where (coalesce(asin, '') || ' ' ||
      coalesce(reviewer_id, '') || ' ' ||
      coalesce(reviewer_name, '') || ' ' ||
      coalesce(summary, '')) ilike '%' || input.q || '%'
limit 10;

The results for the trigram search query on the exact name look like this:

review_id	asin	reviewer_id	reviewer_name	summary	score
2108562	0393072231	A22GLZ0P4MGO0W	Thom Mitchell	Another Michael Lewis Must Read	1
2111265	0393081818	A1VJF95Y8HMXW9	Louis Kokernak	Another fun and informative read from Michael Lewis	1
2114047	0393244660	A13U0KMO103QJP	Larry L. Roberts	Another great book by Michael Lewis. A must read for the small investor.	1
2108273	0393072231	A1P1WJTZGC955H	ITS	Another Michael Lewis Masterpiece	1
2097231	0393057658	A3MYOI5BL91KKA	Joseph M. Powers	Standard, high quality, Michael Lewis offering	1
2097049	0393057658	A2QHM5HBSIXRL4	Andy Orrock	Another good work from Michael Lewis	1
2113780	0393244660	APM2KUPZYHB94	Alice	Michael Lewis Fan	1
2108394	0393072231	A2JOZET739XZT7	Mark Haslett	Big Fan of Michael Lewis	1
2108244	0393072231	A27NDIDE8W9YQC	Gderf	The Big Short by Michael Lewis	1
2111212	0393081818	A2X1XC7SQQGXFH	Ian C Freund	Michael Lewis is amazing	1

And the trigram search query on the fuzzy name:

review_id	asin	reviewer_id	reviewer_name	summary	score
1368320	0316013684	A106393MZH9T4M	Michael Louis Minns	Fun and enlightening	1
1683931	0345536592	A106393MZH9T4M	Michael Louis Minns	Odd Thomas Collection	1
3803521	077831233X	A106393MZH9T4M	Michael Louis Minns	Real law by a real lawyer	1
2990026	0553808036	A106393MZH9T4M	Michael Louis Minns	Koontz Remains the Master	1
5497049	1455546143	A106393MZH9T4M	Michael Louis Minns	Could not put this down…	1
1856766	0375411089	A106393MZH9T4M	Michael Louis Minns	skinny dip	1
2000799	0385343078	A106393MZH9T4M	Michael Louis Minns	Great Historical Fiction	1
3836540	0778327760	A106393MZH9T4M	Michael Louis Minns	Teller Rocks	1
5536658	1460201051	A106393MZH9T4M	Michael Louis Minns	The Cat Didn’t really do it	1
3478374	074326875X	A106393MZH9T4M	Michael Louis Minns	Pretty good read	1

It turns out there was an avid reviewer named Michael Louis. Go figure!

Performance works out like this:

Query	Query String	Execution Time	Buffer Hits
Trigram	exact name	39ms	1685
Trigram	fuzzy name	113ms	5094
Exact-Only	exact name	37ms	4345
Exact-Only	fuzzy name	87ms	10633

👍 A significant improvement: from over 10s to just over 100ms!

Let’s look at the plan for trigram search with the exact name to understand why this is faster:

with input as (select 'Michael Lewis' as q)
select review_id,
      1 - (input.q <<-> (coalesce(asin, '') || ' ' ||
      coalesce(reviewer_id, '') || ' ' ||
      coalesce(reviewer_name, '') || ' ' ||
      coalesce(summary, ''))) as score                    -- (3)
from reviews, input
where input.q <% (coalesce(asin, '') || ' ' ||
      coalesce(reviewer_id, '') || ' ' ||
      coalesce(reviewer_name, '') || ' ' ||
      coalesce(summary, ''))                              -- (1)
order by input.q <<-> (coalesce(asin, '') || ' ' ||
      coalesce(reviewer_id, '') || ' ' ||
      coalesce(reviewer_name, '') || ' ' ||
      coalesce(summary, '')) limit 10;                    -- (2)

Limit  (cost=0.42..7.82 rows=10 width=20) (actual time=8.202..38.716 rows=10 loops=1)
  Buffers: shared hit=1685
  ->  Index Scan using reviews_searchable_text_trgm_gist_idx on reviews  (cost=0.42..65909.97 rows=88980 width=20) (actual time=8.200..38.709 rows=10 loops=1)
"        Index Cond: ((((((((COALESCE(asin, ''::character varying))::text || ' '::text) || (COALESCE(reviewer_id, ''::character varying))::text) || ' '::text) || (COALESCE(reviewer_name, ''::character varying))::text) || ' '::text) || (COALESCE(summary, ''::character varying))::text) %> 'Michael Lewis'::text)"
        Rows Removed by Index Recheck: 3
"        Order By: ((((((((COALESCE(asin, ''::character varying))::text || ' '::text) || (COALESCE(reviewer_id, ''::character varying))::text) || ' '::text) || (COALESCE(reviewer_name, ''::character varying))::text) || ' '::text) || (COALESCE(summary, ''::character varying))::text) <->> 'Michael Lewis'::text)"
        Buffers: shared hit=1685
Planning Time: 0.176 ms
Execution Time: 38.772 ms

One last time, here’s how we spend our time:

About 40ms in an Index Scan block. This uses the new reviews_searchable_text_trgm_gist_idx index for filtering and sorting and returns exactly 10 rows. It accesses just over 13MB of data from the shared buffer cache.

Single Query Summary

Here’s what we know about combining four columns in a single query:

Unioning four queries was more than a 4x slowdown: 2s for one column → 10s for four.
Introducing a clever disjunction made it even slower: 10s → 14s.
Leveraging an expression index and a new trigram operator is our winner: 10s → 113ms.

Conclusion

Through some effort and iteration, we’ve arrived at a very performant query.

We started at 90 seconds to search one text column and ended at 113ms for four columns.

Our implementation consisted primarily of Postgres trigram and string matching operators, and our optimizations used three main techniques:

Indexing the text columns
Separating exact search queries from trigram search queries
Cleverly combining all four text columns into a single index and single query

Throughout the iterations, we leveraged explain (analyze, buffers) with the PEV2 visualizer to understand how we were spending our time on execution and I/O.

As always, I hope this post will save someone a bit of time learning, debugging, and optimizing!

Appendix

Discussion

There was some discussion about this post on HackerNews and r/Postgresql.
The Scaling Postgres podcast covered this post on episode 204.
The 5mins of Postgres podcast covered this post on episode 6.

Potential Improvements

Some folks have responded with interesting suggestions for potential improvements. I’ll cover them below, and might eventually try some of them and update the post.

Generated Columns

In episode 204 of the Scaling Postgres Podcast, around 8:30, the host made a nice suggestion that we might be able to use the Generated Columns feature to minimize the string concatenation boilerplate from the final query.

Some commentors on Hackernews also mentioned that the string concatenation is tedious. I agree it’s hard to read. We also have to be careful to ensure that our concatenation matches the exact expression used in the Expression Index, otherwise we won’t hit the index, which could be a subtle and painful performance regression.

I’ve never used the Generated Columns feature, but I think the solution might look something like this: define a fifth generated text column, specify that the column is generated as the concatenation of the four other columns, build a standard index on that column, and reference that column in search queries. I think this could work.

My only hesitation would be that the generated column is materialized, so it takes up additional space. The docs say specifically, “PostgreSQL currently implements only stored generated columns.” Depending on the size of the table, it might not make any difference and optimizing for readability/simplicity would be great. But that tradeoff seems worth remembering.

Materialized Views

Some commentors on Hackernews mentioned that things get tricky if we have text columns on multiple tables and suggested it might be easier to move all of the text data into a materialized view. I agree this could work, with some caveats.

The data model would have to allow for mapping each searchable “entity” to a single row in the materialized view. This can get tricky with 1:N relationships. For example, imagine a database for a blog: an article can have many comments, with articles and comments in their own separate tables. We want to search for articles, such that our query matches against both the article text and the corresponding comment text. Our query could match multiple comments for the same article, but we only want to return the article once. We would have to find a way to represent an article and all its comments as a single row in a materialized view, and it’s not immediately obvious how we would do that.

We have to account for eventual consistency. For example, imagine the same database for a blog. A user can delete an article or comment, but it remains in the materialized view until the next refresh. Now we need some filtering logic to prevent returning stale results from the materialized view. This could introduce complexity that cancels out any wins from using the materialized view in the first place. I find that eventual consistency is a reality we should all accept in distributed systems, but we should also try to prevent introducing it within a single relational database.

Finally, we would also need a reliable mechanism to refresh the materialized view. This is actually the biggest pitfall in my opinion: I’ve yet to find a satisfying mechanism for refreshing without introducing unfortunate performance dynamics, like decreasing query throughput every five minutes because the refresh is hogging resources.

This is also why I’m particularly excited about a new Postgres feature currently under development, Incremental View Maintenance (IVM). With IVM, the promise is that we can define a materialized view that is atomically updated on any write to the source table. I encourage folks to look around the docs and discussions surrounding the feature – it’s quite interesting.

GitLab’s evolution from Postgres Trigrams to Elasticsearch Fast Search Using PostgreSQL Trigram Text Indexes (March 2016), Lessons from our journey to enable global code search with Elasticsearch on GitLab.com (March 2019), Update: The challenge of enabling Elasticsearch on GitLab.com (July 2019); Update: Elasticsearch lessons learnt for Advanced Global Search 2020-04-28 (April 2020); Troubleshooting Elasticsearch ↩
The conclusions in Gitlab’s investigation of Full Text Search align well with my findings. ↩
Please see Julian McAuley’s Amazon Product Data Landing Page, Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering, Image-based recommendations on styles and substitutes. ↩
Michael Lewis has an uncanny ability to make mundane, complicated topics entertaining. Some of my favorites are The Big Short, Boomerang, and Flash Boys. ↩
For much more about buffers, I recommend reading this excellent article from Postgres.ai: EXPLAIN (ANALYZE) needs BUFFERS to improve the Postgres query optimization process ↩ ↩²
Fun fact: I used to be a Javascript/React developer (ca. 2015). But I’m not anymore, and that’s why I used iframes to make this work. ↩

Speed Limits for Rolling Restarts in Kubernetes

2021-07-20T15:00:00+00:00

Introduction

This post shows how we can tune a Kubernetes Deployment with slow-starting pods to execute rolling restarts more gracefully. In particular, we’ll focus on a case in which each pod needs a warmup phase that is not easily captured by a probe.

For brevity, let’s assume we already have a working knowledge of Kubernetes Deployments,¹ including the rolling restart strategy, ReplicaSets,² and probes.³

TLDR: adjust maxSurge, maxUnavailable, and minReadySeconds to prevent sending traffic to too many new pods at once.

Slow-Starting Pods

Application pods can be slow to start for a variety of reasons:

In a read-heavy app, each pod might populate a local cache.
In a stateful app, each pod needs to hydrate some initial state from a datastore.
In a clustered app, each pod needs to connect to some of its peers.
In an app running an interpreted or just-in-time-compiled (JIT) language, each pod incurs some startup cost for compiling its hotspots, often called a warmup phase.⁴

The first three cases are generally solved by tuning the startup, liveness, and readiness probes.

The fourth case is subtle, because the app could pass all its probes and still not be totally ready for full traffic.

Not All at Once

The issue that led to this post involved a JVM-based web service with high traffic and low latency requirements. The story started when we rolled out a simple change. We ran a standard rolling restart to rollout the new application image, and this somehow increased latency to the point of triggering production alerts.

After inspecting metrics, we noticed a clear pattern: every new pod had a large spike in CPU usage and request latency for the first 30 seconds of its runtime. After some profiling, we were able to attribute these spikes to JVM warmup. Historical metrics suggested we had been flying close to the sun for some time.

Barring some JVM gymnastics, each new pod incurs the cost of JVM warmup.⁵ Requests sent to a pod during warmup will inevitably be slower, and if too many pods are warming up simultaneously, we end up with a significant overall spike in latency.

To summarize, our standard rolling restarts were adding too many new pods, too quickly.

The Test Bench

For the rest of this post, we’ll evaluate some options for solving this type of problem. We’ll use the Nginx deployment commonly seen in Kubernetes’ docs, running in Minikube.⁶ To be clear, Nginx doesn’t really need a warmup phase, but let’s just imagine it’s some other container that does.

For each option, we’ll execute the following steps:

Create the deployment with four pods: kubectl apply -f .
Wait for all pods to be ready: kubectl rollout status deployment/.
Initiate a rolling restart of the deployment: kubectl rollout restart deployment/.
Observe the resulting restart behavior: kubectl get replicaset, sampled every second.

For the last step, we’ll observe three counters returned from thekubectl get replicaset command:

DESIRED is the number of pods our replicaset should end up with.
CURRENT is the number of pods currently running, in any state.
READY is the number of pods that have passed their readiness probe.

Round 1: Default Rolling Restart Strategy

This is a common starting point for a deployment. We specify the deployment has four identical pods, each running a single container, with HTTP liveness and readiness probes. All else is left as defaults.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx:1.14.2
          readinessProbe:
            httpGet:
              path: /
              port: 80
          livenessProbe:
            httpGet:
              path: /
              port: 80
  replicas: 4

Let’s observe the replicasets to see how the restart behaves:

NAME        DESIRED   CURRENT   READY   AGE
nginx-old   4         4         4       7s
---
nginx-old   3         3         3       8s
nginx-new   2         2         0       1s
---
nginx-old   2         2         2       9s
nginx-new   3         3         1       2s
---
nginx-old   2         2         2       10s
nginx-new   3         3         1       3s
---
nginx-old   1         1         1       11s
nginx-new   4         4         2       4s
---
nginx-old   1         1         1       12s
nginx-new   4         4         2       5s
---
nginx-old   0         0         0       13s
nginx-new   4         4         3       6s
---
nginx-old   0         0         0       14s
nginx-new   4         4         4       7s

There are two main behaviors to observe here.

First, for the majority of the restart, we have three pods in ready state. This turns out to make sense. The deployment spec has a setting called maxUnavailable. According to the docs, maxUnavailable “specifies the maximum number of Pods that can be unavailable during the update process” and defaults to 25% of the desired count. In our example, this means the overall deployment can have one pod unavailable during the rolling restart. This effectively means we need to be 25% over-provisioned to gracefully support a rolling restart.

Second, we go from having four old pods to four new pods in just seven seconds. We should be weary of this if our pods have any significant startup costs that cannot be captured with standard probes.

Round 2: Set maxUnavailable to 0 and maxSurge to 1

Let’s tackle the first problem: we want to maintain four pods at all times.

In order to do that, we’ll set maxUnavailable to 0.

We also need to introduce a new setting: maxSurge, which “specifies the maximum number of Pods that can be created over the desired number of Pods.” Like maxUnavailable, it defaults to 25%. The default would work in our example, but I’ve found it’s better to specify explicitly. In a large deployment, adding 25% over the desired replica count could exceed resource quotas.⁷

So we’ll set maxSurge to 1.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx:1.14.2
          readinessProbe:
            httpGet:
              path: /
              port: 80
          livenessProbe:
            httpGet:
              path: /
              port: 80
  replicas: 4
  strategy:
    rollingUpdate:
      maxUnavailable: 0 # New!
      maxSurge: 1       # New!

Let’s observe the replicasets to see how the restart behaves:

NAME        DESIRED   CURRENT   READY   AGE
nginx-old   4         4         4       6s
---
nginx-old   4         4         4       7s
nginx-new   1         1         0       1s
---
nginx-old   3         3         3       9s
nginx-new   2         2         1       3s
---
nginx-old   3         3         3       10s
nginx-new   2         2         1       4s
---
nginx-old   3         3         3       11s
nginx-new   2         2         1       5s
---
nginx-old   2         2         2       12s
nginx-new   3         3         2       6s
---
nginx-old   2         2         2       13s
nginx-new   3         3         2       7s
---
nginx-old   1         1         1       14s
nginx-new   4         4         3       8s
---
nginx-old   1         1         1       15s
nginx-new   4         4         3       9s
---
nginx-old   1         1         1       16s
nginx-new   4         4         3       10s
---
nginx-old   1         1         1       18s
nginx-new   4         4         3       12s
---
nginx-old   0         0         0       18s
nginx-new   4         4         4       12s

This solves the first problem: the total number of ready pods never fell below four.

The transition from four old pods to four new pods was a bit slower (twelve seconds), but still fast enough to make us nervous if we’re concerned about something like JVM warmup.

Round 3: Set minReadySeconds, maxUnavailable to 0, and maxSurge to 1

Now let’s solve the second problem: we want a way to control the speed of our rolling restart.

It turns out there’s a setting for this as well: minReadySeconds. According to the docs, minReadySeconds “specifies the minimum number of seconds for which a newly created Pod should be ready without any of its containers crashing, for it to be considered available” and defaults to zero.

Let’s say our application takes about three seconds to warm up and reach steady-state on key metrics.

So we’ll set minReadySeconds to 3.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        readinessProbe:
          httpGet:
            path: /
            port: 80
        livenessProbe:
          httpGet:
            path: /
            port: 80
  replicas: 4
  strategy:
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  minReadySeconds: 3 # New!

Let’s observe the replicasets to see how the restart behaves:

NAME        DESIRED   CURRENT   READY   AGE
nginx-old   4         4         4       9s
---
nginx-old   4         4         4       10s
nginx-new   1         1         0       1s
---
nginx-old   4         4         4       11s
nginx-new   1         1         1       2s
---
nginx-old   4         4         4       13s
nginx-new   1         1         1       4s
---
nginx-old   4         4         4       14s
nginx-new   1         1         1       5s
---
nginx-old   3         3         3       15s
nginx-new   2         2         1       6s
---
nginx-old   3         3         3       16s
nginx-new   2         2         2       7s
---
nginx-old   3         3         3       17s
nginx-new   2         2         2       8s
---
nginx-old   2         2         2       18s
nginx-new   3         3         2       9s
---
nginx-old   2         2         2       19s
nginx-new   3         3         2       10s
---
nginx-old   2         2         2       21s
nginx-new   3         3         3       12s
---
nginx-old   2         2         2       22s
nginx-new   3         3         3       13s
---
nginx-old   2         2         2       23s
nginx-new   3         3         3       14s
---
nginx-old   1         1         1       24s
nginx-new   4         4         3       15s
---
nginx-old   1         1         1       25s
nginx-new   4         4         3       16s
---
nginx-old   1         1         1       26s
nginx-new   4         4         4       17s
---
nginx-old   1         1         1       27s
nginx-new   4         4         4       18s
---
nginx-old   1         1         1       29s
nginx-new   4         4         4       20s
---
nginx-old   0         0         0       29s
nginx-new   4         4         4       20s

Again, the total number of ready pods never fell below four.

Notice how each new replica is ready for three seconds before the desired and current counters increment. The nginx-new transitions, denoted (desired, current, ready), are:

(1, 1, 1) at 2s
(2, 2, 1) at 6s
(2, 2, 2) at 7s
(3, 3, 2) at 9s
(3, 3, 3) at 13s
(4, 4, 3) at 15s
(4, 4, 4) at 17s

This isn’t a perfectly uniform cadence – we’re sampling via bash script – but it demonstrates that we have in fact slowed down the introduction of new pods.

Crucially, each new replica continues passing its probes throughout its warmup period. This would not be the case if we simply incremented the probes’ initialDelaySeconds.

We don’t want this setup to be a bottleneck for new releases, so we should use metrics to select the smallest satisfactory minReadySeconds value.

Conclusion

This post demonstrates how we can use a combination of standard Kubernetes Deployment settings to solve a subtle problem with rolling restarts. This is just one of several interesting application lifecycle edge cases we must consider as we increase traffic to an application in Kubernetes. As always, I hope this post will save someone a bit of time learning and debugging, or maybe even help anticipate and prevent a looming failure.

A Kubernetes Deployment is just a set of identical pods, referred to as replicas. I’ve most commonly used deployments to run load-balanced HTTP web services. ↩
A Kubernetes Replicaset is the abstraction that maintains a set of replicas within the Deployment. ↩
For a more thorough look probes, see Colin Breck’s post on startup probes and his three part series on liveness and readiness probes ↩
Baeldung has a nice article on the topic of JVM warmup ↩
You can get into some expert-level games to minimize warmup. This is most important with very short-lived apps (e.g., on AWS Lambda). This particular app regularly run for hours or days, so this type of optimization is not worth the effort. ↩
Code for this example ↩
A Kubernetes Resource Quota gives us a way to limit the resources allocated to each namespace. ↩

Scala Type Classes From Scratch

2021-06-18T00:00:00+00:00

Introduction

When I picked up Scala in 2018, I found several new constructs immediately useful: case classes, pattern matching, and immutable collections are just a few that come to mind.

However, one construct left me thoroughly confused for over a year: type classes.

As I’ve watched others learn the language, type classes are repeatedly a stumbling block.

In this post, I’ll walk through a realistic example problem, presenting a few dead-ends based on conventional solutions. I’ll demonstrate why we need a type class to solve the problem and build the solution incrementally. Finally, I’ll close with some thoughts on a couple related concepts.

This is by no means a new topic, likely not even a new perspective, but hopefully it saves someone else some time while learning.

What’s In a Name?

The name type class is not particularly descriptive. I know what a type is. I know what a class is. But what is a type class?

There are certainly more rigorous definitions, but here is mine:

A type class is a set of methods which can be invoked on a generic type, without any specific type implementing or even knowing about the methods.

Let’s build some intuition by diving into an example.

A Motivating Example: JSON Encoding

Here’s our problem statement:

Write a method that takes an instance of some type, encodes it as a JSON string, and writes it to a file. The method should work for types outside the immediate codebase.

Let’s start with a User case class:

final case class User(firstName: String, lastName: String)

Then we need a method which takes two arguments: an instance of some type, and a File:

JsonUtil.toFile(User("Martin", "Odersky"), new File("/tmp/out.json"))

This call should write {"firstName": "Martin", "lastName": "Odersky"} to /tmp/out.json.

This should also work for any other type that can be encoded to JSON. We don’t yet know how we’ll implement this method, so I’m leaving the exact type signature ambiguous.

Could we add a mixin trait and only encode instances which extend it?

trait EncodesToJson {
  def toJson: String
}

final case class User(firstName: String, lastName: String) extends EncodesToJson {
  override def toJson: String = 
    s"""{"firstName": "$firstName", "lastName": "$lastName"}"""
}

object JsonUtil {
  def toFile(o: EncodesToJson, f: File): Unit =
    Files.writeString(file.toPath, o.toJson)
}

This works for types defined in our own codebase, but we can’t make a type from an external library extend this trait.

Could we make the toFile method take an `Any`?

object JsonUtil {
  def toFile(o: Any, f: File): Unit = o match {
    case u: User => 
      Files.writeString(f.toPath, s"""{"firstName": "${u.firstName}", ... }""")
    case _       => throw new Exception("Didn't see that one coming!")
  }
}

Again, this works for our own types, but it’s tedious to handle any type. We also lose type-safety, as there’s no way for the compiler to verify that the type passed will be matched. Moreover, if we publish this code in a library, we cannot handle the custom types that a user will provide.

As an aside, the Jackson libraries in Java take a similar approach. You register encoders for specific types early in your application’s lifecycle. Then you can pass an untyped Object to an encoding method. The method uses reflection to get your instance’s type and look for a registered encoder. If it finds an encoder, it encodes the instance. If not, it throws an exception. This would work for our problem as well, but I find it best to avoid untyped APIs and reflection.

We need something else

Let’s take stock of the problem:

We need a certain piece of functionality to work for multiple types.
We can’t guarantee that the types themselves will implement this piece of functionality.
We could implement it on our own for many types, but not all types, because we can’t know all the types up-front.

So if we work backwards from the problem, we basically need some way to call a method that returns a JSON String for any arbitrary type, without any type necessarily implementing, or even knowing about, the method. That sounds a lot like my original definition of a type class.

We finally need a type class

The type class Encoder[T] describes an API for encoding an instance of T as a JSON String.

trait Encoder[T] {
  def encode(t: T): String
}

Our first-pass implementation of the toFile method takes a T, an encoder for T, and the file.

object JsonUtil {
  def toFile[T](t: T, enc: Encoder[T], f: File): Unit = {
    val jsonString: String = enc.encode(t)
    Files.writeString(f.toPath, jsonString)
  }
}

Now we need an Encoder for the User type:

val userEncoder = new Encoder[User] {
  override def encode(t: User): String =
    s"""{"firstName": "$firstName", "lastName": "$lastName"}"""
}

Finally, we can call the toFile method:

JsonUtil.toFile(User("Martin", "Odersky"), userEncoder, new File("/tmp/out.json"))

Is that any better?

It’s slightly better than the previous solutions in that we can provide our own implementation of an Encoder for any T. But we’ve added a third argument to the toFile method, and now we have to keep track of the Encoder[User]. We can definitely still improve it.

We need some implicits!

We can leverage Scala’s implicits to make this type class solution far more elegant.

Implicits are another common source of confusion in Scala, so I’ll proceed slowly.

The first implicit we need is an implicit parameter to the JsonUtil.toFile method:

object JsonUtil {
  def toFile[T](t: T, f: File)(implicit enc: Encoder[T]): Unit = {
    val jsonString: String = enc.encode(t)
    Files.writeString(f.toPath, jsonString)
  }
}

This lets us call toFile with just two arguments, and the compiler finds the Encoder[T].

As an aside, we’ll often see these implicit parameters called ev, or ev1, ev2, .... ev stands for “evidence.” We’re basically saying, “To call this method, we need evidence that there exists an Encoder for T.”

So then we need an implicit Encoder for a User:

final case class User(firstName: String, lastName: String)

object User {
  implicit val userEncoder = new Encoder[User] {
    override def encode(t: User): String =
      s"""{"firstName": "$firstName", "lastName": "$lastName"}"""
  }
}

Why did we put the implicit in the User companion object? Because we know the compiler automatically looks for implicits related to a class in the companion object. You can learn more about implicit resolution from this excellent Stackoverflow answer.

Now we can call toFile without any explicit reference to an Encoder:

JsonUtil.toFile(User("Martin", "Odersky"), new File("/tmp/out.json"))

Another way to write toFile

We can slightly alter the toFile method to use a Context Bound to express that T should have an Encoder, and then access the Encoder using the implicitly method:

object JsonUtil {
  def toFile[T : Encoder](t: T, f: File): Unit = {
    val jsonString: String = implicitly[Encoder[T]].encode(t)
    Files.writeString(f.toPath, jsonString)
  }
}

When I see the context bound T : Encoder, I read “there exists an instance of Encoder for T.” The implicitly method just lets us grab that instance.

What’s the advantage of this? As far as I know, it’s just syntactic sugar. I like to use it because it lets me avoid thinking of a name for the implicit parameter.

Edge Cases

What happens if there’s no Encoder for a type?

A problem with the method that takes an Any was that it could be called on types which we hadn’t anticipated. If we call toFile on an instance of a type with no corresponding Encoder, we get a compiler error.

JsonUtil.toFile(("Scala", 3), new File("/tmp/out.json"))

could not find implicit value for parameter enc: Encoder[(String, Int)]

As soon as we provide an implicit instance of Encoder[(String, Int)], we can proceed.

What happens if there’s no companion object?

So far we’ve only provided an implicit Encoder[T] in the companion object of the type T.

In the previous section, I mentioned providing an implicit instance of Encoder[(String, Int)]. We don’t have access to the companion object for (String, Int). Likewise, if the type T comes from an external library, we can’t modify its companion object. So how do we proceed?

In either case, we can provide an implicit Encoder in the companion object for Encoder:

trait Encoder[T] {
  def encode(t: T): String
}
object Encoder {
  // Type User comes from an external library
  // so we can't modify the companion object.
  implicit val userEncoder = new Encoder[User] {
    override def encode(t: User): String =
      s"""{"firstName": "$firstName", "lastName": "$lastName"}"""
  }
  
  // (String, Int) comes from an external library 
  // and doesn't even have a companion object.
  implicit val stringIntEncoder = new Encoder[(String, Int)] {
    override def encode(t: (String, Int)): String =
      s"""["${t._1}", ${t._2}]"""
  }
}

The implicits are resolved just like the implicits in the User companion object.

What happens if all the types are in an external library?

Let’s say that User and Encoder both come from an external library, so we can’t modify either companion object.

There are a few ways to solve this. My preferred approach is to use a mixin trait with implicit instances of a single type class.

trait EncoderInstances {
  implicit val userEncoder = new Encoder[User] {
    override def encode(t: User): String =
      s"""{"firstName": "$firstName", "lastName": "$lastName"}"""
  }
}
object EncoderInstances extends EncoderInstances

If we want all implicit instances for the Encoder type class, we just extend the trait:

object Example extends App with EncoderInstances {
  JsonUtil.toFile(User("Martin", "Odersky"), new File("/tmp/out.json"))
}

If we just want a specific instance, we can import it from the EncoderInstances object:

import EncoderInstances.userEncoder

object Example extends App {
  JsonUtil.toFile(User("Martin", "Odersky"), new File("/tmp/out.json"))
}

What happens if we have several instances of the same type class?

Sometimes there is no single canonical implementation of a type class. For example, there might be one Encoder for User that uses camel-case, and another that uses snake-case.

When this happens, I generally avoid implementing any instances in the companion objects. Instead, I suggest using separate mixin traits, e.g., CamelCaseInstances and SnakeCaseInstances. Then we can pick which one to use in specific areas of the application.

Problem Solved?

Let’s review what we accomplished using a type class:

We defined Encoder[T], a type class specifying an API for encoding a T as a JSON String.
We implemented an instance of Encoder for the User type.
We introduced implicits to minimize boilerplate in the API.
We verified we can only call JsonUtil.toFile[T] if T has an Encoder.
We discussed strategies for handling several edge case.

The Complete Example

There are a lot of code snippets floating around. Here is the example again, end-to-end:

import java.io.File
import java.nio.file.Files

trait Encoder[T] {
  def encode(t: T): String
}

object JsonUtil {
  def toFile[T : Encoder](t: T, f: File): Unit = {
    val encoder: Encoder[T] = implicitly[Encoder[T]]
    val jsonString: String = encoder.encode(t)
    Files.writeString(f.toPath, jsonString)
  }
}

final case class User(firstName: String, lastName: String)

object User {
  implicit val userEncoder: Encoder[User] = new Encoder[User] {
    override def encode(t: User): String =
      s"""{"firstName": "${t.firstName}", "lastName": "${t.lastName}"}"""
  }
}

object TypeClassesFromScratch extends App {
  JsonUtil.toFile(User("Martin", "Odersky"), new File("/tmp/out.json"))
}

Running TypeClassesFromScratch produces a file /tmp/out.json, with contents as expected:

$ cat /tmp/out.json 
{"firstName": "Martin", "lastName": "Odersky"}%

Ad-hoc Polymorphism and the Single Responsibility Principle

I’ll end this post with some discussion of a couple related ideas: Ad-hoc Polymorphism and the Single Responsibility Principle (SRP).

Type classes are often referenced as a means to achieve ad-hoc polymorphism in Scala. To me, the term ad-hoc means that something works without up-front planning. I think of polymorphism as a pattern in which different types can be accessed through a common API. So ad-hoc polymorphism describes the existence of polymorphism without up-front planning.

A good example of polymorphism in Scala is the collections API. For example, you can call map on a List[Int] and a List[String]. However, this type of polymorphism is not ad-hoc. The Scala language team anticipated that users want to call map, flatMap, reduce, fold, etc. on a List. So they carefully designed an API with these methods.

The Scala language team did not, however, provide support for JSON encoding. If we want polymorphic JSON encoding of List, we need to implement it ourselves in an ad-hoc fashion.

For all we know, the Scala language team could have anticipated the need for JSON encoding, and chose not to provide it. I consider this is an example of the Single Responsibility Principle.

I think about SRP like this: someone implementing the List should focus on providing an intuitive and performant API for dealing with an ordered, in-memory collection of things. Someone implementing a User case class should focus on a useful representation of a User.

In other words, each type should focus on doing one thing very well.

However, “one thing” can mean different things. Should a type with a map method be distinct from a type with a flatMap method? This distinction can be justified mathematically via Category Theory and is implemented (using type classes) in the popular Cats library.

Though perhaps less mathematically sound, I often find it useful to delineate responsibilities by considering the physical mediums involved in a given API.

A List[T], for any T, exists entirely in memory, so its API should deal with things in memory. Encoding a List can occur entirely in memory, but that’s rarely useful. Useful encoding generally involves some network or persistent storage medium, so it makes sense that this API is decoupled from the in-memory API. We can’t doing anything with a List without going through memory, so it also makes sense that the polymorphic methods implemented directly on a List focus on in-memory operations, and any remaining polymorphism is ad-hoc.

Conclusion

I’ve done my best to guide us up to the need for type classes, including some dead-ends along the way, rather than just introducing and justifying them. If you’ve read this far, I hope this approach was useful, and also thank you for entertaining my musings in the last section. I am hoping to follow-up with a post containing type class examples from the Scala standard library and several popular open-source libraries.

Low-Effort, High-Value Documentation

2021-03-07T15:00:00+00:00

Introduction

Maintaining quality documentation is one of the most important and also most difficult parts of software engineering.

I’ve found engineers, myself included, rarely budget enough time for writing and maintaining documentation. I’ve been on teams delivering high-value projects that started with great documentation but devolved to a mix of stale and inconsistent comments and markdown files, all propped up by tribal knowledge. I’ve accepted that great documentation requires dedicated effort independent of the actual implementation.

With this in mind, I’d like to present two documentation workflows that strike a good balance between effort and value.

Both workflows fit the category of recipes or how-to guides and leverage a compiler and continuous integration. They by no means replace the carefully-curated docs you’ll find in major open-source projects and cloud service SDKs, but they’re certainly an improvement over improvised comments and markdown files and only require a bit of initial setup.

A Project of Executable Recipes

In this workflow, we maintain a project of common recipes alongside the actual library or service. Crucially, these recipes should be easy to clone and execute.

I’ve seen this in several open-source projects. Keras has a directory of examples and benchmarks. Scalapb has example SBT projects – useful for a codegen tool, where build config is the main interface. The Play Framework has an entire repo of example projects with CI infrastructure for testing them.

When building internal libraries, I find it’s useful to maintain these recipes as a project of test suites. They don’t have to apply to functionality provided by your specific project. For instance, if engineers frequently ask about a particular use-case of some open-source library, we can write a recipe to demonstrate it.

Here’s an example for using Circe with snake_case naming.

import org.scalatest.freespec.AnyFreeSpec
import org.scalatest.matchers.should.Matchers

class CirceRecipesSpec extends AnyFreeSpec with Matchers {

  "JSON with snake_case Member Names" - {
    // Important to import from io.circe.generic.extras.
    // The Configuration class lets you customize the codecs.
    import io.circe.generic.extras.Configuration
    import io.circe.generic.extras.semiauto._
    import io.circe.syntax._
    import io.circe.parser._

    // Standard case class with implicit codecs in companion.
    case class User(firstName: String, lastName: String)
    object User {
      // Config tells encoder and decoder to use snake_case.
      implicit val config = Configuration.default.withSnakeCaseMemberNames
      implicit val encoder = deriveConfiguredEncoder[User]
      implicit val decoder = deriveConfiguredDecoder[User]
    }

    "encode with snake_case" in {
      val u = User("Jane", "Doe")
      val s = """{"first_name":"Jane","last_name":"Doe"}"""
      u.asJson.noSpaces shouldBe s
    }

    "decode with snake_case" in {
      val s = """{"first_name":"Jane","last_name":"Doe"}"""
      decode[User](s) shouldBe Right(User("Jane", "Doe"))
    }
  }
}

A critical property of this workflow is the use of real code.

Compared to a code snippet in a markdown file, an engineer can easily clone, compile, run, debug, and improve this recipe. The comments and names should guide the reader through subtleties like imports and implicits. Moreover, these tests are generally cheap to run, so the CI pipeline can continuously exercise them to ensure they stay up-to-date.

I’ve found this especially useful when the recipe demonstrates some functionality at the intersection of two libraries (e.g., using Kantan and Circe to parse a CSV containing JSON columns).

With communication and discipline, a team can develop a virtuous cycle of continually using and improving these recipes.

Compiled Code Snippets

Probably the lowest-effort documentation strategy is to simply add some code snippets to a markdown file (README.md).

If carefully maintained, this can be entirely sufficient for a simple project. For example, David Moten has a host of Java libraries for geospatial computing. Most of them are delightfully minimalistic, solving a single interesting problem with documentation entirely in the README.

Of course the downside to code snippets is their tendency to fall out-of-date. It seems someone inevitably makes a typo, fails to update them after a breaking change, or references some unknown piece of code.

The code snippet story can have a happy ending. The trick is to use a tool to compile the actual markdown files.

I’ve successfully used mdoc in Scala projects for exactly this purpose. It’s a simple yet powerful workflow: add a subproject to the SBT build, configure it to use the mdoc SBT plugin, set the location of markdown files, and run sbt mdoc in the CI pipeline. If the code snippets are stale, the build will fail.

For example, to reproduce the Circe recipe from above, first create a docs project in build.sbt:

lazy val docs = project.in(file("docs"))
  .enablePlugins(MdocPlugin)
  .settings(
    mdocIn := file("README.md"),
    mdocOut := file("/dev/null"),
    libraryDependencies ++= Seq(
      "io.circe" %% "circe-generic-extras" % "0.13.0",
      "io.circe" %% "circe-parser" % "0.13.0"
    )
  )

The subproject has two important configurations. First, mdocIn and mdocOut are set to simply take the README, compile it, and discard the result. Mdoc has some features that would eventually warrant publishing the results, but for now we just check the code. Second, the project includes the circe dependencies. Mdoc makes these available for use in the markdown.

With this configured, we can add code snippets to the README:

## JSON with snake_case names
Note: import from `io.circe.generic.extras.semiauto`, 
not from `io.circe.generic.semiauto`.
```scala mdoc
import io.circe.generic.extras.Configuration
import io.circe.generic.extras.semiauto._
import io.circe.syntax._
import io.circe.parser._
```

Then define a standard case class with implicit codecs 
in a companion. The implicit configuration tells the 
codecs to use snake_case.
```scala mdoc
case class User(firstName: String, lastName: String)
object User {
  implicit val config = Configuration.default.withSnakeCaseMemberNames
  implicit val encoder = deriveConfiguredEncoder[User]
  implicit val decoder = deriveConfiguredDecoder[User]
}
```

Create, encode, and decode the case class.
```scala mdoc
val u = User("Jane", "Doe")
val s = u.asJson.noSpaces // {"first_name":"Jane","last_name":"Doe"}
val r = decode[User](s)   // Right(User("Jane", "Doe"))
```

To compile the snippets, simply run sbt docs/mdoc:

$ sbt docs/mdoc
[info] running mdoc.Main 
info: Compiling 1 file to /dev/null
info: Compiled in 3.82s (0 errors)

If we introduce an error, for example forgetting to import io.circe.parser._, we get immediate feedback:

$ sbt docs/mdoc
info: Compiling 1 file to /dev/null
error: README.md:37:9: not found: value decode
val r = decode[User](s) // Right(User("Jane", "Doe"))
        ^^^^^^
info: Compiled in 3.08s (1 error)

Compared to using a test suite for recipes, this has the advantage that markdown files are more easily discoverable and the disadvantage that the code doesn’t actually execute.

Conclusion

Writing great docs is hard, but we don’t have to settle for scattered, stale documentation. If we invest in some simple up-front setup, we can leverage a compiler and CI to keep our docs up-to-date and useful with minimal ongoing effort.

Sane Scala Dependencies in a Poly-Repo Codebase

2021-01-31T15:00:00+00:00

Introduction

In this post I’ll offer some good practices for managing Scala dependencies in a poly-repo codebase. In other words, I have some tips for avoiding dependency hell when working across multiple Scala/SBT projects.

There are many definitions of dependency hell. For me it’s tracing through a half-dozen repos to fix a cryptic NoSuchMethodError or spending a day bumping versions and releasing artifacts to propagate a trivial bug-fix across several repos.

I’ll start by defining some anti-patterns which lead to dependency hell. Then I’ll provide several solutions which require only existing tools and discipline. I’ll cover my experience with Scala Steward, a tool that offers tremendous value if used correctly and moderate chaos if not. Finally, I’ll sketch out some features I’d like to see in SBT or perhaps an open-source SBT plugin.

The problems and solutions I’ll cover are based on almost three years in a poly-repo Scala codebase. I grew into the role of “SBT guy” and spent time simplifying dependency graphs, fixing dependency conflicts, re-structuring repos, reviewing PRs for backwards-compatibility, and improving various build tools to support all of the above.

It’s unlikely any solutions I present are particularly novel. I learned them through a combination of trial and error and exploring major open-source projects. They’re all fairly obvious, especially after you’ve burned yourself a few times. Finally, the exact terms and examples are specific to Scala and SBT, but the principles should apply to other languages and build tools.

A Poly-Repo Codebase

To be clear, here’s what I mean by poly-repo codebase:

Around ten to twenty repos – enough to overwhelm a new-hire, but an experienced developer with some tribal knowledge can keep a mental model.
Each repo is a single SBT project, typically with SBT sub-projects. Repo and SBT project are interchangeable concepts.
Each repo produces one or more artifacts. E.g., a service publishes its backend as a Docker container and client as a JAR.
Each repo is of non-trivial scope and complexity. Let’s say a full round of CI/CD for one repo takes about 30 minutes.
There are dependency chains spanning multiple repos. E.g., megacorp-backend depends on megacorp-sso depends on megacorp-utils, and all three artifacts are published from separate repos.

The Roads to Dependency Hell

Version Conflicts

When we add a dependency, we also add transitive dependencies (the dependency’s dependencies). Each of these has a specific version, and it’s inevitable that any interesting project will end up with conflicting versions among its transitive dependencies.

A concrete example: say we use akka-http in a service and want to add akka-grpc. We already use circe and akka-http-circe for marshaling JSON in akka-http. akka-grpc uses scalapb for generating Protobufs, but we still need some Protobuf-to-JSON conversion for backwards compatibility. So we grab scalapb-circe, which uses circe to convert protobufs to and from JSON.

At this point our dependencies look like this:

An arrow from A to B indicates that A depends on B, and each of these arrows is a new opportunity to introduce a version conflict. As of January 2021 there exists no conflict-free combination of these dependencies. For example, akka-grpc 1.0.2 depends on scalapb-runtime 0.10.8, but there’s no version of scalapb-circe which depends on scalapb at exactly 0.10.8.

So they actually look more like this:

Note that scalapb-circe and akka-http-circe might also depend on different versions of circe, and akka-grpc and akka-http-circe might depend on different versions of akka-http. So if we’re really unlucky we end up with this:

This is an exaggeration for this particular scenario, but it’s not surprising to find this general structure in a very large project.

So what happens when we introduce version conflicts?

The JVM can only load one version of a dependency at runtime. There are various ways to produce an executable artifact from SBT, and we can generally customize how they handle a version conflict. The default behavior is typically to pick the newest version of any given dependency and evict all others. This means if one dependency requires version 1.2.3 of library A and another dependency requires version 2.3.4 of library A, then only 2.3.4 will be available at runtime. We can run sbt evicted to see which versions are picked and which are evicted.

If the conflicting library versions are not binary compatible along a codepath called from our project, then we’ll find runtime exceptions like ClassNotFoundException or NoSuchMethodError. Sometimes we get lucky and we don’t actually call the binary-incompatible part of the library, but the only built-in way to verify this is to run the code.

Library authors generally try to avoid breaking binary compatibility, but it’s easy to overlook breaking changes. Even trivial changes like adding a default case class member or a default method parameter break binary compatibility.

Library authors also generally follow semver, meaning a breaking change results in a major version bump. Even if everyone follows semver perfectly, libraries evolve at different speeds. I’ve found myself in this scenario several times: we need a bug-fix or feature from the latest version of library A. We already use the latest version of library B, which depends on an older, binary-incompatible version of library A. The maintainers of B are nowhere to be found. We have a few options: Update A and risk runtime exceptions. Update B’s version of A, PR the change, and wait for the maintainers to merge and release the update. Update B’s version of A and release our own artifact internally, effectively a fork. None of these are particularly productive patterns.

In short, as we add dependencies, we inevitably end up with conflicting versions of transitive dependencies. If these conflicting versions are binary-incompatible, we’ll find ourselves debugging runtime exceptions.

To clear up any additional confusion, I recommend reading the official Scala documentation on Binary Compatibility for Library Authors, especially the section Why Binary Compatibility Matters.

Circular Dependencies

A circular dependency occurs when repo A depends on repo B and repo B depends on repo A.

In my experience, it’s rarely this simple. It’s more common to find several hops across multiple repos:

Again, an arrow from A to B means A depends on B.

SBT will prevent you from introducing circular dependencies within a single SBT project, but it has no way to prevent a circular dependency across SBT projects (i.e., across repos).

A more accurate representation is this:

From SBT’s perspective, v1 and v2 of repo A are distinct artifacts. It doesn’t know that they come from the same repo. However, it’s clearly an undesirable state. At some point we’ll need to make a breaking change in A and propagate the change through B, C, and D. Since D still depends on an older version of A, we’ll be forced to break B, C, and D along the way.

I’ve seen this several times when writing tests for utility code: we have a repo of common utilities at the root of the dependency graph. We write some tests and publish a new utility, but then encounter an issue in a downstream project caused by a bug in the utilities. We fix the bug and want to add some regression tests in the utilities project, but the bug involves some constructs from the downstream project. Instead of copying the code, we introduce a test dependency from the utility tests to the downstream project and write a nice regression test. SBT doesn’t prevent this, as its unaware of the broader dependency graph. Code-reviewers are happy to see more tests and overlook the dependency. We only discover the issue several months later when trying to update to a new version of some library used in the utilities and find runtime exceptions in the tests.

Long Dependency Chains

One natural side-effect of a poly-repo codebase is that changes in one repo require changes to other repos.

Many organizations have a project at the root of the dependency hierarchy called common or util. In several cases I’ve found myself working on a service with several library dependencies that depend on the util project. In other words, the util project is a transitive dependency, several times removed:

We reach for a construct from Util, but realize something about it is just slightly wrong. We have two choices. We can upstream a change to Util, make sure it works with A and B, and spend time updating and releasing the intermediate projects. Or we can re-implement a slight variation of the problematic construct and add an aspirational comment:

// TODO: move to util

It seems that the number of aspirational TODOs increases with the length of the dependency chain.

The Road to Sanity

Inspect Evictions

In a perfect world, sbt evicted would return nothing, meaning there are no version conflicts. In reality, this is virtually impossible. Even the most critical, carefully-maintained open-source libraries have evictions. There are some low-level, broadly-used libraries which will virtually always appear on this list (e.g., slf4j-api).

One of the simplest things we can do is to inspect the output of sbt evicted when adding or updating dependencies. When bumping a dependency to a newer version, we should compare the output from sbt evicted before and after to make sure there aren’t any new, binary-incompatible evictions.

Test, test, test

We can’t entirely eliminate version conflicts, and the Scala compiler also can’t help us detect problematic conflicts – instead, they manifest as runtime exceptions like NoSuchMethodError or ClassNotFoundException.

One of the simplest and most effective things we can do is to improve test coverage so that all important codepaths are regularly tested and runtime exceptions are found in CI. It’s really pretty simple: disciplined testing gives us the opportunity to crash and debug in CI rather than in production.

Reduce Dependencies

Scala has a relatively un-opinionated standard library (e.g., no standard JSON or CSV support). So we end up using a lot of libraries. There’s nothing inherently wrong with that, but we should be intentional about the dependencies we use.

When considering a dependency, we can evaluate three criteria:

Is it really necessary? The simplest thing we can do is avoid carelessly introducing over-complicated dependencies. For example, breeze is a neat library, but has a non-trivial dependency footprint. So if we just need to compute summary statistics or do some basic vector arithmetic, we’re probably better off using built-in collection methods (.sum, .length, .zip, .product, etc.) or using a zero-dependency alternative like Apache commons-math3.
Does the library have a reasonable number of dependencies? I usually bias towards small and focused libraries. It’s really just a numbers game. Each dependency has its own transitive dependencies, and each new dependency is a new opportunity for version conflicts.
Do the authors regularly and thoughtfully update their dependencies? At the very least, the authors should release a new version for every major release of their dependencies.

A good example for the second and third criteria is elastic4s, a client and DSL for Elasticsearch. The library includes several sub-projects for using elastic4s with different effect systems (elastic4s-effect-cats, elastic4s-effect-monix, etc.), each published as a distinct artifact. The author also releases a new version for every version of Elasticsearch.

Carefully Structure Sub-projects

SBT allows us to define sub-projects within a larger project, and each sub-project is published as a distinct artifact.

When building projects that are published and consumed by downstream users, we should structure them thoughtully, and in particular avoid grouping many unrelated dependencies in a single project.

For example, we might have a fairly large utils project with, among others, utilities for consuming data from Amazon S3 and utilities for working with JSON data. The S3 utilities depend on the S3 Java SDK, and the JSON utilties depend on circe. We should split the project into two SBT sub-projects, published as distinct artifacts. S3 users need only depend on the S3 artifact and its dependencies. JSON users only need the JSON artifact and its dependencies. By defining this separation, we decrease the chance of dependency conflict for both sets of users. Users working with S3 and JSON can just use both of the artifacts.

This is a pattern used by many libraries. For example, AWS publishes over 300 Java SDKs, delineated by their cloud services. Lightbend’s alpakka project includes over 40 sub-projects for using akka and akka streams with various datastores.

The main advantage of having small, focused sub-projects is that it decreases the opportunity for version conflicts in downstream projects. Remember, it’s a numbers game. It also decreases the time spent downloading artifacts in CI and the size of artifacts published from downstream projects which use our artifacts.

It might seem tedious to carefully define sub-projects and tempting to clump everything together. I’ve found the effort is a one-time investment, and we reap the rewards for the lifetime of the project.

Pin Standard Versions Across Your Organization

An organization typically arrives at some defacto official libraries – maybe akka-http for web services and cats-effect for IO.

Beyond picking standard libraries, we should also agree on standard versions of these libraries.

For example, we agree to use akka-http 10.1.12. When 10.2.0 comes out, we carefully examine the changelog, test it out on branches in several important projects, and only then promote it as the new official version. This diligence serves to prevent upgrading a version of a widely-used dependency in one project, only to introduce a binary-incompatibility with other projects.

We can go a step further and enumerate the standard versions in an internally-published SBT plugin. This allows users to forget about specific library versions in favor of simpler syntax like:

libraryDependencies ++= Seq(
  StandardLibraries.akkaHttp,
  StandardLibraries.catsEffect,
  ...
)

This pattern ensures that the standard versions are defined in code rather than some random wiki. Versions are promoted via code-reviewed PRs, and we can even write plugin tests to catch new evictions caused by upgrading a standard version. Users can forget about bumping a bunch of library versions and instead just make sure they’re using the latest version of the plugin.

As an aside, I find it extremely useful to have an internal plugin that abstracts away the common SBT boilerplate for loading credentials, running linters, releasing, etc.

Prune Unused Dependencies

When we include a dependency, that dependency becomes available to downstream users of our published artifacts and can cause dependency conflicts in their projects. Even if it doesn’t cause a conflict, it gives an opportunity to start using the dependency transitively, only to find a broken build once we remove the unused dependency from the original project.

As such, we should regularly prune unused dependencies, especially in projects which are published as libraries.

The wonderful sbt-explicit-dependencies plugin already includes two tasks for pruning unused dependencies:

unusedCompileDependencies prints a list of the unused dependencies
unusedCompileDependenciesTest throws an exception if there are any unused dependencies

Unfortunately there are some subtle edge cases to be aware of. For example, some very common logging libraries (e.g., logback-classic) are loaded exclusively at runtime. So the unusedCompileDependencies task will mark logback-classic as an unused dependency. The solution is to either add these dependencies with a % Runtime modifier or tune the unusedCompileDependenciesFilter configuration.

Prune Redundant Transitive Dependencies

This tip is subtle and still up for debate in my own mind. Anyone who finds it confusing can safely skip it.

A redundant transitive dependency occurs when we declare dependencies A and B, where B already depends on A. We already get A by depending on B, so we don’t need to declare it explicitly. In this case, A is a redundant transitive dependency

For example, circe already depends on cats-core. So if we declare both circe and cats-core in our libraryDependencies, cats-core is a redundant transitive dependency.

Why should we avoid redundant transitive dependencies? It prevents us from accidentally updating the redundant transitive dependency without updating the dependency that originally brought it in. For example, if we declare both circe and cats-core and later see a new release of cats-core, we might update cats-core without updating circe. This evicts the version of cats-core expected by circe and can introduce a binary incompatibility.

There’s also a pretty simple argument against this tactic: if we only declare circe, start heavily using cats-core, and for whatever reason stop using circe, then circe technically becomes an unused dependency. And I previously suggested we avoid unused dependencies.

On this topic, the sbt-explicit-dependencies plugin includes a couple tasks designed for the opposite of what I recommend:

undeclaredCompileDependencies prints a list of dependencies used to compile which are not explicitly declared
undeclaredCompileDependenciesTest throws an exception if there are any dependencies used to compile but not explicitly declared

In my experience it can be quite shocking (even JARring) to see all of dependencies used in a large project. I find explicitly enumerating them all adds little value, is tedious to maintain, and increases the chance that someone introduces a binary incompatibility by updating one of them.

Scala Steward

Scala Steward, in the authors’ words, is “a bot that helps you keep your scala projects up-to-date.” It works by cloning an SBT project, checking for new versions of its dependencies, and creating a branch and PR for each out-of-date dependency.

Here’s a Scala Steward PR to update akka-actor and akka-stream in the elastic4s repo.

It actually re-writes source code, which is impressive considering there are several ways to define dependencies in SBT.

Scala Steward runs for free on open-source repos and we can use the Scala Steward docker image to run it on internal infrastructure for private repos.

In the past I’ve used it by setting up a weekly Scala Steward cronjob for each internal repo. This seemed like an obvious win at first. Nobody wants to spend time checking for new versions and baby-sitting version bump PRs – let a bot do it.

I learned there are some pitfalls if not rolled out carefully.

The main issue is that repos are maintained at different cadences, and introducing Scala Steward to a fleet of diverse but dependent repos will lead to version conflicts. Organizations usually have some actively-maintained repos with snappy and robust CI. They also have some barely-maintained legacy repos with insufficient or flaky CI. When an active repo starts merging and releasing dependency updates, and a legacy repo doesn’t, we end up with dependency conflicts when the two repos’ artifacts are combined in a third downstream repo.

I found the best way to use Scala Steward is to first create an internal SBT plugin which enumerates standard versions of libraries used across the organization. I elaborated on the exact design of this above. The gist of it is that we should only be updating versions of broadly-used external libraries in one place, and we should coordinate these updates among teams and repo maintainers. Ideally we arrive at a steady-state where Scala Steward only updates internal dependencies and rarely-used external dependencies. Unless all of our repos have uniform maintenance and release cycles, setting Scala Steward loose to update all dependencies can get ugly.

An SBT Wishlist

Finally, here are some dependency management features I’d like to see in SBT or an SBT plugin.

Detect Circular Dependencies

This feature would detect circular dependencies across repos, addressing the issues in the circular dependencies section.

One possible implementation: provide a task called circularTest. It grabs the organization and project name, walks the project’s dependency graph, and throws an exception if it finds any artifacts with the same organization and name. I think it’s safe to assume another artifact with the same organization and name is simply an older version of the current project. We would execute this task in CI.

Note that this doesn’t actually prevent the introduction of a circular dependency – it simply detects that one exists. In order to truly prevent circular dependencies, the SBT project would have to know about the projects which depend on it. This would require maintaining an aggregate view of the dependency graph, external to any single project.

Disallow Evictions by Default

This feature would disallow all evictions (version conflicts) by default, except for a user-provided list of permitted evictions.

One possible implementation: provide a task called evictedTest, which uses the existing evicted task to find all evictions, checks them against a list of permitted evictions defined in the project’s settings, and throws an exception if a new eviction is found. We would execute this task in CI.

Perhaps there’s also a way to use Mima to check specifically for binary-incompatible evictions?

Similar to circularTest, this doesn’t actually prevent version conflicts. It simply forces us to explicitly define them, so that we know what we’re getting into before merging and releasing a dependency update.

Appendix

What about mono-repos?

I’ve worked in some big repos, but never an intentional mono-repo. It can be a controversial topic, but all merits aside, I think it’s much more natural for an organization to expand horizontally across repos, rather than vertically within a single repo. I also don’t think Github, Gitlab, Bitbucket, et. al. offer many features that truly support or encourage a mono-repo. In fact it seems that features like Github’s repository templates actively encourage repo proliferation.

I’d like to find some time to explore newer build tools like Pants and Bazel, which seem to facilitate mono-repo development. I really hope that the smart folks at Google, Facebook, Twitter, Microsoft, etc. moved to mono-repos at least partially because they solve a lot of the dependency-related issues I covered.

Finding Dependency Versions

To find the exact versions of a library’s dependencies, just find the library on search.maven.org (e.g., akka-grpc), and you’ll see an XML section containing the exact dependencies and versions:

  ...
  
    io.grpc
    grpc-core
    1.32.1
  
  ...

Alex Klibisz

Connecting a Gli-Net SFT1200 Travel Router to a pfSense OpenVPN server

Accelerating vector operations on the JVM using the new jdk.incubator.vector module

Introduction

Background

Benchmark Setup

Baseline Implementation

Fused Multiply Add (Math.fma)

jdk.incubator.vector

Crash course

VectorMask on every Iteration

Loop over the Tail

VectorMask on the Tail

Complete Benchmark Results

Takeaways

Appendix

Related Material

jdk.incubator.vector in Elastiknn

FloatVector::pow is significantly slower than FloatVector::mul

Results on Apple Silicon (M1 Macbook Air)

Java Vector API: Benchmarking and Performance Analysis by Basso, et. al

Bug in Jep338FullMaskVectorOperations

Warning: Math.fma can be extremely slow on some platforms

Are Postgres functions faster than queries? (a very simple benchmark)

My takeaways from _Effective Software Testing_ by Maurício Aniche

Introduction

My Background

Overall Impressions

Takeaways

Test effectively and systematically

Exhaustive testing is intractable, but that’s no excuse

Code coverage is a tool, not a goal

Mutation testing seems like a powerful complement to code coverage

The role of specification-based testing and structural testing

Language design matters

Stubs vs. mocks and testing state vs. interaction

Writing testable code is important and dependency injection helps

TDD is a method for implementing, not a method for testing

Only integration test the dependencies that you exclusively own

Keep stubs small and specific

Conclusion

Appendix

Discussion

Optimizing Postgres Text Search with Trigrams

Introduction

Defining “text search”

Why Postgres?

What are Trigrams?

Why not Full Text Search?

The Test Environment

Host

Postgres

Amazon Review Dataset

Example Query Strings

Explain (Analyze, Buffers)

Baseline Search Query

Trigram Operators

Trigram Search Query

Baseline Summary

Optimizations

Indexing

GiST with siglen=64

GiST with siglen=256

Why does Siglen Matter?

Indexing Summary

Separate Exact and Trigram Search Queries

The ilike operator

Exact-Only Search Query

Separate Queries Summary

Single Query for All Text Columns

Four Single-Column Queries

One Four-Column Query with Disjunctions

One Four-Column Query with an Expression Index

Expression Indexes

word_similarity

A Blazing Fast Search Query

Single Query Summary

Conclusion

Appendix

Discussion

Fused Multiply Add (`Math.fma`)

`jdk.incubator.vector` in Elastiknn

The `ilike` operator

`word_similarity`

Could we make the toFile method take an `Any`?