<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.1.1">Jekyll</generator><link href="https://alexklibisz.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://alexklibisz.com/" rel="alternate" type="text/html" /><updated>2026-03-08T17:00:31+00:00</updated><id>https://alexklibisz.com/feed.xml</id><title type="html">Alex Klibisz</title><entry><title type="html">Terrifi: a Terraform provider for UniFi networks (built with Claude Code and hardware-in-the-loop testing)</title><link href="https://alexklibisz.com/2026/03/07/terrifi.html" rel="alternate" type="text/html" title="Terrifi: a Terraform provider for UniFi networks (built with Claude Code and hardware-in-the-loop testing)" /><published>2026-03-07T00:00:00+00:00</published><updated>2026-03-07T00:00:00+00:00</updated><id>https://alexklibisz.com/2026/03/07/terrifi</id><content type="html" xml:base="https://alexklibisz.com/2026/03/07/terrifi.html"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>My most recent hobby project has been <a href="https://github.com/alexklibisz/terraform-provider-terrifi">Terrifi</a>, a Terraform provider to manage my home UniFi network.</p>

<p>At the time of writing, the provider supports <a href="https://github.com/alexklibisz/terraform-provider-terrifi/tree/main/examples/alexnet">my entire home UniFi network</a>, it’s live on the <a href="https://search.opentofu.org/provider/alexklibisz/terrifi/latest">OpenTofu Registry</a>, I just released <a href="https://github.com/alexklibisz/terraform-provider-terrifi/releases/tag/v0.2.0">version 0.2.0</a>, and it might be ready enough for others to try.
The provider also includes a CLI that makes it trivial to import existing resources to Terraform files.</p>

<p>This is my first end-to-end “vibe-coded” project, built mostly using Claude Code.
I’ve used Terraform a lot over the years, and I’ve implemented some mildly-interesting modules, but I’ve never written a provider, and I’ve never used Golang in any serious capacity.</p>

<p>It’s also the first time I’ve built a hardware-in-the-loop test harness for a personal project.
In short, every feature of Terrifi is validated end-to-end on real UniFi hardware.
I think it was a particularly useful way to use Claude.</p>

<p>So this all seemed sufficiently fun and interesting to justify a post.</p>

<h2 id="my-home-network">My Home Network</h2>

<p>I started using UniFi for my home network about six months ago.
This section is just a quick overview of my setup.
I think this is pretty basic, so feel free to skip if you’re already familiar with UniFi hardware.</p>

<h3 id="hardware">Hardware</h3>

<p>At this point my network consists of the following:</p>

<ol>
  <li><a href="https://store.ui.com/us/en/products/uxg-lite">UniFi Gateway Lite</a> - the primary router+firewall.</li>
  <li><a href="https://store.ui.com/us/en/products/u7-lite">Access Point U7 Lite</a> - access point for most of the house.</li>
  <li><a href="https://store.ui.com/us/en/products/uap-ac-pro">AC Pro</a> - access point for the garage (bought used on eBay for ~$40).</li>
  <li>10-port unmanaged POE switch - this is how I connect the access points and all other wired devices to the Gateway Lite.</li>
  <li>Raspberry Pi 4 (4GB RAM) running <a href="https://help.ui.com/hc/en-us/articles/360012282453-Self-Hosting-a-UniFi-Network-Server">UniFi OS Server</a> - this is the controlplane for the network.</li>
</ol>

<p>As far as I know, this is a pretty standard UniFi setup.
The only notable part is the UniFi OS Server.</p>

<p>Unlike most routers I’ve used in the past, the Gateway Lite doesn’t actually host its own controlplane (the thing that lets you see clients, configure the firewall, static IPs, WiFi passwords, etc).
Some of the higher-end UniFi hardware has the controlplane built in, but for the cheaper stuff you either have to use their managed/online offering or host it yourself.
I chose the latter, mostly because I just like to keep things local when possible.</p>

<h3 id="structure">Structure</h3>

<p>This might deserve its own post, but the following aspects are worth mentioning.</p>

<p>I have 5 networks, and each of these is also a zone for firewall purposes:</p>

<ol>
  <li>Default/Internal: this is everything connected via Ethernet, so a couple Proxmox hosts running some self-hosted services (Home Assistant, Scrypted, Joplin, Immich, and Cusdis), a TrueNAS server, and a Mac Mini.</li>
  <li>Personal Devices: this is for our personal laptops, smartphones, and tablets.</li>
  <li>Apple Home: this is for all our Apple home devices, so a couple Apple TVs, a HomePod, and a couple Airport Express used as Airplay adapters for speakers.</li>
  <li>IoT: for all our WiFi IoT devices, so a couple smart vacuums, smart plugs, cameras, home alarm, my Tesla vehicle, and my Tesla Powerwalls.</li>
  <li>Untrusted: the default network for anything connected to WiFi. If I trust the device, I promote it to one of the other networks.</li>
</ol>

<p>I have 3 WiFi networks:</p>

<ol>
  <li>A 5GHz network for anything that can use 5GHz.</li>
  <li>A 2.4GHz network for all the IoT devices that can only use 2.4GHz.</li>
  <li>A 5GHz guest network.</li>
</ol>

<p>And I have some pretty basic firewall rules.
By default no device can communicate across networks.
And then there are exceptions, for example:</p>

<ul>
  <li>Home Assistant and Scrypted can communicate to anything on the IoT network.</li>
  <li>Personal Devices can communicate to any of the Apple Home devices, mostly to allow Airplay.</li>
  <li>Some Personal Devices can communicate with anything on the Internal network, so I can administer all of this from my laptop.</li>
  <li>Several classes of IoT devices are blocked from communicating with the external world. For example, I block all my Tapo cameras from communicating with the Internet, except for NTP, which is needed to set the camera time (although I’d like to eventually self-host an NTP server for this). I’m also considering refactoring to a deny-by-default setup for IoT devices.</li>
</ul>

<h2 id="terraform-and-unifi">Terraform and UniFi</h2>

<h3 id="where-the-provider-fits">Where the Provider Fits</h3>

<p>The UniFi OS Server talks to the UniFi Gateway and access points, essentially telling them how to behave.</p>

<p>The UniFi OS Server follows a pretty standard architecture.
There’s a MongoDB database and a Java server.
The Java server exposes some API endpoints (JSON over HTTP) and serves up a very nice client-side browser app.
There’s also an official UniFi mobile app which talks directly to the server, presumably over the HTTP endpoints.</p>

<p>So a Terraform provider hits the API endpoints on the UniFi OS Server, essentially just like the web app or mobile app.</p>

<h3 id="why-it-helps">Why it Helps</h3>

<p>Why do I need Terraform for my home network?
Basically all the typical reasons for using infrastructure-as-code.
Just calling out a few:</p>

<ul>
  <li>With a sufficiently complicated network, editing text is way faster than using a UI.</li>
  <li>I can see and read all the configuration in one place.</li>
  <li>I can ask an LLM to make changes for me, and I still get to review it all before it takes effect.</li>
</ul>

<h2 id="why-not-use-an-existing-provider">Why not use an existing provider?</h2>

<p>There are a few community providers that have been developed over the years.
As far as I can tell, the main ones are:</p>

<ul>
  <li><a href="https://github.com/paultyng/terraform-provider-unifi">paultyng/terraform-provider-unifi</a></li>
  <li><a href="https://github.com/filipowm/terraform-provider-unifi">filipowm/terraform-provider-unifi</a></li>
  <li><a href="https://github.com/ubiquiti-community/terraform-provider-unifi">ubiquiti-community/terraform-provider-unifi</a></li>
</ul>

<p>I gave these a try and ran into problems very quickly.
Basic things like editing the name of a network crashed with a 400 response from the server.
They also seem to be somewhere between abandoned or barely maintained.</p>

<p>Idealistically, I would be a good community member and open a bunch of PRs to the existing providers and upstream the fixes.
But when I started this, it wasn’t even clear that anyone was even reading the issues, let alone the PRs.
Claude and I did some research into upstreaming fixes for ubiquiti-community/terraform-provider-unifi, and concluded it would be cleaner to start fresh.</p>

<p>I’m certainly not opposed to upstreaming fixes in the long-run.
And it’s all open-source, so anyone is entitled to take any and all parts of this.
I suspect the most useful contribution of this provider is not the actual code, but rather the hardware-in-the-loop testing and development setup.</p>

<p>It’s also worth mentioning that maintaining anything related to the UniFi API seems very complicated.
As far as I can tell, UniFi does not officially support or document their API.
In the process of building Terrifi, Claude and I found a ton of quirks about the API.
And that’s just on my own single version of UniFi with fairly basic hardware and architecture.
I can’t imagine what it would be like to try to support the entire API surface.
So I’m not at all surprised by the state of the Terraform providers.</p>

<h2 id="what-terrifi-brings-to-the-table">What Terrifi brings to the table</h2>

<ol>
  <li>A handful of resources for managing a basic UniFi network, all documented on the <a href="https://search.opentofu.org/provider/alexklibisz/terrifi/latest">Tofu Registry</a>.</li>
  <li>A CLI for a few related tasks, like generating Terraform imports and resources from your existing network.</li>
</ol>

<p>The number of resources is limited compared to some other providers.
I’ve prioritized necessity and quality over breadth.
If I didn’t need it, I didn’t add it.
Every PR goes through automated testing on real UniFi hardware, covered more below.
If I can’t test it with real hardware, I won’t add it.
I’ll consider other resources, as long as they can be thoroughly tested.</p>

<p>The CLI is also somewhat novel.
Anytime I use a new Terraform provider, it’s a big pain to import and re-define all the existing infrastructure.
So Claude and I built the CLI to simplify this process.
You can just run <code class="language-plaintext highlighter-rouge">terrifi generate-imports &lt;resource name&gt;</code>, it calls your UniFi server to get the existing infrastructure, and prints out the corresponding <code class="language-plaintext highlighter-rouge">import</code> and <code class="language-plaintext highlighter-rouge">resource</code> blocks.
You’ll still need to do some editing and re-arranging, but it saves a ton of time.</p>

<h2 id="hardware-in-the-loop-testing">Hardware-in-the-loop Testing</h2>

<p>In this section I’ll stick to covering the <em>why</em> of hardware-in-the-loop (HIL) testing.
The repository includes <a href="https://github.com/alexklibisz/terraform-provider-terrifi/tree/main/hardware-testing">documentation and source code for the setup</a>, and I intend to keep the repo up-to-date as it evolves, whereas the post is a point-in-time snapshot.</p>

<p>So why do we need HIL testing?</p>

<p>Claude and I very quickly found there was a long-tail of undocumented behaviors in the UniFi API.
There were even issues just making and parsing responses with the <a href="https://github.com/ubiquiti-community/go-unifi">community Go SDK</a>.
I had Claude summarize these quirks, <a href="#quirks-in-the-unifi-api-and-go-unifi-sdk">see the appendix</a>.</p>

<p>The existing providers test against a Docker container that’s running the UniFi API in simulation mode.
That’s better than just unit testing, but it’s clearly not enough.
The simulation mode supports very few of the resources.
For example, to create a WiFi network, you need a real access point.
To create a firewall zone, you need a real gateway.</p>

<p>So it quickly became clear that I needed to run these tests against real hardware.</p>

<p>I didn’t want to break my actual home network for this purpose, so I went on Amazon and eBay and bought the cheapest real hardware I could find.
The Gateway Lite was something like $55 on Amazon and the AC Pro was like $35 on eBay.
I already had a travel router, switch, and mini PC available from other projects, and I already had a lot of experience with Proxmox, GitHub Actions, and Tailscale, all of which came in handy for the setup.
I built it so that the HIL harness sits behind a router, so it can connect to the Internet but can’t connect to my actual UniFi network.</p>

<p>To get the setup working well took something like 20 hours of work.
So nothing major, but I was also re-using a ton of prior knowledge.</p>

<p>I think it has paid off really nicely:</p>

<ul>
  <li>Every GitHub PR runs a full suite of HIL tests against the real hardware.</li>
  <li>Claude makes extensive use of the UniFi OS Server when working on new resources and fixing existing bugs. It will sit there and run ad-hoc curl commands against the UniFi API to figure out how exactly it works, which parts of the community SDK it can use, and which parts need workarounds. This is another benefit of the isolated HIL testing environment; I don’t want it doing this reverse-engineering against my real network.</li>
  <li>Sometimes these tests flake, so I ask Claude to look at the recent flakes and figure out how to fix them. For example, <a href="https://github.com/alexklibisz/terraform-provider-terrifi/pull/53">this PR</a>. I told it something like “run the HIL tests and fix any failures until the test suite has passed five times consecutively”. More recently I started running the HIL testing workflow on an hourly schedule in GitHub Actions. I plan to ask Claude to go find the tests that have flaked and work on a PR to fix them.</li>
</ul>

<p>And here’s how it looks:</p>

<p><img src="hardware.jpg" alt="HIL Testing Hardware" /></p>

<h2 id="vibe-coding-the-provider">Vibe-coding the provider</h2>

<p>Terrifi was largely “vibe-coded”, primarily using Claude Code (Opus 4.6).</p>

<p>My human contribution to this project was building the testing harness, determining the resource APIs, and prompting Claude to implement them.
My direct interaction with source code was minimal.
In code review, I primarily reviewed the docs and tests; I didn’t pay much attention to the actual implementation.</p>

<p>To get this working on my own home UniFi network, Claude and I ended up merging just over 80 pull requests over the course of ~3 weeks, about 75% on weekends.
So it still required some focus and attention, but without a doubt much faster than had I coded this all “by hand”.</p>

<p>I think a few aspects of this project made it particularly amenable to vibe-coding.</p>

<p>The HIL testing harness provides an extremely tight feedback loop.
I can prompt Claude to go implement a Terraform resource or attribute and let it iterate on real hardware to figure out how it should work.
Once it writes the tests, they’ll automatically run for all future PRs.</p>

<p>Performance is mostly inconsequential.
There isn’t going to be some subtle performance regression that gets through CI and a staging environment but crashes production.
The “production” is running a single executable to figure out which resources to create/update/delete.
Funny enough, there actually was a case where Terraform’s request parallelism and the inefficiency of some response types would make the UniFi OS Server fall over (<a href="https://github.com/alexklibisz/terraform-provider-terrifi/issues/81">Tofu apply keeps crashing my UniFi server</a>).
But crashing the UniFi OS Server doesn’t actually affect the network’s functionality, and it was pretty easy to fix by tuning a CLI parameter and introducing a short-lived cache.</p>

<p>Finally, I’m not particularly knowledgeable or opinionated about Golang.
So I wasn’t picky about how the code looked.
Early on, I did ask it to add comments to improve my own understanding of how a provider works.</p>

<p>All that said, it’s by no means perfect.
There were still several cases where Claude’s first-pass implementation passed tests and then had to be revised:</p>

<ul>
  <li><a href="https://github.com/alexklibisz/terraform-provider-terrifi/issues/84">Unclear how to use port_group_id with match_opposite_ports</a>: I wanted to block all traffic except to a particular group of ports. The bug was in the interaction of a couple fields in the API. Claude quickly figured this out with a few test requests to the UniFi API and opened a PR to fix it.</li>
  <li><a href="https://github.com/alexklibisz/terraform-provider-terrifi/issues/81">Tofu apply keeps crashing my UniFi server</a>: the Terraform provider was firing off many requests in parallel which led to out-of-memory errors on my little Raspberry Pi-hosted UniFi OS Server. Claude solved this by proposing the use of <code class="language-plaintext highlighter-rouge">-parallelism=1</code> and caching some responses.</li>
  <li><a href="https://github.com/alexklibisz/terraform-provider-terrifi/issues/70">Unexpected new value was X but now Y</a>: there was a field on the <code class="language-plaintext highlighter-rouge">terrifi_firewall_policy</code> resource that indicated it would be used for ordering the policy, and the API accepted it on POST/PUT requests, but then it didn’t actually persist the value and returned a different one. Claude figured out a different way to order policies, using a new standalone resource <code class="language-plaintext highlighter-rouge">terrifi_firewall_policy_order</code>.</li>
  <li><a href="https://github.com/alexklibisz/terraform-provider-terrifi/issues/69">400 when creating policy with mac_addresses in destination</a>: I don’t remember all the details, but some interaction of fields led to a 400.</li>
  <li><a href="https://github.com/alexklibisz/terraform-provider-terrifi/issues/65">400 when creating terrifi_firewall_policy with device_ids in source</a>: I don’t remember all the details, but some interaction of fields led to a 400.</li>
</ul>

<p>So overall Claude massively accelerated the implementation, but it still required some amount of judgement, guidance, and feedback.</p>

<h2 id="conclusion">Conclusion</h2>

<p>If you like UniFi and you like Terraform, give Terrifi a try. 
Some good places to start:</p>

<ul>
  <li><a href="https://search.opentofu.org/provider/alexklibisz/terrifi/latest">The OpenTofu Registry</a></li>
  <li><a href="https://github.com/alexklibisz/terraform-provider-terrifi/tree/main/examples">The examples directory</a>, including a <a href="https://github.com/alexklibisz/terraform-provider-terrifi/tree/main/examples/alexnet">partly-redacted copy of my own network setup</a>.</li>
</ul>

<h2 id="appendix">Appendix</h2>

<h3 id="discussion">Discussion</h3>

<ul>
  <li><a href="https://www.reddit.com/r/Ubiquiti/comments/1rnzule/terrifi_a_vibecoded_terraform_provider_to_manage/">Reddit</a></li>
  <li><a href="https://news.ycombinator.com/item?id=47298798">Hacker News</a></li>
</ul>

<h3 id="quirks-in-the-unifi-api-and-go-unifi-sdk">Quirks in the UniFi API and go-unifi SDK</h3>

<p>I prompted Claude Opus 4.6:</p>

<blockquote>
  <p>summarize the quirky behaviors in the unifi API that we’ve had to work around in building this provider</p>
</blockquote>

<p>Here’s what it told me:</p>

<ul>
  <li><strong>v2 API requires <code class="language-plaintext highlighter-rouge">_id</code> in both URL path AND request body (PUT).</strong> Affects firewall zones and firewall policies. The SDK only puts the ID in the URL, but the v2 API returns <code class="language-plaintext highlighter-rouge">500: "The given id must not be null"</code> if it’s absent from the JSON body too.</li>
  <li><strong>DELETE returns 204, not 200.</strong> The v2 firewall zone and policy DELETE endpoints return <code class="language-plaintext highlighter-rouge">204 No Content</code> on success. The SDK only treats <code class="language-plaintext highlighter-rouge">200</code> as success, so it misinterprets successful deletes as errors.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">default_zone: false</code> serialization breaks zone creation.</strong> The SDK’s <code class="language-plaintext highlighter-rouge">FirewallZone</code> struct serializes <code class="language-plaintext highlighter-rouge">"default_zone": false</code> (no <code class="language-plaintext highlighter-rouge">omitempty</code>), which the UniFi v2 API rejects with <code class="language-plaintext highlighter-rouge">400 Bad Request</code>. We had to write a custom minimal request struct without that field.</li>
  <li><strong>v1 vs v2 endpoint inconsistency for firewall zones.</strong> The SDK’s <code class="language-plaintext highlighter-rouge">GetFirewallZone</code> hits the v1 REST endpoint, which doesn’t reliably return <code class="language-plaintext highlighter-rouge">network_ids</code>. Since create/update use v2 (which does return it), Terraform would see empty <code class="language-plaintext highlighter-rouge">network_ids</code> after every refresh, producing phantom diffs.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">port</code> field is a string in the JSON, not an integer.</strong> The SDK defines <code class="language-plaintext highlighter-rouge">FirewallPolicySource/Destination.port</code> as <code class="language-plaintext highlighter-rouge">*int64</code>, but the v2 API returns it as a JSON string (e.g. <code class="language-plaintext highlighter-rouge">"443"</code>). This causes <code class="language-plaintext highlighter-rouge">json.Unmarshal</code> to fail, breaking all GET/list operations for firewall policies.</li>
  <li><strong>Boolean fields serialized without <code class="language-plaintext highlighter-rouge">omitempty</code>.</strong> Affects both firewall policies (<code class="language-plaintext highlighter-rouge">enabled</code>, <code class="language-plaintext highlighter-rouge">logging</code>, <code class="language-plaintext highlighter-rouge">match_ip_sec</code>, etc.) and client devices (<code class="language-plaintext highlighter-rouge">use_fixedip</code>, <code class="language-plaintext highlighter-rouge">local_dns_record_enabled</code>, <code class="language-plaintext highlighter-rouge">fixed_ap_enabled</code>). Without <code class="language-plaintext highlighter-rouge">omitempty</code>, they always serialize as <code class="language-plaintext highlighter-rouge">false</code>, silently clearing controller settings managed outside Terraform.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">setting_preference</code> defaults to <code class="language-plaintext highlighter-rouge">"auto"</code>, overriding user settings.</strong> The SDK’s <code class="language-plaintext highlighter-rouge">marshalCorporate()</code> for networks calls <code class="language-plaintext highlighter-rouge">valueOrDefault(n.SettingPreference, "auto")</code>. Sending <code class="language-plaintext highlighter-rouge">"auto"</code> tells the controller to auto-manage settings like DHCP enable, which stomps on explicitly configured values. We force <code class="language-plaintext highlighter-rouge">"manual"</code> before every API call.</li>
  <li><strong>DHCP range fields crash the controller when empty strings are sent.</strong> Also in <code class="language-plaintext highlighter-rouge">marshalCorporate()</code>: it uses <code class="language-plaintext highlighter-rouge">valueOrDefault(n.DHCPDStart, defaultStart)</code> which fills in a default even for <code class="language-plaintext highlighter-rouge">nil</code> fields. Terraform’s “unknown” computed fields resolve to <code class="language-plaintext highlighter-rouge">""</code>, and passing <code class="language-plaintext highlighter-rouge">&amp;""</code> causes the controller to throw <code class="language-plaintext highlighter-rouge">java.lang.IllegalArgumentException: Could not parse []</code>. We guard with <code class="language-plaintext highlighter-rouge">IsUnknown()</code> checks to skip those fields entirely.</li>
  <li><strong>SDK doesn’t expose its HTTP client or CSRF token.</strong> Because the SDK creates its own internal HTTP client with no <code class="language-plaintext highlighter-rouge">Do()</code> escape hatch, we can’t reuse its session for the v2 API calls that bypass it. We do a full independent login to get our own session cookie + CSRF token — effectively dual-login on every provider initialization.</li>
</ul>

<h3 id="device-types-browser">Device Types Browser</h3>

<p>As part of this project, Claude and I also built a <a href="https://alexklibisz.github.io/terraform-provider-terrifi/device-types/">Device Types Browser</a> — a single-page app to fuzzy-search the ~5600 device types available in UniFi.
This came up when working on the <code class="language-plaintext highlighter-rouge">terrifi_client_device.device_type_id</code> attribute, which lets you set the device type icon and metadata so devices show up with nice icons in the UI.
The built-in UniFi device type browser is pretty limited: fuzzy search doesn’t work well and it’s hard to find the right ID.
So I had Claude add a CLI command that pulls the device type index from the UniFi API and generates the app.</p>

<h3 id="thoughts-on-ai-the-future-of-software-yada-yada">Thoughts on AI, the Future of software, yada yada</h3>

<p>I doubt I have anything novel to add to the AI discussion.
But it’s my blog, so I’ll opine briefly.</p>

<p>My personal AI/LLM experience has gone something like this:</p>

<ol>
  <li>Late 2022 - mid 2024: This is pretty cool, but spits out a lot of junk. I’ll occasionally ask it some questions or have it wordsmith a document.</li>
  <li>Mid 2024 - mid 2025: This can be quite useful, but you really have to ask the right question and present the right context. The web search integration is a huge unlock for doing research. But it still doesn’t make a big difference in my software work.</li>
  <li>Late 2025 - now: Holy shit this is incredible. Execution and implementation is no longer my moat/bottleneck, now my ability to understand a problem and orchestrate and manage the agents is my moat/bottleneck.</li>
</ol>

<p>At this point, I have to say that Claude Code is the biggest technological advancement I’ve seen in my software engineering career.</p>

<p>I think the rate of advancement and adoption of these tools is largely a problem of tooling and economics.</p>

<p>The tooling problem: how do we adapt and scale existing tools (source control, CI, CD) and processes ( code review, planning, deployment) to let these agents work uninterrupted, but in a way that’s secure and works well with human meat brains?
I currently feel a lot of friction having to constantly review the agents’ permission checks, and haven’t quite found a way to let them cook securely.
I’m also starting to do way more concurrent work, and definitely starting to feel the mental cost of context switching and reviewing all my colleagues’ concurrent work.</p>

<p>The economics problem: do the economics of all this really make sense?
Can these companies really afford to sell this for ~$20 to ~$200 / month / user?
Or is it a game of economic chicken, and eventually the one or two winners get to charge $150 for a 30-minute ride to the airport?</p>

<p>It’s both nerve-wracking and exciting to be a participant in this shift.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[This post summarizes a new project I've been working on recently. It could be useful to some other UniFi and Terraform enthusiasts.]]></summary></entry><entry><title type="html">Some basic smartphone-free controls for the Eight Sleep using Flic buttons and Alexa</title><link href="https://alexklibisz.com/2024/12/30/eight-sleep-alexa-flic-integration.html" rel="alternate" type="text/html" title="Some basic smartphone-free controls for the Eight Sleep using Flic buttons and Alexa" /><published>2024-12-30T00:00:00+00:00</published><updated>2024-12-30T00:00:00+00:00</updated><id>https://alexklibisz.com/2024/12/30/eight-sleep-alexa-flic-integration</id><content type="html" xml:base="https://alexklibisz.com/2024/12/30/eight-sleep-alexa-flic-integration.html"><![CDATA[<h2 id="problem">Problem</h2>

<p>I love my Eight Sleep Pod 3 cover, but I really hate that the only way to control it is through the smartphone app.
Why do I need to use a sleep-disrupting device to control the device that’s supposed to help me sleep?</p>

<p>I’m not the only one who has expressed this frustration:</p>

<ul>
  <li><a href="https://www.reddit.com/r/EightSleep/comments/1ajiin0/way_to_control_the_pod_without_phone/">Reddit: Way to control the pod without a phone</a></li>
  <li><a href="https://www.reddit.com/r/EightSleep/comments/16gu183/way_to_turn_off_8_sleep_without_opening_app/">Reddit: Way to turn off Eight Sleep without opening app</a></li>
  <li><a href="https://www.reddit.com/r/EightSleep/comments/13qhax2/why_is_a_phone_required/">Reddit: Why is a phone required?</a></li>
</ul>

<h2 id="solution">Solution</h2>

<p>I think I finally found a complicated but functional way to get some basic control of the Eight Sleep without a smartphone.</p>

<p>In short, I’m using a set of Amazon Alexa routines, triggered by manual inputs to a pair of <a href="https://flic.io/flic2">Flic 2 buttons</a>, to send text commands to Alexa, which controls the Eight Sleep pod via the <a href="https://www.amazon.com/Eightsleep-Eight-Sleep/dp/B075FGLM9S">Eight Sleep Alexa skill</a>.</p>

<p>I’ll expand a bit below.</p>

<h3 id="alexa-to-eight-sleep-integration">Alexa to Eight Sleep Integration</h3>

<p>I installed the <a href="https://www.amazon.com/Eightsleep-Eight-Sleep/dp/B075FGLM9S">Eight Sleep Alexa skill</a>.
The skill supports some basic voice controls:</p>

<ul>
  <li>“Alexa, ask Eight to set the right side of my bed to one” (or negative one, zero, etc.)</li>
  <li>“Alexa, ask Eight to turn off the right side of my bed”</li>
</ul>

<p>Unfortunately, the voice control is very fragile.
For example, you can’t tell it “increase the temperature by one”.
I tried a dozen different ways, and it just doesn’t know what to do.
And if you mess up the commands even just slightly, Alexa doesn’t know what to do.
This fragility is reflected in the terrible Alexa skill reviews.</p>

<p>But, if you know the right incantation, it executes the limited functionality reliably.</p>

<p>Telling Alexa to set the temperature to two:</p>

<p><img src="alexa-1.png" alt="alexa-1.png" /></p>

<p>Telling Alexa to increase the temperature by one, which it clearly does not understand:</p>

<p><img src="alexa-2.png" alt="alexa-2.png" /></p>

<h3 id="flic-button-to-alexa-integration">Flic Button to Alexa Integration</h3>

<p>I don’t want to talk to Alexa at night.
Even if I did, I probably couldn’t remember the exact command.
I also don’t want an Alexa in my bedroom.
So I need some way to control this manually.</p>

<p>So I bought some <a href="https://flic.io/flic2">Flic 2 buttons</a>.
I actually bought them for another project, but ended up only needing one for that project.
Each button has three possible inputs: a “push”, a “double push”, and a “hold” (basically push and hold for like three seconds).</p>

<p>I installed the <a href="https://flic.io/applications/alexa/setup">Alexa Flic Skill</a>, which adds each input for each button as a distinct smart home device:</p>

<p><img src="alexa-3.png" alt="alexa-3.png" /></p>

<h3 id="flic-to-alexa-to-eight-sleep-routine">Flic to Alexa to Eight Sleep Routine</h3>

<p>With the two Flic buttons and the Eight Sleep skill working, I can setup some routines.</p>

<p>The general form of the routine is: when one of the buttons is pushed/double-pushed/held, send a specific text command to Alexa.
The text command tells Alexa to tell Eight sleep to do something.</p>

<p><img src="alexa-4.png" alt="alexa-4.png" /></p>

<p><img src="alexa-5.png" alt="alexa-5.png" /></p>

<p>This is the action that needs to be selected to tell Alexa to do something via text:</p>

<p><img src="alexa-7.png" alt="alexa-7.png" /></p>

<p>I’ve mounted the two buttons on my bed-side stand using some double-sided tape, and I’ve configured the following routines:</p>

<ul>
  <li>When button 1 is held, turn off the Eight Sleep.</li>
  <li>When button 1 is double-pushed, set the temperature to negative one.</li>
  <li>When button 1 is pushed, set the temperature to zero.</li>
  <li>When button 2 is pushed, set the temperature to one.</li>
  <li>When button 2 is double-pushed, set the temperature to two.</li>
  <li>When button 2 is held, set the temperature to three.</li>
</ul>

<p><img src="alexa-6.png" alt="alexa-6.png" /></p>

<p>Notice that I’ve set “Hear Alexa From” to be an Echo Dot device.
The other option is “This mobile device”, but when I use that, the integration doesn’t seem to work reliably.
Luckily I had a spare Alexa lying around.
This is the only thing I use it for.
It sits in a cabinet with the microphone muted, quietly whispering what it told the Eight Sleep to do when I press a button.</p>

<h2 id="reflecting">Reflecting</h2>

<p>This solution lets me set the Eight Sleep temperature and turn the pod off via manual controls.
I still use my phone to set and disable the vibrating alarm.
That’s actually a good limitation, as it forces me to get up and walk to my phone in another room instead of opening my phone and pointlessly scrolling in bed.</p>

<p>Still, the fact that Eight Sleep provides zero smartphone-free controls is very disappointing.
If we can agree that Eight Sleep’s mission is to promote good sleep, and we can agree that staring at a screen in the bedroom is not good for sleep, it seems clearly on-mission to provide a way to control the $3000 sleep device without a smartphone.
If someone has already paid $3000+ for the pod and cover, Eight Sleep could probably ship a glorified TV remote for $250 and people would happily buy it.</p>

<h2 id="appendix">Appendix</h2>

<h3 id="cheaper-buttons">Cheaper buttons</h3>

<p>If you don’t already have some Flic buttons, and you also find ~$100 for three buttons and a hub surprisingly pricey, I <em>think</em> this could all work with some cheaper <a href="https://www.aliexpress.us/item/3256807093499837.html">Sonoff buttons from Aliexpress</a>.
These are more like $8/button instead of $30/button.
I’ve purchased a handful of these but they haven’t arrived yet.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[This post covers how I've setup some basic smarthphone-free controls for my Eight Sleep cover, using Flic buttons and the Eight Sleep Alexa skill.]]></summary></entry><entry><title type="html">My Homelab, September 2024 (TrueNAS, Proxmox, Tailscale, a 2014 Mac Mini, and more)</title><link href="https://alexklibisz.com/2024/09/27/homelab-september-2024.html" rel="alternate" type="text/html" title="My Homelab, September 2024 (TrueNAS, Proxmox, Tailscale, a 2014 Mac Mini, and more)" /><published>2024-09-27T15:00:00+00:00</published><updated>2024-09-27T15:00:00+00:00</updated><id>https://alexklibisz.com/2024/09/27/homelab-september-2024</id><content type="html" xml:base="https://alexklibisz.com/2024/09/27/homelab-september-2024.html"><![CDATA[<h2 id="background">Background</h2>

<p>This post is a tour of my homelab as of September 2024.
Homelab, aka self-hosting, has been a hobby of mine for about 10 years now.
I’ve gone through many iterations over the years, but my overall setup more-or-less stabilized about a year ago.</p>

<p>I’m writing this post with two goals in mind.
First, to give other homelabers and self-hosters an idea or two.
Second, to get some feedback and ideas about improvements that I could make.</p>

<h2 id="my-homelab-requirements">My Homelab Requirements</h2>

<p>I’ve designed and built my homelab with the following requirements in mind:</p>

<ul>
  <li>My homelab should be a secure, long-term source-of-truth for all of my files and media. My storage consists of boring personal documents (projects, personal finances, ebooks, etc.), my own photos and videos, an archive of my family’s old photos and videos that I’ve digitized, and Timemachine backups of my and my fiancé’s Macbooks. Right now I’ve used up about 4TB of 12TB available storage, and I don’t anticipate expanding anytime soon.</li>
  <li>My homelab should securely host self-hosted web services that I find useful.</li>
  <li>My data and most of my services should be securely accessible remotely but not publicly.</li>
  <li>Some of my services should be securely accessible publicly.</li>
  <li>All data and services should be backed up in an automated fashion, following the 3-2-1 pattern.</li>
  <li>All services should be monitored 24x7 with alerts sent to my email if something is misbehaving.</li>
  <li>The hardware should be nearly silent, quiet enough to live in a bedroom.</li>
  <li>The setup should be relatively cost-efficient. Ideally I pay less than I would for similar cloud services.</li>
</ul>

<p>Here are some notable non-requirements that often come up in homelab discussions:</p>

<ul>
  <li>I don’t need a massive amount of storage for movies and TV shows, as I’m not really a movie and TV buff.</li>
  <li>I don’t need anything faster than gigabit speeds on my LAN. I dabbled with 2.5 gigabit hardware, kind-of got it working, but the benefits weren’t worth the cost and complexity. I think I’m happy to wait until 10 gigabit is the default.</li>
</ul>

<h2 id="why-build-a-homelab">Why Build a Homelab?</h2>

<p>Getting my homelab to a state where it’s consistently useful and reliable has taken non-trivial effort.
So I think it’s useful to briefly reflect on the pros and cons of building and maintaining a homelab.</p>

<p>Pros:</p>

<ul>
  <li>Learning. I work in software, so many of the things I learn in my homelab are useful in my career, and vice-versa.</li>
  <li>Data privacy. All of my important data lives securely on my LAN, with no chance of being used to optimize my ads.</li>
  <li>Independence. I can host a service without worrying that a company is going to shut it down or hike prices.</li>
  <li>Cost (money). There comes a point where it’s actually cheaper to buy and run your own hardware.</li>
</ul>

<p>Cons:</p>

<ul>
  <li>Cost (time). It would be significantly less time-consuming to dump my data onto a cloud service, pay a monthly fee, and hope they don’t misuse it or lose it or shut down the service.</li>
  <li>Responsibility. It’s on me to ensure my data and services are secure and backed up.</li>
</ul>

<p>I’m not a gardener, but I imagine building a homelab is like growing a vegetable garden.
You could just go buy some vegetables at the store, but it’s also pretty cool and useful and enjoyable to do it yourself.</p>

<h2 id="hardware">Hardware</h2>

<h3 id="network-box">Network Box</h3>

<p><img src="hardware-network-box.jpg" alt="My network box" /></p>

<p>The majority of my networking hardware lives in a neutral-colored cloth filing cabinet next to my desk in my living room.</p>

<p>This is in my living room simply because that’s where the cable company decided to install the incoming fiber connection.
If I remember correctly, I bought the box on Amazon for ~$15.
I cut out a hole for cables and airflow.
The cables are a bit messy, but it doesn’t bother me as it’s covered up.
The fake Ikea plant is strategically duct-taped to the top of the box to prevent our two cats from sleeping on it and cratering the lid.</p>

<p>The components as numbered are:</p>

<ol>
  <li>GL.iNet GL-AX1800 Flint router (<a href="https://www.amazon.com/gp/product/B09HBW45ZJ">Amazon</a>).</li>
  <li>Fiber modem provided by my ISP. I get gigabit up and down for ~$70/month.</li>
  <li>350VA Trip Lite UPS (<a href="https://www.amazon.com/dp/B00007FHDP">Amazon</a>). Probably overkill, but it keeps things running during occasional power interruptions.</li>
  <li>TP-Link smart plug with power monitoring (<a href="https://www.amazon.com/Kasa-Smart-Supported-Scheduling-EP25P4/dp/B0B14C719T">Amazon</a>).</li>
</ol>

<p>According to the smart plug, the router and modem run at about 10W.</p>

<h3 id="compute-and-storage-cabinet">Compute and Storage Cabinet</h3>

<p><img src="hardware-compute-storage-cabinet.jpg" alt="Compute and storage cabinet" /></p>

<p>All of my compute and storage lives in an Ikea file cabinet in our guest bedroom.</p>

<p>I purchased the cabinet on Craigslist for around $50.
I believe the line is called Galant, but I’m not totally sure, and I can’t find the exact model online.
The bottom half has a door that slides out with arms for mounting file hangers; I just use it to store spare hardware.
The upper half is pictured and opens with standard swinging cabinet doors.
I keep these doors closed, so I just removed the back panel on the top half for cabling and airflow.</p>

<p>The components as numbered are:</p>

<ol>
  <li>8-port gigabit network switch (<a href="https://www.amazon.com/gp/product/B07PFYM5MZ">Amazon</a>). This is connected to my router via ~40 feet of Cat6 cable running out of my living room, around the building, and through the guest bedroom wall.</li>
  <li>APC 425VA UPS (<a href="https://www.amazon.com/gp/product/B01HDC236Q">Amazon</a>). Last I checked, it can keep everything running for about 15 minutes. The main purpose is just to handle occasional power interruptions.</li>
  <li>2014 Mac Mini, purchased for ~$100 on eBay, used for backups. It has an Intel i5, a 250GB NVME, and a 1TB HDD and idles at about 10W.</li>
  <li>Seagate IronWolf 12TB Hard Drive (<a href="https://www.amazon.com/gp/product/B084ZTSMWF">Amazon</a>) in a cheap USB 3.0 enclosure, attached to the Mac Mini. This idles at about 5W.</li>
  <li>Beelink Mini S12 Pro (<a href="https://www.amazon.com/gp/product/B0BVFKN7ZL">Amazon</a>), running Proxmox with an Ubuntu Server VM for all of my self-hosted services. It has an Intel N100, 16GB RAM, a 500GB NVME SSD, and 2TB 2.5” SSD and idles at about 5W.</li>
  <li>HP Proliant Microserver Gen8, purchased for ~$150 on eBay, running Truenas Core. This has a Xeon E3-1220L CPU, 16GB DDR3 ECC memory, and four 4TB Seagate IronWolf NAS Drives (<a href="https://www.amazon.com/gp/product/B07H289S79">Amazon</a>) and idles at about 40W.</li>
  <li>HP Proliant Microserver Gen7, purchased for ~$50 on Craigslist, used as a test-bench for trying new operating systems and services. This has an AMD Turion II Neo N40L CPU, 16GB DDR3 ECC memory, and three 2TB drives. It’s powered off unless I’m using it to test something.</li>
</ol>

<h2 id="networking">Networking</h2>

<h3 id="tailscale">Tailscale</h3>

<p>I use <a href="https://tailscale.com/">Tailscale</a> to make services accessible <em>remotely</em> but not <em>publicly</em>.</p>

<p>Tailscale is essentially a mesh VPN that enables secure peer-to-peer communication between services and clients on a <a href="https://tailscale.com/kb/1136/tailnet">Tailnet</a>.
I have the Tailscale client on my Macbook, iPhone, and iPad, which lets me access any of the Tailscale-enabled services that I host.
For more details, I wrote a post: <a href="/2024/09/07/accessing-docker-compose-application-tailscale-tls.html">Accessing Docker Compose applications via Tailscale with HTTPS (TLS)</a>.</p>

<p>For services that don’t have a native Tailscale client, I use my router as a <a href="https://tailscale.com/kb/1019/subnets">subnet router</a>.
For example, this lets me access my TrueNAS server remotely from my iPhone via the Files app:</p>

<p><img src="truenas-on-ios-via-tailscale.jpg" width="33%" height="auto" alt="Accessing my TrueNAS server from my iPhone via Tailscale" /></p>

<h3 id="glinet-gl-ax1800-router">GL.iNet GL-AX1800 Router</h3>

<p>I use this router for all networking at home.
I picked this router because it runs <a href="https://openwrt.org/">OpenWRT</a> and supports Tailscale.</p>

<p>I have the router connected to my Tailnet, which lets me remotely administer the router.
I’ve configured it as a <a href="https://tailscale.com/kb/1019/subnets">subnet router</a>, which lets me access other IPs on my LAN that can’t connect natively to Tailscale.
I’ve also configured it as an <a href="https://tailscale.com/kb/1103/exit-nodes">exit node</a>, which lets me route traffic through my home router while I’m traveling.</p>

<p>Overall it’s a great piece of hardware for the price.
I ended up buying and installing the same router for my parents, and configured it similarly. 
This makes it easier to be the IT guy from across the country.</p>

<h3 id="cloudflare-tunnels">Cloudflare Tunnels</h3>

<p>I use <a href="https://www.cloudflare.com/products/tunnel/">Cloudflare Tunnels</a> for any services that need to be accessible via domain name on the public Internet.</p>

<p>A Cloudflare Tunnel amounts to running a Cloudflare client on a server.
The client establishes a secure tunnel between the server and the Cloudflare infrastructure.
You configure the tunnel to proxy traffic between a domain or subdomain you own and a specific application on the server.
For example, I could configure <code class="language-plaintext highlighter-rouge">https://foo.alexklibisz.com</code> to proxy traffic to <code class="language-plaintext highlighter-rouge">http://localhost:8080</code> on my server.
I typically run the Cloudflare client as a Docker container in a Docker Compose application, proxying traffic to and from the service container.</p>

<p>Tunnels automatically include free TLS and some DDOS protection and analytics from Cloudflare.
Another great feature is the ability to configure additional layers of security for a service.
For example, I can configure an allow-list of emails for a service, and Cloudflare will prompt anyone visiting that service to enter an email and authenticate with a token before proceeding to the service.
If the visitor isn’t on the allow-list, they don’t get a token and can’t access the service.</p>

<p>Tunnels are by far the simplest way I’ve found to securely expose a self-hosted service on the Internet.</p>

<h2 id="services">Services</h2>

<h3 id="truenas-core">TrueNAS Core</h3>

<p>I run <a href="https://www.truenas.com/truenas-core/">TrueNAS Core</a> on the Microserver Gen8.
I have four 4TB disks in RaidZ1, meaning I have ~12TB of usable storage and can recover from up to one disk failure.</p>

<p>I use TrueNAS for two main purposes:</p>

<ol>
  <li>As a file server, accessed via SMB from my Macbook, iPhone, and iPad and from some of my self-hosted services.</li>
  <li>As a backup server for Timemachine backups from my and my fiancé’s Macbooks.</li>
</ol>

<p>I’ve made TrueNAS accessible on my Tailnet by configuring my router as a subnet router.
I run a nightly backup via Rsync to my Mac Mini, which gets backed up to Backblaze.
I’ll describe that more below.</p>

<h3 id="proxmox">Proxmox</h3>

<p>I run <a href="https://www.proxmox.com/en/proxmox-virtual-environment/overview">Proxmox Virtual Environment</a> on the Beelink Mini PC.
Right now I run a single “production” Ubuntu Server VM for all of my self-hosted services and some temporary other VMs for testing and learning.
I run services on the Ubuntu Server VM via Docker Compose, and I access them via Tailscale or Cloudflare Tunnels.</p>

<p>I used to just run Ubuntu Server directly on the mini PC, but I recently added Proxmox to unlock a few nice features.
I can create new VMs to experiment with new operating systems and services without affecting the “production” VM.
It’s also much simpler to backup the production VM in an automated fashion.</p>

<h3 id="cusdis">Cusdis</h3>

<p><a href="https://cusdis.com/">Cusdis</a> is a very basic comment system for websites.
I run the <a href="https://hub.docker.com/r/djyde/cusdis">Cusdis server</a> for this website as a Docker Compose application on my Ubuntu Server VM, exposed to the Internet using a Cloudflare Tunnels container.</p>

<p>I originally tried to run Cusdis on a Hetzner VM, but I ran into trouble getting email notifications to work.
It turns out Hetzner blocks email traffic by default, and I was never able to get my support ticket for unblocking this to go through.
Running it locally has been totally sufficient.</p>

<h3 id="firefly">Firefly</h3>

<p><a href="https://www.firefly-iii.org/">Firefly-III</a> is a personal finance manager with features similar to Mint.com.
I run the <a href="https://hub.docker.com/r/fireflyiii/core">server</a> and the <a href="https://hub.docker.com/r/fireflyiii/data-importer">data-importer</a> as Docker Compose applications on my Ubuntu Server VM, exposed only on my Tailnet.</p>

<p>I use Firefly to import a CSV of all my credit card transactions about once a month and run some reports on expense categories, basically as a way to catch surprise expenses and trends.
I used to use it much more extensively, but found that this use-case has the best practical value.</p>

<p>I’ve found this service particularly valuable from a privacy perspective.
Call me paranoid or old-fashioned, but I refuse to use any service that grants automated access to my bank accounts, e.g., Plaid.
The benefit-to-consequence ratio of granting a mysterious third-party access to my financial accounts does not compute for me.</p>

<h3 id="joplin">Joplin</h3>

<p><a href="https://joplinapp.org/">Joplin</a> is a note-taking app, with features similar to Evernote (at least circa 2016, as that’s the last time I used Evernote).
I run the <a href="https://hub.docker.com/r/joplin/server">Joplin server</a> as a Docker Compose application on my Ubuntu Server VM, exposed only on my Tailnet.
The Joplin server acts as the source-of-truth storage and a synchronization mechanism between clients.</p>

<p>I’ve been a happy user of Joplin since late 2022.
I originally picked it because the clients are performant and reliable, and because it has some features I found missing in other note-taking apps.
For example, every note is just a markdown file, but it can also be edited in rich text format, and I can easily embed images and documents into a note.</p>

<h3 id="nextcloud">Nextcloud</h3>

<p><a href="https://nextcloud.com">NextCloud</a> is essentially a file server with features similar to Dropbox or Google Drive.
I run the <a href="https://hub.docker.com/_/nextcloud">Nextcloud community image</a> as a Docker Compose application on my Ubuntu Server VM, exposed only on my Tailnet.</p>

<p>I’ve mounted my TrueNAS as <a href="https://docs.nextcloud.com/server/latest/admin_manual/configuration_files/external_storage_configuration_gui.html">external storage</a>, so I can access my TrueNAS files via Nextcloud.
This seems redundant - why not just access the files via TrueNAS?
The main value of Nextcloud in this setup is that the Nextcloud clients can cache files for offline usage.
So I cache a subset of my TrueNAS files on my Macbook and iPhone for offline usage, e.g., my ebooks for flights.</p>

<h3 id="photoprism">Photoprism</h3>

<p><a href="https://www.photoprism.app/">Photoprism</a> is a photos app with features similar to Google Photos.
I run the <a href="https://hub.docker.com/r/photoprism/photoprism">server</a> as a Docker Compose application on my Ubuntu Server VM, exposed only on my Tailnet.</p>

<p>I’ve mounted several directories from my TrueNAS into Photoprism as read-only CIFS volumes.
This feature is surprisingly under-documented, or maybe I’m searching for the wrong keywords.
I basically did what’s described in <a href="https://www.zdyn.net/docker/2021/07/12/docker-cifs.html">this blog post</a>.
In any case, this means my media files live on TrueNAS as plain old files in folders, but I still get the benefits of a fancy photo app (face detection, geo-tagging, AI search, etc.).</p>

<p>So far I’ve been very happy with Photoprism, but I’d also like to try <a href="https://immich.app/">Immich</a> at some point.</p>

<h3 id="photosync">Photosync</h3>

<p><a href="https://www.photosync-app.com/home">Photosync</a> is a smartphone app for transferring photos and videos from a smartphone to a variety of backends.
I have Photosync configured to automatically backup all photos and videos from my iPhone to my TrueNAS server.
I mount the backup directory into my Photoprism server (see above), so that my iPhone photos and videos are automatically indexed and available in Photoprism.</p>

<p>I also still use iCloud to backup my photos, but I’ll probably turn that off once I’m close to the next iCloud storage tier.</p>

<h2 id="backups">Backups</h2>

<h3 id="automated-truenas-backups">Automated TrueNAS Backups</h3>

<p>I run a 2014 Mac Mini with a 12TB external HDD using encrypted APFS to facilitate TrueNAS backups.
I call it the <em>Backmini</em>.</p>

<p>Once a day, the Backmini runs a script that rsyncs all data from my TrueNAS to the external hard drive.
Later in the day, it runs another script that backs up the hard drive using <a href="https://www.backblaze.com/cloud-backup/personal">Backblaze unlimited personal backup</a>.</p>

<p>This is not a standard TrueNAS backup strategy, but I designed it intentionally for two reasons:</p>

<p>First, I specifically want to have a local copy of all my data on an APFS encrypted drive.
If something horrible happens to the TrueNAS, I can just plug the APFS drive to my Macbook and access a fresh copy of my data.
If something horrible happens to <em>me</em>, my fiancé or family member can plug the APFS drive to a Macbook and access a fresh copy of my data.</p>

<p>Second, it’s the most cost-effective strategy I’ve found to backup the amount of data that I have (currently ~4TB with capacity for up to 12TB).
I made a simple spreadsheet comparing the prices of backing up data over five years using my Backmini+Backblaze setup vs. Backblaze B2 vs. Amazon S3.
I made the following assumptions for Backmini+Backblaze:</p>

<ul>
  <li>Hardware: I paid ~$100 for the 2014 Mac Mini and ~$265 for the 12TB drive and its enclosure. I spread the hardware cost over five years, basically assuming that’s the lifespan of the hardware.</li>
  <li>Electricity: The system runs 24x7 at 15W for with rates at ~$0.48/kWh (🤯, <a href="https://www.pge.com/assets/pge/docs/account/rate-plans/residential-electric-rate-plan-pricing.pdf">see PGE pricing</a>), totaling ~$63/year.</li>
  <li>Storage: I pay $99/year for the Backblaze personal unlimited backup plan.</li>
</ul>

<p>For Backblaze B2 and Amazon S3, I used the current prices: $6/TB/month (B2) and $0.023/GB/month (S3).</p>

<p>The numbers look like this:</p>

<p><img src="backup-pricing.jpg" alt="Backup Pricing" /></p>

<p>So the Backmini+Backblaze strategy becomes the most cost-effective somewhere between 2TB and 4TB of storage.</p>

<p>At 12TB it’s ~4x cheaper than B2 and ~14x cheaper than S3.
In reality S3 would actually be even more expensive due to read and write fees, but I think the point is clear.
If interested, here’s the spreadsheet: <a href="./backup-pricing.ods">backup-pricing.ods</a>.</p>

<p>Also note that I intentionally chose to run the rsync script on the Backmini, i.e., pulling data from TrueNAS to the Backmini’s external HDD.
I did this to account for the case where either the Backmini or TrueNAS are compromised.
If TrueNAS is compromised, the local data can be deleted, but there’s no TrueNAS user that can access the Backmini’s storage, so the backup is unaffected.
If the Backmini is compromised, the local backup data can be deleted, but the Backmini user can only read from TrueNAS, so TrueNAS is unaffected.
So an attacker would have to compromise both systems to delete all the data.
In that case I still have periodic offline backups, see below.
To make the permissions work, I added a TrueNAS user called backmini, added the user to the group of each user whose data needs to be backed up, and configured the directory permissions as 750, i.e., permitting the group to read and execute files.
Now the backmini user can read each TrueNAS user’s data but can only write to its own data.</p>

<p>Because I’ve been burned by silent drive failures before, I use a program called <a href="https://binaryfruit.com/drivedx">DriveDX</a> to run periodic SMART checks and send me the results.</p>

<p>While this setup works well for me, I’ll mention some cons:</p>

<ul>
  <li>It’s complicated. It would definitely be simpler to just sync the data to S3.</li>
  <li>The external hard drive is a single point of failure.</li>
  <li>The Backblaze backup is encrypted, but I find Backblaze’s encryption scheme flawed. Specifically, I have to enter my encryption key <em>in the browser</em> just to browse the files. So I basically have to hand over the key and trust that Backblaze isn’t doing anything dumb with it. I would find it much nicer if Backblaze either didn’t encrypt file names or had separate keys for file name and data encryption.</li>
  <li>The 2014 Mac Mini is almost deprecated by Apple. It’s running Monterey and gets a security patch a couple times a year, but it’s not getting any new major releases. At some point I’ll probably upgrade to a 2018 or a 2020 M1 Mac Mini.</li>
</ul>

<h3 id="manual-truenas-backups">Manual TrueNAS Backups</h3>

<p>Every three months I run a rsync script to copy the most important data from my TrueNAS onto an external hard drive.
Then the hard drive goes back into a small safe.
The hard drive uses encrypted APFS, so it’s useless to anyone without the key.</p>

<h3 id="proxmox-backups">Proxmox Backups</h3>

<p>I use Proxmox’s built-in backup functionality to backup the Proxmox VMs on my Beelink Mini PC to an internal 2TB SSD once/month.
Similar to my TrueNAS backups, I have a script on the Backmini that rsyncs the Proxmox VM backups to the 12TB external HDD, which then gets backed up to Backblaze.</p>

<h3 id="docker-compose-application-backups">Docker Compose Application Backups</h3>

<p>To backup my Docker Compose applications, I use a containerized script that I wrote called <a href="https://github.com/alexklibisz/bdv2s3">bdv2s3</a>.
This horrible name is short for “backup docker volume to S3”.</p>

<p>The container runs as a service in a Docker Compose application with all of the volumes that need to be backed up mounted as read-only.
For example, if I have a Postgres service with a volume mounted for data storage, I’ll also mount the volume into the bdv2s3 container as read-only.
I use labels to identify containers which should be stopped before a backup is started. 
On a configurable cron, the bdv2s3 container will stop all labeled containers, tar up the contents of its mounted volumes, restart the stopped containers, gzip the tar file, encrypt the tar file using a configurable key, copy the encrypted file to S3 (or any S3-compatible storage), and curl a configurable monitoring endpoint.
All configuration is injected via environment variables.</p>

<p>When I need to restore the volumes, I download the backup file, decrypt it, untar it, and use a <code class="language-plaintext highlighter-rouge">docker run</code> command to write the data back to a local docker volume.
I guess it could be a bit more automated, but it’s all documented in the <a href="https://github.com/alexklibisz/bdv2s3">bdv2s3 readme</a>.</p>

<p>I run the backups nightly and store them in a Backblaze B2 bucket with a lifecycle rule that deletes files 60 days after creation.</p>

<p>I’ve been using this since late 2022, and I’ve successfully restored backups several times, so I generally trust this setup.</p>

<p>While this setup works for me, I’ll mention some cons to consider:</p>

<ul>
  <li>It requires stopping the services. This is a homelab, so I’m not concerned with a few seconds of downtime.</li>
  <li>At some point the volumes become too big to backup this way. So far my biggest backups are on the order of 3 GB.</li>
  <li>Some services make nuanced assumptions about the permissions of the data in the volume. So I occasionally have to do a bit of additional permissions surgery when restoring.</li>
</ul>

<h2 id="monitoring">Monitoring</h2>

<h3 id="uptime-kuma">Uptime Kuma</h3>

<p>I use a self-hosted service called <a href="https://github.com/louislam/uptime-kuma">Uptime Kuma</a> for monitoring.</p>

<p>Like some of my other services, Uptime Kuma runs as a Docker Compose application, accessible via Cloudflare Tunnel.
Unlike my other services, it’s running on a tiny VM in the cloud.
I happen to be using <a href="https://www.cloudserver.net/">cloudserver.net</a> because I got a good deal: $10 / year for a VM with 1 CPU and 1GB memory.
I’ll likely move it over to Hetzner when that deal expires.
I chose to run it in the cloud so that it’s decoupled from any failures in my homelab.</p>

<p>I use two types of monitoring in Uptime Kuma:</p>

<ul>
  <li>Push monitoring. The service or task that’s being monitored has to periodically request an endpoint on the monitoring server. If it doesn’t, the service or task is considered to be down.</li>
  <li>HTTP monitoring. Uptime Kuma periodically sends an HTTP request to a specific URL on the server being monitored. If the request returns an error, the server is considered to be down.</li>
</ul>

<p>Here are some examples of alerts I have configured:</p>

<ul>
  <li>Each of my servers has a cron task to request a push monitoring URL once every five minutes. If that URL hasn’t been requested in a certain amount of time, the server is down and I get an email.</li>
  <li>Each of my self-hosted services has an HTTP monitor that sends a request to the service every five minutes. If that request hasn’t succeeded in a certain amount of time, the service is down and I get an email.</li>
  <li>Each of my application backups is configured to send a request to a push monitoring URL each time it runs (typically once/day). If that URl hasn’t been accessed in a certain amount of time, the backup is down and I get an email.</li>
</ul>

<h3 id="uptime-robot">Uptime Robot</h3>

<p>I use a SAAS called <a href="https://uptimerobot.com/">Uptime Robot</a> to monitor Uptime Kuma.
I have a single HTTP monitor that periodically pings my Uptime Kuma server, just to make sure it’s still up.
I would use Uptime Robot for everything, but the free tier is limited and I can get everything I need from Uptime Kuma.</p>

<h3 id="amazon-ses">Amazon SES</h3>

<p>I use Amazon’s <a href="https://aws.amazon.com/ses/">Simple Email Service</a> to send email notifications from Uptime Kuma and any other self-hosted service with an SMTP email integration.</p>

<p>Like many things in AWS, it’s immediately obvious that this should work, but not immediately obvious how to wire it up.
The high-level summary is that you create an IAM user (i.e., the type with a secret access ID and key), grant the user permission to send emails via SES, send a confirmation email to yourself to “subscribe” to the emails, and set <code class="language-plaintext highlighter-rouge">email-smtp.&lt;region&gt;.amazonaws.com</code> as the SMTP host and the IAM ID and IAM key as the SMTP username and password.</p>

<h2 id="conclusion">Conclusion</h2>

<p>So that’s my current setup.
It’s not a particularly impressive amount of storage/compute/networking compared to some other setups I’ve seen, but I feel like I’ve done a good job of designing and building just enough to meet my requirements.</p>

<p>If you have questions/comments/suggestions, feel free to use the self-hosted comment system below, or join <a href="https://www.reddit.com/r/selfhosted/comments/1frvnlr/my_homelab_september_2024_truenas_proxmox/">the Reddit thread</a>.</p>

<h2 id="appendix">Appendix</h2>

<h3 id="why-truenas">Why TrueNAS?</h3>

<p>I first got interested in ZFS and TrueNAS after experiencing two incidents of data corruption.</p>

<p>In the first, I found ~100 photos that I had stored on Google Drive had been corrupted.
This varied from small lines through the photo, to large swaths of the photo grayed out, to not being able to open the file at all.
Luckily I had a backup of these particular files on an old external drive that I was planning to wipe.
I doubt they were corrupted on Google Drive.
It’s more likely they were corrupted at some point while transferring from my phone, to my computer, to Google Drive, etc.</p>

<p>In the second, I found an important folder was completely missing from an external hard drive.
After doing some SMART tests, I learned the drive had some hardware failures, but there had been no warning of this in MacOS.
Luckily I also had a backup in this case.</p>

<p>After these incidents I went down the rabbithole of preventing bitrot, detecting data corruption, etc..
ZFS seemed like the best tool for the job, and TrueNAS seemed like the most user-friendly implementation, so that’s what I used.</p>

<h3 id="why-not-truenas-scale">Why not TrueNAS Scale?</h3>

<p>I chose TrueNAS Core over TrueNAS Scale because Scale performed extremely poorly with the built-in storage controller in my Microserver Gen8.
I ended up buying a storage controller on eBay, which improved performance, but it also increased the power usage by about 10W.
I wrote <a href="https://www.truenas.com/community/threads/slow-writes-and-device-x-is-causing-slow-i-o-with-scale-on-intel-cougar-point-controller-hp-microserver-gen8-but-works-fine-on-core.112441/">a summary about it</a> on the old TrueNAS forums.</p>

<p>So far the only feature I’ve missed from TrueNAS Scale is the ability to natively connect to Tailscale.
But using my Tailscale-connected router as a subnet router has been totally sufficient.
I haven’t been able to notice any performance penalty compared to transferring directly over the LAN.
To me that makes sense, as the packets traverse the router anyways.
I haven’t felt myself wanting any of the fancier features of TrueNAS Scale (VMs, Docker containers, etc.).
I actually prefer to keep this setup as simple as possible for reliability and security purposes.</p>

<h3 id="why-not-pfsense--opensense">Why not PFSense / OpenSense?</h3>

<p>I actually ran PFSense for about six months but ultimately decommissioned it in favor of the GL.iNet router described above.
PFSense was pretty easy to setup, but it also occasionally failed in mysterious ways, a couple times while I was traveling.
This could have totally been hardware-related, but I just wanted something that I can set and forget.
The PFSense subreddit also left a sour taste in my mouth on a couple occasions (<a href="https://www.reddit.com/r/PFSENSE/comments/165osls/comment/jyk87o5/?utm_source=share&amp;utm_medium=web3x&amp;utm_name=web3xcss&amp;utm_term=1&amp;utm_content=share_button">example</a>).</p>

<h3 id="update-101324">Update: 10/13/24</h3>

<p>I’ve replaced the 2014 Mac Mini with a 2015 11.6” Macbook Air, purchased for ~$100 on Backmarket.
My main reasoning for this is the built-in battery.
For some reason the Mac Mini would lose power ~once every couple weeks, even when on the UPS.
Rebooting it was a pain.
Newer versions of MacOS let you login after reboot through the Screen Sharing app, but MacOS Monterrey doesn’t seem to have this feature.
So anytime the Mac Mini rebooted, I would have to attach a USB keyboard and type in the password to get it started.
I’m hoping that the Macbook Air, by virtue of having a built-in UPS, solves this problem.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[This post is a tour of my homelab as of September 2024.]]></summary></entry><entry><title type="html">Accessing a Docker Compose application via Tailscale with TLS (HTTPS)</title><link href="https://alexklibisz.com/2024/09/07/accessing-docker-compose-application-tailscale-tls.html" rel="alternate" type="text/html" title="Accessing a Docker Compose application via Tailscale with TLS (HTTPS)" /><published>2024-09-07T15:00:00+00:00</published><updated>2024-09-07T15:00:00+00:00</updated><id>https://alexklibisz.com/2024/09/07/accessing-docker-compose-application-tailscale-tls</id><content type="html" xml:base="https://alexklibisz.com/2024/09/07/accessing-docker-compose-application-tailscale-tls.html"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>In this post I’ll show how I self-host Docker Compose applications that I can access from my Tailnet with TLS (i.e., via HTTPS).
At the time of writing, I’ve yet to see an end-to-end example for getting this to all to work together.
So I figure I can share my own setup that’s been working for about a year now.
Maybe someone will comment and tell me a way simpler way to do this!<sup id="fnref:internet" role="doc-noteref"><a href="#fn:internet" class="footnote" rel="footnote">1</a></sup></p>

<p>For demo purposes, I’ll be using the <a href="https://hub.docker.com/r/joplin/server">server image</a> for the <a href="https://joplinapp.org/">Joplin</a> app.
It’s relatively simple to setup, and the Joplin mobile client requires that the server is using HTTPS/TLS.
However, the same concepts apply to any Dockerized application.</p>

<p>The source code for this post is available <a href="https://github.com/alexklibisz/site-projects/tree/main/accessing-docker-compose-applications-via-tailscale-with-tls">on Github.</a></p>

<h2 id="background">Background</h2>

<h3 id="docker-compose">Docker Compose</h3>

<p><a href="https://docs.docker.com/compose/">Docker Compose</a> is a command line application for orchestrating multiple Docker containers on a single host.
I’ve found it’s an indispensable tool for self-hosting applications.</p>

<h3 id="tailscale">Tailscale</h3>

<p><a href="https://tailscale.com/">Tailscale</a> is a mesh VPN that allows users to securely connect between devices (servers, PCs, phones) across different networks, without exposing any of the devices to the public Internet.
That means I can install the Tailscale client on my iPhone and connect to my Joplin server running at home, without requiring me to have both devices on the same network, and without requiring me to expose the Joplin server over the public Internet.
The traffic flows device-to-device, as the Tailscale service itself is only facilitating discovery and authentication. 
I’ve tried a few VPNs, and Tailscale is by far the simplest and most reliable.</p>

<h3 id="joplin">Joplin</h3>

<p><a href="https://joplinapp.org/">Joplin</a> is an open-source note-taking application, like Evernote circa 2014 but with Markdown.
It has reliable desktop and mobile applications and provides a few options for syncing notes across devices.
One of the options is to host your own storage and sync server, which is what I’m doing in this post.</p>

<h2 id="step-1-get-the-joplin-server-running">Step 1: get the Joplin server running</h2>

<p>As a first step, I just need to get the Joplin server running.
I have a Docker Compose service for Joplin, configured to bind to port 80 and exposed on <a href="http://localhost:8080">http://localhost:8080</a>.
For now it’s just using an ephemeral Sqlite database.<sup id="fnref:postgres" role="doc-noteref"><a href="#fn:postgres" class="footnote" rel="footnote">2</a></sup></p>

<p>Here’s the docker-compose.yaml file:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">services</span><span class="pi">:</span>
  <span class="na">joplin</span><span class="pi">:</span>
    <span class="na">image</span><span class="pi">:</span> <span class="s">joplin/server:3.0.1-beta</span>
    <span class="na">restart</span><span class="pi">:</span> <span class="s">unless-stopped</span>
    <span class="na">ports</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">8080:80</span>
    <span class="na">environment</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">APP_PORT=80</span>
      <span class="pi">-</span> <span class="s">APP_BASE_URL=http://localhost:8080</span>
</code></pre></div></div>

<p>I run it with <code class="language-plaintext highlighter-rouge">docker compose up</code>, and I’m able to access the web UI on <a href="http://localhost:8080">http://localhost:8080</a>:</p>

<p><img src="joplin-localhost-8080.png" alt="The Joplin web UI running at localhost:8080" width="80%" height="auto" /></p>

<h2 id="step-2-add-a-tailscale-service">Step 2: add a Tailscale service</h2>

<p>Now I need to add a Tailscale service to Docker Compose.
I add the service with the following details:</p>

<ul>
  <li>The <code class="language-plaintext highlighter-rouge">TS_HOSTNAME=joplin-server</code> environment variable tells Tailscale that this device should be accessible at <code class="language-plaintext highlighter-rouge">joplin-server.mytailnet.ts.net</code>.</li>
  <li>The <code class="language-plaintext highlighter-rouge">tailscale</code> volume and <code class="language-plaintext highlighter-rouge">TS_STATE_DIR</code> environment variable allow Tailscale to persist authentication state across restarts.</li>
</ul>

<p>Here’s the updated docker-compose.yaml file.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">services</span><span class="pi">:</span>
  <span class="na">joplin</span><span class="pi">:</span>
    <span class="na">image</span><span class="pi">:</span> <span class="s">joplin/server:3.0.1-beta</span>
    <span class="na">restart</span><span class="pi">:</span> <span class="s">unless-stopped</span>
    <span class="na">environment</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">APP_PORT=80</span>
      <span class="pi">-</span> <span class="s">APP_BASE_URL=http://localhost:8080</span>
    <span class="na">ports</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">8080:80</span>
  <span class="na">tailscale</span><span class="pi">:</span>
    <span class="na">image</span><span class="pi">:</span> <span class="s">tailscale/tailscale:v1.72.1</span>
    <span class="na">volumes</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">tailscale:/var/run/tailscale</span>
    <span class="na">environment</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">TS_HOSTNAME=joplin-server</span>
      <span class="pi">-</span> <span class="s">TS_STATE_DIR=/var/run/tailscale</span>
    <span class="na">restart</span><span class="pi">:</span> <span class="s">unless-stopped</span>
<span class="na">volumes</span><span class="pi">:</span>
  <span class="na">tailscale</span><span class="pi">:</span>
</code></pre></div></div>

<p>When I run <code class="language-plaintext highlighter-rouge">docker compose up</code>, I see a log like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tailscale-1  | To authenticate, visit:
tailscale-1  | 
tailscale-1  |  https://login.tailscale.com/a/d41d8cd98f00b2
</code></pre></div></div>

<p>So I follow the link and Tailscale prompts me to connect the device:</p>

<p><img src="tailscale-connect-device.png" alt="Tailscale prompting me to connect the new device" width="80%" height="auto" /></p>

<p>After some prompts, I can see it’s online:</p>

<p><img src="tailscale-online.png" alt="Tailscale device is online" width="80%" height="auto" /></p>

<h2 id="step-3-access-the-joplin-service-on-my-tailnet">Step 3: access the Joplin service on my Tailnet</h2>

<p>Now I want the Joplin service to be available on my Tailnet, at <a href="http://tailscale-server.mytailnet.ts.net">http://tailscale-server.mytailnet.ts.net</a>.</p>

<p>The simplest way I’ve found to do this is still a bit more involved than I would like it to be.
I need to do the following:</p>

<ul>
  <li>I add an Nginx service, configured to use the Docker DNS and to forward traffic from port <code class="language-plaintext highlighter-rouge">80</code> to the Joplin service.</li>
  <li>I configure the Nginx service to share its network with the the Tailscale service.</li>
</ul>

<p>Here’s the <code class="language-plaintext highlighter-rouge">nginx.conf</code> file.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>events {}
http {
  server {
    # Telling Nginx to use the Docker internal DNS server,
    # so it can resolve the `joplin` host name.
    resolver 127.0.0.11 [::1]:5353 valid=3600s;
    set $backend "http://joplin:80";
    location / {
      proxy_pass $backend;
      proxy_set_header Host $host;
      proxy_set_header X-Real-IP $remote_addr;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      client_max_body_size 64M;
    }
  }
}
</code></pre></div></div>

<p>Here’s the updated docker-compose.yaml:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">services</span><span class="pi">:</span>
  <span class="na">joplin</span><span class="pi">:</span>
    <span class="na">image</span><span class="pi">:</span> <span class="s">joplin/server:3.0.1-beta</span>
    <span class="na">restart</span><span class="pi">:</span> <span class="s">unless-stopped</span>
    <span class="na">environment</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">APP_PORT=80</span>
      <span class="pi">-</span> <span class="s">APP_BASE_URL=http://joplin-server.${TAILNET}.ts.net</span>
  <span class="na">nginx</span><span class="pi">:</span>
    <span class="na">image</span><span class="pi">:</span> <span class="s">nginx:1.27.0</span>
    <span class="na">volumes</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">./nginx-1.conf:/etc/nginx/nginx.conf</span>
    <span class="na">restart</span><span class="pi">:</span> <span class="s">unless-stopped</span>
    <span class="na">network_mode</span><span class="pi">:</span> <span class="s">service:tailscale</span>
  <span class="na">tailscale</span><span class="pi">:</span>
    <span class="na">image</span><span class="pi">:</span> <span class="s">tailscale/tailscale:v1.72.1</span>
    <span class="na">volumes</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">tailscale:/var/run/tailscale</span>
    <span class="na">environment</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">TS_HOSTNAME=joplin-server</span>
      <span class="pi">-</span> <span class="s">TS_STATE_DIR=/var/run/tailscale</span>
    <span class="na">restart</span><span class="pi">:</span> <span class="s">unless-stopped</span>
<span class="na">volumes</span><span class="pi">:</span>
  <span class="na">tailscale</span><span class="pi">:</span>
</code></pre></div></div>

<p>Now the Joplin service is accessible at <a href="http://joplin-server.mytailnet.ts.net">http://joplin-server.mytailnet.ts.net</a>:</p>

<p><img src="joplin-on-tailnet-http.png" alt="Joplin server accessible over HTTP on the Tailnet" width="80%" height="auto" /></p>

<h2 id="step-4-access-the-joplin-service-via-tailnet-with-https">Step 4: access the Joplin service via Tailnet with HTTPS</h2>

<p>The server is currently accessible over HTTP, but I need it to use HTTPS.
To achieve this, I’ll use the Tailscale CLI to generate the TLS key and certificate files, mount them on the the Nginx service, and configure Nginx to use them.</p>

<p>This requires the following changes to the docker-compose file:</p>

<ul>
  <li>I add a <code class="language-plaintext highlighter-rouge">tls</code> volume to store the key and certificate files.</li>
  <li>I mount the <code class="language-plaintext highlighter-rouge">tls</code> volume on the Nginx service in read-only mode.</li>
  <li>I mount the <code class="language-plaintext highlighter-rouge">tls</code> volume on the Tailscale service in read/write mode.</li>
</ul>

<p>Here’s the updated docker-compose.yaml:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">services</span><span class="pi">:</span>
  <span class="na">joplin</span><span class="pi">:</span>
    <span class="na">image</span><span class="pi">:</span> <span class="s">joplin/server:3.0.1-beta</span>
    <span class="na">restart</span><span class="pi">:</span> <span class="s">unless-stopped</span>
    <span class="na">environment</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">APP_PORT=80</span>
      <span class="pi">-</span> <span class="s">APP_BASE_URL=https://joplin-server.${TAILNET}.ts.net</span>
  <span class="na">nginx</span><span class="pi">:</span>
    <span class="na">image</span><span class="pi">:</span> <span class="s">nginx:1.27.0</span>
    <span class="na">volumes</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">./nginx-2.conf:/etc/nginx/nginx.conf</span>
      <span class="pi">-</span> <span class="s">tls:/mnt/tls:ro</span>
    <span class="na">restart</span><span class="pi">:</span> <span class="s">unless-stopped</span>
    <span class="na">network_mode</span><span class="pi">:</span> <span class="s">service:tailscale</span>
  <span class="na">tailscale</span><span class="pi">:</span>
    <span class="na">image</span><span class="pi">:</span> <span class="s">tailscale/tailscale:v1.72.1</span>
    <span class="na">volumes</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">tailscale:/var/run/tailscale</span>
      <span class="pi">-</span> <span class="s">tls:/mnt/tls</span>
    <span class="na">environment</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">TS_HOSTNAME=joplin-server</span>
      <span class="pi">-</span> <span class="s">TS_STATE_DIR=/var/run/tailscale</span>
    <span class="na">restart</span><span class="pi">:</span> <span class="s">unless-stopped</span>
<span class="na">volumes</span><span class="pi">:</span>
  <span class="na">tailscale</span><span class="pi">:</span>
  <span class="na">tls</span><span class="pi">:</span>
</code></pre></div></div>

<p>I need to configure Nginx to use the key and certificate, so I modify the Nginx conf file to include the <code class="language-plaintext highlighter-rouge">listen 443 ssl</code>, <code class="language-plaintext highlighter-rouge">ssl_certificate</code>, and <code class="language-plaintext highlighter-rouge">ssl_certificate_key</code> directives.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>events {}
http {
  server {
    resolver 127.0.0.11 [::1]:5353 valid=15s;
    set $backend "http://joplin:80";
    listen 443 ssl;
    ssl_certificate /mnt/tls/cert.pem;
    ssl_certificate_key /mnt/tls/cert.key;
    location / {
      proxy_pass $backend;
      proxy_set_header Host $host;
      proxy_set_header X-Real-IP $remote_addr;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      client_max_body_size 64M;
    }
  }
}
</code></pre></div></div>

<p>I start the services via <code class="language-plaintext highlighter-rouge">docker compose up --detach</code>.
If I look at the logs at this point, Nginx is crashing and restarting, because there’s no key and cert file yet.</p>

<p>To generate the key and cert files, I use the <code class="language-plaintext highlighter-rouge">tailscale cert</code> command via <code class="language-plaintext highlighter-rouge">docker compose exec</code>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>docker compose <span class="nb">exec </span>tailscale <span class="se">\</span>
    /bin/sh <span class="nt">-c</span> <span class="s2">"tailscale cert --cert-file /mnt/tls/cert.pem --key-file /mnt/tls/cert.key joplin-server.mytailnet.ts.net"</span>
Wrote public cert to /mnt/tls/cert.pem
Wrote private key to /mnt/tls/cert.key
</code></pre></div></div>

<p>When Nginx restarts, it picks up these files and starts proxying requests with TLS.</p>

<p>And finally, I can access the Joplin server over HTTPS, at <a href="https://joplin-server.mytailnet.ts.net">https://joplin-server.mytailnet.ts.net</a>:</p>

<p><img src="joplin-on-tailnet-https.png" alt="Joplin server accessible over HTTPS on the Tailnet" width="80%" height="auto" /></p>

<h2 id="step-5-maintenance">Step 5: Maintenance</h2>

<p>I’ve been running this setup in several Docker Compose applications for about a year now.
The only required maintenance is periodically re-generating the certificates.
I have the <code class="language-plaintext highlighter-rouge">docker compose exec</code> command in a script called <code class="language-plaintext highlighter-rouge">renew_certs.sh</code>, and I’m using <a href="https://github.com/louislam/uptime-kuma">Uptime Kuma</a> to notify me when the certs are nearing their expiry.
I could automate this with a simple cronjob, but this is easy enough.</p>

<h2 id="conclusion">Conclusion</h2>

<p>So this is how I self-host Docker Compose applications on my Tailnet with TLS.
The setup is maybe a bit more complicated than I’d like; I somewhat wish Nginx wasn’t necessary here.
Please comment if you know of a simpler way!
It’s been working reliably for about a year now, so hopefully the write-up is useful for some other self-hosters.</p>

<hr />

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:internet" role="doc-endnote">
      <p>I’ve found a useful tactic for gleaning new information on the Internet is to semi-confidently state something you’re unsure about. More often than not, someone suddenly appears to correct you. <a href="#fnref:internet" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:postgres" role="doc-endnote">
      <p>You can use a persistent Postgres database based on <a href="https://github.com/laurent22/joplin/blob/091bf45149eb530a59b86ea04e917fb56734252a/docker-compose.server-dev.yml">this docker-compose.yaml file</a>, but I recommend waiting until after getting all the Tailscale and HTTPS working, to avoid unnecessary intermediate complexity. <a href="#fnref:postgres" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><summary type="html"><![CDATA[In this post I'll show how to self-host a Dockerized application that I can access from my Tailnet with HTTPS/TLS.]]></summary></entry><entry><title type="html">A script to dim my Macbook display when an external display is detected</title><link href="https://alexklibisz.com/2024/09/02/script-to-dim-macbook-display-when-external-display-detected.html" rel="alternate" type="text/html" title="A script to dim my Macbook display when an external display is detected" /><published>2024-09-02T15:00:00+00:00</published><updated>2024-09-02T15:00:00+00:00</updated><id>https://alexklibisz.com/2024/09/02/script-to-dim-macbook-display-when-external-display-detected</id><content type="html" xml:base="https://alexklibisz.com/2024/09/02/script-to-dim-macbook-display-when-external-display-detected.html"><![CDATA[<h2 id="background">Background</h2>

<p>I live in PG&amp;E territory, where my current off-peak electricity cost is about $0.48 / kWh.
So I occasionally get curious about how much power various appliances and devices are using.
Today I decided to check my 2021 M1 Max Macbook Pro.
It turns out it’s drawing about 25 watts at idle.
But if I dim the screen all the way down, it’s only about 15 watts!</p>

<p>I generally use my Macbook with an external display in mirrored mode, i.e., the external display and the Macbook display show the same thing, but I look at the external display.
I just keep the lid open because I like the fingerprint sensor.
But I don’t actually need the display to be on.</p>

<p>So that got me thinking.
If I keep the screen dimmed all the way down, I could use 40% less energy on my Macbook.
Assuming I use this laptop ~8 hours / day, I can save a whopping 29 kWh / year, or about $14!
Obviously this is peanuts, but it’s still a fun, quick exercise.</p>

<h2 id="requirements">Requirements</h2>

<p>The requirements are pretty simple:</p>

<ul>
  <li>The laptop should automatically dim the screen when it detects that it’s plugged into an external display.</li>
  <li>I should be able to manually override this as-needed, i.e., it shouldn’t dim the screen after I’ve overridden it.</li>
</ul>

<h2 id="research">Research</h2>

<p>I looked through the MacOS settings for like 10 minutes to see if there’s a built-in setting for this.
I couldn’t find one.</p>

<p>I Googled around and found out you can use Applescript to simulate pressing the screen brightness buttons.
I think I ultimately found this answer most helpful: <a href="https://apple.stackexchange.com/a/285907">https://apple.stackexchange.com/a/285907</a></p>

<p>Then I needed a way to detect whether the Macbook is connected to an external monitor.
This answer does the trick: <a href="https://stackoverflow.com/a/20115806">https://stackoverflow.com/a/20115806</a>
It basically just returns some metadata and a list of displays.
My monitor is the “DELL P2715Q”:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>system_profiler SPDisplaysDataType
Graphics/Displays:

    Apple M1 Max:

      Chipset Model: Apple M1 Max
      Type: GPU
      Bus: Built-In
      Total Number of Cores: 24
      Vendor: Apple <span class="o">(</span>0x106b<span class="o">)</span>
      Metal Support: Metal 3
      Displays:
        DELL P2715Q:
          Resolution: 6016 x 3384
          UI Looks like: 3008 x 1692 @ 30.00Hz
          Main Display: Yes
          Mirror: On
          Mirror Status: Master Mirror
          Online: Yes
          Rotation: Supported
        Color LCD:
          Display Type: Built-in Liquid Retina XDR Display
          Resolution: 3456 x 2234 Retina
          Mirror: On
          Mirror Status: Hardware Mirror
          Online: Yes
          Automatically Adjust Brightness: No
          Connection Type: Internal
</code></pre></div></div>

<h2 id="solution">Solution</h2>

<p>With those in hand, I assembled this bash script:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="nb">set</span> <span class="nt">-e</span>

<span class="c"># Check if my external display is connected.</span>
<span class="c"># For some reason I have to use the full path, else crontab can't find the system_profiler binary.</span>
<span class="c"># If you use this, chane the P2715Q to a string that's part of your display name when you run the system_profiler command.</span>
<span class="k">if</span> /usr/sbin/system_profiler SPDisplaysDataType | <span class="nb">grep </span>P2715Q
<span class="k">then
  if</span> <span class="o">[</span> <span class="nt">-f</span> /tmp/brightness <span class="o">]</span>
  <span class="k">then</span>
    <span class="c"># If the /tmp/brightness file already exists, that means the script has already set the brightness</span>
    <span class="c"># (the file is created below, after setting brightness), so we just leave it as-is.</span>
    <span class="c"># This prevents from reverting the brightness if I override it manually.</span>
    <span class="nb">echo</span> <span class="s2">"External monitor detected, but /tmp/brightness file already exists, so we're leaving brightness as-is."</span>
  <span class="k">else</span>
    <span class="c"># If the /tmp/brightness file does not exist, that means the script has not yet set the brightness.</span>
    <span class="c"># Set the brightness and create the file.</span>
    <span class="nb">echo</span> <span class="s2">"External monitor detected, and /tmp/brightness file does not exist, so we're dimming brightness."</span>
    <span class="c"># In theory it should be sufficient to call this once,</span>
    <span class="c"># but for some reason it doesn't seem to actually repeat 16 times,</span>
    <span class="c"># so I wrapped it in another loop for good measure</span>
    <span class="k">for </span>i <span class="k">in</span> <span class="o">{</span>1..5<span class="o">}</span>
    <span class="k">do
      </span>osascript <span class="o">&lt;&lt;</span><span class="no">SCRIPT</span><span class="sh">
        tell application "System Events"
          repeat 16 times
            key code 145
          end repeat
        end tell
</span><span class="no">SCRIPT
</span>    <span class="k">done
    </span><span class="nb">touch</span> /tmp/brightness
  <span class="k">fi
else
  </span><span class="nb">echo</span> <span class="s2">"No external monitor detected"</span>
  <span class="k">if</span> <span class="o">[</span> <span class="nt">-f</span> /tmp/brightness <span class="o">]</span>
  <span class="k">then</span>
    <span class="c"># If there's no monitor and the file exists, we delete the file.</span>
    <span class="c"># The next time the script detects a monitor, it will set the brightness.</span>
    <span class="nb">echo</span> <span class="s2">"Deleting /tmp/brightness file"</span>
    <span class="nb">rm</span> <span class="nt">-rf</span> /tmp/brightness
  <span class="k">fi
fi</span>
</code></pre></div></div>

<p>I think the script comments should explain the logic adequately.
It’s essentially using <code class="language-plaintext highlighter-rouge">system_profiler</code> to detect my monitor, using <code class="language-plaintext highlighter-rouge">osascript</code> to adjust the brightness, and using a file <code class="language-plaintext highlighter-rouge">/tmp/brightness</code> to signal whether the brightness has already been set.</p>

<p>When I ran it from iTerm, I received and approved a permissions request like this:</p>

<p><img src="permissions-iterm.png" alt="MacOS permissions request to run the script from iTerm" /></p>

<p>Then I configured it to run as a cron, once per minute:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">*</span> <span class="k">*</span> <span class="k">*</span> <span class="k">*</span> <span class="k">*</span> /path/to/script.sh <span class="o">&gt;</span> /dev/null
</code></pre></div></div>

<p>After about a minute, I received and approved another permissions request:</p>

<p><img src="permissions-cron.png" alt="MacOS permissions request to run the script from cron" /></p>

<p>If I ever need to modify or revoke the permissions, they are stored in <code class="language-plaintext highlighter-rouge">System Settings &gt; Privacy and Security &gt; Accessibility</code>:</p>

<p><img src="permissions-accessibility.png" alt="Permissions in System Settings &gt; Privacy and Security &gt; Accessibility" /></p>

<p>I added the <code class="language-plaintext highlighter-rouge">&gt;/dev/null</code> redirect to the crontab because otherwise MacOS seems to “mail” me the output from stdout.
Without the redirect, each time I open iTerm I see a message “You have new mail”.
I can view and delete it using the <code class="language-plaintext highlighter-rouge">mail</code> program:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Last login: Thu Sep 12 19:26:45 on ttys009
You have new mail.
➜  ~ mail
Mail version 8.1 6/6/93.  Type ? <span class="k">for </span>help.
<span class="s2">"/var/mail/alex"</span>: 3 messages 3 new
<span class="o">&gt;</span>N  1 alex@AKMBPRO2021.loc  Thu Sep 12 19:27  19/719   <span class="s2">"Cron &lt;alex@AKMBPRO202"</span>
 N  2 alex@AKMBPRO2021.loc  Thu Sep 12 19:28  19/719   <span class="s2">"Cron &lt;alex@AKMBPRO202"</span>
 N  3 alex@AKMBPRO2021.loc  Thu Sep 12 19:29  19/719   <span class="s2">"Cron &lt;alex@AKMBPRO202
&gt; delete *
&gt; q
</span></code></pre></div></div>

<p>If you end up seeing this message, that probably means the script is printing to stderr, and you should check why it’s failing.</p>

<h2 id="conclusion">Conclusion</h2>

<p>So now my Macbook automatically dims its screen all the way down when I connect to my external display.
When I disconnect, I just turn the brightness back up.
I’m saving about 10 watts.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[To save a bit of energy while my laptop is mirrored to my external display.]]></summary></entry><entry><title type="html">Connecting a Gli-Net SFT1200 Travel Router to a pfSense OpenVPN server</title><link href="https://alexklibisz.com/2024/02/11/gl-inet-opal-pfsense-openvpn.html" rel="alternate" type="text/html" title="Connecting a Gli-Net SFT1200 Travel Router to a pfSense OpenVPN server" /><published>2024-02-11T15:00:00+00:00</published><updated>2024-02-11T15:00:00+00:00</updated><id>https://alexklibisz.com/2024/02/11/gl-inet-opal-pfsense-openvpn</id><content type="html" xml:base="https://alexklibisz.com/2024/02/11/gl-inet-opal-pfsense-openvpn.html"><![CDATA[<h2 id="tldr">TLDR</h2>

<details>
<summary>Just show me the answer!</summary>
<p></p>
<p>
If you're struggling to connect your Gli-Net SFT1200 Travel Router to your pfSense OpenVPN server, and you've verified the OpenVPN configuration is valid by connecting from another client, then try using the "Legacy Client" option when you export the OpenVPN configuration from pfSense.
This is what ultimately resolved my issues after a few hours of debugging.
</p>
<img src="pfsense-setting-legacy-client.png" />
</details>

<h2 id="background">Background</h2>

<p>Last year I added a <a href="https://www.pfsense.org/">pfSense</a> firewall at home, running on a <a href="https://www.amazon.com/gp/product/B0C69WJ516">$100 mini PC from Amazon</a>.
If you haven’t heard of pfSense, it’s an operating system that acts as your router and firewall, with a bunch of features: highly-customized networking, custom local DNS, built-in VPNs, ad-blocking, etc.
When I say “router”, it’s not a wireless router.
You have to attach a wireless access point.
I just repurposed my old Asus router by running it in AP mode. 
Anyway, to learn more about pfSense, I recommend checking out <a href="https://www.youtube.com/watch?v=fsdm5uc_LsU&amp;list=PLjGQNuuUzvmsuXCoj6g6vm1N-ZeLJso6o">Tom Lawrence’s YouTube videos</a>, or
<a href="https://www.youtube.com/watch?v=_IzyJTcnPu8">this video from Linus Tech Tips</a>, which talks about a similar piece of software called opnSense.
So far, it has worked great.
I don’t think I really needed it, but it’s been a fun way to learn some networking!</p>

<p>More recently, I bought a <a href="https://www.amazon.com/GL-iNet-GL-SFT1200-Secure-Travel-Router/dp/B09N72FMH5">Gli-Net SFT1200 Travel Router</a>.
My main motivation for buying this is to avoid the hassle of connecting multiple devices to a new Wi-Fi network everytime I stay at a hotel or short-term rental.
I haven’t checked this yet, but I should also be able to use the travel router to connect to a single paid Wi-Fi network and connect several devices to the router’s Wi-Fi.
At $40, this should only take a handful of flights to pay for itself.
If you’re curious about the other features, <a href="https://www.youtube.com/watch?v=28kDU4qTNt8&amp;pp=ygUSb3BhbCB0cmF2ZWwgcm91dGVy">this video is a good overview</a>.</p>

<p>After purchasing the travel router, I noticed there’s also an option to use the pfSense router as an OpenVPN server and the travel router as an OpenVPN client.
The benefit of this is that I can connect the travel router back to my pfSense router and websites will see my traffic as if I’m at home.
This is nice for streaming sites that get upset when I access them while traveling.</p>

<p>So far I’ve used Tailscale for this (<a href="https://www.youtube.com/watch?v=P-q-8R67OPY">How to Setup The Tailscale VPN and Routing on pfsense
</a>), but not all devices have a Tailscale client, and the travel router doesn’t directly support Tailscale.
So it’s nice to just connect the router via OpenVPN.</p>

<p>While I managed to get the OpenVPN client/server setup working, it turned out to be a bit of a rabbit hole!
So I figured I’d document my solution for my future self and for anyone else who might run into this (and for the LLMs that will index it and provide a subtly-incorrect answer to someone else’s related question).</p>

<h2 id="versions">Versions</h2>

<p>If you’re reading this to debug your own issue, be sure to compare my versions to your own.</p>

<p>On the pfSense router, I’m running pfSense community edition 2.7.2 and OpenVPN 2.6.8.
To determine the OpenVPN version, I looked for “OpenVPN 2.” in <a href="https://192.168.1.1/status_logs.php?logfile=openvpn">https://192.168.1.1/status_logs.php?logfile=openvpn</a>.</p>

<p>On the travel router, I’m running the default Gl.iNet firmware version 3.216 and OpenVPN 2.5.
To determine the OpenVPN version, I sshed into the router and ran <code class="language-plaintext highlighter-rouge">openvpn --version</code>.
It printed:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>OpenVPN 2.5_git mipsel-openwrt-linux-gnu [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [MH/PKTINFO] [AEAD]
library versions: OpenSSL 1.1.1i  8 Dec 2020, LZO 2.10
Originally developed by James Yonan
Copyright (C) 2002-2018 OpenVPN Inc &lt;sales@openvpn.net&gt;
</code></pre></div></div>

<h2 id="openvpn-server-on-pfsense">OpenVPN Server on pfSense</h2>

<p>To setup the OpenVPN server on pfSense, I followed this tutorial from Tom Lawrence:</p>

<iframe width="560" height="315" src="https://www.youtube.com/embed/I61t7aoGC2Q?si=QgHbrkFchvvFyi7I" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen=""></iframe>

<p>The only thing I found slightly confusing was how to create the VPN users.
Just click System, then User Manager, and create a standard user.</p>

<p>I had the OpenVPN server running on my home public IP address.
I exported the OpenVPN config as shown in the video, and I verified I could connect from my Macbook, using the <a href="https://openvpn.net/client/">OpenVPN connect client</a>.</p>

<p>However, my home public IP address is dynamic, so this IP-based config is going to eventually break.
To solve this, I configured <a href="https://www.cloudflare.com/learning/dns/glossary/dynamic-dns/">dynamic DNS</a> in pfSense to make the OpenVPN server accessible at a subdomain of a domain name I own on Cloudflare, based on this video:</p>

<iframe width="560" height="315" src="https://www.youtube.com/embed/8GYs61ThGBM?si=ZpruaLifTWgE3Hb3" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen=""></iframe>

<p>The only slightly confusing aspect is that you don’t actually need to use a global API key.
You can use a more narrowly-scoped credential, which is always recommended.
There’s a pinned comment that explains this.</p>

<p>So now I had an OpenVPN server running at <code class="language-plaintext highlighter-rouge">mysubdomain.mydomain.com</code>.
I re-exported the config file and verified that I could still connect from my Macbook.</p>

<h2 id="connecting-the-glinet-travel-router-to-the-openvpn-server">Connecting the Gl.iNet Travel Router to the OpenVPN server</h2>

<p>This was the trickier part, but it ends up being quite simple once you know which settings to use.</p>

<p>Start by exporting an OpenVPN configuration from <a href="https://192.168.1.1/vpn_openvpn_export.php">https://192.168.1.1/vpn_openvpn_export.php</a>.
When you do this, be sure to use the DDNS subdomain, and <strong>be sure to select the <code class="language-plaintext highlighter-rouge">Legacy Client</code> option</strong>:</p>

<p><img src="pfsense-setting-hostname-resolution.png" alt="pfsense setting for hostname resolution" /></p>

<p><img src="pfsense-setting-legacy-client.png" alt="pfsense setting for legacy client" /></p>

<p>The <code class="language-plaintext highlighter-rouge">Legacy Client</code> option is the particularly important.
Without this, I couldn’t get the OpenVPN client working at all; it would just hang with no useful errors and no useful logging.
I scoured the web for at least three hours trying to debug this, and ultimately I figured out the OpenVPN versions and figured I would try this legacy setting.
My best speculation is that there were some new settings introduced between OpenVPN 2.5.x (running on the client) and OpenVPN 2.6.x (running on the server).</p>

<p>Download the Inline “Most Clients” configuration, which produces a single <code class="language-plaintext highlighter-rouge">.ovpn</code> file:</p>

<p><img src="pfsense-download-ovpn.png" alt="pfsense download ovpn file" /></p>

<p>At this point you have a config file.
Mine looks like this (with sensitive settings redacted):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dev tun
persist-tun
persist-key
ncp-ciphers AES-256-CBC
cipher AES-256-CBC
auth SHA256
tls-client
client
resolv-retry infinite
remote mysubdomain.mydomain.com 1194 udp
nobind
auth-user-pass
remote-cert-tls server
explicit-exit-notify

&lt;ca&gt;
-----BEGIN CERTIFICATE-----
...
-----END CERTIFICATE-----
&lt;/ca&gt;
setenv CLIENT_CERT 0
key-direction 1
&lt;tls-auth&gt;
...
&lt;/tls-auth&gt;
</code></pre></div></div>

<p>Next, connect to the travel router admin panel.
For my router, the default admin panel is at <a href="http://192.168.8.1">http://192.168.8.1</a>.</p>

<p>Go to VPN on the left and click OpenVPN client.
Click “Add a new OpenVPN Configuration”, enter a description, the username and password from pfSense, and upload the config file.
Click “Connect”, and wait a few seconds.</p>

<p>If it’s working, you should see a green dot next to “OpenVPN Client”, and you’ll be able to connect to websites as usual.
Here’s how it looked for me:</p>

<p><img src="gl-inet-vpn-client.png" alt="img.png" /></p>

<p>If it’s not working, you’ll see a yellow dot instead of a green dot, and you won’t be able to connect to any websites.</p>

<p>If you’re not at home, you can also verify that it’s working by going to <a href="https://whatismyip.com">https://whatismyip.com</a>.
You should see your home public IP.</p>

<p>I happened to also see a couple warnings (pictured above).
I haven’t figured out what to make of them yet.
As far as I can tell, the first warning is basically saying “if the VPN doesn’t work, you just won’t be able to connect to anything”.
I.e., if the VPN is enabled but doesn’t work, the travel router will prevent “leaking”.
The second one is something about ipv6.
There are some ipv6 settings in the travel router, but as far as I could tell, none of them affect this warning.</p>

<h2 id="connection-speeds">Connection Speeds</h2>

<p>I noticed that the VPN seems to affect my connection speeds pretty drastically.</p>

<p>I tried this while traveling, about 2500 miles from home:
Here are my speeds without the VPN:</p>

<p><img src="speed-without-vpn.png" alt="travel router speedtest without VPN" /></p>

<p>Here they are with the VPN:</p>

<p><img src="speed-with-vpn.png" alt="travel router speedtest with VPN" /></p>

<p>So it’s pretty abysmal with the VPN.
It seems to still be good enough for some streaming (with a bit of up-front buffering).
Maybe there are some settings I could modify on the server and client to improve this.</p>

<h2 id="conclusion">Conclusion</h2>

<p>It seems to work!
The main problem in my case was figuring out to use the <code class="language-plaintext highlighter-rouge">Legacy Client</code> setting when exporting my OpenVPN config from pfSense.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Summarizing the rabbithole of getting this travel router connected to a VPN.]]></summary></entry><entry><title type="html">Accelerating vector operations on the JVM using the new jdk.incubator.vector module</title><link href="https://alexklibisz.com/2023/02/25/accelerating-vector-operations-jvm-jdk-incubator-vector-project-panama.html" rel="alternate" type="text/html" title="Accelerating vector operations on the JVM using the new jdk.incubator.vector module" /><published>2023-02-25T15:00:00+00:00</published><updated>2023-02-25T15:00:00+00:00</updated><id>https://alexklibisz.com/2023/02/25/accelerating-vector-operations-jvm-jdk-incubator-vector-project-panama</id><content type="html" xml:base="https://alexklibisz.com/2023/02/25/accelerating-vector-operations-jvm-jdk-incubator-vector-project-panama.html"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>In my work on <a href="https://elastiknn.com/">Elastiknn</a>, I’ve spent many hours looking for ways to optimize vector operations on the Java Virtual Machine (JVM).</p>

<p>The <a href="https://docs.oracle.com/en/java/javase/19/docs/api/jdk.incubator.vector/module-summary.html"><code class="language-plaintext highlighter-rouge">jdk.incubator.vector</code> module</a>, introduced in JDK 16 as part of <a href="https://openjdk.org/jeps/338">JEP 338</a> and <a href="https://openjdk.org/projects/panama/">Project Panama</a>, is the first opportunity I’ve encountered for significant performance improvements in this area.</p>

<p>I recently had some time to experiment with this new module, and I cover my benchmarks and findings in this post.
Overall, I found improvements in operations per second on the order of 2x to 3x compared to simple baselines.
If vector operations are a bottleneck in your application, I recommend you try out this module.</p>

<h2 id="background">Background</h2>

<p>For our purposes, a “vector” is simply an array of floats: <code class="language-plaintext highlighter-rouge">float[]</code> in Java, <code class="language-plaintext highlighter-rouge">Array[Float]</code> in Scala, etc. 
These are common in data science and machine learning.</p>

<p>The specific operations I’m interested in are:</p>

<ul>
  <li><a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine similarity</a> of two vectors</li>
  <li><a href="https://en.wikipedia.org/wiki/Dot_product">dot product</a> of two vectors</li>
  <li><a href="https://en.wikipedia.org/wiki/Taxicab_geometry">L1 distance</a> (aka, Taxicab distance) between two vectors</li>
  <li><a href="https://en.wikipedia.org/wiki/Euclidean_distance">L2 distance</a> (aka, Euclidean distance) between two vectors</li>
</ul>

<p>These are used commonly in <a href="https://en.wikipedia.org/wiki/Nearest_neighbor_search">nearest neighbor search</a>.</p>

<p>The <code class="language-plaintext highlighter-rouge">jdk.incubator.vector</code> API provides a way to access hardware-level optimizations for processing vectors. 
The two hardware-level optimizations mentioned in JEP 338 are <a href="https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions">Streaming SIMD Extensions (SSE)</a> and <a href="https://en.wikipedia.org/wiki/Advanced_Vector_Extensions">Advanced Vector Extensions (AVX)</a>.</p>

<p>Here’s my over-simplified understanding of these optimizations: the various processor vendors have agreed on a set of CPU instructions for operating directly on vectors. 
Just like they provide hardware-level instructions for adding two scalars, they now provide hardware-level instructions for adding two vectors.
These optimized instructions have been accessible for many years in lower-level languages like C and C++.
As part of Project Panama, the JDK has recently exposed an API to leverage these optimized instructions directly from JVM languages.
This API is contained in the <code class="language-plaintext highlighter-rouge">jdk.incubator.vector</code> module.</p>

<h2 id="benchmark-setup">Benchmark Setup</h2>

<p>My benchmarks measure operations per second for five implementations of each of the four vector operations.
I start with a simple baseline and working through four possible optimizations.</p>

<p>I implemented the benchmark in Java and Scala: the actual vector operations are in Java, and the benchmark harness is in Scala.
I use the Java Microbenchmark Harness (JMH) framework, via the sbt-jmh plugin, to execute the benchmark.
This is my first time using JMH in any serious capacity, so I’m happy to hear feedback about better or simpler ways to use it.</p>

<p>Each variation implements this Java interface:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">public</span> <span class="kd">interface</span> <span class="nc">VectorOperations</span> <span class="o">{</span>
    <span class="c1">// https://en.wikipedia.org/wiki/Cosine_similarity</span>
    <span class="kt">double</span> <span class="nf">cosineSimilarity</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">);</span>

    <span class="c1">// https://en.wikipedia.org/wiki/Dot_product</span>
    <span class="kt">double</span> <span class="nf">dotProduct</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">);</span>

    <span class="c1">// https://en.wikipedia.org/wiki/Taxicab_geometry</span>
    <span class="kt">double</span> <span class="nf">l1Distance</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">);</span>

    <span class="c1">// https://en.wikipedia.org/wiki/Euclidean_distance</span>
    <span class="kt">double</span> <span class="nf">l2Distance</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">);</span>
<span class="o">}</span>
</code></pre></div></div>

<p>I verify the correctness of these optimizations by running tests that run the baseline operation and the optimized operation on a pair of random vectors and check for parity in the results.</p>

<p>All benchmarks operate on a pair of randomly-generated floating-point vectors containing 999 elements.
I chose length 999 specifically to make us deal with some additional complexity in the implementation.
This will make more sense later in the post.</p>

<p>All benchmarks run on Oracle JDK 19.0.2, installed via asdf: <code class="language-plaintext highlighter-rouge">$ asdf install java oracle-19.0.2</code>.</p>

<p>All benchmarks run on my <a href="https://support.apple.com/kb/SP782">2018 Mac Mini</a>, which has an <a href="https://www.intel.com/content/www/us/en/products/sku/134905/intel-core-i78700b-processor-12m-cache-up-to-4-60-ghz/specifications.html">Intel i7-8700B processor</a> with SSE4.1, SSE4.2, and AVX2 instruction set extensions.</p>

<p>Finally, all code is available in my site-projects repository: <a href="https://github.com/alexklibisz/site-projects/tree/main/jdk-incubator-vector-optimizations">jdk-incubator-vector-optimizations</a></p>

<h2 id="baseline-implementation">Baseline Implementation</h2>

<p>We start with a baseline implementation of these vector operations.
No clever tricks here.
This is what we get if we take the definition for each operation from Wikipedia and translate it verbatim into Java:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">public</span> <span class="kd">class</span> <span class="nc">BaselineVectorOperations</span> <span class="kd">implements</span> <span class="nc">VectorOperations</span> <span class="o">{</span>

    <span class="kd">public</span> <span class="kt">double</span> <span class="nf">cosineSimilarity</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">)</span> <span class="o">{</span>
        <span class="kt">double</span> <span class="n">dotProd</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
        <span class="kt">double</span> <span class="n">v1SqrSum</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
        <span class="kt">double</span> <span class="n">v2SqrSum</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
        <span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">;</span> <span class="n">i</span><span class="o">++)</span> <span class="o">{</span>
            <span class="n">dotProd</span> <span class="o">+=</span> <span class="n">v1</span><span class="o">[</span><span class="n">i</span><span class="o">]</span> <span class="o">*</span> <span class="n">v2</span><span class="o">[</span><span class="n">i</span><span class="o">];</span>
            <span class="n">v1SqrSum</span> <span class="o">+=</span> <span class="nc">Math</span><span class="o">.</span><span class="na">pow</span><span class="o">(</span><span class="n">v1</span><span class="o">[</span><span class="n">i</span><span class="o">],</span> <span class="mi">2</span><span class="o">);</span>
            <span class="n">v2SqrSum</span> <span class="o">+=</span> <span class="nc">Math</span><span class="o">.</span><span class="na">pow</span><span class="o">(</span><span class="n">v2</span><span class="o">[</span><span class="n">i</span><span class="o">],</span> <span class="mi">2</span><span class="o">);</span>
        <span class="o">}</span>
        <span class="k">return</span> <span class="n">dotProd</span> <span class="o">/</span> <span class="o">(</span><span class="nc">Math</span><span class="o">.</span><span class="na">sqrt</span><span class="o">(</span><span class="n">v1SqrSum</span><span class="o">)</span> <span class="o">*</span> <span class="nc">Math</span><span class="o">.</span><span class="na">sqrt</span><span class="o">(</span><span class="n">v2SqrSum</span><span class="o">));</span>
    <span class="o">}</span>

    <span class="kd">public</span> <span class="kt">double</span> <span class="nf">dotProduct</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">)</span> <span class="o">{</span>
        <span class="kt">float</span> <span class="n">dotProd</span> <span class="o">=</span> <span class="mi">0</span><span class="n">f</span><span class="o">;</span>
        <span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">;</span> <span class="n">i</span><span class="o">++)</span> <span class="n">dotProd</span> <span class="o">+=</span> <span class="n">v1</span><span class="o">[</span><span class="n">i</span><span class="o">]</span> <span class="o">*</span> <span class="n">v2</span><span class="o">[</span><span class="n">i</span><span class="o">];</span>
        <span class="k">return</span> <span class="n">dotProd</span><span class="o">;</span>
    <span class="o">}</span>

    <span class="kd">public</span> <span class="kt">double</span> <span class="nf">l1Distance</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">)</span> <span class="o">{</span>
        <span class="kt">double</span> <span class="n">sumAbsDiff</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
        <span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">;</span> <span class="n">i</span><span class="o">++)</span> <span class="n">sumAbsDiff</span> <span class="o">+=</span> <span class="nc">Math</span><span class="o">.</span><span class="na">abs</span><span class="o">(</span><span class="n">v1</span><span class="o">[</span><span class="n">i</span><span class="o">]</span> <span class="o">-</span> <span class="n">v2</span><span class="o">[</span><span class="n">i</span><span class="o">]);</span>
        <span class="k">return</span> <span class="n">sumAbsDiff</span><span class="o">;</span>
    <span class="o">}</span>

    <span class="kd">public</span> <span class="kt">double</span> <span class="nf">l2Distance</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">)</span> <span class="o">{</span>
        <span class="kt">double</span> <span class="n">sumSqrDiff</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
        <span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">;</span> <span class="n">i</span><span class="o">++)</span> <span class="n">sumSqrDiff</span> <span class="o">+=</span> <span class="nc">Math</span><span class="o">.</span><span class="na">pow</span><span class="o">(</span><span class="n">v1</span><span class="o">[</span><span class="n">i</span><span class="o">]</span> <span class="o">-</span> <span class="n">v2</span><span class="o">[</span><span class="n">i</span><span class="o">],</span> <span class="mi">2</span><span class="o">);</span>
        <span class="k">return</span> <span class="nc">Math</span><span class="o">.</span><span class="na">sqrt</span><span class="o">(</span><span class="n">sumSqrDiff</span><span class="o">);</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p>This produces the following results:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Benchmark                              Mode  Cnt         Score        Error  Units
Bench.cosineSimilarityBaseline        thrpt    6    689586.585 ±  52359.955  ops/s

Bench.dotProductBaseline              thrpt    6   1162553.117 ±  21328.673  ops/s

Bench.l1DistanceBaseline              thrpt    6   1095704.529 ±  32440.308  ops/s

Bench.l2DistanceBaseline              thrpt    6    951909.125 ±  23376.234  ops/s
</code></pre></div></div>

<h2 id="fused-multiply-add-mathfma">Fused Multiply Add (<code class="language-plaintext highlighter-rouge">Math.fma</code>)</h2>

<p>Next, we introduce an optimization based on the “Fused Multiply Add” operator, implemented by the <code class="language-plaintext highlighter-rouge">Math.fma</code> method, <a href="https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Math.html#fma(float,float,float)">documented here</a>.</p>

<p><code class="language-plaintext highlighter-rouge">Math.fma</code> takes three floats, <code class="language-plaintext highlighter-rouge">a, b, c</code>, and executes <code class="language-plaintext highlighter-rouge">a * b + c</code> as a single operation.
This has some advantages with respect to floating point error and performance.
Basically, executing one operation is generally faster than two operations, and incurs only one rounding error.</p>

<p>I found a way to use <code class="language-plaintext highlighter-rouge">Math.fma</code> in all the vector operations except <code class="language-plaintext highlighter-rouge">l1Distance</code>.</p>

<p>The implementations look like this:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">public</span> <span class="kd">class</span> <span class="nc">FmaVectorOperations</span> <span class="kd">implements</span> <span class="nc">VectorOperations</span> <span class="o">{</span>

    <span class="kd">public</span> <span class="kt">double</span> <span class="nf">cosineSimilarity</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">)</span> <span class="o">{</span>
        <span class="kt">double</span> <span class="n">dotProd</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
        <span class="kt">double</span> <span class="n">v1SqrSum</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
        <span class="kt">double</span> <span class="n">v2SqrSum</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
        <span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">;</span> <span class="n">i</span><span class="o">++)</span> <span class="o">{</span>
            <span class="n">dotProd</span> <span class="o">=</span> <span class="nc">Math</span><span class="o">.</span><span class="na">fma</span><span class="o">(</span><span class="n">v1</span><span class="o">[</span><span class="n">i</span><span class="o">],</span> <span class="n">v2</span><span class="o">[</span><span class="n">i</span><span class="o">],</span> <span class="n">dotProd</span><span class="o">);</span>
            <span class="n">v1SqrSum</span> <span class="o">=</span> <span class="nc">Math</span><span class="o">.</span><span class="na">fma</span><span class="o">(</span><span class="n">v1</span><span class="o">[</span><span class="n">i</span><span class="o">],</span> <span class="n">v1</span><span class="o">[</span><span class="n">i</span><span class="o">],</span> <span class="n">v1SqrSum</span><span class="o">);</span>
            <span class="n">v2SqrSum</span> <span class="o">=</span> <span class="nc">Math</span><span class="o">.</span><span class="na">fma</span><span class="o">(</span><span class="n">v2</span><span class="o">[</span><span class="n">i</span><span class="o">],</span> <span class="n">v2</span><span class="o">[</span><span class="n">i</span><span class="o">],</span> <span class="n">v2SqrSum</span><span class="o">);</span>
        <span class="o">}</span>
        <span class="k">return</span> <span class="n">dotProd</span> <span class="o">/</span> <span class="o">(</span><span class="nc">Math</span><span class="o">.</span><span class="na">sqrt</span><span class="o">(</span><span class="n">v1SqrSum</span><span class="o">)</span> <span class="o">*</span> <span class="nc">Math</span><span class="o">.</span><span class="na">sqrt</span><span class="o">(</span><span class="n">v2SqrSum</span><span class="o">));</span>
    <span class="o">}</span>

    <span class="kd">public</span> <span class="kt">double</span> <span class="nf">dotProduct</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">)</span> <span class="o">{</span>
        <span class="kt">float</span> <span class="n">dotProd</span> <span class="o">=</span> <span class="mi">0</span><span class="n">f</span><span class="o">;</span>
        <span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">;</span> <span class="n">i</span><span class="o">++)</span> <span class="n">dotProd</span> <span class="o">=</span> <span class="nc">Math</span><span class="o">.</span><span class="na">fma</span><span class="o">(</span><span class="n">v1</span><span class="o">[</span><span class="n">i</span><span class="o">],</span> <span class="n">v2</span><span class="o">[</span><span class="n">i</span><span class="o">],</span> <span class="n">dotProd</span><span class="o">);</span>
        <span class="k">return</span> <span class="n">dotProd</span><span class="o">;</span>
    <span class="o">}</span>

    <span class="kd">public</span> <span class="kt">double</span> <span class="nf">l1Distance</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">)</span> <span class="o">{</span>
        <span class="c1">// Does not actually leverage Math.fma.</span>
        <span class="kt">double</span> <span class="n">sumAbsDiff</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
        <span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">;</span> <span class="n">i</span><span class="o">++)</span> <span class="n">sumAbsDiff</span> <span class="o">+=</span> <span class="nc">Math</span><span class="o">.</span><span class="na">abs</span><span class="o">(</span><span class="n">v1</span><span class="o">[</span><span class="n">i</span><span class="o">]</span> <span class="o">-</span> <span class="n">v2</span><span class="o">[</span><span class="n">i</span><span class="o">]);</span>
        <span class="k">return</span> <span class="n">sumAbsDiff</span><span class="o">;</span>
    <span class="o">}</span>

    <span class="kd">public</span> <span class="kt">double</span> <span class="nf">l2Distance</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">)</span> <span class="o">{</span>
        <span class="kt">double</span> <span class="n">sumSqrDiff</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
        <span class="kt">float</span> <span class="n">diff</span><span class="o">;</span>
        <span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">;</span> <span class="n">i</span><span class="o">++)</span> <span class="o">{</span>
            <span class="n">diff</span> <span class="o">=</span> <span class="n">v1</span><span class="o">[</span><span class="n">i</span><span class="o">]</span> <span class="o">-</span> <span class="n">v2</span><span class="o">[</span><span class="n">i</span><span class="o">];</span>
            <span class="n">sumSqrDiff</span> <span class="o">=</span> <span class="nc">Math</span><span class="o">.</span><span class="na">fma</span><span class="o">(</span><span class="n">diff</span><span class="o">,</span> <span class="n">diff</span><span class="o">,</span> <span class="n">sumSqrDiff</span><span class="o">);</span>
        <span class="o">}</span>
        <span class="k">return</span> <span class="nc">Math</span><span class="o">.</span><span class="na">sqrt</span><span class="o">(</span><span class="n">sumSqrDiff</span><span class="o">);</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p>The results are:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Benchmark                              Mode  Cnt         Score        Error  Units
Bench.cosineSimilarityBaseline        thrpt    6    689586.585 ±  52359.955  ops/s
Bench.cosineSimilarityFma             thrpt    6   1086514.074 ±  15190.380  ops/s

Bench.dotProductBaseline              thrpt    6   1162553.117 ±  21328.673  ops/s
Bench.dotProductFma                   thrpt    6   1073368.454 ± 104684.436  ops/s

Bench.l1DistanceBaseline              thrpt    6   1095704.529 ±  32440.308  ops/s
Bench.l1DistanceFma                   thrpt    6   1098354.824 ±  13870.211  ops/s

Bench.l2DistanceBaseline              thrpt    6    951909.125 ±  23376.234  ops/s
Bench.l2DistanceFma                   thrpt    6   1101736.286 ±  11985.949  ops/s
</code></pre></div></div>

<p>We see some improvement in two of the four cases:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">cosineSimilarity</code> is ~1.6x faster.</li>
  <li><code class="language-plaintext highlighter-rouge">l2Distance</code> is ~1.1x faster.</li>
  <li><code class="language-plaintext highlighter-rouge">dotProduct</code> remains about the same, maybe a little worse.</li>
  <li><code class="language-plaintext highlighter-rouge">l1Distance</code> does not leverage this optimization and predictably does not change much.</li>
</ul>

<h2 id="jdkincubatorvector">jdk.incubator.vector</h2>

<p>Now we jump into optimizations using the <code class="language-plaintext highlighter-rouge">jdk.incubator.vector</code> module!</p>

<h3 id="crash-course">Crash course</h3>

<p>I have to start with a crash course on this module.
I also highly recommend reading the examples in <a href="https://openjdk.org/jeps/338">JEP 338</a>.</p>

<p>We are primarily using the <code class="language-plaintext highlighter-rouge">jdk.incubator.vector.FloatVector</code> class.</p>

<p>Given a pair of <code class="language-plaintext highlighter-rouge">float[]</code> arrays, the general pattern for using <code class="language-plaintext highlighter-rouge">jdk.incubator.vector</code> is as follows:</p>

<ul>
  <li>We iterate over strides (i.e., segments or chunks) of the two arrays.</li>
  <li>At each iteration, we use the <code class="language-plaintext highlighter-rouge">FloatVector.fromArray</code> method to copy the current stride from each array into a <code class="language-plaintext highlighter-rouge">FloatVector</code>.</li>
  <li>We call methods on the <code class="language-plaintext highlighter-rouge">FloatVector</code> instances to execute mathematical operations. For example, if we have <code class="language-plaintext highlighter-rouge">FloatVector</code>s <code class="language-plaintext highlighter-rouge">fv1</code> and <code class="language-plaintext highlighter-rouge">fv2</code>, then <code class="language-plaintext highlighter-rouge">fv1.mul(fv2).reduceLanes(VectorOperations.ADD)</code> runs a pairwise multiplication and sums the results.</li>
</ul>

<p>We also need to know about a helper class called <code class="language-plaintext highlighter-rouge">jdk.incubator.vector.VectorSpecies</code>, which involves the following:</p>

<ul>
  <li>Defines the stride length used for vector operations.</li>
  <li>Provides helper methods for iterating over the arrays in strides.</li>
  <li>Is required to copy values from an array and into a <code class="language-plaintext highlighter-rouge">FloatVector</code>.</li>
</ul>

<p>At the time of writing, there are four species: <code class="language-plaintext highlighter-rouge">SPECIES_64</code>, <code class="language-plaintext highlighter-rouge">SPECIES_128</code>, <code class="language-plaintext highlighter-rouge">SPECIES_256</code>, <code class="language-plaintext highlighter-rouge">SPECIES_512</code>, and two aliases: <code class="language-plaintext highlighter-rouge">SPECIES_MAX</code>, and <code class="language-plaintext highlighter-rouge">SPECIES_PREFERRED</code>.
The numbers 64, 128, 256, and 512 refer to the number of bits in a <code class="language-plaintext highlighter-rouge">FloatVector</code>.
A Java <code class="language-plaintext highlighter-rouge">float</code> uses 4 bytes, or 32 bits, so a vector with <code class="language-plaintext highlighter-rouge">SPECIES_256</code> lets us operate on 256 / 32 = 8 floats in a single operation.
I found it’s best to just stick with <code class="language-plaintext highlighter-rouge">SPECIES_PREFERRED</code>, which defaults to <code class="language-plaintext highlighter-rouge">SPECIES_256</code> on my Mac Mini.
Throughput can actually decrease drastically with a suboptimal <code class="language-plaintext highlighter-rouge">VectorSpecies</code>.</p>

<p>Finally, we need to consider what to do when the array length is less than or not a multiple of the <code class="language-plaintext highlighter-rouge">VectorSpecies</code> length.
If our source arrays have a length that’s equal or a multiple of the stride length, then we can iterate over the strides with no elements left over.
Otherwise, we need to figure out what to do with the “tail” of elements that did not fill up a stride.</p>

<p>The way we handle this tail can have some non-negligible performance impact.
This is why I chose to benchmark with vectors of length 999.
There are three options for dealing with the tail, and they are the subject of the following three sections.</p>

<h3 id="vectormask-on-every-iteration">VectorMask on every Iteration</h3>

<p>The first way to handle the tail is to avoid handling it by using a <code class="language-plaintext highlighter-rouge">VectorMask</code> on every stride.
We use the species to define the <code class="language-plaintext highlighter-rouge">VectorMask</code>, and then pass through the mask when creating the <code class="language-plaintext highlighter-rouge">FloatVector</code>.</p>

<p>I refer to this as <code class="language-plaintext highlighter-rouge">Jep338FullMask</code>, and the implementations look like this:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">public</span> <span class="kd">class</span> <span class="nc">Jep338FullMaskVectorOperations</span> <span class="kd">implements</span> <span class="nc">VectorOperations</span><span class="o">{</span>

    <span class="kd">private</span> <span class="kd">final</span> <span class="nc">VectorSpecies</span><span class="o">&lt;</span><span class="nc">Float</span><span class="o">&gt;</span> <span class="n">species</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">SPECIES_PREFERRED</span><span class="o">;</span>

    <span class="kd">public</span> <span class="kt">double</span> <span class="nf">cosineSimilarity</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">)</span> <span class="o">{</span>
        <span class="kt">double</span> <span class="n">dotProd</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
        <span class="kt">double</span> <span class="n">v1SqrSum</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
        <span class="kt">double</span> <span class="n">v2SqrSum</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
        <span class="nc">FloatVector</span> <span class="n">fv1</span><span class="o">,</span> <span class="n">fv2</span><span class="o">;</span>
        <span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">;</span> <span class="n">i</span> <span class="o">+=</span> <span class="n">species</span><span class="o">.</span><span class="na">length</span><span class="o">())</span> <span class="o">{</span>
            <span class="nc">VectorMask</span><span class="o">&lt;</span><span class="nc">Float</span><span class="o">&gt;</span> <span class="n">m</span> <span class="o">=</span> <span class="n">species</span><span class="o">.</span><span class="na">indexInRange</span><span class="o">(</span><span class="n">i</span><span class="o">,</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">);</span>
            <span class="n">fv1</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v1</span><span class="o">,</span> <span class="n">i</span><span class="o">,</span> <span class="n">m</span><span class="o">);</span>
            <span class="n">fv2</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v2</span><span class="o">,</span> <span class="n">i</span><span class="o">,</span> <span class="n">m</span><span class="o">);</span>
            <span class="n">dotProd</span> <span class="o">+=</span> <span class="n">fv1</span><span class="o">.</span><span class="na">mul</span><span class="o">(</span><span class="n">fv2</span><span class="o">).</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
            <span class="n">v1SqrSum</span> <span class="o">+=</span> <span class="n">fv1</span><span class="o">.</span><span class="na">mul</span><span class="o">(</span><span class="n">fv1</span><span class="o">).</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
            <span class="n">v2SqrSum</span> <span class="o">+=</span> <span class="n">fv2</span><span class="o">.</span><span class="na">mul</span><span class="o">(</span><span class="n">fv2</span><span class="o">).</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
        <span class="o">}</span>
        <span class="k">return</span> <span class="n">dotProd</span> <span class="o">/</span> <span class="o">(</span><span class="nc">Math</span><span class="o">.</span><span class="na">sqrt</span><span class="o">(</span><span class="n">v1SqrSum</span><span class="o">)</span> <span class="o">*</span> <span class="nc">Math</span><span class="o">.</span><span class="na">sqrt</span><span class="o">(</span><span class="n">v2SqrSum</span><span class="o">));</span>
    <span class="o">}</span>

    <span class="kd">public</span> <span class="kt">double</span> <span class="nf">dotProduct</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">)</span> <span class="o">{</span>
        <span class="kt">double</span> <span class="n">dotProd</span> <span class="o">=</span> <span class="mi">0</span><span class="n">f</span><span class="o">;</span>
        <span class="nc">FloatVector</span> <span class="n">fv1</span><span class="o">,</span> <span class="n">fv2</span><span class="o">;</span>
        <span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">;</span> <span class="n">i</span> <span class="o">+=</span> <span class="n">species</span><span class="o">.</span><span class="na">length</span><span class="o">())</span> <span class="o">{</span>
            <span class="nc">VectorMask</span><span class="o">&lt;</span><span class="nc">Float</span><span class="o">&gt;</span> <span class="n">m</span> <span class="o">=</span> <span class="n">species</span><span class="o">.</span><span class="na">indexInRange</span><span class="o">(</span><span class="n">i</span><span class="o">,</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">);</span>
            <span class="n">fv1</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v1</span><span class="o">,</span> <span class="n">i</span><span class="o">,</span> <span class="n">m</span><span class="o">);</span>
            <span class="n">fv2</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v2</span><span class="o">,</span> <span class="n">i</span><span class="o">,</span> <span class="n">m</span><span class="o">);</span>
            <span class="n">dotProd</span> <span class="o">+=</span> <span class="n">fv1</span><span class="o">.</span><span class="na">mul</span><span class="o">(</span><span class="n">fv2</span><span class="o">).</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
        <span class="o">}</span>
        <span class="k">return</span> <span class="n">dotProd</span><span class="o">;</span>
    <span class="o">}</span>

    <span class="kd">public</span> <span class="kt">double</span> <span class="nf">l1Distance</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">)</span> <span class="o">{</span>
        <span class="kt">double</span> <span class="n">sumAbsDiff</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
        <span class="nc">FloatVector</span> <span class="n">fv1</span><span class="o">,</span> <span class="n">fv2</span><span class="o">;</span>
        <span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">;</span> <span class="n">i</span> <span class="o">+=</span> <span class="n">species</span><span class="o">.</span><span class="na">length</span><span class="o">())</span> <span class="o">{</span>
            <span class="nc">VectorMask</span><span class="o">&lt;</span><span class="nc">Float</span><span class="o">&gt;</span> <span class="n">m</span> <span class="o">=</span> <span class="n">species</span><span class="o">.</span><span class="na">indexInRange</span><span class="o">(</span><span class="n">i</span><span class="o">,</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">);</span>
            <span class="n">fv1</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v1</span><span class="o">,</span> <span class="n">i</span><span class="o">,</span> <span class="n">m</span><span class="o">);</span>
            <span class="n">fv2</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v2</span><span class="o">,</span> <span class="n">i</span><span class="o">,</span> <span class="n">m</span><span class="o">);</span>
            <span class="n">sumAbsDiff</span> <span class="o">+=</span> <span class="n">fv1</span><span class="o">.</span><span class="na">sub</span><span class="o">(</span><span class="n">fv2</span><span class="o">).</span><span class="na">abs</span><span class="o">().</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
        <span class="o">}</span>
        <span class="k">return</span> <span class="n">sumAbsDiff</span><span class="o">;</span>
    <span class="o">}</span>

    <span class="kd">public</span> <span class="kt">double</span> <span class="nf">l2Distance</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">)</span> <span class="o">{</span>
        <span class="kt">double</span> <span class="n">sumSqrDiff</span> <span class="o">=</span> <span class="mi">0</span><span class="n">f</span><span class="o">;</span>
        <span class="nc">FloatVector</span> <span class="n">fv1</span><span class="o">,</span> <span class="n">fv2</span><span class="o">,</span> <span class="n">fv3</span><span class="o">;</span>
        <span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">;</span> <span class="n">i</span><span class="o">+=</span> <span class="n">species</span><span class="o">.</span><span class="na">length</span><span class="o">())</span> <span class="o">{</span>
            <span class="nc">VectorMask</span><span class="o">&lt;</span><span class="nc">Float</span><span class="o">&gt;</span> <span class="n">m</span> <span class="o">=</span> <span class="n">species</span><span class="o">.</span><span class="na">indexInRange</span><span class="o">(</span><span class="n">i</span><span class="o">,</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">);</span>
            <span class="n">fv1</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v1</span><span class="o">,</span> <span class="n">i</span><span class="o">,</span> <span class="n">m</span><span class="o">);</span>
            <span class="n">fv2</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v2</span><span class="o">,</span> <span class="n">i</span><span class="o">,</span> <span class="n">m</span><span class="o">);</span>
            <span class="n">fv3</span> <span class="o">=</span> <span class="n">fv1</span><span class="o">.</span><span class="na">sub</span><span class="o">(</span><span class="n">fv2</span><span class="o">);</span>
            <span class="c1">// For some unknown reason, fv3.mul(fv3) is significantly faster than fv3.pow(2).</span>
            <span class="n">sumSqrDiff</span> <span class="o">+=</span> <span class="n">fv3</span><span class="o">.</span><span class="na">mul</span><span class="o">(</span><span class="n">fv3</span><span class="o">).</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
        <span class="o">}</span>
        <span class="k">return</span> <span class="nc">Math</span><span class="o">.</span><span class="na">sqrt</span><span class="o">(</span><span class="n">sumSqrDiff</span><span class="o">);</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p>The results are:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Benchmark                              Mode  Cnt         Score        Error  Units
Bench.cosineSimilarityBaseline        thrpt    6    689586.585 ±  52359.955  ops/s
Bench.cosineSimilarityJep338FullMask  thrpt    6    548425.342 ±  19160.168  ops/s

Bench.dotProductBaseline              thrpt    6   1162553.117 ±  21328.673  ops/s
Bench.dotProductJep338FullMask        thrpt    6    384319.569 ±  20067.109  ops/s

Bench.l1DistanceBaseline              thrpt    6   1095704.529 ±  32440.308  ops/s
Bench.l1DistanceJep338FullMask        thrpt    6    356044.308 ±   5186.842  ops/s

Bench.l2DistanceBaseline              thrpt    6    951909.125 ±  23376.234  ops/s
Bench.l2DistanceJep338FullMask        thrpt    6    376810.628 ±   3531.977  ops/s
</code></pre></div></div>

<p>The new implementation is actually significantly slower than the baseline.</p>

<p>Fortunately, the authors of JEP 338 mention this explicitly:</p>

<blockquote>
  <p>Since a mask is used in all iterations, the above implementation may not achieve optimal performance for large array lengths.</p>
</blockquote>

<p><del>I haven’t looked extensively enough to understand why, but it seems like the <code class="language-plaintext highlighter-rouge">VectorMask</code> is either expensive to create, expensive to use, or maybe both.</del>
Edit: The paper <a href="https://dl.acm.org/doi/pdf/10.1145/3578360.3580265">Java Vector API: Benchmarking and Performance Analysis</a> discusses the performance of <code class="language-plaintext highlighter-rouge">indexInRange</code>, which is used to compute a <code class="language-plaintext highlighter-rouge">VectorMask</code>, in section 5.2. 
It turns out that <code class="language-plaintext highlighter-rouge">indexInRange</code> is only optimized on certain platforms, and degrades poorly on others.</p>

<h3 id="loop-over-the-tail">Loop over the Tail</h3>

<p>We can also handle the tail using a plain loop.</p>

<p>I refer to this as <code class="language-plaintext highlighter-rouge">Jep338TailLoop</code>, and the implementations look like this:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">public</span> <span class="kd">class</span> <span class="nc">Jep338TailLoopVectorOperations</span> <span class="kd">implements</span> <span class="nc">VectorOperations</span><span class="o">{</span>

    <span class="kd">private</span> <span class="kd">final</span> <span class="nc">VectorSpecies</span><span class="o">&lt;</span><span class="nc">Float</span><span class="o">&gt;</span> <span class="n">species</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">SPECIES_PREFERRED</span><span class="o">;</span>

    <span class="kd">public</span> <span class="kt">double</span> <span class="nf">cosineSimilarity</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">)</span> <span class="o">{</span>
        <span class="kt">double</span> <span class="n">dotProd</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
        <span class="kt">double</span> <span class="n">v1SqrSum</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
        <span class="kt">double</span> <span class="n">v2SqrSum</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
        <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
        <span class="kt">int</span> <span class="n">bound</span> <span class="o">=</span> <span class="n">species</span><span class="o">.</span><span class="na">loopBound</span><span class="o">(</span><span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">);</span>
        <span class="nc">FloatVector</span> <span class="n">fv1</span><span class="o">,</span> <span class="n">fv2</span><span class="o">;</span>
        <span class="k">for</span> <span class="o">(;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">bound</span><span class="o">;</span> <span class="n">i</span> <span class="o">+=</span> <span class="n">species</span><span class="o">.</span><span class="na">length</span><span class="o">())</span> <span class="o">{</span>
            <span class="n">fv1</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v1</span><span class="o">,</span> <span class="n">i</span><span class="o">);</span>
            <span class="n">fv2</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v2</span><span class="o">,</span> <span class="n">i</span><span class="o">);</span>
            <span class="n">dotProd</span> <span class="o">+=</span> <span class="n">fv1</span><span class="o">.</span><span class="na">mul</span><span class="o">(</span><span class="n">fv2</span><span class="o">).</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
            <span class="n">v1SqrSum</span> <span class="o">+=</span> <span class="n">fv1</span><span class="o">.</span><span class="na">mul</span><span class="o">(</span><span class="n">fv1</span><span class="o">).</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
            <span class="n">v2SqrSum</span> <span class="o">+=</span> <span class="n">fv2</span><span class="o">.</span><span class="na">mul</span><span class="o">(</span><span class="n">fv2</span><span class="o">).</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
        <span class="o">}</span>
        <span class="k">for</span> <span class="o">(;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">;</span> <span class="n">i</span><span class="o">++)</span> <span class="o">{</span>
            <span class="n">dotProd</span> <span class="o">=</span> <span class="nc">Math</span><span class="o">.</span><span class="na">fma</span><span class="o">(</span><span class="n">v1</span><span class="o">[</span><span class="n">i</span><span class="o">],</span> <span class="n">v2</span><span class="o">[</span><span class="n">i</span><span class="o">],</span> <span class="n">dotProd</span><span class="o">);</span>
            <span class="n">v1SqrSum</span> <span class="o">=</span> <span class="nc">Math</span><span class="o">.</span><span class="na">fma</span><span class="o">(</span><span class="n">v1</span><span class="o">[</span><span class="n">i</span><span class="o">],</span> <span class="n">v1</span><span class="o">[</span><span class="n">i</span><span class="o">],</span> <span class="n">v1SqrSum</span><span class="o">);</span>
            <span class="n">v2SqrSum</span> <span class="o">=</span> <span class="nc">Math</span><span class="o">.</span><span class="na">fma</span><span class="o">(</span><span class="n">v2</span><span class="o">[</span><span class="n">i</span><span class="o">],</span> <span class="n">v2</span><span class="o">[</span><span class="n">i</span><span class="o">],</span> <span class="n">v2SqrSum</span><span class="o">);</span>
        <span class="o">}</span>
        <span class="k">return</span> <span class="n">dotProd</span> <span class="o">/</span> <span class="o">(</span><span class="nc">Math</span><span class="o">.</span><span class="na">sqrt</span><span class="o">(</span><span class="n">v1SqrSum</span><span class="o">)</span> <span class="o">*</span> <span class="nc">Math</span><span class="o">.</span><span class="na">sqrt</span><span class="o">(</span><span class="n">v2SqrSum</span><span class="o">));</span>
    <span class="o">}</span>

    <span class="kd">public</span> <span class="kt">double</span> <span class="nf">dotProduct</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">)</span> <span class="o">{</span>
        <span class="kt">double</span> <span class="n">dotProd</span> <span class="o">=</span> <span class="mi">0</span><span class="n">f</span><span class="o">;</span>
        <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
        <span class="kt">int</span> <span class="n">bound</span> <span class="o">=</span> <span class="n">species</span><span class="o">.</span><span class="na">loopBound</span><span class="o">(</span><span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">);</span>
        <span class="nc">FloatVector</span> <span class="n">fv1</span><span class="o">,</span> <span class="n">fv2</span><span class="o">;</span>
        <span class="k">for</span> <span class="o">(;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">bound</span><span class="o">;</span> <span class="n">i</span> <span class="o">+=</span> <span class="n">species</span><span class="o">.</span><span class="na">length</span><span class="o">())</span> <span class="o">{</span>
            <span class="n">fv1</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v1</span><span class="o">,</span> <span class="n">i</span><span class="o">);</span>
            <span class="n">fv2</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v2</span><span class="o">,</span> <span class="n">i</span><span class="o">);</span>
            <span class="n">dotProd</span> <span class="o">+=</span> <span class="n">fv1</span><span class="o">.</span><span class="na">mul</span><span class="o">(</span><span class="n">fv2</span><span class="o">).</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
        <span class="o">}</span>
        <span class="k">for</span> <span class="o">(;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">;</span> <span class="n">i</span><span class="o">++)</span> <span class="o">{</span>
            <span class="n">dotProd</span> <span class="o">=</span> <span class="nc">Math</span><span class="o">.</span><span class="na">fma</span><span class="o">(</span><span class="n">v1</span><span class="o">[</span><span class="n">i</span><span class="o">],</span> <span class="n">v2</span><span class="o">[</span><span class="n">i</span><span class="o">],</span> <span class="n">dotProd</span><span class="o">);</span>
        <span class="o">}</span>
        <span class="k">return</span> <span class="n">dotProd</span><span class="o">;</span>
    <span class="o">}</span>

    <span class="kd">public</span> <span class="kt">double</span> <span class="nf">l1Distance</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">)</span> <span class="o">{</span>
        <span class="kt">double</span> <span class="n">sumAbsDiff</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
        <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
        <span class="kt">int</span> <span class="n">bound</span> <span class="o">=</span> <span class="n">species</span><span class="o">.</span><span class="na">loopBound</span><span class="o">(</span><span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">);</span>
        <span class="nc">FloatVector</span> <span class="n">fv1</span><span class="o">,</span> <span class="n">fv2</span><span class="o">;</span>
        <span class="k">for</span> <span class="o">(;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">bound</span><span class="o">;</span> <span class="n">i</span> <span class="o">+=</span> <span class="n">species</span><span class="o">.</span><span class="na">length</span><span class="o">())</span> <span class="o">{</span>
            <span class="n">fv1</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v1</span><span class="o">,</span> <span class="n">i</span><span class="o">);</span>
            <span class="n">fv2</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v2</span><span class="o">,</span> <span class="n">i</span><span class="o">);</span>
            <span class="n">sumAbsDiff</span> <span class="o">+=</span> <span class="n">fv1</span><span class="o">.</span><span class="na">sub</span><span class="o">(</span><span class="n">fv2</span><span class="o">).</span><span class="na">abs</span><span class="o">().</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
        <span class="o">}</span>
        <span class="k">for</span> <span class="o">(;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">;</span> <span class="n">i</span><span class="o">++)</span> <span class="o">{</span>
            <span class="n">sumAbsDiff</span> <span class="o">+=</span> <span class="nc">Math</span><span class="o">.</span><span class="na">abs</span><span class="o">(</span><span class="n">v1</span><span class="o">[</span><span class="n">i</span><span class="o">]</span> <span class="o">-</span> <span class="n">v2</span><span class="o">[</span><span class="n">i</span><span class="o">]);</span>
        <span class="o">}</span>
        <span class="k">return</span> <span class="n">sumAbsDiff</span><span class="o">;</span>
    <span class="o">}</span>

    <span class="kd">public</span> <span class="kt">double</span> <span class="nf">l2Distance</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">)</span> <span class="o">{</span>
        <span class="kt">double</span> <span class="n">sumSqrDiff</span> <span class="o">=</span> <span class="mi">0</span><span class="n">f</span><span class="o">;</span>
        <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
        <span class="kt">int</span> <span class="n">bound</span> <span class="o">=</span> <span class="n">species</span><span class="o">.</span><span class="na">loopBound</span><span class="o">(</span><span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">);</span>
        <span class="nc">FloatVector</span> <span class="n">fv1</span><span class="o">,</span> <span class="n">fv2</span><span class="o">,</span> <span class="n">fv3</span><span class="o">;</span>
        <span class="k">for</span> <span class="o">(;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">bound</span><span class="o">;</span> <span class="n">i</span><span class="o">+=</span> <span class="n">species</span><span class="o">.</span><span class="na">length</span><span class="o">())</span> <span class="o">{</span>
            <span class="n">fv1</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v1</span><span class="o">,</span> <span class="n">i</span><span class="o">);</span>
            <span class="n">fv2</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v2</span><span class="o">,</span> <span class="n">i</span><span class="o">);</span>
            <span class="n">fv3</span> <span class="o">=</span> <span class="n">fv1</span><span class="o">.</span><span class="na">sub</span><span class="o">(</span><span class="n">fv2</span><span class="o">);</span>
            <span class="c1">// For some unknown reason, fv3.mul(fv3) is significantly faster than fv3.pow(2).</span>
            <span class="n">sumSqrDiff</span> <span class="o">+=</span> <span class="n">fv3</span><span class="o">.</span><span class="na">mul</span><span class="o">(</span><span class="n">fv3</span><span class="o">).</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
        <span class="o">}</span>
        <span class="k">for</span> <span class="o">(;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">;</span> <span class="n">i</span><span class="o">++)</span> <span class="o">{</span>
            <span class="kt">float</span> <span class="n">diff</span> <span class="o">=</span> <span class="n">v1</span><span class="o">[</span><span class="n">i</span><span class="o">]</span> <span class="o">-</span> <span class="n">v2</span><span class="o">[</span><span class="n">i</span><span class="o">];</span>
            <span class="n">sumSqrDiff</span> <span class="o">=</span> <span class="nc">Math</span><span class="o">.</span><span class="na">fma</span><span class="o">(</span><span class="n">diff</span><span class="o">,</span> <span class="n">diff</span><span class="o">,</span> <span class="n">sumSqrDiff</span><span class="o">);</span>
        <span class="o">}</span>
        <span class="k">return</span> <span class="nc">Math</span><span class="o">.</span><span class="na">sqrt</span><span class="o">(</span><span class="n">sumSqrDiff</span><span class="o">);</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p>The results are:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Benchmark                              Mode  Cnt         Score        Error  Units
Bench.cosineSimilarityBaseline        thrpt    6    689586.585 ±  52359.955  ops/s
Bench.cosineSimilarityJep338TailLoop  thrpt    6   1169365.506 ±   5940.850  ops/s

Bench.dotProductBaseline              thrpt    6   1162553.117 ±  21328.673  ops/s
Bench.dotProductJep338TailLoop        thrpt    6   3317032.038 ±  19343.830  ops/s

Bench.l1DistanceBaseline              thrpt    6   1095704.529 ±  32440.308  ops/s
Bench.l1DistanceJep338TailLoop        thrpt    6   2816348.680 ±  35389.932  ops/s

Bench.l2DistanceBaseline              thrpt    6    951909.125 ±  23376.234  ops/s
Bench.l2DistanceJep338TailLoop        thrpt    6   2897756.618 ±  34180.451  ops/s
</code></pre></div></div>

<p>This time we have a significant improvement over the baseline:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">cosineSimilarity</code> is ~1.7x faster.</li>
  <li><code class="language-plaintext highlighter-rouge">dotProduct</code> is ~3.1x faster.</li>
  <li><code class="language-plaintext highlighter-rouge">l1Distance</code> is ~2.6x faster.</li>
  <li><code class="language-plaintext highlighter-rouge">l2Distance</code> is ~3x faster.</li>
</ul>

<h3 id="vectormask-on-the-tail">VectorMask on the Tail</h3>

<p>What if we use a <code class="language-plaintext highlighter-rouge">VectorMask</code>, but only on the tail of vector.
Is that faster than using a loop on the tail?</p>

<p>I refer to this as <code class="language-plaintext highlighter-rouge">Jep338TailMask</code>, and the implementation looks like this:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">public</span> <span class="kd">class</span> <span class="nc">Jep338TailMaskVectorOperations</span> <span class="kd">implements</span> <span class="nc">VectorOperations</span> <span class="o">{</span>

    <span class="kd">private</span> <span class="kd">final</span> <span class="nc">VectorSpecies</span><span class="o">&lt;</span><span class="nc">Float</span><span class="o">&gt;</span> <span class="n">species</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">SPECIES_PREFERRED</span><span class="o">;</span>

    <span class="kd">public</span> <span class="kt">double</span> <span class="nf">cosineSimilarity</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">)</span> <span class="o">{</span>
        <span class="kt">double</span> <span class="n">dotProd</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
        <span class="kt">double</span> <span class="n">v1SqrSum</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
        <span class="kt">double</span> <span class="n">v2SqrSum</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
        <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
        <span class="kt">int</span> <span class="n">bound</span> <span class="o">=</span> <span class="n">species</span><span class="o">.</span><span class="na">loopBound</span><span class="o">(</span><span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">);</span>
        <span class="nc">FloatVector</span> <span class="n">fv1</span><span class="o">,</span> <span class="n">fv2</span><span class="o">;</span>
        <span class="k">for</span> <span class="o">(;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">bound</span><span class="o">;</span> <span class="n">i</span> <span class="o">+=</span> <span class="n">species</span><span class="o">.</span><span class="na">length</span><span class="o">())</span> <span class="o">{</span>
            <span class="n">fv1</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v1</span><span class="o">,</span> <span class="n">i</span><span class="o">);</span>
            <span class="n">fv2</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v2</span><span class="o">,</span> <span class="n">i</span><span class="o">);</span>
            <span class="n">dotProd</span> <span class="o">+=</span> <span class="n">fv1</span><span class="o">.</span><span class="na">mul</span><span class="o">(</span><span class="n">fv2</span><span class="o">).</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
            <span class="n">v1SqrSum</span> <span class="o">+=</span> <span class="n">fv1</span><span class="o">.</span><span class="na">mul</span><span class="o">(</span><span class="n">fv1</span><span class="o">).</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
            <span class="n">v2SqrSum</span> <span class="o">+=</span> <span class="n">fv2</span><span class="o">.</span><span class="na">mul</span><span class="o">(</span><span class="n">fv2</span><span class="o">).</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
        <span class="o">}</span>
        <span class="k">if</span> <span class="o">(</span><span class="n">i</span> <span class="o">&lt;</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">)</span> <span class="o">{</span>
            <span class="nc">VectorMask</span><span class="o">&lt;</span><span class="nc">Float</span><span class="o">&gt;</span> <span class="n">m</span> <span class="o">=</span> <span class="n">species</span><span class="o">.</span><span class="na">indexInRange</span><span class="o">(</span><span class="n">i</span><span class="o">,</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">);</span>
            <span class="n">fv1</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v1</span><span class="o">,</span> <span class="n">i</span><span class="o">,</span> <span class="n">m</span><span class="o">);</span>
            <span class="n">fv2</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v2</span><span class="o">,</span> <span class="n">i</span><span class="o">,</span> <span class="n">m</span><span class="o">);</span>
            <span class="n">dotProd</span> <span class="o">+=</span> <span class="n">fv1</span><span class="o">.</span><span class="na">mul</span><span class="o">(</span><span class="n">fv2</span><span class="o">).</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
            <span class="n">v1SqrSum</span> <span class="o">+=</span> <span class="n">fv1</span><span class="o">.</span><span class="na">mul</span><span class="o">(</span><span class="n">fv1</span><span class="o">).</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
            <span class="n">v2SqrSum</span> <span class="o">+=</span> <span class="n">fv2</span><span class="o">.</span><span class="na">mul</span><span class="o">(</span><span class="n">fv2</span><span class="o">).</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
        <span class="o">}</span>
        <span class="k">return</span> <span class="n">dotProd</span> <span class="o">/</span> <span class="o">(</span><span class="nc">Math</span><span class="o">.</span><span class="na">sqrt</span><span class="o">(</span><span class="n">v1SqrSum</span><span class="o">)</span> <span class="o">*</span> <span class="nc">Math</span><span class="o">.</span><span class="na">sqrt</span><span class="o">(</span><span class="n">v2SqrSum</span><span class="o">));</span>
    <span class="o">}</span>

    <span class="kd">public</span> <span class="kt">double</span> <span class="nf">dotProduct</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">)</span> <span class="o">{</span>
        <span class="kt">double</span> <span class="n">dotProd</span> <span class="o">=</span> <span class="mi">0</span><span class="n">f</span><span class="o">;</span>
        <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
        <span class="kt">int</span> <span class="n">bound</span> <span class="o">=</span> <span class="n">species</span><span class="o">.</span><span class="na">loopBound</span><span class="o">(</span><span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">);</span>
        <span class="nc">FloatVector</span> <span class="n">fv1</span><span class="o">,</span> <span class="n">fv2</span><span class="o">;</span>
        <span class="k">for</span> <span class="o">(;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">bound</span><span class="o">;</span> <span class="n">i</span> <span class="o">+=</span> <span class="n">species</span><span class="o">.</span><span class="na">length</span><span class="o">())</span> <span class="o">{</span>
            <span class="n">fv1</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v1</span><span class="o">,</span> <span class="n">i</span><span class="o">);</span>
            <span class="n">fv2</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v2</span><span class="o">,</span> <span class="n">i</span><span class="o">);</span>
            <span class="n">dotProd</span> <span class="o">+=</span> <span class="n">fv1</span><span class="o">.</span><span class="na">mul</span><span class="o">(</span><span class="n">fv2</span><span class="o">).</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
        <span class="o">}</span>
        <span class="k">if</span> <span class="o">(</span><span class="n">i</span> <span class="o">&lt;</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">)</span> <span class="o">{</span>
            <span class="nc">VectorMask</span><span class="o">&lt;</span><span class="nc">Float</span><span class="o">&gt;</span> <span class="n">m</span> <span class="o">=</span> <span class="n">species</span><span class="o">.</span><span class="na">indexInRange</span><span class="o">(</span><span class="n">i</span><span class="o">,</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">);</span>
            <span class="n">fv1</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v1</span><span class="o">,</span> <span class="n">i</span><span class="o">,</span> <span class="n">m</span><span class="o">);</span>
            <span class="n">fv2</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v2</span><span class="o">,</span> <span class="n">i</span><span class="o">,</span> <span class="n">m</span><span class="o">);</span>
            <span class="n">dotProd</span> <span class="o">+=</span> <span class="n">fv1</span><span class="o">.</span><span class="na">mul</span><span class="o">(</span><span class="n">fv2</span><span class="o">).</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
        <span class="o">}</span>
        <span class="k">return</span> <span class="n">dotProd</span><span class="o">;</span>
    <span class="o">}</span>

    <span class="kd">public</span> <span class="kt">double</span> <span class="nf">l1Distance</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">)</span> <span class="o">{</span>
        <span class="kt">double</span> <span class="n">sumAbsDiff</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
        <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
        <span class="kt">int</span> <span class="n">bound</span> <span class="o">=</span> <span class="n">species</span><span class="o">.</span><span class="na">loopBound</span><span class="o">(</span><span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">);</span>
        <span class="nc">FloatVector</span> <span class="n">fv1</span><span class="o">,</span> <span class="n">fv2</span><span class="o">;</span>
        <span class="k">for</span> <span class="o">(;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">bound</span><span class="o">;</span> <span class="n">i</span> <span class="o">+=</span> <span class="n">species</span><span class="o">.</span><span class="na">length</span><span class="o">())</span> <span class="o">{</span>
            <span class="n">fv1</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v1</span><span class="o">,</span> <span class="n">i</span><span class="o">);</span>
            <span class="n">fv2</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v2</span><span class="o">,</span> <span class="n">i</span><span class="o">);</span>
            <span class="n">sumAbsDiff</span> <span class="o">+=</span> <span class="n">fv1</span><span class="o">.</span><span class="na">sub</span><span class="o">(</span><span class="n">fv2</span><span class="o">).</span><span class="na">abs</span><span class="o">().</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
        <span class="o">}</span>
        <span class="k">if</span> <span class="o">(</span><span class="n">i</span> <span class="o">&lt;</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">)</span> <span class="o">{</span>
            <span class="nc">VectorMask</span><span class="o">&lt;</span><span class="nc">Float</span><span class="o">&gt;</span> <span class="n">m</span> <span class="o">=</span> <span class="n">species</span><span class="o">.</span><span class="na">indexInRange</span><span class="o">(</span><span class="n">i</span><span class="o">,</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">);</span>
            <span class="n">fv1</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v1</span><span class="o">,</span> <span class="n">i</span><span class="o">,</span> <span class="n">m</span><span class="o">);</span>
            <span class="n">fv2</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v2</span><span class="o">,</span> <span class="n">i</span><span class="o">,</span> <span class="n">m</span><span class="o">);</span>
            <span class="n">sumAbsDiff</span> <span class="o">+=</span> <span class="n">fv1</span><span class="o">.</span><span class="na">sub</span><span class="o">(</span><span class="n">fv2</span><span class="o">).</span><span class="na">abs</span><span class="o">().</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
        <span class="o">}</span>
        <span class="k">return</span> <span class="n">sumAbsDiff</span><span class="o">;</span>
    <span class="o">}</span>

    <span class="kd">public</span> <span class="kt">double</span> <span class="nf">l2Distance</span><span class="o">(</span><span class="kt">float</span><span class="o">[]</span> <span class="n">v1</span><span class="o">,</span> <span class="kt">float</span><span class="o">[]</span> <span class="n">v2</span><span class="o">)</span> <span class="o">{</span>
        <span class="kt">double</span> <span class="n">sumSqrDiff</span> <span class="o">=</span> <span class="mi">0</span><span class="n">f</span><span class="o">;</span>
        <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
        <span class="kt">int</span> <span class="n">bound</span> <span class="o">=</span> <span class="n">species</span><span class="o">.</span><span class="na">loopBound</span><span class="o">(</span><span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">);</span>
        <span class="nc">FloatVector</span> <span class="n">fv1</span><span class="o">,</span> <span class="n">fv2</span><span class="o">,</span> <span class="n">fv3</span><span class="o">;</span>
        <span class="k">for</span> <span class="o">(;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">bound</span><span class="o">;</span> <span class="n">i</span><span class="o">+=</span> <span class="n">species</span><span class="o">.</span><span class="na">length</span><span class="o">())</span> <span class="o">{</span>
            <span class="n">fv1</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v1</span><span class="o">,</span> <span class="n">i</span><span class="o">);</span>
            <span class="n">fv2</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v2</span><span class="o">,</span> <span class="n">i</span><span class="o">);</span>
            <span class="n">fv3</span> <span class="o">=</span> <span class="n">fv1</span><span class="o">.</span><span class="na">sub</span><span class="o">(</span><span class="n">fv2</span><span class="o">);</span>
            <span class="c1">// For some unknown reason, fv3.mul(fv3) is significantly faster than fv3.pow(2).</span>
            <span class="n">sumSqrDiff</span> <span class="o">+=</span> <span class="n">fv3</span><span class="o">.</span><span class="na">mul</span><span class="o">(</span><span class="n">fv3</span><span class="o">).</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
        <span class="o">}</span>
        <span class="k">if</span> <span class="o">(</span><span class="n">i</span> <span class="o">&lt;</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">)</span> <span class="o">{</span>
            <span class="nc">VectorMask</span><span class="o">&lt;</span><span class="nc">Float</span><span class="o">&gt;</span> <span class="n">m</span> <span class="o">=</span> <span class="n">species</span><span class="o">.</span><span class="na">indexInRange</span><span class="o">(</span><span class="n">i</span><span class="o">,</span> <span class="n">v1</span><span class="o">.</span><span class="na">length</span><span class="o">);</span>
            <span class="n">fv1</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v1</span><span class="o">,</span> <span class="n">i</span><span class="o">,</span> <span class="n">m</span><span class="o">);</span>
            <span class="n">fv2</span> <span class="o">=</span> <span class="nc">FloatVector</span><span class="o">.</span><span class="na">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v2</span><span class="o">,</span> <span class="n">i</span><span class="o">,</span> <span class="n">m</span><span class="o">);</span>
            <span class="n">fv3</span> <span class="o">=</span> <span class="n">fv1</span><span class="o">.</span><span class="na">sub</span><span class="o">(</span><span class="n">fv2</span><span class="o">);</span>
            <span class="n">sumSqrDiff</span> <span class="o">+=</span> <span class="n">fv3</span><span class="o">.</span><span class="na">mul</span><span class="o">(</span><span class="n">fv3</span><span class="o">).</span><span class="na">reduceLanes</span><span class="o">(</span><span class="nc">VectorOperators</span><span class="o">.</span><span class="na">ADD</span><span class="o">);</span>
        <span class="o">}</span>
        <span class="k">return</span> <span class="nc">Math</span><span class="o">.</span><span class="na">sqrt</span><span class="o">(</span><span class="n">sumSqrDiff</span><span class="o">);</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p>The results are:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Benchmark                              Mode  Cnt         Score        Error  Units
Bench.cosineSimilarityBaseline        thrpt    6    689586.585 ±  52359.955  ops/s
Bench.cosineSimilarityJep338TailLoop  thrpt    6   1169365.506 ±   5940.850  ops/s
Bench.cosineSimilarityJep338TailMask  thrpt    6   1166971.620 ±   6927.790  ops/s

Bench.dotProductBaseline              thrpt    6   1162553.117 ±  21328.673  ops/s
Bench.dotProductJep338TailLoop        thrpt    6   3317032.038 ±  19343.830  ops/s
Bench.dotProductJep338TailMask        thrpt    6   2740443.003 ± 467202.628  ops/s

Bench.l1DistanceBaseline              thrpt    6   1095704.529 ±  32440.308  ops/s
Bench.l1DistanceJep338TailLoop        thrpt    6   2816348.680 ±  35389.932  ops/s
Bench.l1DistanceJep338TailMask        thrpt    6   2717614.796 ±  14014.855  ops/s

Bench.l2DistanceBaseline              thrpt    6    951909.125 ±  23376.234  ops/s
Bench.l2DistanceJep338TailLoop        thrpt    6   2897756.618 ±  34180.451  ops/s
Bench.l2DistanceJep338TailMask        thrpt    6   2492492.274 ±  11376.759  ops/s
</code></pre></div></div>

<p>Across the board, using a <code class="language-plaintext highlighter-rouge">VectorMask</code> is clearly slower than just using a simple loop on the tail.</p>

<h2 id="complete-benchmark-results">Complete Benchmark Results</h2>

<p>Here are the full results once more for comparison:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Benchmark                              Mode  Cnt         Score        Error  Units
Bench.cosineSimilarityBaseline        thrpt    6    689586.585 ±  52359.955  ops/s
Bench.cosineSimilarityFma             thrpt    6   1086514.074 ±  15190.380  ops/s
Bench.cosineSimilarityJep338FullMask  thrpt    6    548425.342 ±  19160.168  ops/s
Bench.cosineSimilarityJep338TailLoop  thrpt    6   1169365.506 ±   5940.850  ops/s
Bench.cosineSimilarityJep338TailMask  thrpt    6   1166971.620 ±   6927.790  ops/s

Bench.dotProductBaseline              thrpt    6   1162553.117 ±  21328.673  ops/s
Bench.dotProductFma                   thrpt    6   1073368.454 ± 104684.436  ops/s
Bench.dotProductJep338FullMask        thrpt    6    384319.569 ±  20067.109  ops/s
Bench.dotProductJep338TailLoop        thrpt    6   3317032.038 ±  19343.830  ops/s
Bench.dotProductJep338TailMask        thrpt    6   2740443.003 ± 467202.628  ops/s

Bench.l1DistanceBaseline              thrpt    6   1095704.529 ±  32440.308  ops/s
Bench.l1DistanceFma                   thrpt    6   1098354.824 ±  13870.211  ops/s
Bench.l1DistanceJep338FullMask        thrpt    6    356044.308 ±   5186.842  ops/s
Bench.l1DistanceJep338TailLoop        thrpt    6   2816348.680 ±  35389.932  ops/s
Bench.l1DistanceJep338TailMask        thrpt    6   2717614.796 ±  14014.855  ops/s

Bench.l2DistanceBaseline              thrpt    6    951909.125 ±  23376.234  ops/s
Bench.l2DistanceFma                   thrpt    6   1101736.286 ±  11985.949  ops/s
Bench.l2DistanceJep338FullMask        thrpt    6    376810.628 ±   3531.977  ops/s
Bench.l2DistanceJep338TailLoop        thrpt    6   2897756.618 ±  34180.451  ops/s
Bench.l2DistanceJep338TailMask        thrpt    6   2492492.274 ±  11376.759  ops/s
</code></pre></div></div>

<p>To summarize, the fastest approach for all operations is <code class="language-plaintext highlighter-rouge">Jep338TailLoop</code>,
This uses the <code class="language-plaintext highlighter-rouge">jdk.incubator.vector</code> API for all strides until the tail of the vectors, and then uses a loop to handle the tails.
Compared to the baseline, this approach yields some substantial improvements:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">cosineSimilarity</code> is ~1.7x faster.</li>
  <li><code class="language-plaintext highlighter-rouge">dotProduct</code> is ~3.1x faster.</li>
  <li><code class="language-plaintext highlighter-rouge">l1Distance</code> is ~2.6x faster.</li>
  <li><code class="language-plaintext highlighter-rouge">l2Distance</code> is ~3x faster.</li>
</ul>

<h2 id="takeaways">Takeaways</h2>

<p>I’ll close with my takeaways from this benchmark.</p>

<p>If vector operations are a bottleneck in your application, and <code class="language-plaintext highlighter-rouge">jdk.incubator.vector</code> is available on your platform, then it’s worth a try.
In my benchmarks, the speedup was anywhere from 1.7x to 3.1x.</p>

<p>When using <code class="language-plaintext highlighter-rouge">jdk.incubator.vector</code>, carefully consider and benchmark the usage of <code class="language-plaintext highlighter-rouge">VectorMask</code>.
This abstraction seems quite expensive.</p>

<p>If <code class="language-plaintext highlighter-rouge">jdk.incubator.vector</code> is not available, then try using <code class="language-plaintext highlighter-rouge">java.lang.Math.fma</code> where possible.
This still offers a noticeable speedup.
There are also some other optimized methods in <code class="language-plaintext highlighter-rouge">java.lang.Math</code> that seem like they could be useful.</p>

<p>My only aesthetic complaint is that the API forces us to duplicate code to handle the vector tail.
However, the API is still far simpler and far more readable than the analagous APIs I’ve seen in C and C++.</p>

<p>Overall, I’m quite impressed by the <code class="language-plaintext highlighter-rouge">jdk.incubator.vector</code> module, and I’m excited to see this opportunity for tighter integration between the hardware and JVM.</p>

<h2 id="appendix">Appendix</h2>

<h3 id="related-material">Related Material</h3>

<p>Here is some related material that I found useful:</p>

<ul>
  <li><a href="https://richardstartin.github.io/posts/vector-api-dot-product">Limiting Factors in a Dot Product Calculation</a> by Richard Startin, July 2018.</li>
  <li><a href="https://twitter.com/denis_makogon/status/1574833657247928329">This Twitter thread about jdk.incubator.vector</a> by Denis Makogon, September 2022.</li>
  <li><a href="https://dl.acm.org/doi/pdf/10.1145/3578360.3580265">Java Vector API: Benchmarking and Performance Analysis</a> by Basso, et. al, February 2023.</li>
  <li><a href="http://web.eecs.utk.edu/~jplank/plank/classes/cs494/494/notes/SIMD/index.html">CS494 Lecture Notes - Some simple SIMD examples</a>, Dr. James Plank, November 2019.</li>
</ul>

<h3 id="jdkincubatorvector-in-elastiknn"><code class="language-plaintext highlighter-rouge">jdk.incubator.vector</code> in Elastiknn</h3>

<p><em>February 26, 2023</em></p>

<p>If you’d like to see an example of this in a real codebase, I incorporated <code class="language-plaintext highlighter-rouge">jdk.incubator.vector</code> as an optional optimization in Elastiknn in this pull request: <a href="https://github.com/alexklibisz/elastiknn/pull/496">alexklibisz/elastiknn #496</a>.</p>

<p>This led to a speedup anywhere from 1.05x to 1.2x on the Elastiknn benchmarks.</p>

<h3 id="floatvectorpow-is-significantly-slower-than-floatvectormul">FloatVector::pow is significantly slower than FloatVector::mul</h3>

<p><em>February 26, 2023</em></p>

<p>While working on the benchmarks above, I found an interesting performance pitfall.
Namely, given a <code class="language-plaintext highlighter-rouge">FloatVector</code> <code class="language-plaintext highlighter-rouge">fv</code>, <code class="language-plaintext highlighter-rouge">fv.mul(fv)</code> is 36x faster than <code class="language-plaintext highlighter-rouge">fv.pow(2)</code>.</p>

<p>The benchmark:</p>

<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@State</span><span class="o">(</span><span class="nv">Scope</span><span class="o">.</span><span class="py">Benchmark</span><span class="o">)</span>
<span class="k">class</span> <span class="nc">BenchPowVsMulFixtures</span> <span class="o">{</span>
  <span class="k">implicit</span> <span class="k">private</span> <span class="k">val</span> <span class="nv">rng</span><span class="k">:</span> <span class="kt">Random</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Random</span><span class="o">(</span><span class="mi">0</span><span class="o">)</span>
  <span class="k">val</span> <span class="nv">species</span><span class="k">:</span> <span class="kt">VectorSpecies</span><span class="o">[</span><span class="kt">lang.Float</span><span class="o">]</span> <span class="k">=</span> <span class="nv">FloatVector</span><span class="o">.</span><span class="py">SPECIES_PREFERRED</span>
  <span class="k">val</span> <span class="nv">v</span><span class="k">:</span> <span class="kt">Array</span><span class="o">[</span><span class="kt">Float</span><span class="o">]</span> <span class="k">=</span> <span class="o">(</span><span class="mi">0</span> <span class="n">until</span> <span class="nv">species</span><span class="o">.</span><span class="py">length</span><span class="o">()).</span><span class="py">map</span><span class="o">(</span><span class="k">_</span> <span class="k">=&gt;</span> <span class="nv">rng</span><span class="o">.</span><span class="py">nextFloat</span><span class="o">()).</span><span class="py">toArray</span>
  <span class="k">val</span> <span class="nv">fv</span><span class="k">:</span> <span class="kt">FloatVector</span> <span class="o">=</span> <span class="nv">FloatVector</span><span class="o">.</span><span class="py">fromArray</span><span class="o">(</span><span class="n">species</span><span class="o">,</span> <span class="n">v</span><span class="o">,</span> <span class="mi">0</span><span class="o">)</span>
<span class="o">}</span>

<span class="k">class</span> <span class="nc">BenchPowVsMul</span> <span class="o">{</span>

  <span class="nd">@Benchmark</span>
  <span class="nd">@BenchmarkMode</span><span class="o">(</span><span class="nc">Array</span><span class="o">(</span><span class="nv">Mode</span><span class="o">.</span><span class="py">Throughput</span><span class="o">))</span>
  <span class="nd">@Fork</span><span class="o">(</span><span class="n">value</span> <span class="k">=</span> <span class="mi">1</span><span class="o">)</span>
  <span class="nd">@Warmup</span><span class="o">(</span><span class="n">time</span> <span class="k">=</span> <span class="mi">5</span><span class="o">,</span> <span class="n">iterations</span> <span class="k">=</span> <span class="mi">3</span><span class="o">)</span>
  <span class="nd">@Measurement</span><span class="o">(</span><span class="n">time</span> <span class="k">=</span> <span class="mi">5</span><span class="o">,</span> <span class="n">iterations</span> <span class="k">=</span> <span class="mi">6</span><span class="o">)</span>
  <span class="k">def</span> <span class="nf">mul</span><span class="o">(</span><span class="n">f</span><span class="k">:</span> <span class="kt">BenchPowVsMulFixtures</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="nv">f</span><span class="o">.</span><span class="py">fv</span><span class="o">.</span><span class="py">mul</span><span class="o">(</span><span class="nv">f</span><span class="o">.</span><span class="py">fv</span><span class="o">)</span>
  
  <span class="nd">@Benchmark</span>
  <span class="nd">@BenchmarkMode</span><span class="o">(</span><span class="nc">Array</span><span class="o">(</span><span class="nv">Mode</span><span class="o">.</span><span class="py">Throughput</span><span class="o">))</span>
  <span class="nd">@Fork</span><span class="o">(</span><span class="n">value</span> <span class="k">=</span> <span class="mi">1</span><span class="o">)</span>
  <span class="nd">@Warmup</span><span class="o">(</span><span class="n">time</span> <span class="k">=</span> <span class="mi">5</span><span class="o">,</span> <span class="n">iterations</span> <span class="k">=</span> <span class="mi">3</span><span class="o">)</span>
  <span class="nd">@Measurement</span><span class="o">(</span><span class="n">time</span> <span class="k">=</span> <span class="mi">5</span><span class="o">,</span> <span class="n">iterations</span> <span class="k">=</span> <span class="mi">6</span><span class="o">)</span>
  <span class="k">def</span> <span class="nf">pow</span><span class="o">(</span><span class="n">f</span><span class="k">:</span> <span class="kt">BenchPowVsMulFixtures</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="nv">f</span><span class="o">.</span><span class="py">fv</span><span class="o">.</span><span class="py">pow</span><span class="o">(</span><span class="mi">2</span><span class="o">)</span>
<span class="o">}</span>
</code></pre></div></div>

<p>The results:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Benchmark           Mode  Cnt           Score           Error  Units
BenchPowVsMul.mul  thrpt    6  1235649170.757 ± 216838871.439  ops/s
BenchPowVsMul.pow  thrpt    6    34105529.504 ±   4899654.049  ops/s
</code></pre></div></div>

<p><del>I have no idea why this would be, but it seems like it could be a bug in the underlying implementation.</del></p>

<p>Update (March 1, 2023): I messaged the panama dev mailing list and got an explanation for this:</p>

<blockquote>
  <p>The performance difference you observe is because the pow operation is falling back to scalar code (Math.pow on each lane element) and not using vector instructions.
On x86 linux or windows you should observe better performance of the pow operation because it should leverage code from Intel’s Short Vector Math Library [1], but that code OS specific and is not currently ported on Mac OS.</p>
</blockquote>

<p><a href="https://mail.openjdk.org/pipermail/panama-dev/2023-February/018735.html">View the full thread here.</a></p>

<h3 id="results-on-apple-silicon-m1-macbook-air">Results on Apple Silicon (M1 Macbook Air)</h3>

<p><em>February 26, 2023</em></p>

<p>I was curious how this would look on Apple silicon, so I also ran the benchmark on my <a href="https://support.apple.com/kb/SP825">2020 M1 Macbook Air</a>.
Other than the host machine, the benchmark setup is identical to the results above.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Benchmark                              Mode  Cnt           Score         Error  Units
Bench.cosineSimilarityBaseline        thrpt    6     1081276.495 ±   39124.749  ops/s
Bench.cosineSimilarityFma             thrpt    6      836076.757 ±      47.184  ops/s
Bench.cosineSimilarityJep338FullMask  thrpt    6     1050298.090 ±      81.960  ops/s
Bench.cosineSimilarityJep338TailLoop  thrpt    6     1532220.920 ±  179274.549  ops/s
Bench.cosineSimilarityJep338TailMask  thrpt    6     1452240.849 ±    6500.704  ops/s

Bench.dotProductBaseline              thrpt    6     1136628.672 ±  180804.344  ops/s
Bench.dotProductFma                   thrpt    6      912200.315 ±    8839.657  ops/s
Bench.dotProductJep338FullMask        thrpt    6      272444.658 ±    1642.048  ops/s
Bench.dotProductJep338TailLoop        thrpt    6     4062575.031 ±    1541.393  ops/s
Bench.dotProductJep338TailMask        thrpt    6     3372980.017 ±    4655.095  ops/s

Bench.l1DistanceBaseline              thrpt    6     1134803.520 ±   22165.180  ops/s
Bench.l1DistanceFma                   thrpt    6     1146026.997 ±    2952.262  ops/s
Bench.l1DistanceJep338FullMask        thrpt    6      271181.722 ±     416.756  ops/s
Bench.l1DistanceJep338TailLoop        thrpt    6     4062832.939 ±     249.915  ops/s
Bench.l1DistanceJep338TailMask        thrpt    6     3362605.808 ±   20805.696  ops/s

Bench.l2DistanceBaseline              thrpt    6     1108095.885 ±    4237.677  ops/s
Bench.l2DistanceFma                   thrpt    6      860659.029 ±    8911.938  ops/s
Bench.l2DistanceJep338FullMask        thrpt    6      269202.529 ±     326.229  ops/s
Bench.l2DistanceJep338TailLoop        thrpt    6     2026410.994 ±     201.837  ops/s
Bench.l2DistanceJep338TailMask        thrpt    6     3273131.452 ±   11284.378  ops/s
</code></pre></div></div>

<p>Here are the Baseline measurements merged:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Chip     Benchmark                        Mode  Cnt        Score         Error  Units
Intel i7 Bench.cosineSimilarityBaseline  thrpt    6   689586.585 ±   52359.955  ops/s
Apple M1 Bench.cosineSimilarityBaseline  thrpt    6  1081276.495 ±   39124.749  ops/s

Intel i7 Bench.dotProductBaseline        thrpt    6  1162553.117 ±   21328.673  ops/s
Apple M1 Bench.dotProductBaseline        thrpt    6  1136628.672 ±  180804.344  ops/s

Intel i7 Bench.l1DistanceBaseline        thrpt    6  1095704.529 ±   32440.308  ops/s
Apple M1 Bench.l1DistanceBaseline        thrpt    6  1134803.520 ±   22165.180  ops/s

Intel i7 Bench.l2DistanceBaseline        thrpt    6   951909.125 ±   23376.234  ops/s
Apple M1 Bench.l2DistanceBaseline        thrpt    6  1108095.885 ±    4237.677  ops/s
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">cosineSimilarity</code> baseline is noticeably faster on the M1.
The others are comparable.</p>

<p>Here are the Jep338TailLoop measurements merged:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Chip     Benchmark                              Mode  Cnt        Score         Error  Units
Intel i7 Bench.cosineSimilarityJep338TailLoop  thrpt    6  1169365.506 ±    5940.850  ops/s
Apple M1 Bench.cosineSimilarityJep338TailLoop  thrpt    6  1532220.920 ±  179274.549  ops/s

Intel i7 Bench.dotProductJep338TailLoop        thrpt    6  3317032.038 ±   19343.830  ops/s
Apple M1 Bench.dotProductJep338TailLoop        thrpt    6  4062575.031 ±    1541.393  ops/s

Intel i7 Bench.l1DistanceJep338TailLoop        thrpt    6  2816348.680 ±   35389.932  ops/s
Apple M1 Bench.l1DistanceJep338TailLoop        thrpt    6  4062832.939 ±     249.915  ops/s

Intel i7 Bench.l2DistanceJep338TailLoop        thrpt    6  2897756.618 ±   34180.451  ops/s
Apple M1 Bench.l2DistanceJep338TailLoop        thrpt    6  2026410.994 ±     201.837  ops/s
</code></pre></div></div>

<p>In this case, the M1 is faster in all but the <code class="language-plaintext highlighter-rouge">l2Distance</code>.
The M1’s error bounds are also impressively tight on the three operations that outperform the Intel.</p>

<p><em>June 7, 2023</em></p>

<p>I re-ran the benchmark on my M1 Max Macbook Pro, and the results were completely different!
So please beware of benchmarking on the M-series chips.
It’s an interesting endeavor, but also even more of a rabbit hole than benchmarking on x86.</p>

<h3 id="java-vector-api-benchmarking-and-performance-analysis-by-basso-et-al">Java Vector API: Benchmarking and Performance Analysis by Basso, et. al</h3>

<p><em>February 27, 2023</em></p>

<p>I discovered the paper <a href="https://dl.acm.org/doi/pdf/10.1145/3578360.3580265">Java Vector API: Benchmarking and Performance Analysis</a> by Basso, et. al shortly after releasing this post.
It looks like they beat me to the release by about week!
This paper is an extremely thorough analysis of the topic, so I also highly recommend reading it.</p>

<p>In section 5.2, the authors discuss the negative performance effects of using <code class="language-plaintext highlighter-rouge">indexInRange</code>.
This matches up well with what I observed in the <code class="language-plaintext highlighter-rouge">Jep338FullMask</code> optimization.
It turns out that using <code class="language-plaintext highlighter-rouge">indexInRange</code> to build a <code class="language-plaintext highlighter-rouge">VectorMask</code> is only performant on systems supporting predicate registers.
I guess this particular feature is not supported by my Intel Mac Mini nor by my M1 Macbook Air.</p>

<p>In section 5.3, the authors discuss how <code class="language-plaintext highlighter-rouge">.pow</code> is far slower than <code class="language-plaintext highlighter-rouge">.mul</code>.
This aligns with my findings, also discussed in the appendix on this post.
Although, I observed a much larger difference when evaluating the two operations in isolation.</p>

<h3 id="bug-in-jep338fullmaskvectoroperations">Bug in Jep338FullMaskVectorOperations</h3>

<p><em>May 20, 2023</em></p>

<p>I had a small bug in my original implementation of <code class="language-plaintext highlighter-rouge">Jep338FullMaskVectorOperations</code>, which I fixed <a href="https://github.com/alexklibisz/site-projects/pull/6">in this PR on 5/20/23</a> and updated the code in this post.
Thanks to Twitter user Varun Thacker for <a href="https://twitter.com/varunthacker/status/1659719167665397760">finding it and proposing a fix</a>.</p>

<h3 id="warning-mathfma-can-be-extremely-slow-on-some-platforms">Warning: Math.fma can be extremely slow on some platforms</h3>

<p>In late May 2023, I noticed <a href="https://twitter.com/thetaph1/status/1662148334134476835">a Tweet from long-time Lucene contributor Uwe Schindler discussing how Lucene had also implemented vector optimizations based on the Panama Vector API</a>.
I replied with <a href="https://twitter.com/AlexKlibisz/status/1662514518680039424">a link to this post and my implementation in Elastiknn</a>.
Uwe kindly responded with a <a href="https://twitter.com/thetaph1/status/1662753591034101760">warning about performance pitfalls of Math.fma.</a></p>

<p>I’m still exploring the effects of this in Elastiknn.
I have some rough data indicating that it is in fact significantly slower on some platforms, so I’ll likely remove it.
I figured it’s worth quickly mentioning here.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[This post shows how we can accelerate vector operations on the JVM using the new `jdk.incubator.vector` module, introduced in JDK 16 as part of Project Panama.]]></summary></entry><entry><title type="html">Are Postgres functions faster than queries? (a very simple benchmark)</title><link href="https://alexklibisz.com/2022/12/18/are-postgres-functions-faster-than-queries.html" rel="alternate" type="text/html" title="Are Postgres functions faster than queries? (a very simple benchmark)" /><published>2022-12-18T15:00:00+00:00</published><updated>2022-12-18T15:00:00+00:00</updated><id>https://alexklibisz.com/2022/12/18/are-postgres-functions-faster-than-queries</id><content type="html" xml:base="https://alexklibisz.com/2022/12/18/are-postgres-functions-faster-than-queries.html"><![CDATA[<h1 id="introduction">Introduction</h1>

<p>In my work with Postgres and other databases, I’ve often heard the statement “stored procedures (i.e., <em>functions</em> in Postgres) are faster than queries.”</p>

<p>To be precise, “stored procedure” refers to a Postgres function, defined using the <a href="https://www.postgresql.org/docs/current/sql-createfunction.html"><code class="language-plaintext highlighter-rouge">create function</code> command</a>, and executed by passing a string like <code class="language-plaintext highlighter-rouge">select * from some_function(...)</code> from the client to the database server.
“Query” refers to a standard SQL query (e.g., <code class="language-plaintext highlighter-rouge">select * from some_table where id = 10</code>), defined by the client, and executed by passing the literal query string from the client to the database server.</p>

<p>I’m willing to accept that a function is faster than a query if the function avoids passing intermediate results back to the client.
There’s an obvious cost to sending bytes over the network, so, all else equal, a function that keeps intermediate results in the database is going to execute faster than multiple queries that pass intermediate results to the client.</p>

<p>However, that’s not really what I’m after.
I’m more interesting in answering this question:</p>

<blockquote>
  <p>Assuming the function and query are executing the same underlying statements and passing the same data back to the client, is the function faster than the query?</p>
</blockquote>

<p>In this post, I take a first pass at answering this question based on a very simple benchmark.</p>

<details>
<summary>Expand for the spoiler!</summary>
On this benchmark, the difference between Postgres functions and queries is negligible.
I.e., the answer to the title is: <i>not really</i>.
</details>
<p><br /></p>

<h1 id="the-benchmark">The Benchmark</h1>

<p>All the benchmarking code is available in my <a href="https://github.com/alexklibisz/site-projects/tree/main/postgres-queries-vs-functions-performance">site-projects repo</a>, but here’s a quick summary.</p>

<h2 id="database-schema">Database Schema</h2>

<p>I created a small blog-style database with two tables for posts and comments.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">create</span> <span class="k">table</span> <span class="n">post</span> <span class="p">(</span>
  <span class="n">post_id</span> <span class="nb">serial</span> <span class="k">primary</span> <span class="k">key</span><span class="p">,</span>
  <span class="n">post_uuid</span> <span class="n">uuid</span> <span class="k">unique</span> <span class="k">not</span> <span class="k">null</span><span class="p">,</span>
  <span class="n">contents</span> <span class="nb">text</span> <span class="k">not</span> <span class="k">null</span><span class="p">,</span>
  <span class="n">created_at</span> <span class="nb">timestamp</span> <span class="k">not</span> <span class="k">null</span><span class="p">,</span>
  <span class="n">updated_at</span> <span class="nb">timestamp</span>
<span class="p">);</span>
<span class="k">create</span> <span class="k">table</span> <span class="k">comment</span> <span class="p">(</span>
  <span class="n">comment_id</span> <span class="nb">serial</span> <span class="k">primary</span> <span class="k">key</span><span class="p">,</span>
  <span class="n">comment_uuid</span> <span class="n">uuid</span> <span class="k">unique</span> <span class="k">not</span> <span class="k">null</span><span class="p">,</span>
  <span class="n">post_id</span> <span class="nb">serial</span> <span class="k">references</span> <span class="n">post</span><span class="p">(</span><span class="n">post_id</span><span class="p">),</span>
  <span class="n">contents</span> <span class="nb">text</span> <span class="k">not</span> <span class="k">null</span><span class="p">,</span>
  <span class="n">created_at</span> <span class="nb">timestamp</span> <span class="k">not</span> <span class="k">null</span><span class="p">,</span>
  <span class="n">updated_at</span> <span class="nb">timestamp</span>
<span class="p">);</span>
<span class="k">create</span> <span class="k">index</span> <span class="n">comment_post_id_idx</span> <span class="k">on</span> <span class="k">comment</span><span class="p">(</span><span class="n">post_id</span><span class="p">);</span>
</code></pre></div></div>

<p>I populated the database with 100,000 random posts and 1,000,000 random comments (exactly 10 comments per post).</p>

<p>For the comparison, I wrote a function and query to count the number of comments for a given <code class="language-plaintext highlighter-rouge">post_uuid</code>:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">create</span> <span class="k">function</span> <span class="n">comment_count_by_post_uuid</span><span class="p">(</span><span class="n">uuid</span><span class="p">)</span>
<span class="k">returns</span> <span class="nb">integer</span>
<span class="k">language</span> <span class="k">sql</span>
<span class="n">parallel</span> <span class="n">safe</span>
<span class="k">returns</span> <span class="k">null</span> <span class="k">on</span> <span class="k">null</span> <span class="k">input</span>
<span class="k">as</span> <span class="err">$$</span>
    <span class="k">select</span> <span class="k">count</span><span class="p">(</span><span class="n">comment_id</span><span class="p">)</span>
    <span class="k">from</span> <span class="k">comment</span> <span class="k">c</span>
    <span class="k">join</span> <span class="n">post</span> <span class="n">p</span> <span class="k">on</span> <span class="k">c</span><span class="p">.</span><span class="n">post_id</span> <span class="o">=</span> <span class="n">p</span><span class="p">.</span><span class="n">post_id</span>
    <span class="k">where</span> <span class="n">p</span><span class="p">.</span><span class="n">post_uuid</span> <span class="o">=</span> <span class="err">$</span><span class="mi">1</span>
<span class="err">$$</span><span class="p">;</span>
</code></pre></div></div>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">select</span> <span class="k">count</span><span class="p">(</span><span class="n">comment_id</span><span class="p">)</span>
<span class="k">from</span> <span class="k">comment</span> <span class="k">c</span>
<span class="k">join</span> <span class="n">post</span> <span class="n">p</span> <span class="k">on</span> <span class="k">c</span><span class="p">.</span><span class="n">post_id</span> <span class="o">=</span> <span class="n">p</span><span class="p">.</span><span class="n">post_id</span>
<span class="k">where</span> <span class="n">p</span><span class="p">.</span><span class="n">post_uuid</span> <span class="o">=</span> <span class="o">%</span><span class="n">s</span><span class="p">::</span><span class="n">uuid</span>
</code></pre></div></div>

<h2 id="benchmarking-scripts">Benchmarking Scripts</h2>

<p>To execute the functions and queries, I wrote a Python script that does the following:</p>

<ol>
  <li>Connects to Postgres using the <code class="language-plaintext highlighter-rouge">psycopg2</code> client library.</li>
  <li>Selects all the post UUIDs, sorts them, and keeps them in memory.</li>
  <li>If called with arg <code class="language-plaintext highlighter-rouge">function</code>, loops over the post UUIDs and runs the function on each UUID.</li>
  <li>If called with arg <code class="language-plaintext highlighter-rouge">query</code>, loops over the post UUIDs and runs the query on each UUID.</li>
</ol>

<p>I incorporated this Python script in a bash script that does the following:</p>

<ol>
  <li>Starts the Postgres container and runs migrations to create tables and insert random data.</li>
  <li>Re-starts the container to reset metrics.</li>
  <li>Starts recording container metrics from <code class="language-plaintext highlighter-rouge">docker stats</code>.</li>
  <li>Runs the query benchmark.</li>
  <li>Re-starts the container to reset metrics.</li>
  <li>Starts recording container metrics from <code class="language-plaintext highlighter-rouge">docker stats</code>.</li>
  <li>Runs the function benchmark.</li>
  <li>Shuts down the containers.</li>
  <li>Plots the results.</li>
</ol>

<h2 id="environment">Environment</h2>

<p>I ran Postgres 15.1.0 in a Docker container on my <a href="https://support.apple.com/kb/SP782?locale=en_US">2018 Intel i7 Mac Mini</a>.
I used a combination of docker-compose and Postgres configurations to limit the resources: 1 CPU, 8GB memory reservation, 8GB memory limit, and <code class="language-plaintext highlighter-rouge">shared_buffers</code> set to 4096MB.
I ran the benchmarking script on the same Mac Mini to eliminate variability from a network connection.</p>

<h2 id="analysis">Analysis</h2>

<p>To analyze the execution time, I simply used the <code class="language-plaintext highlighter-rouge">time</code> command.</p>

<p>To analyze database resource usage, I wrote a simple bash script that polls the <code class="language-plaintext highlighter-rouge">docker stats</code> command and exports the following metrics:</p>

<ul>
  <li>Container CPU usage (%)</li>
  <li>Container memory usage (MiB)</li>
  <li>Network in (MB)</li>
  <li>Network out (MB)</li>
</ul>

<h1 id="why-would-the-function-be-faster">Why would the function be faster?</h1>

<p>Before we get into the results, let’s speculate a bit: if the function and query are doing the same thing, and returning the same data, why would the function be faster?</p>

<p>A few possible reasons come to mind for me:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">select * from comment_count_by_post_uuid(...)</code> is a shorter string than the equivalent SQL query, so by calling the function, we’re sending fewer bytes over the network. Perhaps this is enough to give the function an edge?</li>
  <li>Perhaps the function lets Postgres avoid repeatedly parsing the underlying query?</li>
  <li>Perhaps the function lets Postgres cache the query plan, and therefore skip query planning?</li>
</ul>

<p>We’ll see if any of these hold up.</p>

<h1 id="results">Results</h1>

<p>The execution times for 100,000 iterations are about 1 minute and 35 seconds for both the function and the query, or about 1050 iterations per second.</p>

<p>The container metrics are plotted below:</p>

<p><a href="/assets/img/posts/are-postgres-functions-faster-than-queries/metrics.png"><img src="/assets/img/posts/are-postgres-functions-faster-than-queries/metrics.png" alt="Metrics" /></a></p>

<p>Execution time and memory usage are indistinguishable.</p>

<p>The query seems to use slightly less CPU compared to the function.
I’m not sure why this would be the case, but it happens consistently.</p>

<p>Network input is consistently higher for the query. 
This makes sense, as we’re sending a larger string over the network to the database.</p>

<p>Network output is consistently very slightly higher for the function. 
I’m not sure why this would be the case, but it happens consistently.</p>

<p>I ran this several times and also swapped the order (function first, then query, and vice-versa).
The only difference I observed was the execution times moving a couple seconds in either direction.
Otherwise, everything remains as described above.</p>

<h1 id="interpretation">Interpretation</h1>

<p>Based on these results, for the simple select query that I benchmarked, functions are neither obviously better nor worse than queries.</p>

<p>Functions have a slight advantage in lower network usage.
Queries might have a slight advantage in lower CPU usage.</p>

<p>In cases like this, my tie-breaker tends to be maintainability.
With the toolchains I’ve used, queries are typically easier to maintain and evolve than functions.
Changing a function generally requires both a database migration and application deployment, whereas changing a query generally only requires an application deployment.</p>

<p>Note, this is by no means a definitive conclusion!
I can think of several scenarios where I would still consider both a function and a query.
But, for a simple select query like the one I benchmarked, I will probably reach for a query over a function.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[This post compares the performance of Postgres functions and queries on a very simple benchmark.]]></summary></entry><entry><title type="html">My takeaways from _Effective Software Testing_ by Maurício Aniche</title><link href="https://alexklibisz.com/2022/06/19/my-takeaways-from-effective-software-testing-by-mauricio-aniche.html" rel="alternate" type="text/html" title="My takeaways from _Effective Software Testing_ by Maurício Aniche" /><published>2022-06-19T15:00:00+00:00</published><updated>2022-06-19T15:00:00+00:00</updated><id>https://alexklibisz.com/2022/06/19/my-takeaways-from-effective-software-testing-by-mauricio-aniche</id><content type="html" xml:base="https://alexklibisz.com/2022/06/19/my-takeaways-from-effective-software-testing-by-mauricio-aniche.html"><![CDATA[<h1 id="introduction">Introduction</h1>

<p>I recently finished reading <a href="https://www.manning.com/books/effective-software-testing"><em>Effective Software Testing</em></a> by <a href="https://twitter.com/mauricioaniche">Maurício Aniche</a> as part of a weekly book club with some teammates at work.
This post summarizes my biggest takeaways from the book.
I also took the liberty to mix in my own related musings about software testing.</p>

<h2 id="my-background">My Background</h2>

<p>It might be useful to quickly mention my background, as it affects my opinion of the book:</p>

<ul>
  <li>I work primarily on cloud services for energy systems, using a combination of Scala, Akka, Postgres, Kafka, Kubernetes, and a long-tail of other technologies.</li>
  <li>The cloud services I work on exist in large part to model and interact with hardware, but I have very little firmware experience. So I can’t say much about firmware testing, other than I do know enough to appreciate it can be a different beast.</li>
  <li>I enjoy writing automated unit and integration tests. I value the peace of mind it gives me about the services I build and maintain, especially as they evolve. I also enjoy the challenge of finding a way to test and verify something particularly tricky.</li>
</ul>

<h1 id="overall-impressions">Overall Impressions</h1>

<p>First, I recommend this book to any professional software engineer.</p>

<p>To summarize it in one statement: the book provides thorough coverage of testing techniques, with practical tips and examples for practitioners, based on a foundation of research and experience.</p>

<p>I found the book gave precise terms for some of the best practices I’ve learned from experience. 
For example, I had an intuitive understanding that I can efficiently test a complex logical expression by perturbing each parameter such that it affects the result. 
The book taught me that this idea has a name: <a href="https://en.wikipedia.org/wiki/Modified_condition/decision_coverage"><em>modified condition/decision coverage</em></a>.</p>

<p>I especially recommend the book to those who <em>dread</em> testing. It presents a first-principles motivation for testing and a set of reliable approaches and tools for effective testing.</p>

<p>The book uses Java for all examples, so it’s particularly valuable for engineers already in that ecosystem, but it’s also accessible to those working in other languages.
Having worked primarily in Scala and Python, I can say the techniques translate to those languages, too.</p>

<p>For those in a rush, I recommend reading the introduction and summary of each chapter, and then reading chapters 2, 6, and 7 (specification-based testing, test doubles and mocks, and designing for testability) as they were particularly information-rich.</p>

<h1 id="takeaways">Takeaways</h1>

<h2 id="test-effectively-and-systematically">Test <em>effectively</em> and <em>systematically</em></h2>

<p>In chapter 1, the author establishes that we should focus on <em>effective</em> and <em>systematic</em> testing, and presents methods for both throughout the book.</p>

<p><em>How do we know we’re testing effectively?</em></p>

<p>I find it helpful to consider the antithesis to effective testing: <em>ad-hoc testing</em>.</p>

<p>When we practice ad-hoc testing, we implement a feature and then toss in a few test cases just before submitting it for review.
We test whatever comes to mind – maybe an example from the Jira or Github ticket, maybe something we copy-paste from another test.
Over time, we end up with a hodgepodge of miscellaneous tests that people happened to think of.
It’s bloated, difficult to evolve, and leaves us with either a feeling of pointlessness or a false confidence in the correctness based only on the quantity of tests.
I’ve found this is also what leaves engineers with a dislike for testing.</p>

<p>In contrast, <em>effective</em> testing is the process by which we arrive at a set of tests that verify the correctness of an implementation, are maintainable, and yield an efficient ratio of time spent testing to number of bugs prevented.</p>

<p>Another way to think about it is to consider the information or signal-to-noise ratio of each test.
Effective testing reminds me of the <a href="https://en.wikipedia.org/wiki/20Q">20 questions game I played as a kid</a>.
We get to ask our software 20 questions.
Based on the answers, we choose to deploy to production.
We better pick questions with a high signal-to-noise ratio!</p>

<p><em>How do we know we’re testing systematically?</em></p>

<p>One of my favorite heuristics presented in the book is that two engineers from the same team, given the same requirements, working in the same codebase, should arrive at the same test suite.</p>

<p>We rarely get a chance to run this experiment, but we can also consider our reaction to tests in code review.
If we find it hard to follow tests, or find a teammate’s tests arbitrary or surprising, it might mean the team needs to make its testing practices more systematic.</p>

<h2 id="exhaustive-testing-is-intractable-but-thats-no-excuse">Exhaustive testing is intractable, but that’s no excuse</h2>

<p>In chapter 1, the author establishes that exhaustively testing all possible paths through any interesting piece of software is an intractable problem.</p>

<p>As a simple example, the author presents a system with N boolean configurations.
This system requires 2^N test cases for exhaustive coverage.
At <a href="https://www.wolframalpha.com/input?i=2%5E266">N=266</a> we exceed the number of atoms in the visible universe.</p>

<p>So we should accept we can’t test exhaustively. Does this make testing pointless? No.</p>

<p>The author shows us how we can write fewer, but more effective tests through a combination of techniques: partitioning, boundary analysis, pruning unrealistic test cases, and the kind of creativity that results from thoroughly understanding the domain.</p>

<p>If we understand how inputs affect behavior, the boundaries at which inputs change behavior, and which inputs are nonsensical, we can prune the explosion of test cases down to a handful that actually matter.
The author walks through an example of this type of pruning in chapter 2.</p>

<p>A brief musing: I find it interesting that, in some cases, property-based testing (sometimes called randomized testing) is an antidote to the intractability of exhaustive testing.
As a trivial example, imagine we’re testing a method that multiplies two positive integers.
For any pair of inputs, we can verify the method’s output using addition.
If we use new randomized inputs every time we run the test suite, we can, in the limit, explore and verify the entire input space.
I particularly enjoyed <a href="https://www.youtube.com/watch?v=zD57QKzqdCw">this talk about randomized testing in Apache Lucene and Solr</a>.</p>

<h2 id="code-coverage-is-a-tool-not-a-goal">Code coverage is a tool, not a goal</h2>

<p>The author introduces code coverage in chapter 3.</p>

<p>The rough idea of code coverage is that we can run a tool to identify parts of our software (classes, methods, branches, etc.) which are untested by our test suite.
We can review the untested parts and adapt or add tests to cover them.</p>

<p>Like any simple metric, code coverage can be abused. 
If we consider coverage as the ultimate goal, we risk wasting time on pointless tests and arriving at a false sense of confidence about the correctness of our software.</p>

<p>However, as the author argues in chapter 3, we can still use code coverage to offload the cognitive burden of identifying the parts of software we might have forgotten to test.</p>

<p>I use a code coverage tool anytime I’m implementing or refactoring a substantial feature.
I periodically run the tool and examine the outputs to catch any blind spots.
When I find untested areas that are particularly risky or interesting, I revise or add test cases to cover them.</p>

<p>In some situations it’s fine to omit a test case.
For example, I generally trust the correctness of a language’s standard library or an established open-source library.</p>

<p>At times, I’ve also integrated code coverage as a requirement in continuous integration.
I still have reservations about this, as some tools can be flaky or misleading.
At the very least, code coverage in CI is a good way to set a lower-bound “safety net” for coverage.</p>

<h2 id="mutation-testing-seems-like-a-powerful-complement-to-code-coverage">Mutation testing seems like a powerful complement to code coverage</h2>

<p>The author briefly introduces mutation testing in chapter 3.</p>

<p>The rough idea of mutation testing is that we can automatically generate mutations of our codebase and run the test suite against each mutation.
A mutation can be as simple as flipping <code class="language-plaintext highlighter-rouge">&lt;</code> to <code class="language-plaintext highlighter-rouge">&gt;</code> or <code class="language-plaintext highlighter-rouge">==</code> to <code class="language-plaintext highlighter-rouge">!=</code>.
If our tests are effective, then at least one test should fail for each mutation.
If no tests fail, then we should consider adding a test case to cover the mutation.</p>

<p>This seems like a particularly powerful complement to code coverage.
It’s easy to get perfect code coverage with mediocre tests: just call each method and make an inconsequential assertion.
It seems more difficult to get perfect code coverage <em>and</em> mutation coverage with mediocre tests.</p>

<p>I’ve experimented with some libraries in the <a href="https://stryker-mutator.io/">Stryker Mutator ecosystem</a> for toy projects, but still need to run them on something more interesting.</p>

<h2 id="the-role-of-specification-based-testing-and-structural-testing">The role of specification-based testing and structural testing</h2>

<p>The author covers specification-based testing and structural testing in chapters 2 and 3.</p>

<p>Prior to reading the book, I had an intuitive understanding of specification-based testing but zero knowledge of structural testing as a distinct concept.</p>

<p>Here’s how my current understanding looks.</p>

<p>In specification-based testing, we start with a non-code specification and implement tests to demonstrate our system adheres to the specification. 
These tests usually sound something like, “some HTTP call returns a 401 when the caller’s token has expired.” 
With a sufficiently literate testing framework, specification-based tests can be read by a non-technical teammate and impart confidence that a feature is implemented as specified.</p>

<p>Structural tests, on the other hand, largely ignore the specification and tailor specifically to the implementation details. 
Maybe there’s something particularly interesting about the way we choose to return a 401 response vs. a 403 response, but it’s never stated explicitly in the specification.
Structural tests are a good place to exercise and document this kind of detail.</p>

<p>At the risk of painting in broad strokes, maybe we can distinguish the tests by audience?
Specification-based tests are for the product owners who provided the specification, and structural tests are for the other engineers who will continue to maintain and evolve the implementation.</p>

<h2 id="language-design-matters">Language design matters</h2>

<p>The author covers contract design in chapter 4.</p>

<p>This gets into the details of input validation, pre-conditions, post-conditions, exceptions vs. return values, and so on.</p>

<p>Seeing the concrete implementations of these concepts in Java made me particularly grateful for the Scala language and how its primitives simplify contract design.
Some quick examples:</p>

<ol>
  <li>Scala’s case classes are far less verbose than Java POJOs. This makes it simple to quickly define and provide an informative return type or a test data type.</li>
  <li>Scala has native <code class="language-plaintext highlighter-rouge">Try</code> and <code class="language-plaintext highlighter-rouge">Either</code> constructs. These make it simple to provide known errors as return values instead of throwing exceptions.</li>
  <li>Scala has had <code class="language-plaintext highlighter-rouge">Option</code> since day one. This means <code class="language-plaintext highlighter-rouge">null</code> is virtually non-existent (though, sadly, technically still legal) in Scala code.</li>
  <li>Scala can verify a pattern-match is exhaustive at compile-time.</li>
  <li>Scala has libraries that encode invariants in a type. For example, <code class="language-plaintext highlighter-rouge">NonEmptyList</code> and <code class="language-plaintext highlighter-rouge">NonEmptySet</code> in the cats library.</li>
</ol>

<p>So, even though I agree with the author that “Correct by Design” is a myth (chapter 1), certain languages offer primitives that get us closer to this mythical goal.</p>

<h2 id="stubs-vs-mocks-and-testing-state-vs-interaction">Stubs vs. mocks and testing state vs. interaction</h2>

<p>The author covers dummies, fakes, stubs, mocks, spies, and fixtures in chapter 6.</p>

<p>I found the distinction between stubs and mocks particularly useful.</p>

<p>In summary, a stub is an object that adheres to some API with a simplistic implementation (e.g., methods return hard-coded data).
A mock does the same thing, but also provides mechanisms to verify how it was used (e.g., the number of times a specific method was called).</p>

<p>An analogous distinction is that of <em>state testing</em> vs. <em>interaction testing</em>.
My understanding is that state testing lets us verify the ultimate state (the result) of some method, whereas interaction testing lets us verify the specific interactions (of classes, methods, etc.) which led to this state.
Stubs facilitate state testing, whereas mocks facilitate interaction testing.</p>

<p>In most cases we really only care about the ultimate state, so stubs are sufficient.
In other cases, we might actually care that a specific method was invoked with a specific input a specific number of times, etc.
In these cases, we need mocks.</p>

<p>I think I’ve only ever needed to verify interactions for very performance-sensitive code.
Even then, a broader benchmarking harness was more useful for catching performance regressions.</p>

<h2 id="writing-testable-code-is-important-and-dependency-injection-helps">Writing testable code is important and dependency injection helps</h2>

<p>The author covers techniques for writing testable code in chapter 7.</p>

<p>The fundamental argument for writing testable code is simple: if code is difficult to test, we won’t test it, and we’ll inevitably introduce bugs.
I agree with this, both in principle and from experience.</p>

<p>One pattern seems particularly useful for writing testable code: dependency injection.</p>

<p>The core idea of dependency injection (DI) is pretty simple.
Define an abstract interface for each area of our domain and for each external dependency.
Implement a concrete implementation for each interface.
Critically, don’t let the concrete implementations communicate directly with other implementations.
Instead, they can only communicate through the interfaces.</p>

<p>When executed well, the benefit of DI is that we can test each implementation in isolation.
We do this by injecting a controlled test double (dummy, fake, stub, mock, etc.) for each of the interfaces required by a particular implementation.</p>

<p>There are dedicated libraries for DI in any mainstream language.
In Scala, I’ve found it’s usually sufficient to just use a <code class="language-plaintext highlighter-rouge">trait</code> for the interface, a <code class="language-plaintext highlighter-rouge">final class</code> for the implementation, and constructor parameters for the actual dependency injection.</p>

<p>Like any other technique, DI can be taken to a counterproductive extreme.
There are definitely cases where we want to test one or more concrete implementations together, covered in chapter 9.</p>

<h2 id="tdd-is-a-method-for-implementing-not-a-method-for-testing">TDD is a method for implementing, not a method for testing</h2>

<p>The author covers test-driven development (TDD) in chapter 8.</p>

<p>I found the most interesting point in this chapter was the distinction of TDD as a method to guide development, not a method for testing.</p>

<p>In other words, we can use TDD to arrive at an implementation, but we still need to re-evaluate the tests.
We might keep some or most of the tests.
Or we might realize the tests were effective as a means to an end (the implementation), but are not actually <em>effective tests</em>.</p>

<p>This was a refreshing perspective.
I’ve found it challenging to follow TDD as I previously understood it, simply because the tests I want to write while implementing are usually different from the tests I want to write to verify the implementation.
When I first start on an implementation, I’m generally satisfied testing with some scratch code in a simple <code class="language-plaintext highlighter-rouge">main</code> method that I know I’ll later discard.
When I verify the implementation, I want to optimize for readability and maintainability.</p>

<h2 id="only-integration-test-the-dependencies-that-you-exclusively-own">Only integration test the dependencies that you exclusively own</h2>

<p>The author covers large tests, including system and integration tests, in chapter 9.</p>

<p>Based on my experience, and the author’s perspectives, I’ve arrived at a heuristic for integration testing: I only integration test the external dependencies that my service exclusively owns.</p>

<p>For example, if my service exclusively owns a Postgres database, I’ll write integration tests against a realistic, local Postgres container.
If my service also depends on some other shared service, I stub or mock all interactions with that service.</p>

<p>If I find that I actually need to think about the implementation details of another service (e.g., the underlying database), there’s probably a bug or a leaky abstraction in the other service.
Instead of leaking the detail into my service, I work with the owner to fix the underlying issue.
The alternative is that every service knows implementation details about other services, which is clearly untenable.</p>

<p>In other words, every service owner should write tests to ensure the service can be trusted to work as described in the API contract.</p>

<h2 id="keep-stubs-small-and-specific">Keep stubs small and specific</h2>

<p>The author covers some test smells and anti-patterns in chapter 10.</p>

<p>One that I was particularly happy to see covered was <em>fixtures that are too general</em>.
This test smell consists of a large fixture that is shared by many tests.</p>

<p>Another way I’ve seen this play out is a singleton stub shared by many tests.
This approach is expedient when first implementing a test suite, as it lets us share the same stub in many places (DRY).
However, it quickly becomes brittle: when many tests depend on the specific behaviors of a shared stub, a trivial change or addition in the stub will break several unrelated tests.</p>

<p>The solution is to keep stubs as close and specific as possible to the tests that use them.
It might result in more test code overall, but it keeps the tests decoupled and easier to maintain.</p>

<h1 id="conclusion">Conclusion</h1>

<p>Buy and read <a href="https://www.manning.com/books/effective-software-testing"><em>Effective Software Testing</em></a> by <a href="https://twitter.com/mauricioaniche">Maurício Aniche</a>.
It’s well worth the time and cost!</p>

<h1 id="appendix">Appendix</h1>

<h2 id="discussion">Discussion</h2>

<ul>
  <li>The <a href="https://trendingintesting.com/trending-in-testing-weekly-newsletter/">Trending in Testing weekly newsletter</a> highlighted this post in issue #38.</li>
</ul>]]></content><author><name></name></author><summary type="html"><![CDATA[This post summarizes my takeaways from Effective Software Testing. Overall, I recommend the book to any professional software engineer.]]></summary></entry><entry><title type="html">Optimizing Postgres Text Search with Trigrams</title><link href="https://alexklibisz.com/2022/02/18/optimizing-postgres-trigram-search.html" rel="alternate" type="text/html" title="Optimizing Postgres Text Search with Trigrams" /><published>2022-02-18T15:00:00+00:00</published><updated>2022-02-18T15:00:00+00:00</updated><id>https://alexklibisz.com/2022/02/18/optimizing-postgres-trigram-search</id><content type="html" xml:base="https://alexklibisz.com/2022/02/18/optimizing-postgres-trigram-search.html"><![CDATA[<h1 id="introduction">Introduction</h1>

<p>In this post, we’ll implement and optimize a text search system based on <a href="https://www.postgresql.org/docs/14/pgtrgm.html">Postgres Trigrams</a>.</p>

<p>We’ll start with some fundamental concepts, then define a test environment based on a dataset of 8.9 million Amazon reviews, then cover three possible optimizations.</p>

<p>Our search will start very slow, about 360 seconds.
With some thoughtful optimization we’ll end up at just over 100 milliseconds – a ~3600x speedup!
These optimizations won’t apply perfectly to every text search use-case, but they should at the very least spark some ideas.</p>

<h2 id="defining-text-search">Defining “text search”</h2>

<p>For our purposes, “text search” is defined as follows:</p>

<ol>
  <li>We have a database table with multiple text columns.</li>
  <li>A user provides one input: a query string (think Google or Amazon search box).</li>
  <li>We search for ten rows in our table for which at least one of the columns matches the query string. A match can be either exact or fuzzy. For example, the query string “foobar” is an exact match for the value “foobarbaz” and a fuzzy match for the value “foo bar”.</li>
  <li>Once we’ve found ten such rows, we score them, re-rank them by their scores, and return them to the user.</li>
</ol>

<h2 id="why-postgres">Why Postgres?</h2>

<p>Postgres is a ubiquitous relational database, but dedicated search systems like Solr, Elasticsearch, and Opensearch are far better-known for text search.</p>

<p>Still, Postgres offers some competent text search functionality, with several benefits over a dedicated search system:</p>

<ol>
  <li>We avoid operating additional infrastructure (e.g., an Elasticsearch cluster).</li>
  <li>We avoid syncing data between systems (e.g., Postgres to Elasticsearch).</li>
  <li>We avoid re-implementing non-trivial features (e.g., multi-tenant authorization).</li>
</ol>

<p>Having implemented and operated search functionality on both Postgres and Elasticsearch, my current heuristics for choosing between them are:</p>

<ul>
  <li>If search is our core offering, or we can’t afford to fit our searchable text in Postgres’ shared memory buffer, use Elasticsearch and earmark one engineer for operations and tuning.</li>
  <li>Otherwise, Postgres is probably good enough. At the very least, we’ll end up with a competent baseline for further improvement.</li>
</ul>

<p>As a case-study, Gitlab has publicly documented their journey in growing from Postgres trigram-based search to advanced search in Elasticsearch.<sup id="fnref:gitlab-elasticsearch" role="doc-noteref"><a href="#fn:gitlab-elasticsearch" class="footnote" rel="footnote">1</a></sup></p>

<h2 id="what-are-trigrams">What are Trigrams?</h2>

<p>A trigram is simply a three-character sequence from a string.</p>

<p>For example, the trigrams in <code class="language-plaintext highlighter-rouge">"hello"</code> are <code class="language-plaintext highlighter-rouge">{"  h"," he","hel","ell","llo","lo "}</code>.</p>

<p>Trigrams present a simple solution for string comparison: to compare two strings, we can count the number of trigrams they share.</p>

<p>For example, “hello” and “helo” share 4 trigrams. “hello” and “bye” share 0. This isn’t fool-proof: “hello” shares more trigrams with “jello” than with “hi”. But it’s usually a strong baseline.</p>

<h3 id="why-not-full-text-search">Why not Full Text Search?</h3>

<p>Postgres also offers <a href="https://www.postgresql.org/docs/14/textsearch.html">Full Text Search</a>. 
I’m personally more familiar with trigram-based search, which is part of the reason we’ll use trigrams in this post.</p>

<p>From some experimenting with Full Text Search, I’ve found the API focuses more narrowly on natural language text (i.e., words, spaces, punctuation), than on general-purpose text (i.e., natural language text <em>and</em> product SKUs <em>and</em> email addresses, …).<sup id="fnref:gitlab-full-text-search" role="doc-noteref"><a href="#fn:gitlab-full-text-search" class="footnote" rel="footnote">2</a></sup></p>

<p>Still, some optimizations in this post might translate nicely to Full Text Search.</p>

<h1 id="the-test-environment">The Test Environment</h1>

<p>The full test environment is available <a href="https://github.com/alexklibisz/site-projects/tree/main/optimizing-postgres-text-search-with-trigrams">on Github</a>. 
Let’s have a look at some specific components.</p>

<h2 id="host">Host</h2>

<p>Everything is running on my Dell XPS-9570 Laptop with an Intel i7-8750H, 32GB of memory, an SSD, and Ubuntu 20.04. 
Exact timing will vary by the host environment, but the relative performance should be host-agnostic.</p>

<h2 id="postgres">Postgres</h2>

<p>We’ll use Postgres version 14.1.0, running the <a href="https://hub.docker.com/r/bitnami/postgresql/tags"><code class="language-plaintext highlighter-rouge">bitnami/postgresql:14.1.0</code></a> container. 
All examples should work on Postgres &gt;= 13.</p>

<p>We’ll set two <a href="https://www.postgresql.org/docs/14/runtime-config.html">server configurations</a>:</p>

<div class="language-conf highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Provide 4GB of memory for buffers.
# This lets us keep the full table and indexes in memory.
</span><span class="n">shared_buffers</span> = <span class="m">4096</span><span class="n">MB</span>

<span class="c"># Set to true so that `explain (... buffers ...) will include IO timings.
</span><span class="n">track_io_timing</span> = <span class="n">true</span>
</code></pre></div></div>

<p>We’ll also set two <a href="https://www.postgresql.org/docs/14/runtime-config-query.html">query planning configurations</a>:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- This disables parallel gathers, as I've found they produce highly</span>
<span class="c1">-- variable results depending on the host system.</span>
<span class="k">set</span> <span class="n">max_parallel_workers_per_gather</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

<span class="c1">-- This makes it more likely that the query planner chooses an index scan</span>
<span class="c1">-- instead of another strategy. I've found this will generally improve</span>
<span class="c1">-- performance on any system with an SSD.</span>
<span class="k">set</span> <span class="n">random_page_cost</span> <span class="o">=</span> <span class="mi">0</span><span class="p">.</span><span class="mi">9</span><span class="p">;</span>
</code></pre></div></div>

<h2 id="amazon-review-dataset">Amazon Review Dataset</h2>

<p>We’ll use data from the <a href="http://jmcauley.ucsd.edu/data/amazon/links.html">Amazon Review Dataset</a> to demonstrate our optimizations.<sup id="fnref:jmacauley-citation" role="doc-noteref"><a href="#fn:jmacauley-citation" class="footnote" rel="footnote">3</a></sup></p>

<p>Specifically, we’ll use text properties from the 5-core Book Reviews Subset, a dataset of 8.9 million reviews for books sold on Amazon. 
An example review shows the shape of our dataset:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"reviewerID"</span><span class="p">:</span><span class="w"> </span><span class="s2">"A2SUAM1J3GNN3B"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"asin"</span><span class="p">:</span><span class="w"> </span><span class="s2">"0000013714"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"reviewerName"</span><span class="p">:</span><span class="w"> </span><span class="s2">"J. McDonald"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"helpful"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="p">],</span><span class="w">
  </span><span class="nl">"reviewText"</span><span class="p">:</span><span class="w"> </span><span class="s2">"I bought this for my husband who plays the piano. ..."</span><span class="p">,</span><span class="w">
  </span><span class="nl">"overall"</span><span class="p">:</span><span class="w"> </span><span class="mf">5.0</span><span class="p">,</span><span class="w">
  </span><span class="nl">"summary"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Heavenly Highway Hymns"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"unixReviewTime"</span><span class="p">:</span><span class="w"> </span><span class="mi">1252800000</span><span class="p">,</span><span class="w">
  </span><span class="nl">"reviewTime"</span><span class="p">:</span><span class="w"> </span><span class="s2">"09 13, 2009"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Each of these reviews includes five text properties: <code class="language-plaintext highlighter-rouge">reviewerID, asin, reviewerName, reviewText, summary</code>. 
<code class="language-plaintext highlighter-rouge">reviewerID</code> and <code class="language-plaintext highlighter-rouge">asin</code> are machine-generated identifiers. 
<code class="language-plaintext highlighter-rouge">reviewerName, reviewText, summary</code> are free-form human-generated text.</p>

<p>For simplicity, we’ll ignore <code class="language-plaintext highlighter-rouge">reviewText</code> and make a table with the remaining text properties:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">create</span> <span class="k">table</span> <span class="n">reviews</span> <span class="p">(</span>
  <span class="n">review_id</span> <span class="n">bigserial</span> <span class="k">primary</span> <span class="k">key</span><span class="p">,</span>
  <span class="n">reviewer_id</span> <span class="nb">varchar</span><span class="p">(</span><span class="mi">50</span><span class="p">),</span>
  <span class="n">reviewer_name</span> <span class="nb">varchar</span><span class="p">(</span><span class="mi">100</span><span class="p">),</span>
  <span class="n">asin</span> <span class="nb">varchar</span><span class="p">(</span><span class="mi">50</span><span class="p">),</span>
  <span class="n">summary</span> <span class="nb">varchar</span><span class="p">(</span><span class="mi">1000</span><span class="p">)</span>
<span class="p">);</span>
</code></pre></div></div>

<p>I wouldn’t recommend this schema in a real application. 
For example, the reviewer name should be factored out to a <code class="language-plaintext highlighter-rouge">reviewer</code> table. 
But it’s good enough for a demo.</p>

<p>It takes about 8 minutes to populate the table and the table size ends up at 926MB.</p>

<h2 id="example-query-strings">Example Query Strings</h2>

<p>We’ll use one of my favorite authors, Michael Lewis, as a test subject.<sup id="fnref:michael-lewis" role="doc-noteref"><a href="#fn:michael-lewis" class="footnote" rel="footnote">4</a></sup></p>

<p>Specifically, we’ll search for two variations of his name:</p>

<ol>
  <li>“Michael Lewis” – the correct spelling – to find exact matches</li>
  <li>“Michael L<u><i>ou</i></u>is” – a plausible misspeling – to find fuzzy matches</li>
</ol>

<p>To avoid confusion, let’s refer to these as the <em>exact name</em> and the <em>fuzzy name</em>.</p>

<h2 id="explain-analyze-buffers">Explain (Analyze, Buffers)</h2>

<p>We’ll use Postgres’ <a href="https://www.postgresql.org/docs/14/sql-explain.html"><code class="language-plaintext highlighter-rouge">explain (analyze, buffers)</code> command</a> to evaluate performance.
This command takes a query, executes it, and returns the query plan and execution details.</p>

<p>This is not a fool-proof solution. 
A better benchmarking harness would include realistic application request patterns, authentication, authorization, logging, serialization, etc. 
However, building such a harness would be pointlessly cumbersome, as it would need to be re-implemented for any other non-trivial application.</p>

<p>The main thing we’re looking at is how the query plan, execution time, and I/O statistics<sup id="fnref:buffers" role="doc-noteref"><a href="#fn:buffers" class="footnote" rel="footnote">5</a></sup> react to optimizations. This is enough to conclude one approach is better than another.</p>

<p>To make this a bit more aesthetically appealing, I cobbled together an embedded version of the excellent <a href="https://github.com/dalibo/pev2">PEV2 (Postgres Explain Visualizer 2) project</a>.<sup id="fnref:pev2" role="doc-noteref"><a href="#fn:pev2" class="footnote" rel="footnote">6</a></sup></p>

<p>Let’s look at an example. I ran this query:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">explain</span> <span class="p">(</span><span class="k">analyze</span><span class="p">,</span> <span class="n">buffers</span><span class="p">)</span>
<span class="k">select</span> <span class="k">count</span><span class="p">(</span><span class="n">review_id</span><span class="p">)</span> <span class="k">from</span> <span class="n">reviews</span><span class="p">;</span>
</code></pre></div></div>

<p>Which produced this output:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Aggregate  (cost=177675.25..177675.26 rows=1 width=8) (actual time=3282.696..3282.697 rows=1 loops=1)
  Buffers: shared hit=10 read=24314
  I/O Timings: read=121.636
  -&gt;  Index Only Scan using reviews_pkey on reviews  (cost=0.43..155430.15 rows=8898041 width=8) (actual time=1.341..1679.504 rows=8898041 loops=1)
    Heap Fetches: 0
    Buffers: shared hit=10 read=24314
    I/O Timings: read=121.636
Planning Time: 0.138 ms
Execution Time: 3282.768 ms
</code></pre></div></div>

<p>The PEV2 viewer shows the query plan visualization, raw query plan, query, and some stats.</p>

<div data-app-component="pev2">
  <pre>
select count(review_id) from reviews;
  </pre>
  <pre>
Aggregate  (cost=177675.25..177675.26 rows=1 width=8) (actual time=3282.696..3282.697 rows=1 loops=1)
  Buffers: shared hit=10 read=24314
  I/O Timings: read=121.636
  -&gt;  Index Only Scan using reviews_pkey on reviews  (cost=0.43..155430.15 rows=8898041 width=8) (actual time=1.341..1679.504 rows=8898041 loops=1)
    Heap Fetches: 0
    Buffers: shared hit=10 read=24314
    I/O Timings: read=121.636
Planning Time: 0.138 ms
Execution Time: 3282.768 ms
  </pre>
</div>

<p>Clicking around a bit in the plan reveals three important results:</p>

<ol>
  <li>Total planning and execution time: we spent 0.138ms planning and about 3.2s executing.</li>
  <li>Timing and types of execution: we spent 1.6s in an <code class="language-plaintext highlighter-rouge">Aggregate</code> and 1.6s in an <code class="language-plaintext highlighter-rouge">Index Only Scan</code>. The <code class="language-plaintext highlighter-rouge">Index Only Scan</code> tells us we were able to make use of an index in this query.</li>
  <li>Timing, amount, and types of I/O: if we expand the <code class="language-plaintext highlighter-rouge">Index Only Scan</code> and open the <code class="language-plaintext highlighter-rouge">IO and Buffers</code> tab, we see we spent 122ms on I/O, hit 10 blocks, and read 24,314 blocks. A hit means the block was in the shared buffer cache. A read means we went to the filesystem cache or SSD.</li>
</ol>

<h1 id="baseline-search-query">Baseline Search Query</h1>

<p>With our test environment explained, let’s build a relatively simple baseline query pattern based on the trigram <code class="language-plaintext highlighter-rouge">similarity</code> function and its corresponding operators: <code class="language-plaintext highlighter-rouge">%</code> and <code class="language-plaintext highlighter-rouge">&lt;-&gt;</code>.</p>

<h2 id="trigram-operators">Trigram Operators</h2>

<p>Those already familiar with trigram <code class="language-plaintext highlighter-rouge">similarity</code>, <code class="language-plaintext highlighter-rouge">%</code>, and <code class="language-plaintext highlighter-rouge">&lt;-&gt;</code> can safely skip this section.</p>

<p>First, we need the function <code class="language-plaintext highlighter-rouge">similarity(text1, text2)</code>. 
This function breaks both texts into a set of trigrams, computes the intersection of sets, computes the union of sets, and divides the intersection size by the union size to produce a score between 0 and 1. 
In other words, this is the <a href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard Index</a> of the sets of trigrams.</p>

<p>The query below give us some intuition about <code class="language-plaintext highlighter-rouge">similarity('abc', 'abb')</code>, <code class="language-plaintext highlighter-rouge">'abc' % 'abb'</code> and <code class="language-plaintext highlighter-rouge">'abc' &lt;-&gt; 'abb'</code>:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="k">input</span> <span class="k">as</span> <span class="p">(</span><span class="k">select</span> <span class="s1">'abc'</span> <span class="k">as</span> <span class="n">text1</span><span class="p">,</span> <span class="s1">'abb'</span> <span class="k">as</span> <span class="n">text2</span><span class="p">)</span>
<span class="k">select</span>
  <span class="n">show_trgm</span><span class="p">(</span><span class="n">text1</span><span class="p">)</span> <span class="k">as</span> <span class="nv">"text1 trigrams"</span><span class="p">,</span>
  <span class="n">show_trgm</span><span class="p">(</span><span class="n">text2</span><span class="p">)</span> <span class="k">as</span> <span class="nv">"text2 trigrams"</span><span class="p">,</span>
  <span class="n">array</span><span class="p">(</span><span class="k">select</span> <span class="n">t1</span><span class="p">.</span><span class="n">t1</span> 
        <span class="k">from</span> <span class="k">unnest</span><span class="p">(</span><span class="n">show_trgm</span><span class="p">(</span><span class="n">text1</span><span class="p">))</span> <span class="n">t1</span><span class="p">,</span> 
             <span class="k">unnest</span><span class="p">(</span><span class="n">show_trgm</span><span class="p">(</span><span class="n">text2</span><span class="p">))</span> <span class="n">t2</span> <span class="k">where</span> <span class="n">t1</span><span class="p">.</span><span class="n">t1</span> <span class="o">=</span> <span class="n">t2</span><span class="p">.</span><span class="n">t2</span><span class="p">)</span> <span class="k">as</span> <span class="nv">"intersection"</span><span class="p">,</span>
  <span class="n">array</span><span class="p">(</span><span class="k">select</span> <span class="n">t1</span><span class="p">.</span><span class="n">t1</span> <span class="k">from</span> <span class="k">unnest</span><span class="p">(</span><span class="n">show_trgm</span><span class="p">(</span><span class="n">text1</span><span class="p">))</span> <span class="n">t1</span> 
        <span class="k">union</span> 
        <span class="k">select</span> <span class="n">t2</span><span class="p">.</span><span class="n">t2</span> <span class="k">from</span> <span class="k">unnest</span><span class="p">(</span><span class="n">show_trgm</span><span class="p">(</span><span class="n">text2</span><span class="p">))</span> <span class="n">t2</span><span class="p">)</span> <span class="k">as</span> <span class="nv">"union"</span><span class="p">,</span>
  <span class="n">round</span><span class="p">(</span><span class="n">similarity</span><span class="p">(</span><span class="n">text1</span><span class="p">,</span> <span class="n">text2</span><span class="p">)::</span><span class="nb">numeric</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span> <span class="k">as</span> <span class="nv">"similarity"</span><span class="p">,</span>
  <span class="n">text1</span> <span class="o">%</span> <span class="n">text2</span> <span class="k">as</span> <span class="nv">"text1 % text2"</span><span class="p">,</span>
  <span class="n">text1</span> <span class="o">&lt;-&gt;</span> <span class="n">text2</span> <span class="k">as</span> <span class="nv">"text1 &lt;-&gt; text2"</span>
<span class="k">from</span> <span class="k">input</span><span class="p">;</span>
</code></pre></div></div>

<p>This produces:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">text1 trigrams</th>
      <th style="text-align: left">text2 trigrams</th>
      <th style="text-align: left">intersection</th>
      <th style="text-align: left">union</th>
      <th style="text-align: left">similarity</th>
      <th style="text-align: left">text1 % text2</th>
      <th style="text-align: left">text1 &lt;-&gt; text2</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">{  a, ab,abc,bc }</td>
      <td style="text-align: left">{  a, ab,abb,bb }</td>
      <td style="text-align: left">{  a, ab}</td>
      <td style="text-align: left">{bb ,abb,  a, ab,bc ,abc}</td>
      <td style="text-align: left">0.333</td>
      <td style="text-align: left">true</td>
      <td style="text-align: left">0.666</td>
    </tr>
  </tbody>
</table>

<p>For “abc” and “abb”, the intersection size is 2 and union size is 6, so 2/6 is a similarity of 1/3.</p>

<p>The operator <code class="language-plaintext highlighter-rouge">text1 % text2</code> returns true if <code class="language-plaintext highlighter-rouge">similarity(text1, text2)</code> exceeds a pre-defined threshold setting, <code class="language-plaintext highlighter-rouge">pg_trgm.similarity_threshold</code>. The default threshold is 0.3, so <code class="language-plaintext highlighter-rouge">select 'abc' % 'abc'</code> returns <code class="language-plaintext highlighter-rouge">true</code>.</p>

<p>The operator <code class="language-plaintext highlighter-rouge">text1 &lt;-&gt; text2</code> returns the distance between text1 and text2, which is just <code class="language-plaintext highlighter-rouge">1 - similarity(text1, text2)</code>, so <code class="language-plaintext highlighter-rouge">select 'abc' &lt;-&gt; 'abb'</code> returns 2/3.</p>

<p>Why do we need these operators if they just alias the <code class="language-plaintext highlighter-rouge">similarity</code> function? At the risk of spoiling one of the optimizations, operators can leverage an index, and functions cannot.</p>

<h2 id="trigram-search-query">Trigram Search Query</h2>

<p>Let’s use these operators to search for reviews where <code class="language-plaintext highlighter-rouge">summary</code> matches the exact name.</p>

<p>(We should compare the query string against all text columns, but we start with just <code class="language-plaintext highlighter-rouge">summary</code> for simplicity.)</p>

<p>The query looks like this:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="k">input</span> <span class="k">as</span> <span class="p">(</span><span class="k">select</span> <span class="s1">'Michael Lewis'</span> <span class="k">as</span> <span class="n">q</span><span class="p">)</span> <span class="c1">-- (1)</span>
<span class="k">select</span> <span class="n">review_id</span><span class="p">,</span>
       <span class="mi">1</span><span class="p">.</span><span class="mi">0</span> <span class="o">-</span> <span class="p">(</span><span class="n">summary</span> <span class="o">&lt;-&gt;</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span><span class="p">)</span> <span class="k">as</span> <span class="n">score</span> <span class="c1">-- (4)</span>
<span class="k">from</span> <span class="n">reviews</span><span class="p">,</span> <span class="k">input</span>
<span class="k">where</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">%</span> <span class="n">summary</span> <span class="c1">-- (2)</span>
<span class="k">order</span> <span class="k">by</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">&lt;-&gt;</span> <span class="n">summary</span> <span class="k">limit</span> <span class="mi">10</span><span class="p">;</span> <span class="c1">-- (3)</span>
</code></pre></div></div>

<p>Let’s break the query into components, numbered in correspondence to the comments above:</p>

<ol>
  <li>This is a Common Table Expression (CTE). It gives us a way to reference the query string as a variable, <code class="language-plaintext highlighter-rouge">input.q</code>.</li>
  <li>We use <code class="language-plaintext highlighter-rouge">input.q % summary</code> to filter the table down to a set of candidate rows. For each of these rows, <code class="language-plaintext highlighter-rouge">input.q</code> and <code class="language-plaintext highlighter-rouge">summary</code> have a trigram similarity greater than or equal to 0.3.</li>
  <li>Once we’ve found candidate rows, we sort them by the trigram distance between <code class="language-plaintext highlighter-rouge">input.q</code> and <code class="language-plaintext highlighter-rouge">summary</code> and keep the top 10. We want the rows with highest similarity, which is equivalent to lowest distance. So we sort by the distance operator in ascending order.</li>
  <li>In order to return the score to the user, we just subtract the trigram distance from 1.0.</li>
</ol>

<p>Let’s look at the results and performance for the exact name:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">review_id</th>
      <th style="text-align: left">summary</th>
      <th style="text-align: left">score</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">589771</td>
      <td style="text-align: left">Michael Lewis Fan</td>
      <td style="text-align: left">0.7777</td>
    </tr>
    <tr>
      <td style="text-align: left">2113780</td>
      <td style="text-align: left">Michael Lewis Fan</td>
      <td style="text-align: left">0.7777</td>
    </tr>
    <tr>
      <td style="text-align: left">2111282</td>
      <td style="text-align: left">Michael Lewis bland?</td>
      <td style="text-align: left">0.6999</td>
    </tr>
    <tr>
      <td style="text-align: left">2114048</td>
      <td style="text-align: left">MIchael Lewis is Good</td>
      <td style="text-align: left">0.6666</td>
    </tr>
    <tr>
      <td style="text-align: left">2100962</td>
      <td style="text-align: left">Not Michael Lewis’ Best</td>
      <td style="text-align: left">0.6086</td>
    </tr>
    <tr>
      <td style="text-align: left">610753</td>
      <td style="text-align: left">Not Michael Lewis’ Best</td>
      <td style="text-align: left">0.6086</td>
    </tr>
    <tr>
      <td style="text-align: left">2111364</td>
      <td style="text-align: left">Boomerang, Michael Lewis</td>
      <td style="text-align: left">0.5833</td>
    </tr>
    <tr>
      <td style="text-align: left">2111212</td>
      <td style="text-align: left">Michael Lewis is amazing</td>
      <td style="text-align: left">0.5833</td>
    </tr>
    <tr>
      <td style="text-align: left">2111190</td>
      <td style="text-align: left">Michael Lewis on a Roll</td>
      <td style="text-align: left">0.5833</td>
    </tr>
    <tr>
      <td style="text-align: left">2108446</td>
      <td style="text-align: left">Go Long on Michael Lewis</td>
      <td style="text-align: left">0.5833</td>
    </tr>
  </tbody>
</table>

<div data-app-component="pev2">
<pre>
with input as (select 'Michael Lewis' as q)
select review_id,
       1.0 - (summary &lt;-&gt; input.q) as score
from reviews, input
where input.q % summary
order by input.q &lt;-&gt; summary limit 10;
</pre>
<pre>
Limit  (cost=229772.34..229772.37 rows=10 width=20) (actual time=94817.773..94817.777 rows=10 loops=1)
  Buffers: shared hit=118549
  -&gt;  Sort  (cost=229772.34..229774.39 rows=819 width=20) (actual time=94817.772..94817.773 rows=10 loops=1)
        Sort Key: (('Michael Lewis'::text &lt;-&gt; (reviews.summary)::text))
        Sort Method: top-N heapsort  Memory: 26kB
        Buffers: shared hit=118549
        -&gt;  Seq Scan on reviews  (cost=0.00..229754.64 rows=819 width=20) (actual time=171.828..94816.588 rows=761 loops=1)
              Filter: ('Michael Lewis'::text % (summary)::text)
              Rows Removed by Filter: 8897280
              Buffers: shared hit=118549
Planning Time: 1.669 ms
Execution Time: 94817.814 ms
</pre>
</div>

<p>And again for the fuzzy name:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">review_id</th>
      <th style="text-align: left">summary</th>
      <th style="text-align: left">score</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">4341036</td>
      <td style="text-align: left">Lo Michael</td>
      <td style="text-align: left">0.6666</td>
    </tr>
    <tr>
      <td style="text-align: left">4341045</td>
      <td style="text-align: left">Lo, Michael</td>
      <td style="text-align: left">0.6666</td>
    </tr>
    <tr>
      <td style="text-align: left">4341030</td>
      <td style="text-align: left">Lo Michael!</td>
      <td style="text-align: left">0.6666</td>
    </tr>
    <tr>
      <td style="text-align: left">4341034</td>
      <td style="text-align: left">Lo,Michael</td>
      <td style="text-align: left">0.6666</td>
    </tr>
    <tr>
      <td style="text-align: left">4341027</td>
      <td style="text-align: left">Lo, Michael!</td>
      <td style="text-align: left">0.6666</td>
    </tr>
    <tr>
      <td style="text-align: left">4341043</td>
      <td style="text-align: left">Lo,Michael</td>
      <td style="text-align: left">0.6666</td>
    </tr>
    <tr>
      <td style="text-align: left">4341025</td>
      <td style="text-align: left">Lo, Michael!</td>
      <td style="text-align: left">0.6666</td>
    </tr>
    <tr>
      <td style="text-align: left">4341026</td>
      <td style="text-align: left">Lo. Michael !</td>
      <td style="text-align: left">0.6666</td>
    </tr>
    <tr>
      <td style="text-align: left">4341029</td>
      <td style="text-align: left">Lo, Michael!</td>
      <td style="text-align: left">0.6666</td>
    </tr>
    <tr>
      <td style="text-align: left">4341050</td>
      <td style="text-align: left">Lo michael</td>
      <td style="text-align: left">0.6666</td>
    </tr>
  </tbody>
</table>

<div data-app-component="pev2">
<pre>
with input as (select 'Michael Louis' as q)
select review_id,
       1.0 - (summary &lt;-&gt; input.q) as score
from reviews, input
where input.q % summary
order by input.q &lt;-&gt; summary limit 10;
</pre>
<pre>
Limit  (cost=229792.85..229792.87 rows=10 width=20) (actual time=94591.716..94591.720 rows=10 loops=1)
  Buffers: shared hit=118549
  -&gt;  Sort  (cost=229792.85..229794.90 rows=821 width=20) (actual time=94591.715..94591.716 rows=10 loops=1)
        Sort Key: (('Michael Louis'::text &lt;-&gt; (reviews.summary)::text))
        Sort Method: top-N heapsort  Memory: 26kB
        Buffers: shared hit=118549
        -&gt;  Seq Scan on reviews  (cost=0.00..229775.11 rows=821 width=20) (actual time=176.054..94590.575 rows=729 loops=1)
              Filter: ('Michael Louis'::text % (summary)::text)
              Rows Removed by Filter: 8897312
              Buffers: shared hit=118549
Planning Time: 1.761 ms
Execution Time: 94591.758 ms
</pre>
</div>

<p>Qualitatively speaking, the results are reasonable. There are some exact matches for the exact name, and a few summaries containing “Lo” and “Michael” that match the fuzzy name.</p>

<p>👎 <strong>But the performance is terrible: <em>over 94 seconds to find ten results!</em></strong></p>

<p>If we extrapolate this to all four text columns, we can estimate a runtime of over 360 seconds.</p>

<p>How are we spending this time? The query plans suggest the following:</p>

<ol>
  <li>About 94s in <code class="language-plaintext highlighter-rouge">Seq Scan on reviews</code>. This is a sequential scan on the reviews table, which means Postgres iterates over all rows and keeps those that satisfy <code class="language-plaintext highlighter-rouge">input.q % summary</code>. This returns 761 and 729 matches for the exact and fuzzy names, respectively. The <code class="language-plaintext highlighter-rouge">IO &amp; Buffers</code> tabs indicate this also involved reading 926MB of data (i.e., the whole table) from the in-memory cache. It’s better than going to SSD, but it’s still non-negligible.</li>
  <li>About 1ms in <code class="language-plaintext highlighter-rouge">Sort</code>, which sorts the matches by <code class="language-plaintext highlighter-rouge">input.q &lt;-&gt; summary</code>.</li>
  <li>Less than 1ms in <code class="language-plaintext highlighter-rouge">Limit</code>, which takes the first ten of the sorted rows.</li>
</ol>

<h2 id="baseline-summary">Baseline Summary</h2>

<p>Here’s what we know about our trigram search query so far:</p>

<ol>
  <li>Qualitatively, it’s not Google, but the results are reasonable.</li>
  <li>The query is unusably slow (over 94s).</li>
  <li>It spends virtually all its time scanning the reviews table.</li>
</ol>

<h1 id="optimizations">Optimizations</h1>

<p>Let’s get into some optimizations to see if we can improve on the baseline.</p>

<h2 id="indexing">Indexing</h2>

<p>The first optimization should be unsurprising: we’ll create an index for the text field.</p>

<p>Trigrams support both GIN and GiST index types. 
The main difference is that GiST supports filtering and sorting, whereas GIN only supports filtering. 
Since our search query involves sorting by trigram distance, we’ll use GiST.</p>

<p>At a high level, the GiST index works by building a lookup table from each trigram to the list or rows containing the trigram.
At query time, Postgres takes the trigrams from the query string and asks the index, “which rows contain these trigrams?”
The trigrams are stored as a signature (i.e., a hash), and sometimes the signatures can collide.</p>

<p>Since Postgres 13, the GiST index type includes a parameter called <code class="language-plaintext highlighter-rouge">siglen</code>, which lets us control the precision of the signature. Here’s how the docs describe it:</p>

<blockquote>
  <p>gist_trgm_ops GiST opclass approximates a set of trigrams as a bitmap signature. Its optional integer parameter siglen determines the signature length in bytes. The default length is 12 bytes. Valid values of signature length are between 1 and 2024 bytes. Longer signatures lead to a more precise search (scanning a smaller fraction of the index and fewer heap pages), at the cost of a larger index.</p>
</blockquote>

<p>In short, higher siglen should translate to more precise search (i.e., fewer signature collisions), at the cost of a larger index.</p>

<p>We’ll start with a GiST index with siglen=64, check performance, then repeat with siglen=256.</p>

<h3 id="gist-with-siglen64">GiST with siglen=64</h3>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">create</span> <span class="k">index</span> <span class="n">reviews_summary_trgm_gist_idx</span> <span class="k">on</span> <span class="n">reviews</span> 
  <span class="k">using</span> <span class="n">gist</span><span class="p">(</span><span class="n">summary</span> <span class="n">gist_trgm_ops</span><span class="p">(</span><span class="n">siglen</span><span class="o">=</span><span class="mi">64</span><span class="p">));</span>
<span class="k">vacuum</span> <span class="k">analyze</span> <span class="n">reviews</span><span class="p">;</span>
</code></pre></div></div>

<p>This takes about 10 minutes to build and ends up using about 1000MB of storage.</p>

<p>Does it make a difference for performance?</p>

<p>For the exact name, we find:</p>

<div data-app-component="pev2">
<pre>
with input as (select 'Michael Lewis' as q)
select review_id,
       1.0 - (summary &lt;-&gt; input.q) as score
from reviews, input
where input.q % summary
order by input.q &lt;-&gt; summary limit 10;
</pre>
<pre>
Limit  (cost=0.42..9.81 rows=10 width=20) (actual time=4181.401..4216.478 rows=10 loops=1)
  Buffers: shared hit=135684
  -&gt;  Index Scan using reviews_summary_trgm_gist_idx on reviews  (cost=0.42..771.80 rows=821 width=20) (actual time=4181.400..4216.474 rows=10 loops=1)
        Index Cond: ((summary)::text % 'Michael Lewis'::text)
        Order By: ((summary)::text &lt;-&gt; 'Michael Lewis'::text)
        Buffers: shared hit=135684
Planning Time: 1.933 ms
Execution Time: 4216.519 ms
</pre>
</div>

<p>And for the fuzzy name:</p>

<div data-app-component="pev2">
<pre>
with input as (select 'Michael Louis' as q)
select review_id,
       1.0 - (summary &lt;-&gt; input.q) as score
from reviews, input
where input.q % summary
order by input.q &lt;-&gt; summary limit 10;
</pre>
<pre>
Limit  (cost=0.42..9.81 rows=10 width=20) (actual time=4330.713..4330.850 rows=10 loops=1)
  Buffers: shared hit=135447
  -&gt;  Index Scan using reviews_summary_trgm_gist_idx on reviews  (cost=0.42..771.80 rows=821 width=20) (actual time=4330.711..4330.845 rows=10 loops=1)
        Index Cond: ((summary)::text % 'Michael Louis'::text)
        Order By: ((summary)::text &lt;-&gt; 'Michael Louis'::text)
        Buffers: shared hit=135447
Planning Time: 1.829 ms
Execution Time: 4330.889 ms
</pre>
</div>

<p><br /></p>

<p>👍 <strong>This is a significant improvement: from over 94 seconds to under 4.5 seconds!</strong></p>

<p>If we extrapolate this to all four text columns, we can estimate a runtime of under 20 seconds.</p>

<p>The query plans tell us how we’re making better use of time:</p>

<ol>
  <li>About 4.3s in an <code class="language-plaintext highlighter-rouge">Index Scan</code> on the new <code class="language-plaintext highlighter-rouge">reviews_summary_trgm_gist_idx</code> index. The <code class="language-plaintext highlighter-rouge">Misc</code> tab indicates Postgres uses the index for filtering (<code class="language-plaintext highlighter-rouge">Index Cond</code>) and sorting (<code class="language-plaintext highlighter-rouge">Order By</code>). The <code class="language-plaintext highlighter-rouge">IO &amp; Buffers</code> tab indicates we’re accessing 1.03GB of data from the cache. We don’t know precisely, but this data is some combination of the index and the rows.</li>
  <li>Less than 40ms in <code class="language-plaintext highlighter-rouge">Limit</code>. As far as I can tell, this is a trivial pass-through, as the index scan has already returned exactly ten rows.</li>
</ol>

<h3 id="gist-with-siglen256">GiST with siglen=256</h3>

<p>Let’s try again with siglen=256:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">drop</span> <span class="k">index</span> <span class="n">reviews_summary_trgm_gist_idx</span><span class="p">;</span>
<span class="k">create</span> <span class="k">index</span> <span class="n">reviews_summary_trgm_gist_idx</span> <span class="k">on</span> <span class="n">reviews</span>
  <span class="k">using</span> <span class="n">gist</span><span class="p">(</span><span class="n">summary</span> <span class="n">gist_trgm_ops</span><span class="p">(</span><span class="n">siglen</span><span class="o">=</span><span class="mi">256</span><span class="p">));</span>
<span class="k">vacuum</span> <span class="k">analyze</span> <span class="n">reviews</span><span class="p">;</span>
</code></pre></div></div>

<p>This takes about 15 minutes to build and uses 1036MB of storage.</p>

<p>For the exact name, we find:</p>

<div data-app-component="pev2">
<pre>
with input as (select 'Michael Lewis' as q)
select review_id,
       1.0 - (summary &lt;-&gt; input.q) as score
from reviews, input
where input.q % summary
order by input.q &lt;-&gt; summary limit 10;
</pre>
<pre>
Limit  (cost=0.42..9.81 rows=10 width=20) (actual time=503.082..1996.835 rows=10 loops=1)
  Buffers: shared hit=62167
  -&gt;  Index Scan using reviews_summary_trgm_gist_idx on reviews  (cost=0.42..771.80 rows=821 width=20) (actual time=503.079..1996.828 rows=10 loops=1)
        Index Cond: ((summary)::text % 'Michael Lewis'::text)
        Order By: ((summary)::text &lt;-&gt; 'Michael Lewis'::text)
        Buffers: shared hit=62167
Planning Time: 5.283 ms
Execution Time: 1997.397 ms
</pre>
</div>

<p>And for the fuzzy name:</p>

<div data-app-component="pev2">
<pre>
with input as (select 'Michael Louis' as q)
select review_id,
       1.0 - (summary &lt;-&gt; input.q) as score
from reviews, input
where input.q % summary
order by input.q &lt;-&gt; summary limit 10;
</pre>
<pre>
Limit  (cost=0.42..9.81 rows=10 width=20) (actual time=707.952..708.081 rows=10 loops=1)
  Buffers: shared hit=22639
  -&gt;  Index Scan using reviews_summary_trgm_gist_idx on reviews  (cost=0.42..771.80 rows=821 width=20) (actual time=707.951..708.078 rows=10 loops=1)
        Index Cond: ((summary)::text % 'Michael Louis'::text)
        Order By: ((summary)::text &lt;-&gt; 'Michael Louis'::text)
        Buffers: shared hit=22639
Planning Time: 1.654 ms
Execution Time: 708.577 ms
</pre>
</div>

<p><br /></p>

<p>👍 <strong>Another improvement: from 4.5 seconds to under 2 seconds!</strong></p>

<p>If we extrapolate this to all four text columns, we can estimate a runtime of about 8 seconds.</p>

<h3 id="why-does-siglen-matter">Why does Siglen Matter?</h3>

<p>Inspection of these results leads to two questions:</p>

<ol>
  <li>Why is siglen=256 over 2x faster than siglen=64?</li>
  <li>For siglen=256, why is the exact name over 2x faster than the fuzzy name?</li>
</ol>

<p>We can begin to answer these by looking at the <code class="language-plaintext highlighter-rouge">IO &amp; Buffers</code> tabs, which tell us how much data was accessed. The numbers work out like this:</p>

<table>
  <thead>
    <tr>
      <th>Siglen</th>
      <th>Query String</th>
      <th>Data Accessed</th>
      <th>Access Type</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>64</td>
      <td>exact name</td>
      <td>1.04GB</td>
      <td>hit (from in-memory cache)</td>
    </tr>
    <tr>
      <td>64</td>
      <td>fuzzy name</td>
      <td>1.03GB</td>
      <td>hit (from in-memory cache)</td>
    </tr>
    <tr>
      <td>256</td>
      <td>exact name</td>
      <td>486MB</td>
      <td>hit (from in-memory cache)</td>
    </tr>
    <tr>
      <td>256</td>
      <td>fuzzy name</td>
      <td>177MB</td>
      <td>hit (from in-memory cache)</td>
    </tr>
  </tbody>
</table>

<p>Even though this data is in memory, decreasing the amount accessed makes a difference.</p>

<p>I’m still working on an intuitive understanding of why these two specific values of siglen work out to these specific differences, but that’s likely a topic for another post.<sup id="fnref:buffers:1" role="doc-noteref"><a href="#fn:buffers" class="footnote" rel="footnote">5</a></sup></p>

<h3 id="indexing-summary">Indexing Summary</h3>

<p>Here’s what we know about indexing:</p>

<ol>
  <li>Adding a GiST index yields a significant speedup: 94s → 4.5s.</li>
  <li>Increasing the siglen parameter from 64 to 256 yields another speedup: 4.5s → 2s.</li>
  <li>The siglen parameter affects the number of buffers read to execute the index scan: greater siglen → fewer buffers → faster query.</li>
</ol>

<h2 id="separate-exact-and-trigram-search-queries">Separate Exact and Trigram Search Queries</h2>

<p>Recall that we’re interested in both exact and fuzzy matches. 
So far, we’ve used a single trigram search query to satisfy both match types. 
Trigrams are useful for fuzzy matches, but are they really necessary for exact matches?</p>

<p>Let’s take a step back, compose an exact-only search query, and see what we can do with it.</p>

<h3 id="the-ilike-operator">The <code class="language-plaintext highlighter-rouge">ilike</code> operator</h3>

<p>The boolean operator <code class="language-plaintext highlighter-rouge">text1 ilike '%' || text2 || '%'</code> will return true if <code class="language-plaintext highlighter-rouge">text1</code> contains <code class="language-plaintext highlighter-rouge">text2</code>, ignoring capitalization.</p>

<p>Here are some examples:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">select</span>
   <span class="s1">'abc'</span> <span class="k">ilike</span> <span class="s1">'%'</span> <span class="o">||</span> <span class="s1">'ab'</span> <span class="o">||</span> <span class="s1">'%'</span> <span class="k">as</span> <span class="nv">"abc contains ab"</span><span class="p">,</span>
   <span class="s1">'abc'</span> <span class="k">ilike</span> <span class="s1">'%'</span> <span class="o">||</span> <span class="s1">'AB'</span> <span class="o">||</span> <span class="s1">'%'</span> <span class="k">as</span> <span class="nv">"abc contains AB"</span><span class="p">,</span>
   <span class="s1">'abc'</span> <span class="k">ilike</span> <span class="s1">'%'</span> <span class="o">||</span> <span class="s1">'abc'</span> <span class="o">||</span> <span class="s1">'%'</span> <span class="k">as</span> <span class="nv">"abc contains abc"</span><span class="p">,</span>
   <span class="s1">'abc'</span> <span class="k">ilike</span> <span class="s1">'%'</span> <span class="o">||</span> <span class="s1">'abb'</span> <span class="o">||</span> <span class="s1">'%'</span> <span class="k">as</span> <span class="nv">"abc contains abb"</span>
</code></pre></div></div>

<p>This produces:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">abc contains ab</th>
      <th style="text-align: left">abc contains AB</th>
      <th style="text-align: left">abc contains abc</th>
      <th style="text-align: left">abc contains abb</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">true</td>
      <td style="text-align: left">true</td>
      <td style="text-align: left">true</td>
      <td style="text-align: left">false</td>
    </tr>
  </tbody>
</table>

<h3 id="exact-only-search-query">Exact-Only Search Query</h3>

<p>We can use the <code class="language-plaintext highlighter-rouge">ilike</code> operator compose an exact-only search query:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="k">input</span> <span class="k">as</span> <span class="p">(</span><span class="k">select</span> <span class="s1">'Michael Lewis'</span> <span class="k">as</span> <span class="n">q</span><span class="p">)</span>
<span class="k">select</span> <span class="n">review_id</span><span class="p">,</span>
       <span class="mi">1</span><span class="p">.</span><span class="mi">0</span> <span class="k">as</span> <span class="n">score</span> <span class="c1">-- (2)</span>
<span class="k">from</span> <span class="n">reviews</span><span class="p">,</span> <span class="k">input</span>
<span class="k">where</span> <span class="n">summary</span> <span class="k">ilike</span> <span class="s1">'%'</span> <span class="o">||</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">||</span> <span class="s1">'%'</span> <span class="c1">-- (1)</span>
<span class="k">limit</span> <span class="mi">10</span><span class="p">;</span> <span class="c1">-- (3)</span>
</code></pre></div></div>

<ol>
  <li>We use the <code class="language-plaintext highlighter-rouge">ilike</code> operator to filter for rows where <code class="language-plaintext highlighter-rouge">summary</code> contains the query string.</li>
  <li>Since each <code class="language-plaintext highlighter-rouge">summary</code> contains the query string, we simply assign a score of 1.0.</li>
  <li>We just want ten of them. They all have the same score, so no need to sort.</li>
</ol>

<p>How does it perform on our query strings?</p>

<p>For the exact name, we find:</p>

<div data-app-component="pev2">
<pre>
with input as (select 'Michael Lewis' as q)
select review_id,
       1.0 as score
from reviews, input
where summary ilike '%' || input.q || '%'
limit 10;
</pre>
<pre>
Limit  (cost=0.42..9.71 rows=10 width=40) (actual time=2.955..6.431 rows=10 loops=1)
  Buffers: shared hit=865
  -&gt;  Index Scan using reviews_summary_trgm_gist_idx on reviews  (cost=0.42..763.59 rows=821 width=40) (actual time=2.952..6.425 rows=10 loops=1)
        Index Cond: ((summary)::text ~~* '%Michael Lewis%'::text)
        Buffers: shared hit=865
Planning Time: 0.413 ms
Execution Time: 6.456 ms
</pre>
</div>

<p>And for the fuzzy name:</p>

<div data-app-component="pev2">
<pre>
with input as (select 'Michael Louis' as q)
select review_id,
       1.0 as score
from reviews, input
where summary ilike '%' || input.q || '%'
limit 10;
</pre>
<pre>
Limit  (cost=0.42..9.71 rows=10 width=40) (actual time=10.582..10.583 rows=0 loops=1)
  Buffers: shared hit=1429
  -&gt;  Index Scan using reviews_summary_trgm_gist_idx on reviews  (cost=0.42..763.59 rows=821 width=40) (actual time=10.581..10.581 rows=0 loops=1)
    Index Cond: ((summary)::text ~~* '%Michael Louis%'::text)
    Rows Removed by Index Recheck: 1
    Buffers: shared hit=1429
Planning Time: 0.340 ms
Execution Time: 10.007 ms
</pre>
</div>

<p>👍 <strong>A significant improvement: from 2s to 10ms!</strong></p>

<p>If we extrapolate to all four text columns, we’re down to potentially 40ms.</p>

<p>This tells us that finding exact matches with an exact-only query is significantly faster than finding them with a trigram search query.</p>

<p>The query plan is roughly the same as our trigram search query, basically just an <code class="language-plaintext highlighter-rouge">Index Scan on reviews</code>, but the amount of data accessed is significantly lower: under 12MB.</p>

<p>Crucially, this presents an opportunity for optimization: given a query string and a desired number of results, we first attempt to very quickly search for exact matches. 
If we find the desired number of results, we can skip the fuzzy search entirely. 
If we don’t find all the results, we run the fuzzy query. 
If we want to get fancy, we can even run the two searches in parallel and cancel the fuzzy search if our exact search is sufficient.</p>

<h3 id="separate-queries-summary">Separate Queries Summary</h3>

<p>Here’s what we know about separating exact and trigram search queries:</p>

<ol>
  <li>An exact-only query accesses significantly less data than a trigram query: 177MB → 11MB</li>
  <li>An exact-only query is significantly faster than a trigram query: 2s → 10ms</li>
  <li>If the exact-only query finds enough results, we can skip the fuzzy query.</li>
  <li>In the best case, we turn a 2s search into a 10ms search.</li>
  <li>In the worst case, we turn a 2s search into a 2.01s search.</li>
</ol>

<h2 id="single-query-for-all-text-columns">Single Query for All Text Columns</h2>

<p>So far our search queries have only checked for matches in the <code class="language-plaintext highlighter-rouge">summary</code> column, and we’ve been extrapolating the timing.</p>

<p>Now is the time to stop extrapolating and compose a query that actually checks all four text columns.
Let’s look at three ways we can make this happen.</p>

<h3 id="four-single-column-queries">Four Single-Column Queries</h3>

<p>The simplest method to check each of the columns is to simply search for every column separately. 
Then we would deduplicate and re-rank the results in application code.</p>

<p>To do this, we start by building indexes on the three remaining columns:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">create</span> <span class="k">index</span> <span class="n">reviews_reviewer_id_trgm_gist_idx</span> <span class="k">on</span> <span class="n">reviews</span>
  <span class="k">using</span> <span class="n">gist</span><span class="p">(</span><span class="n">reviewer_id</span> <span class="n">gist_trgm_ops</span><span class="p">(</span><span class="n">siglen</span><span class="o">=</span><span class="mi">256</span><span class="p">));</span>
<span class="k">create</span> <span class="k">index</span> <span class="n">reviews_reviewer_name_trgm_gist_idx</span> <span class="k">on</span> <span class="n">reviews</span>
  <span class="k">using</span> <span class="n">gist</span><span class="p">(</span><span class="n">reviewer_name</span> <span class="n">gist_trgm_ops</span><span class="p">(</span><span class="n">siglen</span><span class="o">=</span><span class="mi">256</span><span class="p">));</span>
<span class="k">create</span> <span class="k">index</span> <span class="n">reviews_asin_trgm_gist_idx</span> <span class="k">on</span> <span class="n">reviews</span>
  <span class="k">using</span> <span class="n">gist</span><span class="p">(</span><span class="n">asin</span> <span class="n">gist_trgm_ops</span><span class="p">(</span><span class="n">siglen</span><span class="o">=</span><span class="mi">256</span><span class="p">));</span>
<span class="k">vacuum</span> <span class="k">analyze</span> <span class="n">reviews</span><span class="p">;</span>
</code></pre></div></div>

<p>Each of these takes about fifteen minutes to build and uses about 690MB of storage.</p>

<p>The trigram search query is just a union of the original trigram search query on each column:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="k">input</span> <span class="k">as</span> <span class="p">(</span><span class="k">select</span> <span class="s1">'Michael Lewis'</span> <span class="k">as</span> <span class="n">q</span><span class="p">)</span>
<span class="p">(</span><span class="k">select</span> <span class="n">review_id</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span> <span class="o">-</span> <span class="p">(</span><span class="n">reviewer_id</span> <span class="o">&lt;-&gt;</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span><span class="p">)</span> <span class="k">as</span> <span class="n">score</span>
<span class="k">from</span> <span class="n">reviews</span><span class="p">,</span> <span class="k">input</span>
<span class="k">where</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">%</span> <span class="n">reviewer_id</span>
<span class="k">order</span> <span class="k">by</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">&lt;-&gt;</span> <span class="n">reviewer_id</span> <span class="k">limit</span> <span class="mi">10</span><span class="p">)</span>
<span class="k">union</span> <span class="k">all</span>
<span class="p">(</span><span class="k">select</span> <span class="n">review_id</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span> <span class="o">-</span> <span class="p">(</span><span class="n">reviewer_name</span> <span class="o">&lt;-&gt;</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span><span class="p">)</span> <span class="k">as</span> <span class="n">score</span>
<span class="k">from</span> <span class="n">reviews</span><span class="p">,</span> <span class="k">input</span>
<span class="k">where</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">%</span> <span class="n">reviewer_name</span>
<span class="k">order</span> <span class="k">by</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">&lt;-&gt;</span> <span class="n">reviewer_name</span> <span class="k">limit</span> <span class="mi">10</span><span class="p">)</span>
<span class="k">union</span> <span class="k">all</span>
<span class="p">(</span><span class="k">select</span> <span class="n">review_id</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span> <span class="o">-</span> <span class="p">(</span><span class="n">summary</span> <span class="o">&lt;-&gt;</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span><span class="p">)</span> <span class="k">as</span> <span class="n">score</span>
<span class="k">from</span> <span class="n">reviews</span><span class="p">,</span> <span class="k">input</span>
<span class="k">where</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">%</span> <span class="n">summary</span>
<span class="k">order</span> <span class="k">by</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">&lt;-&gt;</span> <span class="n">summary</span> <span class="k">limit</span> <span class="mi">10</span><span class="p">)</span>
<span class="k">union</span> <span class="k">all</span>
<span class="p">(</span><span class="k">select</span> <span class="n">review_id</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span> <span class="o">-</span> <span class="p">(</span><span class="n">asin</span> <span class="o">&lt;-&gt;</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span><span class="p">)</span> <span class="k">as</span> <span class="n">score</span>
<span class="k">from</span> <span class="n">reviews</span><span class="p">,</span> <span class="k">input</span>
<span class="k">where</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">%</span> <span class="n">asin</span>
<span class="k">order</span> <span class="k">by</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">&lt;-&gt;</span> <span class="n">asin</span> <span class="k">limit</span> <span class="mi">10</span><span class="p">);</span>
</code></pre></div></div>

<p>The exact-only query follows the same pattern:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">explain</span> <span class="p">(</span><span class="k">analyze</span><span class="p">,</span> <span class="n">buffers</span><span class="p">)</span>
<span class="k">with</span> <span class="k">input</span> <span class="k">as</span> <span class="p">(</span><span class="k">select</span> <span class="s1">'Michael Lewis'</span> <span class="k">as</span> <span class="n">q</span><span class="p">)</span>
<span class="p">(</span><span class="k">select</span> <span class="n">review_id</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span> <span class="k">as</span> <span class="n">score</span>
<span class="k">from</span> <span class="n">reviews</span><span class="p">,</span> <span class="k">input</span>
<span class="k">where</span> <span class="n">reviewer_id</span> <span class="k">ilike</span> <span class="s1">'%'</span> <span class="o">||</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">||</span> <span class="s1">'%'</span>
<span class="k">limit</span> <span class="mi">10</span><span class="p">)</span>
<span class="k">union</span> <span class="k">all</span>
<span class="p">(</span><span class="k">select</span> <span class="n">review_id</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span> <span class="k">as</span> <span class="n">score</span>
<span class="k">from</span> <span class="n">reviews</span><span class="p">,</span> <span class="k">input</span>
<span class="k">where</span> <span class="n">reviewer_name</span> <span class="k">ilike</span> <span class="s1">'%'</span> <span class="o">||</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">||</span> <span class="s1">'%'</span>
<span class="k">limit</span> <span class="mi">10</span><span class="p">)</span>
<span class="k">union</span> <span class="k">all</span>
<span class="p">(</span><span class="k">select</span> <span class="n">review_id</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span> <span class="k">as</span> <span class="n">score</span>
<span class="k">from</span> <span class="n">reviews</span><span class="p">,</span> <span class="k">input</span>
<span class="k">where</span> <span class="n">summary</span> <span class="k">ilike</span> <span class="s1">'%'</span> <span class="o">||</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">||</span> <span class="s1">'%'</span>
<span class="k">limit</span> <span class="mi">10</span><span class="p">)</span>
<span class="k">union</span> <span class="k">all</span>
<span class="p">(</span><span class="k">select</span> <span class="n">review_id</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span> <span class="k">as</span> <span class="n">score</span>
<span class="k">from</span> <span class="n">reviews</span><span class="p">,</span> <span class="k">input</span>
<span class="k">where</span> <span class="n">asin</span> <span class="k">ilike</span> <span class="s1">'%'</span> <span class="o">||</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">||</span> <span class="s1">'%'</span>
<span class="k">limit</span> <span class="mi">10</span><span class="p">);</span>
</code></pre></div></div>

<p>Before analyzing the query execution, let’s review our thinking on how long this <em>should</em> take.</p>

<p>Our latest queries looked at the <code class="language-plaintext highlighter-rouge">summary</code> column and took about 10ms for exact-only search and 2s for trigram search.
We have four text columns, so it’s not crazy to estimate somewhere between 40ms and 8s for four one-column queries.</p>

<p>The actual performance works out like this:</p>

<table>
  <thead>
    <tr>
      <th>Query</th>
      <th>Query String</th>
      <th>Execution Time</th>
      <th>Buffer Hits</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Trigram</td>
      <td>exact name</td>
      <td>10.7s</td>
      <td>336986</td>
    </tr>
    <tr>
      <td>Trigram</td>
      <td>fuzzy name</td>
      <td>11.2s</td>
      <td>336998</td>
    </tr>
    <tr>
      <td>Exact-Only</td>
      <td>exact name</td>
      <td>144ms</td>
      <td>12684</td>
    </tr>
    <tr>
      <td>Exact-Only</td>
      <td>fuzzy name</td>
      <td>94ms</td>
      <td>9263</td>
    </tr>
  </tbody>
</table>

<p>👎 <strong>The performance is pretty bad: about 11s to find ten matches.</strong></p>

<p>All four plans are roughly identical, so let’s look at the trigram query for the exact name:</p>

<div data-app-component="pev2">
<pre>
with input as (select 'Michael Lewis' as q)
(select review_id, 1.0 - (reviewer_id &lt;-&gt; input.q) as score
from reviews, input
where input.q % reviewer_id
order by input.q &lt;-&gt; reviewer_id limit 10)
union all
(select review_id, 1.0 - (reviewer_name &lt;-&gt; input.q) as score
from reviews, input
where input.q % reviewer_name
order by input.q &lt;-&gt; reviewer_name limit 10)
union all
(select review_id, 1.0 - (summary &lt;-&gt; input.q) as score
from reviews, input
where input.q % summary
order by input.q &lt;-&gt; summary limit 10)
union all
(select review_id, 1.0 - (asin &lt;-&gt; input.q) as score
from reviews, input
where input.q % asin
order by input.q &lt;-&gt; asin limit 10);
</pre>
<pre>
Append  (cost=64064.97..256647.56 rows=40 width=16) (actual time=6764.697..10795.095 rows=20 loops=1)
  Buffers: shared hit=336986
  CTE input
    -&gt;  Result  (cost=0.00..0.01 rows=1 width=32) (actual time=0.001..0.002 rows=1 loops=1)
"  -&gt;  Subquery Scan on ""*SELECT* 1_1""  (cost=64064.96..64065.09 rows=10 width=16) (actual time=2506.643..2506.645 rows=0 loops=1)"
        Buffers: shared hit=69700
        -&gt;  Limit  (cost=64064.96..64064.99 rows=10 width=20) (actual time=2506.642..2506.643 rows=0 loops=1)
              Buffers: shared hit=69700
              -&gt;  Sort  (cost=64064.96..64287.41 rows=88980 width=20) (actual time=2506.641..2506.642 rows=0 loops=1)
                    Sort Key: ((input.q &lt;-&gt; (reviews.reviewer_id)::text))
                    Sort Method: quicksort  Memory: 25kB
                    Buffers: shared hit=69700
                    -&gt;  Nested Loop  (cost=0.42..62142.14 rows=88980 width=20) (actual time=2506.636..2506.636 rows=0 loops=1)
                          Buffers: shared hit=69700
                          -&gt;  CTE Scan on input  (cost=0.00..0.02 rows=1 width=32) (actual time=0.003..0.005 rows=1 loops=1)
                          -&gt;  Index Scan using reviews_reviewer_id_trgm_gist_idx on reviews  (cost=0.42..60584.97 rows=88980 width=22) (actual time=2506.628..2506.628 rows=0 loops=1)
                                Index Cond: ((reviewer_id)::text % input.q)
                                Buffers: shared hit=69700
"  -&gt;  Subquery Scan on ""*SELECT* 2""  (cost=64056.86..64056.99 rows=10 width=16) (actual time=4258.051..4258.058 rows=10 loops=1)"
        Buffers: shared hit=133720
        -&gt;  Limit  (cost=64056.86..64056.89 rows=10 width=20) (actual time=4258.048..4258.052 rows=10 loops=1)
              Buffers: shared hit=133720
              -&gt;  Sort  (cost=64056.86..64279.31 rows=88980 width=20) (actual time=4258.047..4258.049 rows=10 loops=1)
                    Sort Key: ((input_1.q &lt;-&gt; (reviews_1.reviewer_name)::text))
                    Sort Method: top-N heapsort  Memory: 26kB
                    Buffers: shared hit=133720
                    -&gt;  Nested Loop  (cost=0.42..62134.04 rows=88980 width=20) (actual time=0.750..4239.400 rows=50214 loops=1)
                          Buffers: shared hit=133720
                          -&gt;  CTE Scan on input input_1  (cost=0.00..0.02 rows=1 width=32) (actual time=0.001..0.002 rows=1 loops=1)
                          -&gt;  Index Scan using reviews_reviewer_name_trgm_gist_idx on reviews reviews_1  (cost=0.42..60576.87 rows=88980 width=24) (actual time=0.722..3483.767 rows=50214 loops=1)
                                Index Cond: ((reviewer_name)::text % input_1.q)
                                Buffers: shared hit=133720
"  -&gt;  Subquery Scan on ""*SELECT* 3""  (cost=64460.96..64461.09 rows=10 width=16) (actual time=4022.956..4022.963 rows=10 loops=1)"
        Buffers: shared hit=132744
        -&gt;  Limit  (cost=64460.96..64460.99 rows=10 width=20) (actual time=4022.954..4022.957 rows=10 loops=1)
              Buffers: shared hit=132744
              -&gt;  Sort  (cost=64460.96..64683.41 rows=88980 width=20) (actual time=4022.953..4022.954 rows=10 loops=1)
                    Sort Key: ((input_2.q &lt;-&gt; (reviews_2.summary)::text))
                    Sort Method: top-N heapsort  Memory: 26kB
                    Buffers: shared hit=132744
                    -&gt;  Nested Loop  (cost=0.42..62538.14 rows=88980 width=20) (actual time=8.015..4022.513 rows=761 loops=1)
                          Buffers: shared hit=132744
                          -&gt;  CTE Scan on input input_2  (cost=0.00..0.02 rows=1 width=32) (actual time=0.000..0.002 rows=1 loops=1)
                          -&gt;  Index Scan using reviews_summary_trgm_gist_idx on reviews reviews_2  (cost=0.42..60980.97 rows=88980 width=34) (actual time=7.986..4009.293 rows=761 loops=1)
                                Index Cond: ((summary)::text % input_2.q)
                                Buffers: shared hit=132744
"  -&gt;  Subquery Scan on ""*SELECT* 4""  (cost=64064.06..64064.19 rows=10 width=16) (actual time=7.418..7.419 rows=0 loops=1)"
        Buffers: shared hit=822
        -&gt;  Limit  (cost=64064.06..64064.09 rows=10 width=20) (actual time=7.417..7.418 rows=0 loops=1)
              Buffers: shared hit=822
              -&gt;  Sort  (cost=64064.06..64286.51 rows=88980 width=20) (actual time=7.416..7.417 rows=0 loops=1)
                    Sort Key: ((input_3.q &lt;-&gt; (reviews_3.asin)::text))
                    Sort Method: quicksort  Memory: 25kB
                    Buffers: shared hit=822
                    -&gt;  Nested Loop  (cost=0.42..62141.24 rows=88980 width=20) (actual time=7.410..7.411 rows=0 loops=1)
                          Buffers: shared hit=822
                          -&gt;  CTE Scan on input input_3  (cost=0.00..0.02 rows=1 width=32) (actual time=0.000..0.001 rows=1 loops=1)
                          -&gt;  Index Scan using reviews_asin_trgm_gist_idx on reviews reviews_3  (cost=0.42..60584.07 rows=88980 width=19) (actual time=7.406..7.406 rows=0 loops=1)
                                Index Cond: ((asin)::text % input_3.q)
                                Buffers: shared hit=822
Planning Time: 0.433 ms
Execution Time: 10795.196 ms
</pre>
</div>

<p>Here’s how we spend this time:</p>

<ol>
  <li>Just over 10s in four <code class="language-plaintext highlighter-rouge">Index Scan</code> blocks (one per column). These scans return 50,214 rows for <code class="language-plaintext highlighter-rouge">reviewer_name</code>, 761 rows for <code class="language-plaintext highlighter-rouge">summary</code> and 0 rows for <code class="language-plaintext highlighter-rouge">asin</code> and <code class="language-plaintext highlighter-rouge">reviewer_id</code>. In total, they access about 2.59GB of data from the shared buffer cache.</li>
  <li>769ms in <code class="language-plaintext highlighter-rouge">Nested Loop</code> blocks. These loops combine the input with the <code class="language-plaintext highlighter-rouge">Index Scan</code> results. It’s rather surprising that we spend any significant time here, but we could easily optimize this out by getting rid of the <code class="language-plaintext highlighter-rouge">input</code> CTE.</li>
</ol>

<p>If we want to search all four text columns, we’ll need to think a bit harder!</p>

<h3 id="one-four-column-query-with-disjunctions">One Four-Column Query with Disjunctions</h3>

<p>As a second pass, what if we flatten the four unioned queries into a single disjunctive query?</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">select</span> <span class="n">review_id</span><span class="p">,</span>
       <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">least</span><span class="p">(</span>
        <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">&lt;-&gt;</span> <span class="n">reviewer_id</span><span class="p">,</span>
        <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">&lt;-&gt;</span> <span class="n">reviewer_name</span><span class="p">,</span>
        <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">&lt;-&gt;</span> <span class="n">summary</span><span class="p">,</span>
        <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">&lt;-&gt;</span> <span class="n">asin</span><span class="p">))</span> <span class="k">as</span> <span class="n">score</span> <span class="c1">-- (3)</span>
<span class="k">from</span> <span class="n">reviews</span><span class="p">,</span> <span class="k">input</span>
<span class="k">where</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">%</span> <span class="n">reviewer_id</span>
   <span class="k">or</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">%</span> <span class="n">reviewer_name</span>
   <span class="k">or</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">%</span> <span class="n">summary</span>
   <span class="k">or</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">%</span> <span class="n">asin</span> <span class="c1">-- (1)</span>
<span class="k">order</span> <span class="k">by</span> <span class="n">least</span><span class="p">(</span>
    <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">&lt;-&gt;</span> <span class="n">reviewer_id</span><span class="p">,</span>
    <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">&lt;-&gt;</span> <span class="n">reviewer_name</span><span class="p">,</span>
    <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">&lt;-&gt;</span> <span class="n">summary</span><span class="p">,</span>
    <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">&lt;-&gt;</span> <span class="n">asin</span><span class="p">)</span> <span class="k">limit</span> <span class="mi">10</span><span class="p">;</span> <span class="c1">-- (2)</span>
</code></pre></div></div>

<p>Explaining the numbered components:</p>

<ol>
  <li>We keep the row as a candidate if it’s a trigram match for any of the four columns.</li>
  <li>We sort the candidates by the lowest trigram distance to any of the four queries.</li>
  <li>We score the candidates by one minus the lowest trigram distance to any of the four queries. This is equivalent to the greatest trigram similarity.</li>
</ol>

<p>The performance works out like this:</p>

<table>
  <thead>
    <tr>
      <th>Query</th>
      <th>Query String</th>
      <th>Execution Time</th>
      <th>Buffer Hits</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Trigram</td>
      <td>exact name</td>
      <td>13.8s</td>
      <td>323953</td>
    </tr>
    <tr>
      <td>Trigram</td>
      <td>fuzzy name</td>
      <td>13.9s</td>
      <td>324728</td>
    </tr>
    <tr>
      <td>Exact-Only</td>
      <td>exact name</td>
      <td>162ms</td>
      <td>14705</td>
    </tr>
    <tr>
      <td>Exact-Only</td>
      <td>fuzzy name</td>
      <td>153ms</td>
      <td>12987</td>
    </tr>
  </tbody>
</table>

<p>👎 <strong>The performance is even worse: about 14s to find ten matches.</strong></p>

<div data-app-component="pev2">
<pre>
explain (analyze, buffers)
with input as (select 'Michael Lewis' as q)
select review_id,
       (1 - least(
        input.q &lt;-&gt; reviewer_id,
        input.q &lt;-&gt; reviewer_name,
        input.q &lt;-&gt; summary,
        input.q &lt;-&gt; asin)) as score -- (3)
from reviews, input
where input.q % reviewer_id
   or input.q % reviewer_name
   or input.q % summary
   or input.q % asin -- (1)
order by least(
    input.q &lt;-&gt; reviewer_id,
    input.q &lt;-&gt; reviewer_name,
    input.q &lt;-&gt; summary,
    input.q &lt;-&gt; asin) limit 10; -- (2)
</pre>
<pre>
Limit  (cost=5389.77..5389.79 rows=10 width=20) (actual time=13856.366..13856.370 rows=10 loops=1)
  Buffers: shared hit=323953
  -&gt;  Sort  (cost=5389.77..5403.40 rows=5452 width=20) (actual time=13856.364..13856.367 rows=10 loops=1)
"        Sort Key: (LEAST(('Michael Lewis'::text &lt;-&gt; (reviews.reviewer_id)::text), ('Michael Lewis'::text &lt;-&gt; (reviews.reviewer_name)::text), ('Michael Lewis'::text &lt;-&gt; (reviews.summary)::text), ('Michael Lewis'::text &lt;-&gt; (reviews.asin)::text)))"
        Sort Method: top-N heapsort  Memory: 26kB
        Buffers: shared hit=323953
        -&gt;  Bitmap Heap Scan on reviews  (cost=102.01..5271.95 rows=5452 width=20) (actual time=10013.108..13837.707 rows=50929 loops=1)
              Recheck Cond: (('Michael Lewis'::text % (reviewer_id)::text) OR ('Michael Lewis'::text % (reviewer_name)::text) OR ('Michael Lewis'::text % (summary)::text) OR ('Michael Lewis'::text % (asin)::text))
              Filter: (('Michael Lewis'::text % (reviewer_id)::text) OR ('Michael Lewis'::text % (reviewer_name)::text) OR ('Michael Lewis'::text % (summary)::text) OR ('Michael Lewis'::text % (asin)::text))
              Heap Blocks: exact=34021
              Buffers: shared hit=323953
              -&gt;  BitmapOr  (cost=102.01..102.01 rows=5453 width=0) (actual time=9995.890..9995.892 rows=0 loops=1)
                    Buffers: shared hit=289932
                    -&gt;  Bitmap Index Scan on reviews_reviewer_id_trgm_gist_idx  (cost=0.00..15.07 rows=874 width=0) (actual time=2488.516..2488.517 rows=0 loops=1)
                          Index Cond: ((reviewer_id)::text % 'Michael Lewis'::text)
                          Buffers: shared hit=69700
                    -&gt;  Bitmap Index Scan on reviews_reviewer_name_trgm_gist_idx  (cost=0.00..48.25 rows=2898 width=0) (actual time=3429.824..3429.824 rows=50214 loops=1)
                          Index Cond: ((reviewer_name)::text % 'Michael Lewis'::text)
                          Buffers: shared hit=87344
                    -&gt;  Bitmap Index Scan on reviews_summary_trgm_gist_idx  (cost=0.00..18.28 rows=821 width=0) (actual time=4070.030..4070.030 rows=761 loops=1)
                          Index Cond: ((summary)::text % 'Michael Lewis'::text)
                          Buffers: shared hit=132066
                    -&gt;  Bitmap Index Scan on reviews_asin_trgm_gist_idx  (cost=0.00..14.96 rows=859 width=0) (actual time=7.515..7.515 rows=0 loops=1)
                          Index Cond: ((asin)::text % 'Michael Lewis'::text)
                          Buffers: shared hit=822
Planning Time: 5.831 ms
Execution Time: 13856.489 ms
</pre>
</div>

<p>Here’s how we spend this time:</p>

<ol>
  <li>About 10s in four <code class="language-plaintext highlighter-rouge">Bitmap Index Scan</code> blocks, one per text column. Just like the previous iteration, these scans return 50,214 and 761 rows for <code class="language-plaintext highlighter-rouge">reviewer_name</code> and <code class="language-plaintext highlighter-rouge">summary</code>, respectively. In total, they access about 2.24GB of data from the shared buffer cache.</li>
  <li>About 3s in a <code class="language-plaintext highlighter-rouge">Bitmap Heap Scan</code>. This step deduplicates the data returned from the <code class="language-plaintext highlighter-rouge">Bitmap Index Scan</code> blocks. Unfortunately, it only removes 46 of the 50975 rows returned from the scans, and it accesses another 266MB of data from the shared buffer cache.</li>
  <li>About 20ms in a <code class="language-plaintext highlighter-rouge">Sort</code> block that sorts the 50,929 rows returned from the previous blocks.</li>
</ol>

<p>Alas, the query is a bit more compact, but it doesn’t make very good use of time.</p>

<h3 id="one-four-column-query-with-an-expression-index">One Four-Column Query with an Expression Index</h3>

<p>Let’s give this one more try.
For this final pass, we’ll need to introduce two new concepts: <a href="https://www.postgresql.org/docs/14/indexes-expressional.html">expression indexes</a> and <a href="https://www.postgresql.org/docs/14/pgtrgm.html#id-1.11.7.42.6">trigram word_similarity</a>.</p>

<h4 id="expression-indexes">Expression Indexes</h4>

<p>An Expression Index lets us apply some function to a set of columns (all on the same table) and index the resulting values.</p>

<p>The canonical example is a query for a full name against a table with <code class="language-plaintext highlighter-rouge">first_name</code> and <code class="language-plaintext highlighter-rouge">last_name</code> columns:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">people</span> <span class="k">WHERE</span> <span class="p">(</span><span class="n">first_name</span> <span class="o">||</span> <span class="s1">' '</span> <span class="o">||</span> <span class="n">last_name</span><span class="p">)</span> <span class="o">=</span> <span class="s1">'John Smith'</span><span class="p">;</span>
</code></pre></div></div>

<p>We don’t want to store a <code class="language-plaintext highlighter-rouge">full_name</code> column, as that would duplicate data and probably drift.
Instead, we can create an index on the same name concatenation:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">INDEX</span> <span class="n">people_names</span> <span class="k">ON</span> <span class="n">people</span> <span class="p">((</span><span class="n">first_name</span> <span class="o">||</span> <span class="s1">' '</span> <span class="o">||</span> <span class="n">last_name</span><span class="p">));</span>
</code></pre></div></div>

<p>Then, any query with the same expression can leverage the index – pretty cool if you ask me.</p>

<h4 id="word_similarity"><code class="language-plaintext highlighter-rouge">word_similarity</code></h4>

<p>The trigram <code class="language-plaintext highlighter-rouge">word_similarity(text1, text2)</code> function is a variation on the <code class="language-plaintext highlighter-rouge">similarity(text1, text2)</code> function.</p>

<p>As a reminder, <code class="language-plaintext highlighter-rouge">similarity(text1, text2)</code> computes the intersection-over-union of the two trigram sets.
In contrast, <code class="language-plaintext highlighter-rouge">word_similarity(text1, text2)</code> computes the <em>greatest similarity between the set of trigrams in the first string and any continuous extent of an ordered set of trigrams in the second string</em>.</p>

<p>That is quite a mouthful. 
For our purposes, the point is this: <code class="language-plaintext highlighter-rouge">similarity</code> is sensitive to the length of the two strings, whereas <code class="language-plaintext highlighter-rouge">word_similarity</code> is not!</p>

<p>Let’s look at an example that demonstrates the sensitivity to string length:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">select</span> <span class="n">text1</span><span class="p">,</span> <span class="n">text2</span><span class="p">,</span> <span class="n">similarity</span><span class="p">(</span><span class="n">text1</span><span class="p">,</span> <span class="n">text2</span><span class="p">),</span> <span class="n">word_similarity</span><span class="p">(</span><span class="n">text1</span><span class="p">,</span> <span class="n">text2</span><span class="p">)</span>
<span class="k">from</span>
<span class="p">(</span><span class="k">values</span> <span class="p">(</span><span class="s1">'louis'</span><span class="p">,</span> <span class="s1">'lewis'</span><span class="p">),</span>
        <span class="p">(</span><span class="s1">'louis'</span><span class="p">,</span> <span class="s1">'a lewis c'</span><span class="p">),</span>
        <span class="p">(</span><span class="s1">'louis'</span><span class="p">,</span> <span class="s1">'aa lewis cc'</span><span class="p">),</span>
        <span class="p">(</span><span class="s1">'louis'</span><span class="p">,</span> <span class="s1">'aaa lewis ccc'</span><span class="p">))</span> <span class="n">v</span><span class="p">(</span><span class="n">text1</span><span class="p">,</span> <span class="n">text2</span><span class="p">);</span>
</code></pre></div></div>

<p>Note how the <code class="language-plaintext highlighter-rouge">similarity</code> decreases as the length of <code class="language-plaintext highlighter-rouge">text2</code> increases, whereas <code class="language-plaintext highlighter-rouge">word_similarity</code> remains constant.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">text1</th>
      <th style="text-align: left">text2</th>
      <th style="text-align: left">similarity</th>
      <th style="text-align: left">word_similarity</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">louis</td>
      <td style="text-align: left">lewis</td>
      <td style="text-align: left">0.2</td>
      <td style="text-align: left">0.2</td>
    </tr>
    <tr>
      <td style="text-align: left">louis</td>
      <td style="text-align: left">a lewis c</td>
      <td style="text-align: left">0.14285715</td>
      <td style="text-align: left">0.2</td>
    </tr>
    <tr>
      <td style="text-align: left">louis</td>
      <td style="text-align: left">aa lewis cc</td>
      <td style="text-align: left">0.125</td>
      <td style="text-align: left">0.2</td>
    </tr>
    <tr>
      <td style="text-align: left">louis</td>
      <td style="text-align: left">aaa lewis ccc</td>
      <td style="text-align: left">0.11111111</td>
      <td style="text-align: left">0.2</td>
    </tr>
  </tbody>
</table>

<p>Why does this property matter?
I don’t want to give too much away, but we just described an indexing technique that leverages concatenated text columns.
Concatenated text columns are, by definition, longer than individual text columns.</p>

<p>Some final details, for sake of completeness:</p>

<ol>
  <li>The order of arguments matters. <code class="language-plaintext highlighter-rouge">word_similarity(text1, text2)</code> will only equal <code class="language-plaintext highlighter-rouge">word_similarity(text2, text1)</code> if <code class="language-plaintext highlighter-rouge">text1 = text2</code>.</li>
  <li>The <code class="language-plaintext highlighter-rouge">text1 &lt;&lt;-&gt; text2</code> operator is used to compute word_similarity distance, i.e., <code class="language-plaintext highlighter-rouge">1 - word_similarity(text1, text2)</code>. This is analogous to <code class="language-plaintext highlighter-rouge">text1 &lt;-&gt; text2</code> and <code class="language-plaintext highlighter-rouge">1 - similarity(text1, text2)</code>.</li>
  <li>The <code class="language-plaintext highlighter-rouge">text1 &lt;&lt;% text2</code> operator is used to filter for <code class="language-plaintext highlighter-rouge">word_similarity(text1, text2)</code> exceeding a fixed threshold. The default threshold is 0.6.</li>
</ol>

<h4 id="a-blazing-fast-search-query">A Blazing Fast Search Query</h4>

<p>Let’s put our knowledge of expression indexes and <code class="language-plaintext highlighter-rouge">word_similarity</code> to use.</p>

<p>We’ll start by building an index on the concatenation expression of all four text columns. 
We have to coalesce the columns to empty strings, as they are all nullable.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">create</span> <span class="k">index</span> <span class="n">reviews_searchable_text_trgm_gist_idx</span> <span class="k">on</span> <span class="n">reviews</span>
  <span class="k">using</span> <span class="n">gist</span><span class="p">((</span>
      <span class="n">coalesce</span><span class="p">(</span><span class="n">asin</span><span class="p">,</span> <span class="s1">''</span><span class="p">)</span> <span class="o">||</span> <span class="s1">' '</span> <span class="o">||</span>
      <span class="n">coalesce</span><span class="p">(</span><span class="n">reviewer_id</span><span class="p">,</span> <span class="s1">''</span><span class="p">)</span> <span class="o">||</span> <span class="s1">' '</span> <span class="o">||</span>
      <span class="n">coalesce</span><span class="p">(</span><span class="n">reviewer_name</span><span class="p">,</span> <span class="s1">''</span><span class="p">)</span> <span class="o">||</span> <span class="s1">' '</span> <span class="o">||</span>
      <span class="n">coalesce</span><span class="p">(</span><span class="n">summary</span><span class="p">,</span> <span class="s1">''</span><span class="p">))</span>  <span class="n">gist_trgm_ops</span><span class="p">(</span><span class="n">siglen</span><span class="o">=</span><span class="mi">256</span><span class="p">));</span>
</code></pre></div></div>

<p>This takes about 16 minutes to build and ends up using about 2.2GB of storage.</p>

<p>Now we need a search query that can leverage this index. 
Behold, our new trigram search query:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="k">input</span> <span class="k">as</span> <span class="p">(</span><span class="k">select</span> <span class="s1">'Michael Louis'</span> <span class="k">as</span> <span class="n">q</span><span class="p">)</span>
<span class="k">select</span> <span class="n">review_id</span><span class="p">,</span>
      <span class="mi">1</span> <span class="o">-</span> <span class="p">(</span><span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">&lt;&lt;-&gt;</span> <span class="p">(</span><span class="n">coalesce</span><span class="p">(</span><span class="n">asin</span><span class="p">,</span> <span class="s1">''</span><span class="p">)</span> <span class="o">||</span> <span class="s1">' '</span> <span class="o">||</span> 
      <span class="n">coalesce</span><span class="p">(</span><span class="n">reviewer_id</span><span class="p">,</span> <span class="s1">''</span><span class="p">)</span> <span class="o">||</span> <span class="s1">' '</span> <span class="o">||</span>
      <span class="n">coalesce</span><span class="p">(</span><span class="n">reviewer_name</span><span class="p">,</span> <span class="s1">''</span><span class="p">)</span> <span class="o">||</span> <span class="s1">' '</span> <span class="o">||</span>
      <span class="n">coalesce</span><span class="p">(</span><span class="n">summary</span><span class="p">,</span> <span class="s1">''</span><span class="p">)))</span> <span class="k">as</span> <span class="n">score</span>                    <span class="c1">-- (3)</span>
<span class="k">from</span> <span class="n">reviews</span><span class="p">,</span> <span class="k">input</span>
<span class="k">where</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">&lt;%</span> <span class="p">(</span><span class="n">coalesce</span><span class="p">(</span><span class="n">asin</span><span class="p">,</span> <span class="s1">''</span><span class="p">)</span> <span class="o">||</span> <span class="s1">' '</span> <span class="o">||</span>
      <span class="n">coalesce</span><span class="p">(</span><span class="n">reviewer_id</span><span class="p">,</span> <span class="s1">''</span><span class="p">)</span> <span class="o">||</span> <span class="s1">' '</span> <span class="o">||</span>
      <span class="n">coalesce</span><span class="p">(</span><span class="n">reviewer_name</span><span class="p">,</span> <span class="s1">''</span><span class="p">)</span> <span class="o">||</span> <span class="s1">' '</span> <span class="o">||</span>
      <span class="n">coalesce</span><span class="p">(</span><span class="n">summary</span><span class="p">,</span> <span class="s1">''</span><span class="p">))</span>                              <span class="c1">-- (1)</span>
<span class="k">order</span> <span class="k">by</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">&lt;&lt;-&gt;</span> <span class="p">(</span><span class="n">coalesce</span><span class="p">(</span><span class="n">asin</span><span class="p">,</span> <span class="s1">''</span><span class="p">)</span> <span class="o">||</span> <span class="s1">' '</span> <span class="o">||</span>
      <span class="n">coalesce</span><span class="p">(</span><span class="n">reviewer_id</span><span class="p">,</span> <span class="s1">''</span><span class="p">)</span> <span class="o">||</span> <span class="s1">' '</span> <span class="o">||</span>
      <span class="n">coalesce</span><span class="p">(</span><span class="n">reviewer_name</span><span class="p">,</span> <span class="s1">''</span><span class="p">)</span> <span class="o">||</span> <span class="s1">' '</span> <span class="o">||</span>
      <span class="n">coalesce</span><span class="p">(</span><span class="n">summary</span><span class="p">,</span> <span class="s1">''</span><span class="p">))</span> <span class="k">limit</span> <span class="mi">10</span><span class="p">;</span>                    <span class="c1">-- (2)</span>
</code></pre></div></div>

<p>The numbered components should help cut through the concatenations:</p>

<ol>
  <li>We use <code class="language-plaintext highlighter-rouge">input.q &lt;% concatenated_columns</code> to filter the table down to a set of candidate rows. For each of these rows, <code class="language-plaintext highlighter-rouge">input.q</code> and the concatenated columns have a trigram word similarity greater than or equal to 0.6.</li>
  <li>Once we have candidate rows, we compute and sort by the trigram word distance between <code class="language-plaintext highlighter-rouge">input.q</code> and the concatenated columns.</li>
  <li>In order to return the score, we just subtract the trigram word distance from 1.0.</li>
</ol>

<p>The corresponding exact-only search query looks similar:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">explain</span> <span class="p">(</span><span class="k">analyze</span><span class="p">,</span> <span class="n">buffers</span><span class="p">)</span>
<span class="k">with</span> <span class="k">input</span> <span class="k">as</span> <span class="p">(</span><span class="k">select</span> <span class="s1">'Michael Lewis'</span> <span class="k">as</span> <span class="n">q</span><span class="p">)</span>
<span class="k">select</span> <span class="n">review_id</span><span class="p">,</span>
      <span class="mi">1</span><span class="p">.</span><span class="mi">0</span> <span class="k">as</span> <span class="n">score</span>
<span class="k">from</span> <span class="n">reviews</span><span class="p">,</span> <span class="k">input</span>
<span class="k">where</span> <span class="p">(</span><span class="n">coalesce</span><span class="p">(</span><span class="n">asin</span><span class="p">,</span> <span class="s1">''</span><span class="p">)</span> <span class="o">||</span> <span class="s1">' '</span> <span class="o">||</span>
      <span class="n">coalesce</span><span class="p">(</span><span class="n">reviewer_id</span><span class="p">,</span> <span class="s1">''</span><span class="p">)</span> <span class="o">||</span> <span class="s1">' '</span> <span class="o">||</span>
      <span class="n">coalesce</span><span class="p">(</span><span class="n">reviewer_name</span><span class="p">,</span> <span class="s1">''</span><span class="p">)</span> <span class="o">||</span> <span class="s1">' '</span> <span class="o">||</span>
      <span class="n">coalesce</span><span class="p">(</span><span class="n">summary</span><span class="p">,</span> <span class="s1">''</span><span class="p">))</span> <span class="k">ilike</span> <span class="s1">'%'</span> <span class="o">||</span> <span class="k">input</span><span class="p">.</span><span class="n">q</span> <span class="o">||</span> <span class="s1">'%'</span>
<span class="k">limit</span> <span class="mi">10</span><span class="p">;</span>
</code></pre></div></div>

<p>The results for the trigram search query on the exact name look like this:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">review_id</th>
      <th style="text-align: left">asin</th>
      <th style="text-align: left">reviewer_id</th>
      <th style="text-align: left">reviewer_name</th>
      <th style="text-align: left">summary</th>
      <th style="text-align: left">score</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">2108562</td>
      <td style="text-align: left">0393072231</td>
      <td style="text-align: left">A22GLZ0P4MGO0W</td>
      <td style="text-align: left">Thom Mitchell</td>
      <td style="text-align: left">Another Michael Lewis Must Read</td>
      <td style="text-align: left">1</td>
    </tr>
    <tr>
      <td style="text-align: left">2111265</td>
      <td style="text-align: left">0393081818</td>
      <td style="text-align: left">A1VJF95Y8HMXW9</td>
      <td style="text-align: left">Louis Kokernak</td>
      <td style="text-align: left">Another fun and informative read from Michael Lewis</td>
      <td style="text-align: left">1</td>
    </tr>
    <tr>
      <td style="text-align: left">2114047</td>
      <td style="text-align: left">0393244660</td>
      <td style="text-align: left">A13U0KMO103QJP</td>
      <td style="text-align: left">Larry L. Roberts</td>
      <td style="text-align: left">Another great book by Michael Lewis. A must read for the small investor.</td>
      <td style="text-align: left">1</td>
    </tr>
    <tr>
      <td style="text-align: left">2108273</td>
      <td style="text-align: left">0393072231</td>
      <td style="text-align: left">A1P1WJTZGC955H</td>
      <td style="text-align: left">ITS</td>
      <td style="text-align: left">Another Michael Lewis Masterpiece</td>
      <td style="text-align: left">1</td>
    </tr>
    <tr>
      <td style="text-align: left">2097231</td>
      <td style="text-align: left">0393057658</td>
      <td style="text-align: left">A3MYOI5BL91KKA</td>
      <td style="text-align: left">Joseph M. Powers</td>
      <td style="text-align: left">Standard, high quality, Michael Lewis offering</td>
      <td style="text-align: left">1</td>
    </tr>
    <tr>
      <td style="text-align: left">2097049</td>
      <td style="text-align: left">0393057658</td>
      <td style="text-align: left">A2QHM5HBSIXRL4</td>
      <td style="text-align: left">Andy Orrock</td>
      <td style="text-align: left">Another good work from Michael Lewis</td>
      <td style="text-align: left">1</td>
    </tr>
    <tr>
      <td style="text-align: left">2113780</td>
      <td style="text-align: left">0393244660</td>
      <td style="text-align: left">APM2KUPZYHB94</td>
      <td style="text-align: left">Alice</td>
      <td style="text-align: left">Michael Lewis Fan</td>
      <td style="text-align: left">1</td>
    </tr>
    <tr>
      <td style="text-align: left">2108394</td>
      <td style="text-align: left">0393072231</td>
      <td style="text-align: left">A2JOZET739XZT7</td>
      <td style="text-align: left">Mark Haslett</td>
      <td style="text-align: left">Big Fan of Michael Lewis</td>
      <td style="text-align: left">1</td>
    </tr>
    <tr>
      <td style="text-align: left">2108244</td>
      <td style="text-align: left">0393072231</td>
      <td style="text-align: left">A27NDIDE8W9YQC</td>
      <td style="text-align: left">Gderf</td>
      <td style="text-align: left">The Big Short by Michael Lewis</td>
      <td style="text-align: left">1</td>
    </tr>
    <tr>
      <td style="text-align: left">2111212</td>
      <td style="text-align: left">0393081818</td>
      <td style="text-align: left">A2X1XC7SQQGXFH</td>
      <td style="text-align: left">Ian C Freund</td>
      <td style="text-align: left">Michael Lewis is amazing</td>
      <td style="text-align: left">1</td>
    </tr>
  </tbody>
</table>

<p>And the trigram search query on the fuzzy name:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">review_id</th>
      <th style="text-align: left">asin</th>
      <th style="text-align: left">reviewer_id</th>
      <th style="text-align: left">reviewer_name</th>
      <th style="text-align: left">summary</th>
      <th style="text-align: left">score</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">1368320</td>
      <td style="text-align: left">0316013684</td>
      <td style="text-align: left">A106393MZH9T4M</td>
      <td style="text-align: left">Michael Louis Minns</td>
      <td style="text-align: left">Fun and enlightening</td>
      <td style="text-align: left">1</td>
    </tr>
    <tr>
      <td style="text-align: left">1683931</td>
      <td style="text-align: left">0345536592</td>
      <td style="text-align: left">A106393MZH9T4M</td>
      <td style="text-align: left">Michael Louis Minns</td>
      <td style="text-align: left">Odd Thomas Collection</td>
      <td style="text-align: left">1</td>
    </tr>
    <tr>
      <td style="text-align: left">3803521</td>
      <td style="text-align: left">077831233X</td>
      <td style="text-align: left">A106393MZH9T4M</td>
      <td style="text-align: left">Michael Louis Minns</td>
      <td style="text-align: left">Real law by a real lawyer</td>
      <td style="text-align: left">1</td>
    </tr>
    <tr>
      <td style="text-align: left">2990026</td>
      <td style="text-align: left">0553808036</td>
      <td style="text-align: left">A106393MZH9T4M</td>
      <td style="text-align: left">Michael Louis Minns</td>
      <td style="text-align: left">Koontz Remains the Master</td>
      <td style="text-align: left">1</td>
    </tr>
    <tr>
      <td style="text-align: left">5497049</td>
      <td style="text-align: left">1455546143</td>
      <td style="text-align: left">A106393MZH9T4M</td>
      <td style="text-align: left">Michael Louis Minns</td>
      <td style="text-align: left">Could not put this down…</td>
      <td style="text-align: left">1</td>
    </tr>
    <tr>
      <td style="text-align: left">1856766</td>
      <td style="text-align: left">0375411089</td>
      <td style="text-align: left">A106393MZH9T4M</td>
      <td style="text-align: left">Michael Louis Minns</td>
      <td style="text-align: left">skinny dip</td>
      <td style="text-align: left">1</td>
    </tr>
    <tr>
      <td style="text-align: left">2000799</td>
      <td style="text-align: left">0385343078</td>
      <td style="text-align: left">A106393MZH9T4M</td>
      <td style="text-align: left">Michael Louis Minns</td>
      <td style="text-align: left">Great Historical Fiction</td>
      <td style="text-align: left">1</td>
    </tr>
    <tr>
      <td style="text-align: left">3836540</td>
      <td style="text-align: left">0778327760</td>
      <td style="text-align: left">A106393MZH9T4M</td>
      <td style="text-align: left">Michael Louis Minns</td>
      <td style="text-align: left">Teller Rocks</td>
      <td style="text-align: left">1</td>
    </tr>
    <tr>
      <td style="text-align: left">5536658</td>
      <td style="text-align: left">1460201051</td>
      <td style="text-align: left">A106393MZH9T4M</td>
      <td style="text-align: left">Michael Louis Minns</td>
      <td style="text-align: left">The Cat Didn’t really do it</td>
      <td style="text-align: left">1</td>
    </tr>
    <tr>
      <td style="text-align: left">3478374</td>
      <td style="text-align: left">074326875X</td>
      <td style="text-align: left">A106393MZH9T4M</td>
      <td style="text-align: left">Michael Louis Minns</td>
      <td style="text-align: left">Pretty good read</td>
      <td style="text-align: left">1</td>
    </tr>
  </tbody>
</table>

<p>It turns out there was an avid reviewer named Michael Louis. Go figure!</p>

<p>Performance works out like this:</p>

<table>
  <thead>
    <tr>
      <th>Query</th>
      <th>Query String</th>
      <th>Execution Time</th>
      <th>Buffer Hits</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Trigram</td>
      <td>exact name</td>
      <td>39ms</td>
      <td>1685</td>
    </tr>
    <tr>
      <td>Trigram</td>
      <td>fuzzy name</td>
      <td>113ms</td>
      <td>5094</td>
    </tr>
    <tr>
      <td>Exact-Only</td>
      <td>exact name</td>
      <td>37ms</td>
      <td>4345</td>
    </tr>
    <tr>
      <td>Exact-Only</td>
      <td>fuzzy name</td>
      <td>87ms</td>
      <td>10633</td>
    </tr>
  </tbody>
</table>

<p>👍 <strong>A significant improvement: from over 10s to just over 100ms!</strong></p>

<p>Let’s look at the plan for trigram search with the exact name to understand why this is faster:</p>

<div data-app-component="pev2">
<pre>
with input as (select 'Michael Lewis' as q)
select review_id,
      1 - (input.q &lt;&lt;-&gt; (coalesce(asin, '') || ' ' ||
      coalesce(reviewer_id, '') || ' ' ||
      coalesce(reviewer_name, '') || ' ' ||
      coalesce(summary, ''))) as score                    -- (3)
from reviews, input
where input.q &lt;% (coalesce(asin, '') || ' ' ||
      coalesce(reviewer_id, '') || ' ' ||
      coalesce(reviewer_name, '') || ' ' ||
      coalesce(summary, ''))                              -- (1)
order by input.q &lt;&lt;-&gt; (coalesce(asin, '') || ' ' ||
      coalesce(reviewer_id, '') || ' ' ||
      coalesce(reviewer_name, '') || ' ' ||
      coalesce(summary, '')) limit 10;                    -- (2)
</pre>
<pre>
Limit  (cost=0.42..7.82 rows=10 width=20) (actual time=8.202..38.716 rows=10 loops=1)
  Buffers: shared hit=1685
  -&gt;  Index Scan using reviews_searchable_text_trgm_gist_idx on reviews  (cost=0.42..65909.97 rows=88980 width=20) (actual time=8.200..38.709 rows=10 loops=1)
"        Index Cond: ((((((((COALESCE(asin, ''::character varying))::text || ' '::text) || (COALESCE(reviewer_id, ''::character varying))::text) || ' '::text) || (COALESCE(reviewer_name, ''::character varying))::text) || ' '::text) || (COALESCE(summary, ''::character varying))::text) %&gt; 'Michael Lewis'::text)"
        Rows Removed by Index Recheck: 3
"        Order By: ((((((((COALESCE(asin, ''::character varying))::text || ' '::text) || (COALESCE(reviewer_id, ''::character varying))::text) || ' '::text) || (COALESCE(reviewer_name, ''::character varying))::text) || ' '::text) || (COALESCE(summary, ''::character varying))::text) &lt;-&gt;&gt; 'Michael Lewis'::text)"
        Buffers: shared hit=1685
Planning Time: 0.176 ms
Execution Time: 38.772 ms
</pre>
</div>

<p>One last time, here’s how we spend our time:</p>

<ol>
  <li>About 40ms in an <code class="language-plaintext highlighter-rouge">Index Scan</code> block. This uses the new <code class="language-plaintext highlighter-rouge">reviews_searchable_text_trgm_gist_idx</code> index for filtering and sorting and returns exactly 10 rows. It accesses just over 13MB of data from the shared buffer cache.</li>
</ol>

<h3 id="single-query-summary">Single Query Summary</h3>

<p>Here’s what we know about combining four columns in a single query:</p>

<ol>
  <li>Unioning four queries was more than a 4x slowdown: 2s for one column → 10s for four.</li>
  <li>Introducing a clever disjunction made it even slower: 10s → 14s.</li>
  <li>Leveraging an expression index and a new trigram operator is our winner: 10s → 113ms.</li>
</ol>

<h1 id="conclusion">Conclusion</h1>

<p>Through some effort and iteration, we’ve arrived at a very performant query.</p>

<p><em>We started at 90 seconds to search one text column and ended at 113ms for four columns.</em></p>

<p>Our implementation consisted primarily of Postgres trigram and string matching operators, and our optimizations used three main techniques:</p>

<ol>
  <li>Indexing the text columns</li>
  <li>Separating exact search queries from trigram search queries</li>
  <li>Cleverly combining all four text columns into a single index and single query</li>
</ol>

<p>Throughout the iterations, we leveraged <code class="language-plaintext highlighter-rouge">explain (analyze, buffers)</code> with the PEV2 visualizer to understand how we were spending our time on execution and I/O.</p>

<p>As always, I hope this post will save someone a bit of time learning, debugging, and optimizing!</p>

<h1 id="appendix">Appendix</h1>

<h2 id="discussion">Discussion</h2>

<ul>
  <li>There was some discussion about this post on <a href="https://news.ycombinator.com/item?id=30433269">HackerNews</a> and <a href="https://www.reddit.com/r/PostgreSQL/comments/swe8v8/optimizing_postgres_text_search_with_trigrams/">r/Postgresql</a>.</li>
  <li>The Scaling Postgres podcast covered this post on <a href="https://www.scalingpostgres.com/episodes/204-optimizing-trigram-search-replication-review-logical-improvements-timescale-investment/">episode 204</a>.</li>
  <li>The 5mins of Postgres podcast covered this post on <a href="https://pganalyze.com/blog/5mins-postgres-optimizing-postgres-text-search-trigrams-gist-indexes">episode 6</a>.</li>
</ul>

<h2 id="potential-improvements">Potential Improvements</h2>

<p>Some folks have responded with interesting suggestions for potential improvements.
I’ll cover them below, and might eventually try some of them and update the post.</p>

<h3 id="generated-columns">Generated Columns</h3>

<p>In <a href="https://youtu.be/yih3qEiIC_U?t=510">episode 204 of the Scaling Postgres Podcast, around 8:30</a>, the host made a nice suggestion that we might be able to use the <a href="https://www.postgresql.org/docs/14/ddl-generated-columns.html">Generated Columns feature</a> to minimize the string concatenation boilerplate from the final query.</p>

<p>Some commentors on Hackernews also mentioned that the string concatenation is tedious. 
I agree it’s hard to read. 
We also have to be careful to ensure that our concatenation matches the exact expression used in the Expression Index, otherwise we won’t hit the index, which could be a subtle and painful performance regression.</p>

<p>I’ve never used the Generated Columns feature, but I think the solution might look something like this: define a fifth generated text column, specify that the column is generated as the concatenation of the four other columns, build a standard index on that column, and reference that column in search queries. 
I think this could work.</p>

<p>My only hesitation would be that the generated column is materialized, so it takes up additional space. 
The docs say specifically, “PostgreSQL currently implements only stored generated columns.” 
Depending on the size of the table, it might not make any difference and optimizing for readability/simplicity would be great.
But that tradeoff seems worth remembering.</p>

<h3 id="materialized-views">Materialized Views</h3>

<p>Some commentors on Hackernews mentioned that things get tricky if we have text columns on multiple tables and suggested it might be easier to move all of the text data into a materialized view.
I agree this could work, with some caveats.</p>

<p>The data model would have to allow for mapping each searchable “entity” to a single row in the materialized view.
This can get tricky with 1:N relationships.
For example, imagine a database for a blog: an article can have many comments, with articles and comments in their own separate tables.
We want to search for articles, such that our query matches against both the article text and the corresponding comment text.
Our query could match multiple comments for the same article, but we only want to return the article once.
We would have to find a way to represent an article and all its comments as a single row in a materialized view, and it’s not immediately obvious how we would do that.</p>

<p>We have to account for eventual consistency.
For example, imagine the same database for a blog.
A user can delete an article or comment, but it remains in the materialized view until the next refresh.
Now we need some filtering logic to prevent returning stale results from the materialized view.
This could introduce complexity that cancels out any wins from using the materialized view in the first place.
I find that eventual consistency is a reality we should all accept in distributed systems, but we should also try to prevent introducing it within a single relational database.</p>

<p>Finally, we would also need a reliable mechanism to refresh the materialized view.
This is actually the biggest pitfall in my opinion: I’ve yet to find a satisfying mechanism for refreshing without introducing unfortunate performance dynamics, like decreasing query throughput every five minutes because the refresh is hogging resources.</p>

<p>This is also why I’m particularly excited about a new Postgres feature currently under development, <a href="https://wiki.postgresql.org/wiki/Incremental_View_Maintenance">Incremental View Maintenance</a> (IVM).
With IVM, the promise is that we can define a materialized view that is atomically updated on any write to the source table.
I encourage folks to look around the docs and discussions surrounding the feature – it’s quite interesting.</p>

<hr />

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:gitlab-elasticsearch" role="doc-endnote">
      <p>GitLab’s evolution from Postgres Trigrams to Elasticsearch <a href="https://about.gitlab.com/blog/2016/03/18/fast-search-using-postgresql-trigram-indexes/">Fast Search Using PostgreSQL Trigram Text Indexes (March 2016)</a>, <a href="https://about.gitlab.com/blog/2019/03/20/enabling-global-search-elasticsearch-gitlab-com/">Lessons from our journey to enable global code search with Elasticsearch on GitLab.com (March 2019)</a>, <a href="https://about.gitlab.com/blog/2019/07/16/elasticsearch-update/">Update: The challenge of enabling Elasticsearch on GitLab.com (July 2019)</a>; <a href="https://about.gitlab.com/blog/2020/04/28/elasticsearch-update/">Update: Elasticsearch lessons learnt for Advanced Global Search 2020-04-28 (April 2020)</a>; <a href="https://docs.gitlab.com/ee/administration/troubleshooting/elasticsearch.html">Troubleshooting Elasticsearch</a> <a href="#fnref:gitlab-elasticsearch" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:gitlab-full-text-search" role="doc-endnote">
      <p>The conclusions in <a href="https://gitlab.com/gitlab-org/gitlab-foss/-/issues/42442#note_91045483">Gitlab’s investigation of Full Text Search</a> align well with my findings. <a href="#fnref:gitlab-full-text-search" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:jmacauley-citation" role="doc-endnote">
      <p>Please see Julian McAuley’s <a href="http://jmcauley.ucsd.edu/data/amazon/links.html">Amazon Product Data Landing Page</a>, <a href="http://cseweb.ucsd.edu/~jmcauley/pdfs/www16a.pdf">Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering</a>, <a href="http://cseweb.ucsd.edu/~jmcauley/pdfs/sigir15.pdf">Image-based recommendations on styles and substitutes</a>. <a href="#fnref:jmacauley-citation" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:michael-lewis" role="doc-endnote">
      <p>Michael Lewis has an uncanny ability to make mundane, complicated topics entertaining. Some of my favorites are <em>The Big Short</em>, <em>Boomerang</em>, and <em>Flash Boys</em>. <a href="#fnref:michael-lewis" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:buffers" role="doc-endnote">
      <p>For much more about buffers, I recommend reading this excellent article from Postgres.ai: <a href="https://postgres.ai/blog/20220106-explain-analyze-needs-buffers-to-improve-the-postgres-query-optimization-process">EXPLAIN (ANALYZE) needs BUFFERS to improve the Postgres query optimization process</a> <a href="#fnref:buffers" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:buffers:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p>
    </li>
    <li id="fn:pev2" role="doc-endnote">
      <p>Fun fact: I used to be a Javascript/React developer (ca. 2015). But I’m not anymore, and that’s why I used iframes to make this work. <a href="#fnref:pev2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><summary type="html"><![CDATA[In this post, we'll implement and optimize a text search system based on Postgres Trigrams]]></summary></entry></feed>