Speed Limits for Rolling Restarts in Kubernetes

July 20, 2021

Introduction

This post shows how we can tune a Kubernetes Deployment with slow-starting pods to execute rolling restarts more gracefully. In particular, we’ll focus on a case in which each pod needs a warmup phase that is not easily captured by a probe.

For brevity, let’s assume we already have a working knowledge of Kubernetes Deployments,¹ including the rolling restart strategy, ReplicaSets,² and probes.³

TLDR: adjust maxSurge, maxUnavailable, and minReadySeconds to prevent sending traffic to too many new pods at once.

Slow-Starting Pods

Application pods can be slow to start for a variety of reasons:

In a read-heavy app, each pod might populate a local cache.
In a stateful app, each pod needs to hydrate some initial state from a datastore.
In a clustered app, each pod needs to connect to some of its peers.
In an app running an interpreted or just-in-time-compiled (JIT) language, each pod incurs some startup cost for compiling its hotspots, often called a warmup phase.⁴

The first three cases are generally solved by tuning the startup, liveness, and readiness probes.

The fourth case is subtle, because the app could pass all its probes and still not be totally ready for full traffic.

Not All at Once

The issue that led to this post involved a JVM-based web service with high traffic and low latency requirements. The story started when we rolled out a simple change. We ran a standard rolling restart to rollout the new application image, and this somehow increased latency to the point of triggering production alerts.

After inspecting metrics, we noticed a clear pattern: every new pod had a large spike in CPU usage and request latency for the first 30 seconds of its runtime. After some profiling, we were able to attribute these spikes to JVM warmup. Historical metrics suggested we had been flying close to the sun for some time.

Barring some JVM gymnastics, each new pod incurs the cost of JVM warmup.⁵ Requests sent to a pod during warmup will inevitably be slower, and if too many pods are warming up simultaneously, we end up with a significant overall spike in latency.

To summarize, our standard rolling restarts were adding too many new pods, too quickly.

The Test Bench

For the rest of this post, we’ll evaluate some options for solving this type of problem. We’ll use the Nginx deployment commonly seen in Kubernetes’ docs, running in Minikube.⁶ To be clear, Nginx doesn’t really need a warmup phase, but let’s just imagine it’s some other container that does.

For each option, we’ll execute the following steps:

Create the deployment with four pods: kubectl apply -f <file>.
Wait for all pods to be ready: kubectl rollout status deployment/<name>.
Initiate a rolling restart of the deployment: kubectl rollout restart deployment/<name>.
Observe the resulting restart behavior: kubectl get replicaset, sampled every second.

For the last step, we’ll observe three counters returned from thekubectl get replicaset command:

DESIRED is the number of pods our replicaset should end up with.
CURRENT is the number of pods currently running, in any state.
READY is the number of pods that have passed their readiness probe.

Round 1: Default Rolling Restart Strategy

This is a common starting point for a deployment. We specify the deployment has four identical pods, each running a single container, with HTTP liveness and readiness probes. All else is left as defaults.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx:1.14.2
          readinessProbe:
            httpGet:
              path: /
              port: 80
          livenessProbe:
            httpGet:
              path: /
              port: 80
  replicas: 4

Let’s observe the replicasets to see how the restart behaves:

NAME        DESIRED   CURRENT   READY   AGE
nginx-old   4         4         4       7s
---
nginx-old   3         3         3       8s
nginx-new   2         2         0       1s
---
nginx-old   2         2         2       9s
nginx-new   3         3         1       2s
---
nginx-old   2         2         2       10s
nginx-new   3         3         1       3s
---
nginx-old   1         1         1       11s
nginx-new   4         4         2       4s
---
nginx-old   1         1         1       12s
nginx-new   4         4         2       5s
---
nginx-old   0         0         0       13s
nginx-new   4         4         3       6s
---
nginx-old   0         0         0       14s
nginx-new   4         4         4       7s

There are two main behaviors to observe here.

First, for the majority of the restart, we have three pods in ready state. This turns out to make sense. The deployment spec has a setting called maxUnavailable. According to the docs, maxUnavailable “specifies the maximum number of Pods that can be unavailable during the update process” and defaults to 25% of the desired count. In our example, this means the overall deployment can have one pod unavailable during the rolling restart. This effectively means we need to be 25% over-provisioned to gracefully support a rolling restart.

Second, we go from having four old pods to four new pods in just seven seconds. We should be weary of this if our pods have any significant startup costs that cannot be captured with standard probes.

Round 2: Set maxUnavailable to 0 and maxSurge to 1

Let’s tackle the first problem: we want to maintain four pods at all times.

In order to do that, we’ll set maxUnavailable to 0.

We also need to introduce a new setting: maxSurge, which “specifies the maximum number of Pods that can be created over the desired number of Pods.” Like maxUnavailable, it defaults to 25%. The default would work in our example, but I’ve found it’s better to specify explicitly. In a large deployment, adding 25% over the desired replica count could exceed resource quotas.⁷

So we’ll set maxSurge to 1.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx:1.14.2
          readinessProbe:
            httpGet:
              path: /
              port: 80
          livenessProbe:
            httpGet:
              path: /
              port: 80
  replicas: 4
  strategy:
    rollingUpdate:
      maxUnavailable: 0 # New!
      maxSurge: 1       # New!

Let’s observe the replicasets to see how the restart behaves:

NAME        DESIRED   CURRENT   READY   AGE
nginx-old   4         4         4       6s
---
nginx-old   4         4         4       7s
nginx-new   1         1         0       1s
---
nginx-old   3         3         3       9s
nginx-new   2         2         1       3s
---
nginx-old   3         3         3       10s
nginx-new   2         2         1       4s
---
nginx-old   3         3         3       11s
nginx-new   2         2         1       5s
---
nginx-old   2         2         2       12s
nginx-new   3         3         2       6s
---
nginx-old   2         2         2       13s
nginx-new   3         3         2       7s
---
nginx-old   1         1         1       14s
nginx-new   4         4         3       8s
---
nginx-old   1         1         1       15s
nginx-new   4         4         3       9s
---
nginx-old   1         1         1       16s
nginx-new   4         4         3       10s
---
nginx-old   1         1         1       18s
nginx-new   4         4         3       12s
---
nginx-old   0         0         0       18s
nginx-new   4         4         4       12s

This solves the first problem: the total number of ready pods never fell below four.

The transition from four old pods to four new pods was a bit slower (twelve seconds), but still fast enough to make us nervous if we’re concerned about something like JVM warmup.

Round 3: Set minReadySeconds, maxUnavailable to 0, and maxSurge to 1

Now let’s solve the second problem: we want a way to control the speed of our rolling restart.

It turns out there’s a setting for this as well: minReadySeconds. According to the docs, minReadySeconds “specifies the minimum number of seconds for which a newly created Pod should be ready without any of its containers crashing, for it to be considered available” and defaults to zero.

Let’s say our application takes about three seconds to warm up and reach steady-state on key metrics.

So we’ll set minReadySeconds to 3.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        readinessProbe:
          httpGet:
            path: /
            port: 80
        livenessProbe:
          httpGet:
            path: /
            port: 80
  replicas: 4
  strategy:
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  minReadySeconds: 3 # New!

Let’s observe the replicasets to see how the restart behaves:

NAME        DESIRED   CURRENT   READY   AGE
nginx-old   4         4         4       9s
---
nginx-old   4         4         4       10s
nginx-new   1         1         0       1s
---
nginx-old   4         4         4       11s
nginx-new   1         1         1       2s
---
nginx-old   4         4         4       13s
nginx-new   1         1         1       4s
---
nginx-old   4         4         4       14s
nginx-new   1         1         1       5s
---
nginx-old   3         3         3       15s
nginx-new   2         2         1       6s
---
nginx-old   3         3         3       16s
nginx-new   2         2         2       7s
---
nginx-old   3         3         3       17s
nginx-new   2         2         2       8s
---
nginx-old   2         2         2       18s
nginx-new   3         3         2       9s
---
nginx-old   2         2         2       19s
nginx-new   3         3         2       10s
---
nginx-old   2         2         2       21s
nginx-new   3         3         3       12s
---
nginx-old   2         2         2       22s
nginx-new   3         3         3       13s
---
nginx-old   2         2         2       23s
nginx-new   3         3         3       14s
---
nginx-old   1         1         1       24s
nginx-new   4         4         3       15s
---
nginx-old   1         1         1       25s
nginx-new   4         4         3       16s
---
nginx-old   1         1         1       26s
nginx-new   4         4         4       17s
---
nginx-old   1         1         1       27s
nginx-new   4         4         4       18s
---
nginx-old   1         1         1       29s
nginx-new   4         4         4       20s
---
nginx-old   0         0         0       29s
nginx-new   4         4         4       20s

Again, the total number of ready pods never fell below four.

Notice how each new replica is ready for three seconds before the desired and current counters increment. The nginx-new transitions, denoted (desired, current, ready), are:

(1, 1, 1) at 2s
(2, 2, 1) at 6s
(2, 2, 2) at 7s
(3, 3, 2) at 9s
(3, 3, 3) at 13s
(4, 4, 3) at 15s
(4, 4, 4) at 17s

This isn’t a perfectly uniform cadence – we’re sampling via bash script – but it demonstrates that we have in fact slowed down the introduction of new pods.

Crucially, each new replica continues passing its probes throughout its warmup period. This would not be the case if we simply incremented the probes’ initialDelaySeconds.

We don’t want this setup to be a bottleneck for new releases, so we should use metrics to select the smallest satisfactory minReadySeconds value.

Conclusion

This post demonstrates how we can use a combination of standard Kubernetes Deployment settings to solve a subtle problem with rolling restarts. This is just one of several interesting application lifecycle edge cases we must consider as we increase traffic to an application in Kubernetes. As always, I hope this post will save someone a bit of time learning and debugging, or maybe even help anticipate and prevent a looming failure.

A Kubernetes Deployment is just a set of identical pods, referred to as replicas. I’ve most commonly used deployments to run load-balanced HTTP web services. ↩
A Kubernetes Replicaset is the abstraction that maintains a set of replicas within the Deployment. ↩
For a more thorough look probes, see Colin Breck’s post on startup probes and his three part series on liveness and readiness probes ↩
Baeldung has a nice article on the topic of JVM warmup ↩
You can get into some expert-level games to minimize warmup. This is most important with very short-lived apps (e.g., on AWS Lambda). This particular app regularly run for hours or days, so this type of optimization is not worth the effort. ↩
Code for this example ↩
A Kubernetes Resource Quota gives us a way to limit the resources allocated to each namespace. ↩

Introduction

Slow-Starting Pods

Not All at Once

The Test Bench

Round 1: Default Rolling Restart Strategy

Round 2: Set maxUnavailable to 0 and maxSurge to 1

Round 3: Set minReadySeconds, maxUnavailable to 0, and maxSurge to 1

Conclusion

Comments