Advanced Features of Kubernetes’ Horizontal Pod Autoscaler

Kubernetes’ Horizontal Pod Autoscaler has features you probably don’t know about. Here’s how to use them to your advantage.

Published in

Better Programming

6 min readJul 4, 2022

Most people who use Kubernetes know that you can scale applications using Horizontal Pod Autoscaler (HPA) based on their CPU or memory usage. There are however many more features of HPA that you can use to customize scaling behaviour of your application, such as scaling using custom application metrics or external metrics, as well as alpha/beta features like “scaling to zero” or container metrics scaling.

So, in this article we will explore all of these options so that we can take full advantage of all available features of HPA and to get a head start on the features that are coming in future Kubernetes releases.

Setup

Before we get started with scaling, we first need a testing environment. For that we will use KinD (Kubernetes in Docker) cluster defined by the following YAML:

This manifest configures the KinD cluster with 1 control plane node and 3 workers, additionally it enables a couple of feature gates related to autoscaling. These feature gates will later allow us to use some alpha/beta features of HPA. To create a cluster with the above configuration, you can run:

Apart from the cluster, we will also need an application that we will scale. For that we will use resource consumer tool and it’s image, which are used in Kubernetes end-to-end testing. To deploy it, you can run:

This application is very handy in this situation, as it allows us to simulate CPU and memory consumption of a Pod. It can also expose custom metrics which are needed for scaling based on custom/external metrics. To test this out we can run:

Next, we will also need to deploy services that collect metrics based on which we will later scale our test application. First of these is Kubernetes metrics-server which is usually available in cluster by default, but that's not the case in KinD, so to deploy it we need to run:

metrics-server allows us to monitor for basic metrics such as CPU and memory usage, but we also want to implement scaling based on custom metrics, such as the ones exposed by an application on its /metrics endpoint, or even external ones like queue depth of a queue running outside of cluster. For these we will need:

Prometheus Operator to gather the custom/external metrics.
ServiceMonitor object(s) to tell Prometheus how to scrape our application’s metrics.
Prometheus adapter to get custom/external metrics from Prometheus instance into Kubernetes API.

You can refer to the end-to-end walkthrough for more details of the setup.

The above requires a lot of setup, so for purpose of this article and for your convenience, I’ve made a script and a set manifests that you can use to spin up KinD cluster along with all the required components. All you need to do is run setup.sh script from this repository.

After running the script, we can verify that everything is ready using following commands:

More helpful commands can be found in output of above mentioned script or in the repository README.

Basic Autoscaling

Now that we have our infrastructure up-and-running, we can start scaling the test application. The simplest way to do so is to create HPA using command like kubectl autoscale deploy resource-consumer --min=1 --max=5 --cpu-percent=75, this however creates HPA with apiVersion of autoscaling/v1, which lacks most of the features.

So, instead, we will create the HPA with YAML, specifying autoscaling/v2 as a apiVersion:

The above HPA will use basic metrics gathered from application Pod(s) by metrics-server. To test out the scaling we can simulate heavy memory usage:

Custom Metrics

Scaling based on CPU and memory usage is often enough, but we’re after the advanced scaling options. First of them is scaling using custom metrics exposed by an application:

This HPA is configured to scale the application based on the value of custom_metric that was scraped by Prometheus from application's /metrics endpoint. This will scale the application up if average value of specified metric across all pods (.target.type: AverageValue) goes over 100.

The above uses Pod metric to scale, but it’s possible to specify any other object which has a metric attached to itself:

This snippet achieves the same as the previous one, this time however, using Service instead of Pod as the source of the metric. It also shows that you can use direct comparison to measure the scaling threshold by setting .target.type to Value instead of AverageValue.

To figure out which objects expose metrics that you can use in scaling, you can traverse the API using kubectl get --raw. For example to look up the custom_metric for either Pod or Service you can use:

Also, to help you troubleshoot, the HPA object provides a status stanza, that shows whether the applied metric was recognized:

Finally, to test out the behavior of the above HPA, we can bump the metric exposed by the application and see how the application scales up:

External Metrics

To show full potential of HPA, we will also try scaling an application based on external metric. This would require us to scrape metrics from external system running outside of a cluster, such Kafka or PostgreSQL. We don’t have that available, so instead we’ve configured Prometheus Adapter to treat certain metrics as external. The configuration that does this can be found here. All you need to know though is that with this test cluster, any application metrics prefixed with external will go to external metrics API. To test this out, we bump up such a metric and check if the API gets populated:

To then scale our deployment based on this metric we can use following HPA:

HPAScaleToZero

Now that we’ve gone through all the well known features of HPA, let’s also take a look at the alpha/beta ones that we enabled using feature gates. First one being HPAScaleToZero.

As the name suggests, this will allow you to set minReplicas in HPA to zero, effectively turning the service off if there's no traffic. This can be useful in "bursty" workflow, for example in case where your application receives data from an external queue. In this use case the application can be safely scaled to zero when there are messages waiting to be processed.

With the feature gate enabled we can simply run:

Which sets the minimum replicas of previously shown HPA to zero.

Be aware though, that this will only work for metrics of type External or Object.

HPAContainerMetrics

Another feature gate that we can make use of is HPAContainerMetrics which allows us to use metrics of type: ContainerResource:

This makes it possible to scale based on resource utilization of individual containers rather than whole Pod. This can be useful if you have multi-container Pod with application container and sidecar, and you want to ignore the sidecar and scale the deployment only based on the application container.

You can also view the breakdown of Pod/container metrics by running the following command:

LogarithmicScaleDown

Last but not least is LogarithmicScaleDown feature flag.

Without this feature, the Pod that’s been running for least amount of time gets deleted first during downscaling. That’s not always ideal though as it can create imbalance in replica distribution because newer Pods tend serve less traffic than the older ones.

With this feature flag enabled, a semi-random selection of Pods will be used instead when selecting Pod to be deleted.

For a full rationale and algorithm details see KEP-2189.

Closing Thoughts

In this article, I tried to cover most of the things you can do with Kubernetes HPA to scale your application. There are however, many more tools and options for scaling applications running in Kubernetes, such as vertical pod autoscaler which can help to keep Pod resource requests and limits up-to-date.

Another option would be predictive HPA by Digital Ocean, which will try to predict how many replicas a resource should and application have.

Finally, autoscaling doesn’t end with Pods — next step after setting up Pod autoscaling is to also set up cluster autoscaling to avoid running out of available resources in you whole cluster.

Automate All the Boring Kubernetes Operations With Python

Learn how you can use Python’s Kubernetes Client library to automate all the boring Kubernetes tasks and operations

betterprogramming.pub

Stop Messing With Kubernetes Finalizers

Here’s why you should never force-delete Kubernetes resources or remove their finalizers