4 min read

Understanding Horizontal Pod Autoscaling

Understanding Horizontal Pod Autoscaling

Autoscaling is an important aspect of running applications on Kubernetes at scale. Not only does it ensure your applications smoothly scale out with increasing load, it also allows better resource utilization and cost optimization.

In this post we’re going to look at HPA i.e. Horizontal Pod Autoscaling in details. We’ll see how it works under the hood and then understand how application administrators can leverage it to deploy seamless scaling.

Primer on Kubernetes Autoscaling

There are three major auto scaling mechanisms available in Kubernetes:

  • Horizontal Pod Autoscaling: Add more pods to the application for horizontal spreading of workload.
  • Vertical Pod Autoscaling: Add more CPU / Memory to the existing pod, so it can handle higher load.
  • Cluster Autoscaler: Add more nodes to the existing cluster.

The decision to choose an approach vs others is generally based on the application that needs to be scaled and other environmental factors.

For example, a stateless application like Nginx may be better off scaling horizontally. Since Nginx is stateless, there is not much additional Nginx specific data (except a static config file) that has to be available on new nodes before Nginx can be scheduled there. The minimal effort to horizontally scale Nginx pods as and when load increases and then scale it down again.

What is HPA

As per the official Kubernetes documentation,

The Horizontal Pod Autoscaler automatically scales the number of Pods in a replication controller, deployment, replica set or stateful set based on observed CPU utilization (or, with custom metrics support, on some other application-provided metrics). Note that Horizontal Pod Autoscaling does not apply to objects that can't be scaled, for example, DaemonSets.

For administrators this means an automated mechanism that keeps looking at certain metrics from an application’s pod. Then, based on a threshold, triggers increase or decrease in total number of pods.

Note that HPA is available only for objects that can be scaled, for example ReplicaSets, Deployments, StatefulSets. HPA is not applicable to Kubernetes objects that can’t be scaled, like DaemonSets.

HPA Metrics

To get a better understanding of HPA, it is important to understand the Kubernetes metrics landscape. From an HPA perspective, there are two API endpoints of interest:

  • metrics.k8s.io: This API is served by metrics-server. The metrics-server is generally launched as a cluster addon. It exposes the resources data - i.e. CPU and Memory metrics. This data is then used make decisions about changes in the pod replicas.
  • custom.metrics.k8s.io: The default metrics from metrics-server is limited to CPU and Memory. In many cases only CPU and Memory based scaling may not be enough. Administrators may want to use their application specific metrics, for example number of concurrent requests, or some internal metric exposed via application’s Prometheus endpoint. Such metrics are called Custom metrics, available via custom.metrics.k8s.io API. Custom metrics server provides this extensibility to external providers. Any provider can develop an adapter API server that serves data related to some arbitrary metrics. Here is the list of known solutions.

HPA Algorithm

At a high level, HPA tries to use the ratio of current value vs expected value to calculate the expected number of replicas. For example, if current memory utilization is 500 MiB and target utilization value is 1000 MiB. Then, HPA will try to half the number of replicas based on the formula

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

==> desiredReplicas = ceil[currentReplicas * (500/1000)]
==> desiredReplicas = ceil[currentReplicas * 0.5]

HPA also accepts fields like targetAverageValue and targetAverageUtilization. In this case, the currentMetricValue is computed by taking the average of the given metric across all Pods in the HPA's scale target.

HPA in Practice

HPA is implemented as a native Kubernetes resource. It can be created / deleted using kubectl or via the yaml specification. Here is a sample HPA spec

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
name: sample-app
namespace: default
  apiVersion: apps/v1
  kind: Deployment
  name: sample-app
minReplicas: 1
maxReplicas: 10
  - type: Resource
      name: cpu
        type: Utilization
        averageUtilization: 50
  - type: Pods
        name: packets-per-second
        type: AverageValue
        averageValue: 1k
  - type: Object
        name: requests-per-second
        apiVersion: networking.k8s.io/v1beta1
        kind: Ingress
        name: main-route
        type: Value
        value: 10k

Let's understand the various entries under the metrics section.

  • Resource Metric (Standard): This is the type of metric specified under the container spec of a resource. This covers only the CPU utilization and Memory - as these are the only supported fields under resources spec of a Kubernetes object. These resources do not change names from cluster to cluster, and should always be available, as long as the metrics.k8s.io API is available.
  • Pod Metric (Custom): These metrics describe Pods, and are averaged together across Pods and compared with a target value to determine the replica count. They work much like resource metrics, except that they only support a target type of AverageValue.
  • Resource Metric (Custom): These metrics describe a different object in the same namespace, instead of describing Pods. The metrics are not necessarily fetched from the object; they only describe it. Object metrics support target types of both Value and AverageValue. With Value, the target is compared directly to the returned metric from the API. With AverageValue, the value returned from the custom metrics API is divided by the number of Pods before being compared to the target.

Best Practices

When running production workloads with autoscaling enabled, there are a few best practices to keep in mind.

  • Install a metric server: Kubernetes requires a metrics server be installed in order for autoscaling to work.
  • Define pod requests and limits: A Kubernetes scheduler makes scheduling decisions according to the requests and limits set in the pod.
  • Specify PodDisruptionBudgets for mission-critical applications: PodDisruptionBudget avoids disruption of critical pods running in the Kubernetes Cluster.
  • Don’t mix HPA with VPA: Horizontal Pod Autoscaler and Vertical Pod Autoscaler should not be run together. It is recommended to run Vertical Pod Autoscaler first, to get the proper values for CPU and memory as recommendations, and then to run HPA to handle traffic spikes.
  • Resource requests should be close to the average usage of the pods.


The Horizontal Pod Autoscaler is the most widely used and stable version available in Kubernetes for horizontally scaling workloads. However, this may not be suitable for every type of workload. HPA works best when combined with Cluster Autoscaler to get your compute resources scaled in tandem with the pods within the cluster.