Understanding Horizontal Pod Autoscaling

Autoscaling is an important aspect of running applications on Kubernetes at scale. Not only does it ensure your applications smoothly scale out with increasing load, it also allows better resource utilization and cost optimization.
In this post we’re going to look at HPA i.e. Horizontal Pod Autoscaling in details. We’ll see how it works under the hood and then understand how application administrators can leverage it to deploy seamless scaling.
Primer on Kubernetes Autoscaling
There are three major auto scaling mechanisms available in Kubernetes:
- Horizontal Pod Autoscaling: Add more pods to the application for horizontal spreading of workload.
- Vertical Pod Autoscaling: Add more CPU / Memory to the existing pod, so it can handle higher load.
- Cluster Autoscaler: Add more nodes to the existing cluster.
The decision to choose an approach vs others is generally based on the application that needs to be scaled and other environmental factors.
For example, a stateless application like Nginx may be better off scaling horizontally. Since Nginx is stateless, there is not much additional Nginx specific data (except a static config file) that has to be available on new nodes before Nginx can be scheduled there. The minimal effort to horizontally scale Nginx pods as and when load increases and then scale it down again.
What is HPA
As per the official Kubernetes documentation,
The Horizontal Pod Autoscaler automatically scales the number of Pods in a replication controller, deployment, replica set or stateful set based on observed CPU utilization (or, with custom metrics support, on some other application-provided metrics). Note that Horizontal Pod Autoscaling does not apply to objects that can't be scaled, for example, DaemonSets.
For administrators this means an automated mechanism that keeps looking at certain metrics from an application’s pod. Then, based on a threshold, triggers increase or decrease in total number of pods.
Note that HPA is available only for objects that can be scaled, for example ReplicaSets, Deployments, StatefulSets. HPA is not applicable to Kubernetes objects that can’t be scaled, like DaemonSets.
HPA Metrics
To get a better understanding of HPA, it is important to understand the Kubernetes metrics landscape. From an HPA perspective, there are two API endpoints of interest:
- metrics.k8s.io: This API is served by metrics-server. The metrics-server is generally launched as a cluster addon. It exposes the resources data - i.e. CPU and Memory metrics. This data is then used make decisions about changes in the pod replicas.
- custom.metrics.k8s.io: The default metrics from metrics-server is limited to CPU and Memory. In many cases only CPU and Memory based scaling may not be enough. Administrators may want to use their application specific metrics, for example number of concurrent requests, or some internal metric exposed via application’s Prometheus endpoint. Such metrics are called Custom metrics, available via
custom.metrics.k8s.io
API. Custom metrics server provides this extensibility to external providers. Any provider can develop an adapter API server that serves data related to some arbitrary metrics. Here is the list of known solutions.
HPA Algorithm
At a high level, HPA tries to use the ratio of current value vs expected value to calculate the expected number of replicas. For example, if current memory utilization is 500 MiB and target utilization value is 1000 MiB. Then, HPA will try to half the number of replicas based on the formula
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]
==> desiredReplicas = ceil[currentReplicas * (500/1000)]
==> desiredReplicas = ceil[currentReplicas * 0.5]
HPA also accepts fields like targetAverageValue
and targetAverageUtilization
. In this case, the currentMetricValue
is computed by taking the average of the given metric across all Pods in the HPA's scale target.
HPA in Practice
HPA is implemented as a native Kubernetes resource. It can be created / deleted using kubectl
or via the yaml
specification. Here is a sample HPA spec
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: sample-app
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: sample-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
- type: Pods
pods:
metric:
name: packets-per-second
target:
type: AverageValue
averageValue: 1k
- type: Object
object:
metric:
name: requests-per-second
describedObject:
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
name: main-route
target:
type: Value
value: 10k
Let's understand the various entries under the metrics section.
- Resource Metric (Standard): This is the type of metric specified under the container spec of a resource. This covers only the CPU utilization and Memory - as these are the only supported fields under resources spec of a Kubernetes object. These resources do not change names from cluster to cluster, and should always be available, as long as the
metrics.k8s.io
API is available. - Pod Metric (Custom): These metrics describe Pods, and are averaged together across Pods and compared with a target value to determine the replica count. They work much like resource metrics, except that they only support a
target
type ofAverageValue
. - Resource Metric (Custom): These metrics describe a different object in the same namespace, instead of describing Pods. The metrics are not necessarily fetched from the object; they only describe it. Object metrics support
target
types of bothValue
andAverageValue
. WithValue
, the target is compared directly to the returned metric from the API. WithAverageValue
, the value returned from the custom metrics API is divided by the number of Pods before being compared to the target.
Best Practices
When running production workloads with autoscaling enabled, there are a few best practices to keep in mind.
- Install a metric server: Kubernetes requires a metrics server be installed in order for autoscaling to work.
- Define pod requests and limits: A Kubernetes scheduler makes scheduling decisions according to the requests and limits set in the pod.
- Specify PodDisruptionBudgets for mission-critical applications: PodDisruptionBudget avoids disruption of critical pods running in the Kubernetes Cluster.
- Don’t mix HPA with VPA: Horizontal Pod Autoscaler and Vertical Pod Autoscaler should not be run together. It is recommended to run Vertical Pod Autoscaler first, to get the proper values for CPU and memory as recommendations, and then to run HPA to handle traffic spikes.
- Resource requests should be close to the average usage of the pods.
Conclusion
The Horizontal Pod Autoscaler is the most widely used and stable version available in Kubernetes for horizontally scaling workloads. However, this may not be suitable for every type of workload. HPA works best when combined with Cluster Autoscaler to get your compute resources scaled in tandem with the pods within the cluster.