Autoscaling is an important aspect of running applications on Kubernetes at scale. Not only does it ensure your applications smoothly scale out with increasing load, it also allows better resource utilization and cost optimization.
In this post we’re going to look at HPA i.e. Horizontal Pod Autoscaling in details. We’ll see how it works under the hood and then understand how application administrators can leverage it to deploy seamless scaling.
Primer on Kubernetes Autoscaling
There are three major auto scaling mechanisms available in Kubernetes:
- Horizontal Pod Autoscaling: Add more pods to the application for horizontal spreading of workload.
- Vertical Pod Autoscaling: Add more CPU / Memory to the existing pod, so it can handle higher load.
- Cluster Autoscaler: Add more nodes to the existing cluster.
The decision to choose an approach vs others is generally based on the application that needs to be scaled and other environmental factors.
For example, a stateless application like Nginx may be better off scaling horizontally. Since Nginx is stateless, there is not much additional Nginx specific data (except a static config file) that has to be available on new nodes before Nginx can be scheduled there. The minimal effort to horizontally scale Nginx pods as and when load increases and then scale it down again.
What is HPA
As per the official Kubernetes documentation,
The Horizontal Pod Autoscaler automatically scales the number of Pods in a replication controller, deployment, replica set or stateful set based on observed CPU utilization (or, with custom metrics support, on some other application-provided metrics). Note that Horizontal Pod Autoscaling does not apply to objects that can't be scaled, for example, DaemonSets.
For administrators this means an automated mechanism that keeps looking at certain metrics from an application’s pod. Then, based on a threshold, triggers increase or decrease in total number of pods.
Note that HPA is available only for objects that can be scaled, for example ReplicaSets, Deployments, StatefulSets. HPA is not applicable to Kubernetes objects that can’t be scaled, like DaemonSets.
To get a better understanding of HPA, it is important to understand the Kubernetes metrics landscape. From an HPA perspective, there are two API endpoints of interest:
- metrics.k8s.io: This API is served by metrics-server. The metrics-server is generally launched as a cluster addon. It exposes the resources data - i.e. CPU and Memory metrics. This data is then used make decisions about changes in the pod replicas.
- custom.metrics.k8s.io: The default metrics from metrics-server is limited to CPU and Memory. In many cases only CPU and Memory based scaling may not be enough. Administrators may want to use their application specific metrics, for example number of concurrent requests, or some internal metric exposed via application’s Prometheus endpoint. Such metrics are called Custom metrics, available via
custom.metrics.k8s.ioAPI. Custom metrics server provides this extensibility to external providers. Any provider can develop an adapter API server that serves data related to some arbitrary metrics. Here is the list of known solutions.
At a high level, HPA tries to use the ratio of current value vs expected value to calculate the expected number of replicas. For example, if current memory utilization is 500 MiB and target utilization value is 1000 MiB. Then, HPA will try to half the number of replicas based on the formula
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )] ==> desiredReplicas = ceil[currentReplicas * (500/1000)] ==> desiredReplicas = ceil[currentReplicas * 0.5]
HPA also accepts fields like
targetAverageUtilization. In this case, the
currentMetricValue is computed by taking the average of the given metric across all Pods in the HPA's scale target.
HPA in Practice
HPA is implemented as a native Kubernetes resource. It can be created / deleted using
kubectl or via the
yaml specification. Here is a sample HPA spec
apiVersion: autoscaling/v2beta2 kind: HorizontalPodAutoscaler metadata: name: sample-app namespace: default spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: sample-app minReplicas: 1 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 50 - type: Pods pods: metric: name: packets-per-second target: type: AverageValue averageValue: 1k - type: Object object: metric: name: requests-per-second describedObject: apiVersion: networking.k8s.io/v1beta1 kind: Ingress name: main-route target: type: Value value: 10k
Let's understand the various entries under the metrics section.
- Resource Metric (Standard): This is the type of metric specified under the container spec of a resource. This covers only the CPU utilization and Memory - as these are the only supported fields under resources spec of a Kubernetes object. These resources do not change names from cluster to cluster, and should always be available, as long as the
metrics.k8s.ioAPI is available.
- Pod Metric (Custom): These metrics describe Pods, and are averaged together across Pods and compared with a target value to determine the replica count. They work much like resource metrics, except that they only support a
- Resource Metric (Custom): These metrics describe a different object in the same namespace, instead of describing Pods. The metrics are not necessarily fetched from the object; they only describe it. Object metrics support
targettypes of both
Value, the target is compared directly to the returned metric from the API. With
AverageValue, the value returned from the custom metrics API is divided by the number of Pods before being compared to the target.
When running production workloads with autoscaling enabled, there are a few best practices to keep in mind.
- Install a metric server: Kubernetes requires a metrics server be installed in order for autoscaling to work.
- Define pod requests and limits: A Kubernetes scheduler makes scheduling decisions according to the requests and limits set in the pod.
- Specify PodDisruptionBudgets for mission-critical applications: PodDisruptionBudget avoids disruption of critical pods running in the Kubernetes Cluster.
- Don’t mix HPA with VPA: Horizontal Pod Autoscaler and Vertical Pod Autoscaler should not be run together. It is recommended to run Vertical Pod Autoscaler first, to get the proper values for CPU and memory as recommendations, and then to run HPA to handle traffic spikes.
- Resource requests should be close to the average usage of the pods.
The Horizontal Pod Autoscaler is the most widely used and stable version available in Kubernetes for horizontally scaling workloads. However, this may not be suitable for every type of workload. HPA works best when combined with Cluster Autoscaler to get your compute resources scaled in tandem with the pods within the cluster.