Skip to content

Kubernetes_hpa

Kubernetes Horizontal Pod Autoscaling (HPA)

Section titled “Kubernetes Horizontal Pod Autoscaling (HPA)”

Horizontal Pod Autoscaling (HPA) automatically adjusts the number of pod replicas in a Deployment, ReplicaSet, or StatefulSet based on observed CPU utilization, memory usage, or custom metrics. This enables applications to handle increased load automatically and reduce costs during low traffic periods.

┌─────────────────────────────────────────────────────────────────┐
│ HPA Controller │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Metrics │───▶│ Decision │───▶│ Scale │ │
│ │ Server │ │ Engine │ │ Action │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Pod Metrics │ │ ReplicaSet │ │
│ │ (CPU/Mem) │ │ Update │ │
│ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘

The HPA controller:

  1. Polls Metrics Server at regular intervals (default: 15 seconds)
  2. Calculates desired replica count based on metrics
  3. Updates the scale sub-resource of the target resource
  4. Continuously monitors to adjust as needed

The Metrics Server must be installed for HPA to work:

Terminal window
# Install using kubectl
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Or install via Helm
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server
helm install metrics-server metrics-server/metrics-server \
--namespace kube-system \
--set args[0]="--kubelet-insecure-tls"

Verify installation:

Terminal window
# Check metrics-server pod
kubectl get pods -n kube-system -l k8s-app=metrics-server
# Test metrics API
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes
kubectl get --raw /apis/metrics.k8s.io/v1beta1/pods

Create an HPA for a Deployment:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: myapp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp-deployment
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max

Key fields:

  • scaleTargetRef: The target to scale (Deployment, ReplicaSet, or StatefulSet)
  • minReplicas / maxReplicas: Bounds for the replica count
  • metrics: Metrics to base scaling decisions on
  • behavior: Configure scaling behavior for stability

The most common autoscaling metric:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-deployment
minReplicas: 1
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50

This will maintain average CPU utilization at 50%.

Create and test:

Terminal window
# Create HPA
kubectl apply -f hpa.yaml
# View HPA status
kubectl get hpa
# Detailed information
kubectl describe hpa myapp-hpa

Scale based on memory usage:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: memory-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp-deployment
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80

Note: Memory-based scaling requires the application to handle memory gracefully (no memory leaks).

Combine multiple metrics for better scaling decisions:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: multi-metric-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp-deployment
minReplicas: 2
maxReplicas: 15
metrics:
# CPU metric
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
# Memory metric
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
# Use the metric that requires the highest replica count
scalingPolicy:
metric:
type: Resource
resource:
name: cpu
mode: AverageValue
containerResource:
container: app

Scale based on application-specific metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: custom-metric-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp-deployment
minReplicas: 1
maxReplicas: 10
metrics:
# Custom metric from Prometheus
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
# CPU as fallback
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70

You need a custom metrics adapter (like Prometheus Adapter) for this to work.

Control how aggressively HPA scales:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: controlled-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp-deployment
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
behavior:
# Scale down behavior
scaleDown:
# Wait 5 minutes before scaling down
stabilizationWindowSeconds: 300
policies:
# Remove at most 10% of pods per minute
- type: Percent
value: 10
periodSeconds: 60
# Or remove max 4 pods per minute
- type: Pods
value: 4
periodSeconds: 60
# Use the policy that results in the largest reduction
selectPolicy: Max
# Scale up behavior
scaleUp:
# No stabilization - scale up immediately
stabilizationWindowSeconds: 0
policies:
# Double pods every 15 seconds
- type: Percent
value: 100
periodSeconds: 15
# Or add 4 pods every 15 seconds
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max

VPA adjusts resource requests (CPU/memory) for pods:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: myapp-vpa
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: myapp-deployment
updatePolicy:
updateMode: "Auto" # Auto, Recreate, Off
resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 4
memory: 8Gi
controlledResources: ["cpu", "memory"]

Important: Don’t use HPA and VPA on the same pods simultaneously without proper configuration.

Always set resource requests for HPA to work effectively:

containers:
- name: app
image: myapp:latest
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
  • minReplicas: At least 2 for production to ensure availability during updates
  • maxReplicas: Cap based on cluster capacity and cost budget
  • CPU: Good for stateless, CPU-intensive workloads
  • Memory: Good for workloads with consistent memory patterns
  • Custom: For queue depth, request latency, etc.

Prevent “flapping” (rapid scale up/down):

behavior:
scaleDown:
stabilizationWindowSeconds: 300 # 5 minutes
scaleUp:
stabilizationWindowSeconds: 0 # Immediate scale up
Terminal window
# Generate load to test
kubectl run -it --rm load-generator --image=busybox -- /bin/sh
# Then inside the container:
while true; do wget -q -O- http://myapp-service; done
# Monitor HPA
kubectl get hpa myapp-hpa --watch

Common issues and solutions:

Terminal window
# HPA not creating pods
kubectl describe hpa <name>
# Check metrics availability
kubectl top pods
kubectl top nodes
# Check HPA events
kubectl get events --field-selector involvedObject.name=<hpa-name>
# Common issues:
# 1. Missing metrics-server
# 2. Missing resource requests in pods
# 3. HPA targeting wrong resource
# 4. Cluster at capacity (max nodes reached)

Horizontal Pod Autoscaling is essential for:

  • Cost optimization: Scale down during low traffic
  • Availability: Scale up during high traffic
  • Resilience: Handle traffic spikes automatically

Key considerations:

  • Always set resource requests
  • Choose appropriate metrics for your workload
  • Configure scaling behavior for stability
  • Test autoscaling under load
  • Monitor and adjust based on real-world behavior