Kubernetes_hpa

Kubernetes Horizontal Pod Autoscaling (HPA)

Overview

Horizontal Pod Autoscaling (HPA) automatically adjusts the number of pod replicas in a Deployment, ReplicaSet, or StatefulSet based on observed CPU utilization, memory usage, or custom metrics. This enables applications to handle increased load automatically and reduce costs during low traffic periods.

How HPA Works

┌─────────────────────────────────────────────────────────────────┐
│                      HPA Controller                             │
│                                                                  │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐     │
│  │   Metrics    │───▶│   Decision   │───▶│    Scale     │     │
│  │   Server     │    │    Engine    │    │   Action     │     │
│  └──────────────┘    └──────────────┘    └──────────────┘     │
│         │                                        │              │
│         ▼                                        ▼              │
│  ┌──────────────┐                       ┌──────────────┐       │
│  │  Pod Metrics │                       │  ReplicaSet  │       │
│  │  (CPU/Mem)   │                       │  Update      │       │
│  └──────────────┘                       └──────────────┘       │
└─────────────────────────────────────────────────────────────────┘

The HPA controller:

Polls Metrics Server at regular intervals (default: 15 seconds)
Calculates desired replica count based on metrics
Updates the scale sub-resource of the target resource
Continuously monitors to adjust as needed

Installing Metrics Server

The Metrics Server must be installed for HPA to work:

# Install using kubectl
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Or install via Helm
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server
helm install metrics-server metrics-server/metrics-server \
  --namespace kube-system \
  --set args[0]="--kubelet-insecure-tls"

Verify installation:

# Check metrics-server pod
kubectl get pods -n kube-system -l k8s-app=metrics-server

# Test metrics API
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes
kubectl get --raw /apis/metrics.k8s.io/v1beta1/pods

Basic HPA Configuration

Create an HPA for a Deployment:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
      - type: Pods
        value: 4
        periodSeconds: 15
      selectPolicy: Max

Key fields:

scaleTargetRef: The target to scale (Deployment, ReplicaSet, or StatefulSet)
minReplicas / maxReplicas: Bounds for the replica count
metrics: Metrics to base scaling decisions on
behavior: Configure scaling behavior for stability

CPU-Based Autoscaling

The most common autoscaling metric:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-deployment
  minReplicas: 1
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

This will maintain average CPU utilization at 50%.

Create and test:

# Create HPA
kubectl apply -f hpa.yaml

# View HPA status
kubectl get hpa

# Detailed information
kubectl describe hpa myapp-hpa

Memory-Based Autoscaling

Scale based on memory usage:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: memory-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 80

Note: Memory-based scaling requires the application to handle memory gracefully (no memory leaks).

Multiple Metrics

Combine multiple metrics for better scaling decisions:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: multi-metric-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp-deployment
  minReplicas: 2
  maxReplicas: 15
  metrics:
  # CPU metric
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  # Memory metric
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70
  # Use the metric that requires the highest replica count
  scalingPolicy:
    metric:
      type: Resource
      resource:
        name: cpu
    mode: AverageValue
    containerResource:
      container: app

Custom Metrics Autoscaling

Scale based on application-specific metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: custom-metric-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp-deployment
  minReplicas: 1
  maxReplicas: 10
  metrics:
  # Custom metric from Prometheus
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"
  # CPU as fallback
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

You need a custom metrics adapter (like Prometheus Adapter) for this to work.

Scaling Behavior Configuration

Control how aggressively HPA scales:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: controlled-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp-deployment
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  behavior:
    # Scale down behavior
    scaleDown:
      # Wait 5 minutes before scaling down
      stabilizationWindowSeconds: 300
      policies:
      # Remove at most 10% of pods per minute
      - type: Percent
        value: 10
        periodSeconds: 60
      # Or remove max 4 pods per minute
      - type: Pods
        value: 4
        periodSeconds: 60
      # Use the policy that results in the largest reduction
      selectPolicy: Max
    # Scale up behavior
    scaleUp:
      # No stabilization - scale up immediately
      stabilizationWindowSeconds: 0
      policies:
      # Double pods every 15 seconds
      - type: Percent
        value: 100
        periodSeconds: 15
      # Or add 4 pods every 15 seconds
      - type: Pods
        value: 4
        periodSeconds: 15
      selectPolicy: Max

Vertical Pod Autoscaling (VPA)

VPA adjusts resource requests (CPU/memory) for pods:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: myapp-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: myapp-deployment
  updatePolicy:
    updateMode: "Auto"  # Auto, Recreate, Off
  resourcePolicy:
    containerPolicies:
    - containerName: app
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 4
        memory: 8Gi
      controlledResources: ["cpu", "memory"]

Important: Don’t use HPA and VPA on the same pods simultaneously without proper configuration.

Best Practices

1. Set Appropriate Resource Requests

Always set resource requests for HPA to work effectively:

containers:
- name: app
  image: myapp:latest
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 512Mi

2. Set Appropriate Min/Max Replicas

minReplicas: At least 2 for production to ensure availability during updates
maxReplicas: Cap based on cluster capacity and cost budget

3. Use Appropriate Metrics

CPU: Good for stateless, CPU-intensive workloads
Memory: Good for workloads with consistent memory patterns
Custom: For queue depth, request latency, etc.

4. Configure Stabilization Windows

Prevent “flapping” (rapid scale up/down):

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300  # 5 minutes
  scaleUp:
    stabilizationWindowSeconds: 0     # Immediate scale up

5. Test Autoscaling

# Generate load to test
kubectl run -it --rm load-generator --image=busybox -- /bin/sh
# Then inside the container:
while true; do wget -q -O- http://myapp-service; done

# Monitor HPA
kubectl get hpa myapp-hpa --watch

Troubleshooting

Common issues and solutions:

# HPA not creating pods
kubectl describe hpa <name>

# Check metrics availability
kubectl top pods
kubectl top nodes

# Check HPA events
kubectl get events --field-selector involvedObject.name=<hpa-name>

# Common issues:
# 1. Missing metrics-server
# 2. Missing resource requests in pods
# 3. HPA targeting wrong resource
# 4. Cluster at capacity (max nodes reached)

Summary

Horizontal Pod Autoscaling is essential for:

Cost optimization: Scale down during low traffic
Availability: Scale up during high traffic
Resilience: Handle traffic spikes automatically

Key considerations:

Always set resource requests
Choose appropriate metrics for your workload
Configure scaling behavior for stability
Test autoscaling under load
Monitor and adjust based on real-world behavior