Most Kubernetes Autoscaling Setups Are Silently Broken: 10 Gotchas to Watch Out For

1. CPU Percentage Doesn't Equal Performance

The Horizontal Pod Autoscaler (HPA) defaults to CPU usage, which works well for CPU-bound applications. However, async I/O, GPU, or network-bound services might show near-zero CPU usage while being overloaded. To ensure reliable scaling, choose a metric that reflects user experience, such as request latency or queue depth, rather than just CPU.

Example: Database connection pool service

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: db-pool-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: db-connection-pool
  minReplicas: 2
  maxReplicas: 10
  metrics:
  # Bad: CPU might be low while connections are maxed out
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  # Good: Scale based on actual connection usage
  - type: Object
    object:
      metric:
        name: active_connections_percentage
      target:
        type: Value
        value: "80"

2. You Can Autoscale on Business Metrics

HPA supports custom, external, or object metrics like queries per second (QPS), queue depth, SLO latency, or even revenue per pod. Most clusters already use Prometheus—add the Prometheus Adapter to unlock smarter scaling based on metrics that matter to your business.

Example: E-commerce checkout service scaling on revenue

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkout-revenue-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-service
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: External
    external:
      metric:
        name: revenue_per_minute
        selector:
          matchLabels:
            service: checkout
      target:
        type: AverageValue
        averageValue: "10"

3. HPA + VPA: Mix Carefully

Running the Vertical Pod Autoscaler (VPA) in Auto mode alongside HPA on the same metric can cause thrashing, where resources conflict and scale erratically. Instead, use VPA in Off mode to recommend right-sized resource requests, while HPA scales replicas based on custom or external metrics.

Example: VPA in recommendation mode with HPA

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  updatePolicy:
    updateMode: "Off"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  metrics:
  - type: External
    external:
      metric:
        name: http_requests_per_second

4. Cold-Start Time Can Wipe Out Benefits

If a new pod takes 90 seconds to start and the autoscaler reacts 15 seconds after a traffic spike, users will wait about 105 seconds for the pod to handle traffic. Fast scaling only helps if your pods start quickly. Optimize pod startup times to maximize autoscaling benefits.

Example: Optimizing container startup time

# Bad: Slow startup Dockerfile
FROM ubuntu:20.04
RUN apt-get update && apt-get install -y python3 python3-pip
COPY requirements.txt .
RUN pip3 install -r requirements.txt
COPY . .
CMD ["python3", "app.py"]

# Good: Fast startup with multi-stage build
FROM python:3.9-slim as builder
COPY requirements.txt .
RUN pip install --user -r requirements.txt

FROM python:3.9-slim
COPY --from=builder /root/.local /root/.local
COPY . .
ENV PATH=/root/.local/bin:$PATH
CMD ["python", "app.py"]

5. Node Scale-Down Can Cut Capacity

The Cluster Autoscaler may drain a node before a replacement pod is fully ready, leading to traffic brownouts. Use Pod Disruption Budgets (PDBs), readiness probes, preStop hooks, and multiple replicas to ensure capacity isn't reduced prematurely.

Example: Pod Disruption Budget for API service

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api-server
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 4
  template:
    spec:
      containers:
      - name: api
        image: myapi:v1
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]

6. Cluster Autoscaler Isn't Instant

Provisioning a VM, attaching storage, and registering a node can take 60–120 seconds. Account for this delay by buffering or pre-scaling ahead of predictable traffic spikes, like scheduled events or peak hours.

Example: Pre-scaling before known traffic spike

apiVersion: batch/v1
kind: CronJob
metadata:
  name: pre-scale-batch-workers
spec:
  schedule: "55 8 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: pre-scaler
            image: bitnami/kubectl
            command:
            - /bin/sh
            - -c
            - kubectl scale deployment batch-workers --replicas=20
          restartPolicy: OnFailure

7. Cost-Aware Autoscaling Is Still New

Traditional autoscalers optimize for utilization, not cost per request. Some tools like Karpenter or CAST AI can help select cheaper instance types, leverage spot capacity, or consolidate nodes to reduce costs. Explore these options to balance performance and budget.

Example: Using node affinity for cost optimization

apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-processor
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: node.kubernetes.io/instance-type
                operator: In
                values: ["t3.medium", "t3.large"]
          - weight: 80
            preference:
              matchExpressions:
              - key: karpenter.sh/capacity-type
                operator: In
                values: ["spot"]

8. Hope Is Not a Strategy

Don't wait for a real traffic spike to test your setup. Feed synthetic metrics into HPA to simulate high-load scenarios, like a Black Friday surge at 2 a.m. Validate that scale-out times meet your Service Level Objectives (SLOs) before production traffic hits.

Example: A simple load testing script for HPA validation

#!/bin/bash
# simulate_load_test.sh

kubectl run load-test --rm -i --tty \
  --image=williamyeh/hey \
  --restart=Never \
  -- -z 300s -c 50 http://my-service/api/endpoint

watch kubectl get hpa,pods

9. Resource Requests Matter

HPA scales based on the ratio of resource usage to requests, not raw usage. Setting requests too high can lead to under-scaling, even under heavy load. Regularly review and adjust resource requests to align with actual usage patterns.

Example: Right-sizing resource requests

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  template:
    spec:
      containers:
      - name: app
        image: myapp:v1
        resources:
          requests:
            cpu: "250m"
            memory: "512Mi"
          limits:
            cpu: "500m"
            memory: "1Gi"

10. Autoscaling Can Hurt SLOs

Without Prediction If traffic spikes predictably (e.g., every hour on the hour) and scale-up takes two minutes, you'll drop requests during that window. Avoid this by proactively scaling up with scheduled jobs or a simple kubectl scale command before the spike hits.

Example: Scheduled pre-scaling for predictable traffic

apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-up-for-lunch-rush
spec:
  schedule: "55 11 * * 1-5"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: scaler
            image: bitnami/kubectl
            command:
            - kubectl
            - scale
            - deployment/food-ordering-api
            - --replicas=15
          restartPolicy: OnFailure
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-down-after-lunch
spec:
  schedule: "30 13 * * 1-5"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: scaler
            image: bitnami/kubectl
            command:
            - kubectl
            - scale
            - deployment/food-ordering-api
            - --replicas=5
          restartPolicy: OnFailure

Resources

👉 A big thanks to FAUN.dev sponsor, PerfectScale, for sponsoring our newsletters this week.
Download their free eBook, "Mastering Kubernetes Autoscaling", for a practical guide to effective autoscaling techniques.

👉 Want to become the best Kubernetes engineer on your team? Subscribe to Kaptain, FAUN.dev's Kubernetes weekly newsletter, for expert insights, tips, tutorials and contents that you'll not find elsewhere.

👉 If you want to dive deeper into HPA, VPA, or the Cluster Autoscaler, the Kubernetes docs are still the best place to start.