Recipe: Kubernetes HPA design
A production-ready Horizontal Pod Autoscaler configuration tuned for latency-sensitive workloads running on Meridian infrastructure.
Target workload
Stateless API servers behind a Layer-7 ingress. Average request latency p99 must stay under 200ms during traffic spikes.
Metrics source
Prometheus adapter exposing http_requests_per_second per pod. Fallback to CPU utilization when custom metrics are unavailable.
Scaling thresholds
- Target: 80 requests/second per pod
- Scale-up window: 60s stabilization
- Scale-down window: 300s stabilization
- Min replicas: 3, Max replicas: 20
Manifest
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-gateway
minReplicas: 3
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "80"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300Caveats
HPA reacts to metric spikes, not queuing depth. Pair with a pod-overhead budget so cluster autoscaler has headroom. Test scale-from-zero cold starts before enabling in production.