

The complete guide to Kubernetes autoscaling
Kubernetes autoscaling automatically adjusts compute resources to match application demand. This guide covers everything from basic concepts to advanced patterns, with practical examples for implementing autoscaling in production environments.
Key takeaways:
- Kubernetes autoscaling includes HPA (horizontal), VPA (vertical), and Cluster Autoscaler
- Works for ALL workloads: web apps, APIs, data processing, GPUs
- Can reduce costs by 50-70% while maintaining performance
- Northflank simplifies configuration with visual controls and automated metric collection
- Custom metrics enable business-specific scaling (queue depth, latency, connections)
Kubernetes autoscaling dynamically adjusts resources based on real-time demand. Unlike traditional static provisioning where you guess capacity needs, autoscaling responds to actual usage patterns, scaling up during peaks and down during quiet periods.
Workload Type | Scaling trigger | Example | Northflank advantage |
---|---|---|---|
Web Apps | Request rate | E-commerce during sales | RPS-based scaling |
APIs | Latency/connections | Payment gateways | Custom metrics support |
Data Processing | Queue depth | ETL pipelines | Prometheus integration |
ML/GPU | GPU utilization | Model training | Resource-aware scaling |
Microservices | Service-specific | Order processing | Per-service configs |
🛍️ E-commerce flash sales: An online retailer experiences 20x normal traffic during Black Friday. Without autoscaling, they'd need to provision for peak capacity year-round. With Northflank's autoscaling, their platform automatically scales from 5 to 100 instances during the sale, then back down afterward.
📹 Social media viral content: A social platform's video service typically handles 1,000 requests/second. When content goes viral, traffic spikes to 50,000 requests/second within minutes. Autoscaling prevents service degradation by rapidly adding capacity.
🏦 Financial services batch processing: A bank processes transactions in nightly batches. Data volume varies from 1GB on weekends to 100GB at month-end. Autoscaling provisions resources only when needed, reducing costs by 80% compared to static provisioning.
HPA scales the number of pod replicas based on metrics. It's ideal for stateless applications where adding instances directly increases capacity.
Basic HPA configuration:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: webapp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: webapp
minReplicas: 2
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Northflank's simplified approach: Instead of YAML, Northflank provides toggle controls where you:
- Enable horizontal autoscaling with one click
- Set min/max instances with sliders
- Configure CPU, memory, and RPS thresholds
- View real-time scaling events in the dashboard
VPA adjusts resource requests and limits for pods, right-sizing containers based on actual usage.
VPA configuration example:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: webapp-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: webapp
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: webapp
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2
memory: 2Gi
Manages the cluster itself by adding or removing nodes based on pod scheduling needs.
Key behaviors:
- Adds nodes when pods can't be scheduled
- Removes underutilized nodes after grace period
- Respects pod disruption budgets
- Considers node pools and instance types
Northflank's custom metrics support enables scaling based on application-specific indicators:
# Expose custom metrics in your app
from prometheus_client import Gauge, generate_latest
queue_depth = Gauge('message_queue_depth', 'Pending messages')
@app.route('/metrics')
def metrics():
queue_depth.set(get_queue_size())
return Response(generate_latest(), mimetype='text/plain')
Then in Northflank:
- Specify your Prometheus endpoint and port
- Select metric type (Gauge or Counter)
- Set threshold values
- Northflank handles the rest, no adapter configuration needed
Northflank transforms complex Kubernetes autoscaling into a straightforward process. Instead of wrestling with YAML configurations and manual metric setup, Northflank provides an intuitive interface that makes autoscaling accessible to teams of all sizes. The platform handles the underlying complexity while giving you powerful features like custom metrics, multi-threshold scaling, and real-time monitoring, all without the operational overhead of managing Kubernetes directly.
Every 15 seconds, the autoscaling control loop:
-
Collects metrics from all pods in the deployment
-
Calculates average utilization across instances
-
Determines required replicas using the formula:
required = ceil[current * (actualMetric / targetMetric)]
-
Applies scaling decision based on policies
Scale-up characteristics:
- Immediate response to threshold breach
- Can scale multiple instances at once
- No cooldown period by default
Scale-down characteristics:
- 5-minute stabilization window
- Gradual reduction to prevent flapping
- Selects highest replica count from window
When using multiple metrics (CPU, memory, RPS), the autoscaler:
- Calculates required replicas for each metric independently
- Selects the highest requirement
- Ensures no metric is under-provisioned
Northflank makes this seamless, enable any combination of metrics and the platform coordinates scaling decisions automatically.
Static provisioning wastes resources during low-demand periods. Autoscaling delivers significant savings:
- Development environments: 70-80% reduction (scale to zero when unused)
- Production services: 50-60% reduction (right-sized for actual load)
- Batch processing: 80-90% reduction (resources only when processing)
Autoscaling maintains performance KPIs during demand variations:
- Response times stay consistent during traffic spikes
- Queue processing times remain predictable
- User experience doesn't degrade under load
Teams save significant time by eliminating manual scaling tasks:
- No midnight interventions for traffic spikes
- Automatic response to seasonal patterns
- Focus on features, not infrastructure
Northflank amplifies these benefits with visual monitoring and one-click configuration changes that would typically require complex YAML editing and kubectl commands.
Combine reactive autoscaling with scheduled scaling for predictable patterns:
# Morning scale-up for business hours
apiVersion: batch/v1
kind: CronJob
metadata:
name: morning-scaleup
spec:
schedule: "0 7 * * MON-FRI"
jobTemplate:
spec:
template:
spec:
containers:
- name: scaler
image: bitnami/kubectl
command: ["kubectl", "patch", "hpa", "webapp-hpa",
"--patch", '{"spec":{"minReplicas":10}}']
Northflank's deployment strategies work seamlessly with autoscaling:
- Deploy new version with same autoscaling config
- Gradually shift traffic while both scale independently
- Complete cutover with zero downtime
For global applications, implement region-aware autoscaling:
- Scale regions independently based on local demand
- Use Northflank's multi-region support
- Configure different thresholds per region
Traditional Kubernetes autoscaling requires:
- Installing metrics servers
- Configuring RBAC permissions
- Writing YAML manifests
- Setting up Prometheus adapters
- Managing metric aggregation
Northflank provides:
- Visual configuration: Sliders and toggles instead of YAML
- Automatic metric collection: No manual setup required
- Integrated monitoring: See metrics and scaling events together
- Custom metrics support: Prometheus endpoints work immediately
- Multi-metric coordination: CPU, memory, and RPS in one interface
Here's how a Northflank user configures autoscaling:
- Navigate to your service's Resources page
- Expand "Advanced resource options"
- Toggle "Enable horizontal autoscaling"
- Set your parameters:
- Minimum instances: 2
- Maximum instances: 20
- CPU threshold: 70%
- Memory threshold: 80%
- RPS threshold: 1000
That's it. No YAML, no kubectl, no manual metric server configuration.
Northflank provides integrated monitoring that shows:
- Current instance count with scaling history
- Real-time metrics for all configured thresholds
- Scaling event logs with reasons
- Cost tracking as instances scale
Q: Is autoscaling only for GPU/ML workloads? A: No! Autoscaling works for any workload: web apps, APIs, batch jobs, microservices. Northflank supports autoscaling for all deployment types.
Q: How quickly does autoscaling respond? A: Metrics are checked every 15 seconds. Scale-up can happen immediately, while scale-down uses a 5-minute window to prevent flapping.
Q: Can I use custom metrics with Northflank? A: Yes! Expose any metric via Prometheus format, and Northflank will use it for scaling decisions. Common examples include queue depth, active connections, or business metrics.
Q: What happens during deployment updates? A: Northflank maintains autoscaling configuration during updates. New pods inherit the same scaling rules, ensuring consistent behavior.
Q: How do I test autoscaling? A: Use load testing tools to simulate traffic. Monitor the Northflank dashboard to see scaling in action. Start with conservative thresholds and adjust based on observations.
Q: Can I combine different types of autoscaling? A: Yes, but carefully. HPA and VPA can conflict if not configured properly. Northflank's platform handles HPA elegantly, and you can use VPA recommendations to set initial resource requests.
Kubernetes changes transforms how we manage application resources, moving from static provisioning to dynamic, demand-based allocation. While the underlying technology is powerful, complexity has traditionally limited adoption.
Northflank changes this by making enterprise-grade autoscaling accessible to teams of all sizes. Through intuitive interfaces, automatic metric collection, and integrated monitoring, Northflank removes the operational burden while delivering the full benefits of Kubernetes autoscaling.
Whether you're scaling web applications during traffic spikes, optimizing batch processing costs, or managing complex microservices architectures, autoscaling ensures optimal resource utilization. Start with basic CPU-based scaling, monitor real-world behavior through Northflank's dashboards, and gradually introduce custom metrics as your needs evolve.
The future of infrastructure is adaptive, not static.
With Northflank's approach to Kubernetes autoscaling, that future is accessible today, no infrastructure expertise required.