Header image for blog post: The complete guide to Kubernetes autoscaling

Published 16th June 2025

The complete guide to Kubernetes autoscaling

Kubernetes autoscaling automatically adjusts compute resources to match application demand. This guide covers everything from basic concepts to advanced patterns, with practical examples for implementing autoscaling in production environments.

📖 TL;DR

Key takeaways:

Kubernetes autoscaling includes HPA (horizontal), VPA (vertical), and Cluster Autoscaler
Works for ALL workloads: web apps, APIs, data processing, GPUs
Can reduce costs by 50-70% while maintaining performance
Northflank simplifies configuration with visual controls and automated metric collection
Custom metrics enable business-specific scaling (queue depth, latency, connections)

What is Kubernetes autoscaling?

Kubernetes autoscaling dynamically adjusts resources based on real-time demand. Unlike traditional static provisioning where you guess capacity needs, autoscaling responds to actual usage patterns, scaling up during peaks and down during quiet periods.

Workload Type	Scaling trigger	Example	Northflank advantage
Web Apps	Request rate	E-commerce during sales	RPS-based scaling
APIs	Latency/connections	Payment gateways	Custom metrics support
Data Processing	Queue depth	ETL pipelines	Prometheus integration
ML/GPU	GPU utilization	Model training	Resource-aware scaling
Microservices	Service-specific	Order processing	Per-service configs

Real-world scenarios where autoscaling is essential

🛍️ E-commerce flash sales: An online retailer experiences 20x normal traffic during Black Friday. Without autoscaling, they'd need to provision for peak capacity year-round. With Northflank's autoscaling, their platform automatically scales from 5 to 100 instances during the sale, then back down afterward.

📹 Social media viral content: A social platform's video service typically handles 1,000 requests/second. When content goes viral, traffic spikes to 50,000 requests/second within minutes. Autoscaling prevents service degradation by rapidly adding capacity.

🏦 Financial services batch processing: A bank processes transactions in nightly batches. Data volume varies from 1GB on weekends to 100GB at month-end. Autoscaling provisions resources only when needed, reducing costs by 80% compared to static provisioning.

What are the types of autoscaling in Kubernetes?

Horizontal Pod Autoscaler (HPA)

HPA scales the number of pod replicas based on metrics. It's ideal for stateless applications where adding instances directly increases capacity.

Basic HPA configuration:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: webapp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: webapp
  minReplicas: 2
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Northflank's simplified approach: Instead of YAML, Northflank provides toggle controls where you:

Enable horizontal autoscaling with one click
Set min/max instances with sliders
Configure CPU, memory, and RPS thresholds
View real-time scaling events in the dashboard

Vertical Pod Autoscaler (VPA)

VPA adjusts resource requests and limits for pods, right-sizing containers based on actual usage.

VPA configuration example:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: webapp-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: webapp
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: webapp
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 2
        memory: 2Gi

Cluster Autoscaler

Manages the cluster itself by adding or removing nodes based on pod scheduling needs.

Key behaviors:

Adds nodes when pods can't be scheduled
Removes underutilized nodes after grace period
Respects pod disruption budgets
Considers node pools and instance types

Custom metrics autoscaling with Northflank

Northflank's custom metrics support enables scaling based on application-specific indicators:

# Expose custom metrics in your app
from prometheus_client import Gauge, generate_latest

queue_depth = Gauge('message_queue_depth', 'Pending messages')

@app.route('/metrics')
def metrics():
    queue_depth.set(get_queue_size())
    return Response(generate_latest(), mimetype='text/plain')

Then in Northflank:

Specify your Prometheus endpoint and port
Select metric type (Gauge or Counter)
Set threshold values
Northflank handles the rest, no adapter configuration needed

Why Northflank for Kubernetes autoscaling

Northflank transforms complex Kubernetes autoscaling into a straightforward process. Instead of wrestling with YAML configurations and manual metric setup, Northflank provides an intuitive interface that makes autoscaling accessible to teams of all sizes. The platform handles the underlying complexity while giving you powerful features like custom metrics, multi-threshold scaling, and real-time monitoring, all without the operational overhead of managing Kubernetes directly.

How does autoscaling work in Kubernetes?

The control loop explained

Every 15 seconds, the autoscaling control loop:

Collects metrics from all pods in the deployment
Calculates average utilization across instances

Determines required replicas using the formula:

required = ceil[current * (actualMetric / targetMetric)]

Applies scaling decision based on policies

Scaling behavior and policies

Scale-up characteristics:

Immediate response to threshold breach
Can scale multiple instances at once
No cooldown period by default

Scale-down characteristics:

5-minute stabilization window
Gradual reduction to prevent flapping
Selects highest replica count from window

Multi-metric coordination

When using multiple metrics (CPU, memory, RPS), the autoscaler:

Calculates required replicas for each metric independently
Selects the highest requirement
Ensures no metric is under-provisioned

Northflank makes this seamless, enable any combination of metrics and the platform coordinates scaling decisions automatically.

Benefits of autoscaling

✅ Cost optimization

Static provisioning wastes resources during low-demand periods. Autoscaling delivers significant savings:

Development environments: 70-80% reduction (scale to zero when unused)
Production services: 50-60% reduction (right-sized for actual load)
Batch processing: 80-90% reduction (resources only when processing)

✅ Performance consistency

Autoscaling maintains performance KPIs during demand variations:

Response times stay consistent during traffic spikes
Queue processing times remain predictable
User experience doesn't degrade under load

✅ Operational efficiency

Teams save significant time by eliminating manual scaling tasks:

No midnight interventions for traffic spikes
Automatic response to seasonal patterns
Focus on features, not infrastructure

Northflank amplifies these benefits with visual monitoring and one-click configuration changes that would typically require complex YAML editing and kubectl commands.

Advanced autoscaling patterns

Predictive scaling

Combine reactive autoscaling with scheduled scaling for predictable patterns:

# Morning scale-up for business hours
apiVersion: batch/v1
kind: CronJob
metadata:
  name: morning-scaleup
spec:
  schedule: "0 7 * * MON-FRI"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: scaler
            image: bitnami/kubectl
            command: ["kubectl", "patch", "hpa", "webapp-hpa",
                     "--patch", '{"spec":{"minReplicas":10}}']

Blue-green deployments with autoscaling

Northflank's deployment strategies work seamlessly with autoscaling:

Deploy new version with same autoscaling config
Gradually shift traffic while both scale independently
Complete cutover with zero downtime

Geographic scaling patterns

For global applications, implement region-aware autoscaling:

Scale regions independently based on local demand
Use Northflank's multi-region support
Configure different thresholds per region

Northflank's autoscaling advantages

Simplified configuration

Traditional Kubernetes autoscaling requires:

Installing metrics servers
Configuring RBAC permissions
Writing YAML manifests
Setting up Prometheus adapters
Managing metric aggregation

Northflank provides:

Visual configuration: Sliders and toggles instead of YAML
Automatic metric collection: No manual setup required
Integrated monitoring: See metrics and scaling events together
Custom metrics support: Prometheus endpoints work immediately
Multi-metric coordination: CPU, memory, and RPS in one interface

How to set up Kubernetes autoscaling

Here's how a Northflank user configures autoscaling:

Navigate to your service's Resources page
Expand "Advanced resource options"
Toggle "Enable horizontal autoscaling"
Set your parameters:
- Minimum instances: 2
- Maximum instances: 20
- CPU threshold: 70%
- Memory threshold: 80%
- RPS threshold: 1000

That's it. No YAML, no kubectl, no manual metric server configuration.

Monitoring and observability

Northflank provides integrated monitoring that shows:

Current instance count with scaling history
Real-time metrics for all configured thresholds
Scaling event logs with reasons
Cost tracking as instances scale

💭 FAQs

Q: Is autoscaling only for GPU/ML workloads? A: No! Autoscaling works for any workload: web apps, APIs, batch jobs, microservices. Northflank supports autoscaling for all deployment types.

Q: How quickly does autoscaling respond? A: Metrics are checked every 15 seconds. Scale-up can happen immediately, while scale-down uses a 5-minute window to prevent flapping.

Q: Can I use custom metrics with Northflank? A: Yes! Expose any metric via Prometheus format, and Northflank will use it for scaling decisions. Common examples include queue depth, active connections, or business metrics.

Q: What happens during deployment updates? A: Northflank maintains autoscaling configuration during updates. New pods inherit the same scaling rules, ensuring consistent behavior.

Q: How do I test autoscaling? A: Use load testing tools to simulate traffic. Monitor the Northflank dashboard to see scaling in action. Start with conservative thresholds and adjust based on observations.

Q: Can I combine different types of autoscaling? A: Yes, but carefully. HPA and VPA can conflict if not configured properly. Northflank's platform handles HPA elegantly, and you can use VPA recommendations to set initial resource requests.

Final thoughts

Kubernetes changes transforms how we manage application resources, moving from static provisioning to dynamic, demand-based allocation. While the underlying technology is powerful, complexity has traditionally limited adoption.

Northflank changes this by making enterprise-grade autoscaling accessible to teams of all sizes. Through intuitive interfaces, automatic metric collection, and integrated monitoring, Northflank removes the operational burden while delivering the full benefits of Kubernetes autoscaling.

Whether you're scaling web applications during traffic spikes, optimizing batch processing costs, or managing complex microservices architectures, autoscaling ensures optimal resource utilization. Start with basic CPU-based scaling, monitor real-world behavior through Northflank's dashboards, and gradually introduce custom metrics as your needs evolve.

The future of infrastructure is adaptive, not static.

With Northflank's approach to Kubernetes autoscaling, that future is accessible today, no infrastructure expertise required.

Try Northflank for free, today.

Share this article with your network