

What are spot GPUs? Complete guide to cost-effective AI infrastructure
Spot GPUs are unused cloud GPU instances available at up to 90% discounts compared to on-demand pricing. They're perfect for AI inference, training jobs, and burst workloads, but can be interrupted with short notice (30 seconds to 2 minutes, depending on the provider). While traditional spot instances require complex management and quota approvals, modern orchestration platforms like Northflank handle the complexity automatically, providing seamless fallback to on-demand instances when needed. Try it out now or reach out to an engineer to guide you.
Let me tell you a quick story.
An AI founder was scaling his voice cloning platform, but GPU costs were threatening to kill his startup.
With just two engineers and limited credits, he used spot GPUs with automated orchestration to scale to millions of users while spending 90% less on compute.
If you're running AI workloads, you've likely felt the same sticker shock. High-end GPU instances can cost significant money per hour on demand, but what if you could get the same power at up to 90% off?
That's what spot GPUs offer, and understanding how to use them successfully could be the difference between burning through your budget and building a sustainable AI business.
Let’s talk about spot GPUs and how to cut your AI infrastructure costs.
Spot GPUs are high-performance graphics cards (like NVIDIA H100s or A100s) that you can rent from cloud providers at 60-90% discounts compared to regular prices.
They're essentially "leftover" GPU capacity that cloud providers offer when their data centers aren't fully utilized, but your workload can be interrupted with short notice (30 seconds to 2 minutes, depending on the provider) if they need that hardware back for full-paying customers.
It’s like booking a standby flight ticket. You get the exact same plane and destination as someone who paid full price, but you're flying for $50 instead of $500 because you're willing to get bumped if the flight fills up.
Spot GPUs work the same way, where you get enterprise-grade AI compute power at a fraction of the cost by accepting the possibility of interruption.
The key difference is that "spot instances" refer to any discounted virtual machines using excess capacity, while "spot GPUs" specifically mean those machines equipped with powerful graphics cards for AI training, machine learning inference, and other compute-intensive tasks.
So, for most AI workloads that can handle brief interruptions, you're getting the same performance as expensive on-demand instances at startup-friendly prices.
Spot pricing works like a stock market for compute resources. When cloud providers have excess GPU capacity, they auction it off at discounted rates. The price fluctuates in real-time based on supply and demand; if lots of people want GPUs in a specific region, the price goes up. When demand drops, prices fall.
Let’s briefly go over how the interruption system works.
The interruption process is straightforward but happens quickly. When someone wants to pay full price for on-demand capacity and there's no spare hardware available, you're the one who gets bumped:
- Short warning period: AWS gives 2 minutes, Google Cloud and Azure give just 30 seconds
- Reclaimed for on-demand: Happens when someone pays full price, and no spare hardware exists
- Peak usage risk: Most likely during high-demand periods or popular GPU types
- Limited exit time: Enough to save work, but you need automated systems for reliability
Now, let’s talk about the pricing advantage across providers
What makes this system work in your favor is that cloud providers would rather make some money from unused capacity than none at all:
- AWS spot instances: Up to 90% savings, widest GPU selection, 2-minute termination notice
- Google Cloud spot VMs: 60-91% discounts, more stable pricing, excellent A100/H100 availability
- Azure spot VMs: Up to 90% savings, deepest discounts during off-peak hours
You only pay the current spot price (not your maximum bid), often run for hours/days without interruption, and get the same performance as expensive on-demand instances. To find current prices, use each provider's built-in tools: AWS Spot Advisor, Google's pricing console, or Azure's usage recommendations.
The main differences come down to cost, reliability, and what happens when things go wrong.
With standard (on-demand) VMs, you pay full price but get guaranteed access, and your instance runs until you decide to shut it down.
Spot VMs flip this equation: you pay 60-90% less but accept that your instance might get terminated when someone else needs that hardware.
Let’s see the key differences at a glance:
Feature | On-Demand VMs | Spot VMs |
---|---|---|
Pricing | Full price, predictable costs | 60-90% savings, but prices fluctuate based on demand |
Availability | Guaranteed but starts immediately when you request it | Depends on available capacity and might not be available in your preferred region/instance type |
Interruptions | Never interrupted (unless you don't pay your bill) | Can be terminated with short notice (30 seconds to 2 minutes) when capacity is needed elsewhere |
Service Level Agreements (SLAs) | Covered by cloud provider SLAs for uptime guarantees | No SLA coverage; you're using "leftover" capacity |
The choice between them depends on your workload requirements. If you need guaranteed uptime for production databases, web servers, or customer-facing applications, on-demand instances are your safest bet.
However, if you're running AI training jobs, batch processing, development environments, or anything that can pause and resume gracefully, spot instances offer massive savings without significantly impacting your results.
Yes, spot instances are significantly cheaper, but the savings depend on several factors. Let's break down the costs so you can see how much you'll save.
The savings are substantial across all GPU types.
While exact prices fluctuate daily based on demand, the pattern is consistent: spot instances typically cost 60-90% less than on-demand rates. If you’re using H100s, A100s, or older V100 GPUs, you’ll see similar percentage savings.
For instance, if an H100 instance costs $8/hour on-demand, you might pay just $1-2/hour with spot pricing. That same pattern applies across all GPU types and cloud providers.
Now, what does this mean for your total spending?
The savings add up quickly. With 60-90% discounts on every job you run, those savings can add up to significant amounts over time if you're running multiple experiments or inference workloads.
You'll save the most when:
- Running jobs during off-peak hours (nights, weekends)
- Using less popular regions (avoid us-east-1 during business hours)
- Being flexible on GPU types (A100 vs H100)
- Running batch jobs that can restart easily
And your savings shrink when:
- Working during peak demand periods (business hours in popular regions)
- Needing specific GPU requirements with limited availability
- Dealing with frequent interruptions that require restart overhead
- Running short jobs where setup time matters more than runtime costs
Now that you understand how spot GPUs work and their cost advantages, let's look at when they're great for your needs versus when they might create problems. Like any tool, spot GPUs have compelling benefits but also limitations you need to be aware of.
The cost savings alone make spot GPUs attractive, but they offer several other advantages that make them particularly well-suited for modern AI workloads. See some of them below:
- Massive cost savings (60-91% discounts): With spot pricing, you can access high-end GPUs like H100s at a fraction of the on-demand cost. For startups and teams with tight budgets, this makes previously unaffordable hardware accessible.
- Perfect for burst workloads: If you need to scale from 10 to 100 GPUs for a weekend training run, spot instances let you scale up quickly without long-term commitments, then scale back down when you're done.
- Ideal for inference: Most inference requests take seconds to complete. If a spot instance gets interrupted, you can simply route the next request to another instance and users won't even notice the switch.
- Great for batch processing: Training jobs, data processing pipelines, and rendering tasks are naturally fault-tolerant. They can pause, save progress, and resume on a new instance without losing work.
- Enables experimentation: When GPU time costs 90% less, you can afford to try more experiments, test different model architectures, and iterate faster without burning through your budget.
While the benefits are compelling, spot GPUs aren't perfect. These are some of the challenges you'll face:
- Interruption risk and reliability concerns: Your workload can be terminated with 30 seconds to 2 minutes notice. For time-sensitive or long-running processes, this unpredictability can be problematic.
- Management complexity without proper tools: Handling interruptions, checkpointing, and failover manually requires significant engineering effort. You need systems that can gracefully handle shutdowns and restart elsewhere.
- Quota and approval challenges: This is where many teams get stuck. Getting access to spot GPU capacity isn't as simple as clicking "launch"; cloud providers have implemented strict approval processes that can take days or weeks.
Also, people have asked this question “Why are GPU approvals so complicated?” on forums like reddit:
Cloud providers face a massive fraud problem. Bad actors use stolen account credentials to spin up hundreds of GPU instances for cryptocurrency mining, then disappear when the bill comes due. AWS, Google, and Azure have responded by heavily vetting GPU requests, especially for new accounts.
Your account type makes a huge difference in approval speed. Enterprise accounts with established billing history get faster approval, while individual developers or new startups often face lengthy review processes. Some teams wait weeks just to get permission to use spot instances they're willing to pay for.
Also spot instances are not suitable for all workloads. So, real-time applications, stateful services, and mission-critical production systems that can't tolerate interruptions should stick with on-demand instances despite the higher cost.
Given everything we've covered about interruptions, cost savings, and management complexity, you could be thinking:
"Is this right for my workload?"
The answer depends on how well your application can handle brief interruptions and whether you can design around the unpredictability.
Let’s see what spot GPUs work great for:
- AI model batch inference APIs: Most batch inference processes handle multiple requests together and can tolerate brief delays. If your spot instance gets interrupted, you can restart the batch on another instance without affecting user experience. The cost savings here are massive since batch inference workloads often run continuously.
- Training jobs with checkpointing: If your training code saves progress every few minutes, getting interrupted isn't a big deal. You just resume from the last checkpoint on a new instance. This works especially well for long training runs where you're saving thousands of dollars.
- Batch data processing: ETL jobs, data analysis pipelines, and similar workloads are naturally fault-tolerant. They can pause mid-way through a dataset and pick up where they left off.
- Development and testing environments: Your dev environments don't need 99.9% uptime. If a spot instance gets terminated while you're testing model performance, you simply start another one.
- Burst computing needs: When you need to scale from 10 to 100 GPUs for a weekend project, spot instances let you access that capacity without long-term commitments.
Now let’s see what you should avoid spot GPUs for:
- Real-time applications requiring guaranteed uptime: If you're serving live video processing or real-time recommendations where even a 30-second interruption affects users, stick with on-demand instances.
- Stateful services without backup strategies: If your application stores important state in memory and can't quickly save/restore that state, interruptions will cause data loss.
- Mission-critical production workloads: Your main customer-facing API probably shouldn't run on spot instances unless you have sophisticated failover systems in place.
- Long-running processes without checkpointing: If your job takes 48 hours to complete and can't save progress along the way, one interruption means starting over from scratch.
By now, you might be thinking:
"Spot GPUs sound great for my workloads, but managing all those interruptions, fallbacks, and multi-cloud complexity seems like a full-time job."
You're right, doing this manually would require a dedicated DevOps team. That's where Northflank comes in to handle all the heavy lifting automatically.
Let’s see some of the ways Northflank can help you in this case:
- Automatic spot instance management across your cloud accounts: Instead of monitoring AWS, Google Cloud, and Azure separately for the best spot prices in your own accounts, Northflank does this continuously through our Bring Your Own Cloud (BYOC) model. It automatically provisions your workloads on whichever cloud has the cheapest available capacity at that moment within your connected accounts.
- Seamless fallback options to prevent downtime costs: When your spot instance gets interrupted, Northflank immediately spins up a replacement, either another spot instance if available, or an on-demand instance if necessary. Your workloads keep running without you having to manually intervene at 2 AM.
- Perfect for inference and burst workloads: Remember those use cases we just discussed? Northflank is specifically designed for inference APIs that need to scale quickly and training jobs that can handle interruptions. It automatically routes traffic away from instances that are about to be terminated.
- Multi-cloud spot optimization to find the cheapest capacity: Rather than being locked into one cloud provider's pricing and availability, Northflank continuously finds the best deals across all major providers. If AWS spot prices spike in us-east-1, it might move your workloads to Google Cloud in us-central1 automatically.
- No manual quota management overhead: Remember those frustrating GPU approval processes we discussed? Northflank handles relationships with cloud providers directly, so you don't have to submit quota requests or wait weeks for approval to access spot capacity.
- Maximize savings while minimizing operational complexity: You get all the cost benefits of spot instances (60-90% savings), without needing a team of engineers to manage the complexity. Focus on building your AI applications while Northflank handles the infrastructure.
All of this might sound too good to be true, so let's look at a real-world example. Weights, an AI platform for voice cloning and content creation, faced the exact challenge you might be dealing with:
“how do you scale AI workloads to serve millions of users without burning through your budget or hiring an entire DevOps team?”
The challenge: Weights needed to scale from a local AI application to a platform serving millions of users, with just two engineers. They were bootstrapped, had limited cloud credits, and couldn't afford the typical infrastructure team that most Series B startups require for this kind of scale.
The solution: Instead of building their own spot instance management system or hiring DevOps engineers, they used Northflank's automated orchestration. This let them focus on building their AI product while Northflank handled all the infrastructure complexity behind the scenes.
The results speak for themselves:
- 250+ concurrent GPUs across 9 clusters - They're running more infrastructure than most well-funded startups, but managing it with a small team
- 10,000+ daily training jobs and 500,000+ daily inference runs - This is production-scale AI infrastructure that would typically require a full operations team
- Model loading time cut from 7 minutes to 55 seconds - Faster loading means less GPU time per job, which translates directly to cost savings when you're paying by the minute
- Cloud migration from weeks to hours - When they wanted to switch from Azure to GCP to use different cloud credits, it took an afternoon instead of weeks of engineering time
As JonLuca DeCaro, Weights' founder, puts it:
"We don't waste time or money on infrastructure, so we can focus on building product."
That's exactly what spot GPU orchestration should enable, more time building your AI applications, less time dealing with cloud infrastructure.
See some of the most common questions we see about using spot GPUs in production environments:
Q: How often do spot instances get interrupted? A: Interruption rates vary by region and instance type, typically ranging from 5-20% depending on demand. Popular GPU types in busy regions like us-east-1 see higher interruption rates, while less popular instances in quieter regions can run for days without interruption.
Q: How reliable are spot instances? A: With proper orchestration and fallback mechanisms, spot instances can be very reliable for production workloads. The key is designing your applications to handle interruptions gracefully and having automated systems that can quickly restart work elsewhere when needed.
Q: What are the drawbacks of spot instances? A: The main drawbacks are unpredictable interruptions (30 seconds to 2 minutes notice), complex management without proper tools, quota approval challenges from cloud providers, and unsuitability for workloads that can't handle brief downtime.
Q: Can spot instances be interrupted? A: Yes, spot instances can be interrupted at any time when cloud providers need the capacity for on-demand customers. You'll receive 30 seconds to 2 minutes notice depending on the provider, which is enough time to save work but requires automated systems for seamless recovery.
You've seen how spot GPUs can change your AI infrastructure costs, from the startup that scaled to millions of users with just two engineers, to the significant savings of accessing H100 instances at up to 90% off instead of paying full on-demand rates. The question isn't if spot GPUs can save you money, but if you want to handle the complexity of managing them yourself.
If you want the cost savings without the operational complexity, Northflank's automated orchestration handles all the heavy lifting. You get the same 60-90% discounts with automatic fallbacks, multi-cloud optimization, and no quota management overhead.
Try Northflank for free and see how much you can save on your next AI workload, or talk to an engineer who can show you how spot GPU orchestration fits into your infrastructure.