

Best GPUs for AI workloads (and how to run them on Northflank)
If you've worked with models like Qwen, DeepSeek, or LLaMA, you already know different workloads push your GPU in different ways. Some need high memory to even start, others only need something that won't slow down during inference.
That’s why I started looking into which GPUs people rely on for AI workloads, and how you can run them without spending thousands on hardware upfront.
Northflank makes that possible. You can run high-end GPUs like the H100, A100, or 4090, use your own cloud setup if you have one, and get all the tools to train, serve, and deploy your models in one place.
In this article, I’ll help you figure out which GPUs are best for different AI use cases, and how to start using them without owning the hardware yourself.
If you're working on training, fine-tuning, or running inference on large models, most people have seen that VRAM tends to be more important than clock speed. The more memory you have, the better your models can run without running into limits.
What most people recommend:
- Inference: Start with one or two A100/H100. You’ll get enough VRAM for 8B to 32B parameter models and sub‑second latency.
- Fine‑tuning / PEFT: Scaling out to H100×8 or H200×8 offers extra FP16/TF32 horsepower for faster gradient sync.
- Memory‑intensive jobs: B200×8 (144 GB each) is unbeatable for tasks involing large models.
What if you don’t want to buy hardware?
That’s where Northflank comes in. You can:
- Run A100, H100, H200, and B200 workloads
- Pay by the hour (starting at $2.74/hour for an H100, with spot pricing available)
- Train, serve, and deploy models (beyond compute alone, it's a full-stack platform.)
- Use your own cloud (BYOC) (if you already have GPU infrastructure)
With Northflank, you get the performance of high-end GPUs without the upfront cost, cooling setup, or constant maintenance. It's designed for teams and solo developers who prioritize speed, pricing, and flexibility.
Not every GPU is built for the same kind of work. Some are perfect for fast, lightweight inference. Others are built to handle massive fine-tuning runs or full model training.
I'll give you a quick breakdown to help you match the right GPU to the job.
I've grouped them by workload type, so you can find what fits best based on what you're building or running:
If you're serving models in production or building an API that responds in real time, you’ll want something that balances speed, energy use, and cost.
- A100×1 (40 GB) will serve 8 B‑parameter LLMs at ~1 000 tokens/sec in FP16.
- H100×1 (80 GB) boosts that to ~1 500 tokens/sec under optimized runtimes.
- Scale to more cards when you need larger context windows (e.g., 32 K tokens) or batch inference.
When you’re customizing open-source models or experimenting with parameter-efficient tuning, memory becomes a bigger factor, particularly for attention-heavy models.
- A100×8 (40 GB) gives you 320 GB aggregate VRAM.
- H100x8 extends that to 640 GB with NVLink.
- H200×8 doubles NVLink bandwidth and adds tensor‑core improvements, slashing sync overhead.
The correct options depend on the size of the base model being fine-tuned.
Training an 8 B‑parameter transformer on ~15 trillion tokens requires ~200 000 GPU‑hours (H100‑equivalent). Capping out at eight GPUs, you’d see:
Configuration | GPU‑Hours | Wall‑Clock Time (continuous) | Notes |
---|---|---|---|
H100×8 | 200 000 | ~25 000 h (~2.85 years) | Full training at this scale is multi‑year work |
A100×8 | 200 000 | ~25 000 h (~2.85 years) | Similar performance in FP16/TF32 |
H200×8 | ~160 000¹ | ~20 000 h (~2.3 years) | Faster tensor cores shorten time by ~20% |
B200×8 | 200 000² | ~25 000 h (~2.85 years) | Massive 144 GB VRAM, but ~80% of H100’s FLOPS |
¹ H200’s tensor‑cores are ~1.25× faster than H100 in FP16/TF32.
² B200 is memory‑optimized (Grace Blackwell)-ideal for huge batches.
Key takeaway: If you need full pretraining, eight GPUs (even B200) mean multi‑year runs. Most teams either pretrain at hyperscale or fine‑tune existing checkpoints.
By now, you’ve seen that different workloads need different kinds of GPUs. However, what about specific models? If you’re working with popular OSS projects like Qwen, DeepSeek, or Stable Diffusion, let’s see a quick cheat sheet to help you choose the right setup.
Model | Task | Northflank GPU | Why It Fits |
---|---|---|---|
Qwen 1.5 7B | Inference | A100×1–2, H100×1 | Sub‑second responses, fits in 80 GB VRAM |
DeepSeek Coder 6.7B | Fine‑tuning | A100×4–8, H100×4–8 | Perfect for LoRA and adapter workflows |
LLaMA 3 8B | All stages | A100×2 for infer, 4–8 for tune, 8 for tiny train | Flexible across tasks |
Mixtral 8×7B | Fine‑tuning | H100×4–8, H200×8 | Handles MoE gating and memory spikes |
Stable Diffusion XL | Inference/FT | A100×2, H100×2, H200×8 | Large image batches and fast sampling |
Whisper | Streaming infer | A100×1 | Low‑latency audio streams |
Once you’ve matched your model to the right GPU, the next step is getting access and running your workloads without extra setup. That’s where Northflank stands out, it goes beyond access to GPUs, offering a full environment designed around AI workloads.
You can:
- Run the same project with or without a GPU. If you're deploying a CPU-based API or a GPU-heavy training job, you use the same setup, same platform.
- Access A100, H100, H200, and B200 directly, and switch between them as your workloads grow.
- Bring your own infrastructure from providers like CoreWeave, Lambda Labs, or your on-prem hardware.
- Tap into spot GPUs with automatic fallback, so your jobs don’t fail when spot capacity runs out.
- Provision in under 30 minutes, regardless of if it's a single-node API or a multi-node distributed job.
- Scale up and down automatically, with cost tracking and resource isolation already built in.
- Use ready-to-go templates for Jupyter, Qwen, LLaMA, and others, including GitOps support.
In a nutshell, Northflank goes beyond providing compute, it gives you a full environment to build, train, fine-tune, and serve models without switching tools.
While Northflank gives you full-stack GPU environments, it’s also useful to compare it with other AI infrastructure platforms.
Most tools focus on giving you GPU access alone, but if you also care about things like autoscaling APIs, managed databases, or bringing your own cloud, the differences become clear.
See a quick comparison:
Feature | Northflank | Modal | Baseten | Together AI |
---|---|---|---|---|
GPU access (A100, H100, L4, etc.) | Full range of GPU options, including cloud and BYOC | Serverless GPU jobs | GPU access only | GPU clusters with H100, GB200 |
Microservices & APIs | Built-in support | Basic API runtimes only | Not supported | Managed API endpoints |
Databases (Postgres, Redis) | Integrated managed services | Not available | Not available | No full-service DB support |
BYOC support | Full self-service BYOC across AWS, GCP, Azure | Not supported unless enterprise | No | Enterprise-only option |
Secure multi-tenancy | Strong isolation and RBAC support | Limited sandboxing | Unknown | Limited visibility |
Jobs & Jupyter support | Background jobs, scheduled tasks, notebooks | Jupyter + batch jobs only | Not supported | Jupyter and endpoints only |
CI/CD & Git-native workflows | Git-based pipelines, preview environments | Minimal integration | Not integrated | Basic workflow support |
Once you’ve seen how different platforms compare, you might still have a few lingering questions, particularly if you're trying to choose a GPU for the first time or running into VRAM bottlenecks. Let’s see a quick FAQ to clear up the most common questions developers ask.
-
Which GPUs are best for AI?
It depends on your workload. Use L4 for inference, A100 for fine-tuning, and H100 for large-scale model training.
-
What is the most powerful AI GPU?
NVIDIA's GB200 Grace Blackwell, designed for massive AI training and inference at scale, is currently the most powerful.
-
Do you need a powerful GPU for AI?
Only if you're training or fine-tuning large models. For inference, GPUs like L4 or A10G are usually enough.
-
Which GPU is better for AI, NVIDIA or AMD?
NVIDIA is preferred because of its CUDA ecosystem and better support across AI frameworks like PyTorch and TensorFlow.
-
Which GPU is best for Stable Diffusion?
A100 or RTX 4090. Models like SDXL and DreamBooth benefit from having at least 24GB of VRAM.
-
How much RAM do you need for AI?
It depends on the model size. A general rule is to have 3–4× the model’s parameter size in RAM to account for training overhead. For many use cases, 32GB of system RAM and 16–24GB of GPU VRAM is a good starting point.
-
Why is NVIDIA best for AI?
Tools like CUDA, cuDNN, and widespread framework support make it the default choice for most AI workloads.
-
Does AI need CPUs or GPUs?
Both. CPUs handle orchestration and I/O; GPUs handle model training and inference.
-
What is the minimum GPU for deep learning?
At least 16GB of VRAM. A10G or L4 GPUs are a practical starting point for small to medium workloads.
-
Is 8GB of VRAM enough for deep learning?
It can handle small models or inference jobs, but you’ll likely hit memory limits during training.
-
Do AI companies use GPUs?
Yes, most AI companies rely on GPUs for both training and inference. Northflank supports running these across multiple cloud providers or on your own infrastructure.
Remember that you don’t always need to manage your own hardware to train, fine-tune, or serve models at scale. If you're working with models like Qwen, DeepSeek, LLaMA, or Stable Diffusion, Northflank gives you an easier way to get started.
See what you can do:
- Deploy in minutes (no local setup or manual provisioning)
- Scale across clouds, use spot GPUs, or bring your own infrastructure
- Run everything from CI pipelines to APIs, databases, notebooks, and AI jobs