← Back to Blog
Header image for blog post: Best GPUs for AI workloads (and how to run them on Northflank)
Deborah Emeni
Published 25th July 2025

Best GPUs for AI workloads (and how to run them on Northflank)

If you've worked with models like Qwen, DeepSeek, or LLaMA, you already know different workloads push your GPU in different ways. Some need high memory to even start, others only need something that won't slow down during inference.

That’s why I started looking into which GPUs people rely on for AI workloads, and how you can run them without spending thousands on hardware upfront.

Northflank makes that possible. You can run high-end GPUs like the H100, A100, or 4090, use your own cloud setup if you have one, and get all the tools to train, serve, and deploy your models in one place.

In this article, I’ll help you figure out which GPUs are best for different AI use cases, and how to start using them without owning the hardware yourself.

TL;DR: Best GPUs by use case + how to run them without owning hardware

If you're working on training, fine-tuning, or running inference on large models, most people have seen that VRAM tends to be more important than clock speed. The more memory you have, the better your models can run without running into limits.

What most people recommend:

  • Inference: Start with one or two A100/H100. You’ll get enough VRAM for 8B to 32B parameter models and sub‑second latency.
  • Fine‑tuning / PEFT: Scaling out to H100×8 or H200×8 offers extra FP16/TF32 horsepower for faster gradient sync.
  • Memory‑intensive jobs: B200×8 (144 GB each) is unbeatable for tasks involing large models.

What if you don’t want to buy hardware?

That’s where Northflank comes in. You can:

  1. Run A100, H100, H200, and B200 workloads
  2. Pay by the hour (starting at $2.74/hour for an H100, with spot pricing available)
  3. Train, serve, and deploy models (beyond compute alone, it's a full-stack platform.)
  4. Use your own cloud (BYOC) (if you already have GPU infrastructure)

With Northflank, you get the performance of high-end GPUs without the upfront cost, cooling setup, or constant maintenance. It's designed for teams and solo developers who prioritize speed, pricing, and flexibility.

Which GPU is best for your AI workload?

Not every GPU is built for the same kind of work. Some are perfect for fast, lightweight inference. Others are built to handle massive fine-tuning runs or full model training.

I'll give you a quick breakdown to help you match the right GPU to the job.

I've grouped them by workload type, so you can find what fits best based on what you're building or running:

1. Running inference (lightweight or low-latency)

If you're serving models in production or building an API that responds in real time, you’ll want something that balances speed, energy use, and cost.

  • A100×1 (40 GB) will serve 8 B‑parameter LLMs at ~1 000 tokens/sec in FP16.
  • H100×1 (80 GB) boosts that to ~1 500 tokens/sec under optimized runtimes.
  • Scale to more cards when you need larger context windows (e.g., 32 K tokens) or batch inference.

2. Fine-tuning smaller models (LoRA, adapters, PEFT)

When you’re customizing open-source models or experimenting with parameter-efficient tuning, memory becomes a bigger factor, particularly for attention-heavy models.

  • A100×8 (40 GB) gives you 320 GB aggregate VRAM.
  • H100x8 extends that to 640 GB with NVLink.
  • H200×8 doubles NVLink bandwidth and adds tensor‑core improvements, slashing sync overhead.

The correct options depend on the size of the base model being fine-tuned.

3. Full‑Model Pretraining (8 B Parameters)

Training an 8 B‑parameter transformer on ~15 trillion tokens requires ~200 000 GPU‑hours (H100‑equivalent). Capping out at eight GPUs, you’d see:

ConfigurationGPU‑HoursWall‑Clock Time (continuous)Notes
H100×8200 000~25 000 h (~2.85 years)Full training at this scale is multi‑year work
A100×8200 000~25 000 h (~2.85 years)Similar performance in FP16/TF32
H200×8~160 000¹~20 000 h (~2.3 years)Faster tensor cores shorten time by ~20%
B200×8200 000²~25 000 h (~2.85 years)Massive 144 GB VRAM, but ~80% of H100’s FLOPS

¹ H200’s tensor‑cores are ~1.25× faster than H100 in FP16/TF32.

² B200 is memory‑optimized (Grace Blackwell)-ideal for huge batches.

Key takeaway: If you need full pretraining, eight GPUs (even B200) mean multi‑year runs. Most teams either pretrain at hyperscale or fine‑tune existing checkpoints.

Match open-source models to GPU needs (cheat sheet)

By now, you’ve seen that different workloads need different kinds of GPUs. However, what about specific models? If you’re working with popular OSS projects like Qwen, DeepSeek, or Stable Diffusion, let’s see a quick cheat sheet to help you choose the right setup.

ModelTaskNorthflank GPUWhy It Fits
Qwen 1.5 7BInferenceA100×1–2, H100×1Sub‑second responses, fits in 80 GB VRAM
DeepSeek Coder 6.7BFine‑tuningA100×4–8, H100×4–8Perfect for LoRA and adapter workflows
LLaMA 3 8BAll stagesA100×2 for infer, 4–8 for tune, 8 for tiny trainFlexible across tasks
Mixtral 8×7BFine‑tuningH100×4–8, H200×8Handles MoE gating and memory spikes
Stable Diffusion XLInference/FTA100×2, H100×2, H200×8Large image batches and fast sampling
WhisperStreaming inferA100×1Low‑latency audio streams

Why Northflank is built for GPU workloads beyond basic access

Once you’ve matched your model to the right GPU, the next step is getting access and running your workloads without extra setup. That’s where Northflank stands out, it goes beyond access to GPUs, offering a full environment designed around AI workloads.

You can:

  • Run the same project with or without a GPU. If you're deploying a CPU-based API or a GPU-heavy training job, you use the same setup, same platform.
  • Access A100, H100, H200, and B200 directly, and switch between them as your workloads grow.
  • Bring your own infrastructure from providers like CoreWeave, Lambda Labs, or your on-prem hardware.
  • Tap into spot GPUs with automatic fallback, so your jobs don’t fail when spot capacity runs out.
  • Provision in under 30 minutes, regardless of if it's a single-node API or a multi-node distributed job.
  • Scale up and down automatically, with cost tracking and resource isolation already built in.
  • Use ready-to-go templates for Jupyter, Qwen, LLaMA, and others, including GitOps support.

In a nutshell, Northflank goes beyond providing compute, it gives you a full environment to build, train, fine-tune, and serve models without switching tools.

Platform comparison: Northflank vs other AI infrastructure tools

While Northflank gives you full-stack GPU environments, it’s also useful to compare it with other AI infrastructure platforms.

Most tools focus on giving you GPU access alone, but if you also care about things like autoscaling APIs, managed databases, or bringing your own cloud, the differences become clear.

See a quick comparison:

FeatureNorthflankModalBasetenTogether AI
GPU access (A100, H100, L4, etc.)Full range of GPU options, including cloud and BYOCServerless GPU jobsGPU access onlyGPU clusters with H100, GB200
Microservices & APIsBuilt-in supportBasic API runtimes onlyNot supportedManaged API endpoints
Databases (Postgres, Redis)Integrated managed servicesNot availableNot availableNo full-service DB support
BYOC supportFull self-service BYOC across AWS, GCP, AzureNot supported unless enterpriseNoEnterprise-only option
Secure multi-tenancyStrong isolation and RBAC supportLimited sandboxingUnknownLimited visibility
Jobs & Jupyter supportBackground jobs, scheduled tasks, notebooksJupyter + batch jobs onlyNot supportedJupyter and endpoints only
CI/CD & Git-native workflowsGit-based pipelines, preview environmentsMinimal integrationNot integratedBasic workflow support

Common questions about AI GPUs

Once you’ve seen how different platforms compare, you might still have a few lingering questions, particularly if you're trying to choose a GPU for the first time or running into VRAM bottlenecks. Let’s see a quick FAQ to clear up the most common questions developers ask.

  1. Which GPUs are best for AI?

    It depends on your workload. Use L4 for inference, A100 for fine-tuning, and H100 for large-scale model training.

  2. What is the most powerful AI GPU?

    NVIDIA's GB200 Grace Blackwell, designed for massive AI training and inference at scale, is currently the most powerful.

  3. Do you need a powerful GPU for AI?

    Only if you're training or fine-tuning large models. For inference, GPUs like L4 or A10G are usually enough.

  4. Which GPU is better for AI, NVIDIA or AMD?

    NVIDIA is preferred because of its CUDA ecosystem and better support across AI frameworks like PyTorch and TensorFlow.

  5. Which GPU is best for Stable Diffusion?

    A100 or RTX 4090. Models like SDXL and DreamBooth benefit from having at least 24GB of VRAM.

  6. How much RAM do you need for AI?

    It depends on the model size. A general rule is to have 3–4× the model’s parameter size in RAM to account for training overhead. For many use cases, 32GB of system RAM and 16–24GB of GPU VRAM is a good starting point.

  7. Why is NVIDIA best for AI?

    Tools like CUDA, cuDNN, and widespread framework support make it the default choice for most AI workloads.

  8. Does AI need CPUs or GPUs?

    Both. CPUs handle orchestration and I/O; GPUs handle model training and inference.

  9. What is the minimum GPU for deep learning?

    At least 16GB of VRAM. A10G or L4 GPUs are a practical starting point for small to medium workloads.

  10. Is 8GB of VRAM enough for deep learning?

    It can handle small models or inference jobs, but you’ll likely hit memory limits during training.

  11. Do AI companies use GPUs?

    Yes, most AI companies rely on GPUs for both training and inference. Northflank supports running these across multiple cloud providers or on your own infrastructure.

Start running GPU workloads without the usual complexity

Remember that you don’t always need to manage your own hardware to train, fine-tune, or serve models at scale. If you're working with models like Qwen, DeepSeek, LLaMA, or Stable Diffusion, Northflank gives you an easier way to get started.

See what you can do:

  • Deploy in minutes (no local setup or manual provisioning)
  • Scale across clouds, use spot GPUs, or bring your own infrastructure
  • Run everything from CI pipelines to APIs, databases, notebooks, and AI jobs

Start building with GPUs on Northflank

Share this article with your network
X