Header image for blog post: How to run AI workloads on cloud GPUs (without buying hardware)

Published 10th September 2025

How to run AI workloads on cloud GPUs (without buying hardware)

If you've worked with models like Qwen, DeepSeek, or LLaMA, you know different workloads push your GPU in different ways. Some need high memory to even start, others just need something that won't slow down during inference.

The challenge is this:

Getting access to the right GPU for your specific workload without spending thousands upfront on hardware, cooling, and maintenance.

That's where cloud GPU platforms like Northflank come in. You can run enterprise-grade GPUs, use your own cloud setup if you have one, and get all the tools to train, serve, and deploy your models in one place.

In this guide, I'll show you how to match your AI workload to the right cloud GPU setup and get started without owning the hardware yourself.

TL;DR: Match your AI workload to cloud GPUs + start in minutes

The key insight most developers have learned: VRAM matters more than clock speed for AI workloads. The more memory you have available, the better your models can run without hitting limits.

Quick workload matching:

Inference: A100 or H100 for 8B-32B parameter models with sub-second latency
Fine-tuning/PEFT: H100×8 or H200×8 for faster gradient sync
Memory-intensive jobs: B200×8 for large models requiring massive memory

Why cloud GPUs make sense:

Access A100, H100, H200, and B200 on demand
Pay hourly (starting at $2.74/hour for H100)
Full platform for training, serving, and deployment
Bring your own cloud (BYOC) if you have existing infrastructure

With Northflank, you get enterprise GPU performance without upfront costs, cooling setup, or maintenance. Built for teams and solo developers who need speed, flexibility, and cost control.

Match your AI workload to the right cloud GPU

Not every workload needs the same type of GPU. Here's how to choose the right cloud GPU configuration for what you're building:

Running inference (production APIs, real-time responses)

For serving models in production or building APIs that need real-time responses:

A100×1 (40GB): Serves 8B-parameter LLMs at approx 1,000 tokens/sec in FP16
H100×1 (80GB): Boosts performance to approx 1,500 tokens/sec with optimized runtimes
Scale up: Add more cards for larger context windows (32K tokens) or batch inference

Fine-tuning and PEFT (LoRA, adapters, customization)

When customizing open-source models or experimenting with parameter-efficient tuning:

A100×8 (40GB): 320GB aggregate VRAM for medium models
H100×8: 640GB with NVLink for larger base models
H200×8: Enhanced tensor cores and bandwidth for reduced sync overhead

Full model training (when you need to train from scratch)

Training large models requires significant compute time. For an 8B-parameter transformer:

Configuration	Estimated Time	Best For
H100×8	approx 2.85 years continuous	Research projects
H200×8	approx 2.3 years continuous	20% faster with improved tensor cores
B200×8	approx 2.85 years continuous	Memory-intensive large batch training

Reality check: Most teams fine-tune existing checkpoints rather than training from scratch due to time and cost requirements.

Popular model examples with GPU recommendations

If you're working with specific open-source models, here's how to match them to cloud GPU setups:

Model	Task	Recommended Setup	Why This Works
Qwen 1.5 7B	Inference	A100×1-2, H100×1	Fits in 80GB VRAM, sub-second responses
DeepSeek Coder 6.7B	Fine-tuning	A100×4-8, H100×4-8	Perfect for LoRA and adapter workflows
LLaMA 3 8B	All stages	A100×2 (inference), 4-8 (tuning)	Flexible across different tasks
Mixtral 8×7B	Fine-tuning	H100×4-8, H200×8	Handles MoE gating and memory spikes
Stable Diffusion XL	Inference/Fine-tuning	A100×2, H100×2	Large image batches, fast sampling
Whisper	Real-time inference	A100×1	Low-latency audio processing

Getting started with Northflank for AI workloads

Northflank goes beyond just providing GPU access - it's a complete platform designed for AI development workflows:

Immediate access:

Deploy GPU workloads in under 30 minutes
Switch between A100, H100, H200, and B200 as needs change
Access through web interface, CLI, or API

Cost optimization:

Hourly pricing with spot GPU options
Automatic scaling up and down
Resource isolation and usage tracking
Hibernation for long-running jobs

Full development environment:

Integrated databases (Postgres, Redis)
CI/CD pipelines with Git integration
Jupyter notebooks and development tools
Templates for popular frameworks (PyTorch, TensorFlow)

Flexibility options:

Use Northflank's managed cloud
Bring your own cloud (AWS, GCP, Azure)
Connect existing GPU infrastructure
Automatic fallback when spot capacity runs out

Platform comparison: Why Northflank for AI workloads

While many platforms offer GPU access, Northflank provides a complete development environment:

Need	Northflank Solution	Alternative Platforms
Quick GPU access	A100, H100, H200, B200 on demand	Most provide basic GPU access
Development tools	Integrated Jupyter, databases, APIs	Usually requires separate services
Cost control	Spot pricing, auto-scaling, hibernation	Limited cost optimization
Your own infrastructure	Full BYOC across all major clouds	Enterprise-only or not available
Production deployment	Built-in CI/CD, monitoring, scaling	Requires additional tooling

Common questions about running AI on cloud GPUs

How much does it cost to run AI workloads on cloud GPUs?

Starting at $2.74/hour for H100 access, with spot pricing available for additional savings. You only pay for actual usage.
Can I bring my own cloud infrastructure?

Yes, Northflank supports BYOC across AWS, GCP, and Azure, letting you use existing credits or infrastructure while getting the platform benefits.
What if I need to scale beyond single GPUs?

Northflank handles multi-GPU setups automatically, with NVLink support for high-bandwidth communication between GPUs.
How quickly can I get started?

Most workloads can be deployed within 30 minutes, including environment setup and initial model deployment.
Do I need to manage infrastructure?

No, Northflank handles provisioning, scaling, monitoring, and maintenance automatically.

Start running your AI workloads today

Instead of waiting weeks for hardware procurement or dealing with setup complexity, you can start developing with enterprise-grade GPUs immediately.

Get started with Northflank:

Choose your GPU type based on your workload
Deploy using templates or bring your existing code
Scale automatically as your needs grow
Pay only for what you use

Whether you're fine-tuning your first model or deploying production AI services, Northflank gives you the infrastructure you need without the operational overhead.

Start building with GPUs on Northflank →

Share this article with your network

Deborah Emeni • 13th October 2025

Top 5 Lightning AI alternatives for ML teams in 2025

Compare Lightning AI alternatives: Northflank for deployment, Modal, Replicate, Runpod, and SageMaker. Find the right ML platform for 2025

Deborah Emeni • 30th September 2025

Fireworks AI vs Together AI: Which platform fits your stack?

Compare Fireworks AI, Together AI, and Northflank for AI deployment. Learn which platform fits your stack for inference and production apps.

Also from the blog