Header image for blog post: An engineer’s guide to open source AI models

Published 25th June 2025

An engineer’s guide to open source AI models

Open source AI models give you cost-effective alternatives to proprietary solutions with full control over your stack.

From Llama 4 for chat to Whisper for speech, these models offer enterprise-grade capabilities without vendor lock-in.

The challenge? Moving from notebooks to production requires proper infrastructure with autoscaling, APIs, and observability.

Northflank simplifies this with container-based deployment, built-in CI/CD, and GPU support, letting small teams scale to millions of users without a dedicated DevOps team.

⏳ TL;DR

Open source AI models are downloadable ML models with open weights and code. You can run, fine-tune, and deploy them on your own infra, no vendor lock-in or per-token pricing.

Popular types:

LLMs: Llama 4, DeepSeek-V3, Phi-3 Mini (chat, reasoning)
Speech: Whisper, XTTS-v2 (transcription, voice cloning)
Video: AnimateDiff, CogVideoX (image animation, text-to-video)
Multimodal: Llama 4 Maverick (text + image)

Here's how to self-host Deepseek.

Why use them: Full control, lower cost, better data privacy, custom tuning.

What matters:

License (check for commercial use)
Model size vs performance (3B–70B is sweet spot)
Hardware needs (GPU VRAM)
Ecosystem (vLLM, HF Transformers)

Deploying is the hard part.

You need autoscaling, APIs, GPU orchestration, CI/CD, observability.

Northflank handles it:

Deploy open source models in containers with built-in CI/CD, GPU support, BYOC, and full observability. Run LLMs, APIs, schedulers, and vector DB, all on one platform. No infra team required.

What are open source AI models?

Open source AI models are machine learning models whose weights, architecture, and often training code are freely available for use, modification, and distribution. Unlike proprietary models locked behind APIs, these models can be downloaded, fine-tuned, and deployed on your own infrastructure.

Key benefits of open source AI models:

Cost control: No per-token pricing or usage limits beyond your infrastructure costs
Data sovereignty: Your data never leaves your systems
Customization freedom: Fine-tune models for your specific use cases
No vendor lock-in: Switch providers or go fully self-hosted anytime
Transparency: Full visibility into model architecture and training procedures

Types of open source AI models:

Large Language Models (LLMs): Text generation, chat, and reasoning
Speech models: Text-to-speech synthesis and speech recognition
Vision models: Image generation, analysis, and processing
Video models: Video generation and editing capabilities
Multimodal models: Combined text, image, and audio understanding

What to look for when choosing open source AI models

Selecting the right open source model requires evaluating several critical factors beyond just performance benchmarks.

License considerations are paramount. Models like Llama 3.3 use custom licenses that allow commercial use but with restrictions for large-scale services. MIT and Apache 2.0 licensed models offer more permissive terms. Always verify license compatibility with your intended use case.

Hardware requirements directly impact your deployment costs. A 7B parameter model might run efficiently on consumer GPUs, while 70B+ models require enterprise hardware or multi-GPU setups. Consider memory requirements, inference speed, and whether the model supports optimization techniques like quantization.

Ecosystem support determines how quickly you can move from experimentation to production. Models with strong community backing typically offer better documentation, deployment tools, and troubleshooting resources. Integration with frameworks like Hugging Face Transformers, vLLM, or TensorRT can significantly accelerate development.

Model size versus performance represents the classic tradeoff in AI deployment. Larger models generally provide better quality but at higher computational and latency costs. The key is finding the smallest model that meets your quality requirements.

Best open source AI models by category

Below you can find a list of the best open source AI models.

Large Language Models (LLMs)

Llama 4 Scout and Maverick represent the cutting edge of open source multimodal AI. Meta's latest models introduce Scout (17B active parameters with 109B total parameters) and Maverick (17B active parameters with 400B total parameters) using mixture-of-experts architecture. Llama 4 Scout dramatically increases the supported context length from 128K in Llama 3 to an industry leading 10 million tokens, while Llama 4 Maverick exceeds comparable models like GPT-4o and Gemini 2.0 on coding, reasoning, multilingual, long-context, and image benchmarks. Both models feature native multimodality with early fusion, seamlessly integrating text and vision capabilities.

DeepSeek-V3 represents the pinnacle of open source language modeling. DeepSeek-V3 is a 671B-parameter open-source LLM that truly rivals closed-source heavyweights like Sonnet 3.5 and GPT-4o. While resource-intensive, it delivers frontier-level performance for applications requiring maximum capability.

Phi 3 Mini excels in resource-constrained environments. Phi 3 Mini is an open source instruct-tuned LLM by Microsoft that achieves state of the art performance for models of its size at just 3.8 billion parameters. Despite its compact size, it offers impressive capabilities with both 4k and 128k context variants.

Mixtral 8x7B provides an excellent balance of performance and efficiency through its Mixture of Experts architecture. The model offers strong multilingual capabilities and function calling support while maintaining reasonable resource requirements.

Qwen 3 32B delivers advanced reasoning with hybrid thinking modes. Qwen3 models introduce a hybrid approach to problem-solving. They support two modes: Thinking Mode where the model takes time to reason step by step before delivering the final answer for complex problems, and Non-Thinking Mode for quick, near-instant responses suitable for simpler questions. Pre-trained on approximately 36 trillion tokens covering 119 languages and dialects, Qwen3-32B demonstrates competitive performance against larger models while offering flexible reasoning budget control and improved agentic capabilities including enhanced MCP support.

Speech AI models

Whisper remains the gold standard for speech recognition. OpenAI's model offers robust multilingual support and handles various audio conditions effectively, making it ideal for transcription services and voice interfaces.

XTTS-v2 excels at voice cloning applications. XTTS-v2 is capable of cloning voices into different languages with just a quick 6-second audio sample. This efficiency eliminates the need for extensive training data, making it an attractive solution for voice cloning and multilingual speech generation.

ChatTTS focuses on conversational applications. ChatTTS is a voice generation model designed for conversational applications, particularly for dialogue tasks in LLM assistants, offering natural speech synthesis optimized for interactive use cases.

MeloTTS provides multilingual capabilities with real-time performance. MeloTTS offers a broad range of languages and accents. A key highlight is the ability of the Chinese speaker to handle mixed Chinese and English speech, making it valuable for international applications.

Video AI models

CogVideoX leads open source video generation with its ability to create high-quality video sequences from text prompts. The model offers various parameter sizes to balance quality and computational requirements.

Stable Video Diffusion extends Stability AI's diffusion approach to video generation, providing controllable video synthesis capabilities for creative applications.

AnimateDiff specializes in animating static images, offering an accessible entry point for video generation without requiring complex video training data.

Performance tradeoffs: Size, speed, and accuracy

Model Category	Model	Parameters	Speed	Quality	GPU memory	Use case
LLMs	Phi 3 Mini	3.8B	Fast	Good	8GB	Edge/mobile apps
	Qwen 3 32B	32B	Fast	Very Good	64GB	Reasoning/multilingual
	Llama 4 Scout	17B active (109B total)	Fast	Very Good	24GB	General chat/long context
	Llama 4 Maverick	17B active (400B total)	Moderate	Excellent	80GB+	Multimodal production
	DeepSeek-V3	671B	Slow	Frontier	80GB+	Research/premium
Speech	Whisper Base	74M	Very Fast	Good	1GB	Real-time transcription
	Whisper Large	1.55B	Moderate	Excellent	6GB	High-quality transcription
	XTTS-v2	~2B	Moderate	Very Good	8GB	Voice cloning
Video	AnimateDiff	~860M	Moderate	Good	12GB	Image animation
	CogVideoX-2B	2B	Slow	Very Good	18GB	Text-to-video

The general pattern shows smaller models offer faster inference and lower resource requirements but sacrifice some quality. The sweet spot for most production applications falls in the 7B-70B range for LLMs, where you get strong performance without requiring specialized infrastructure.

From notebooks to production: The deployment challenge

Running a model in a Jupyter notebook bears little resemblance to production deployment. Production AI applications require:

Scalable infrastructure that can handle varying loads without manual intervention. Your application might see 10 requests per minute during quiet periods and 1,000 requests per minute during peak times.

Robust APIs with proper error handling, rate limiting, and monitoring. A simple model inference becomes complex when you add authentication, logging, and health checks.

Observability and monitoring to track model performance, latency, and resource utilization. You need visibility into both technical metrics and business outcomes.

CI/CD pipelines for updating models and deploying new versions without downtime. Model updates shouldn't require manual server management.

Resource optimization including GPU utilization, autoscaling, and cost management. GPUs are expensive, and inefficient usage directly impacts your bottom line.

Most teams underestimate this complexity. What works for experimentation often breaks under production load, leading to weeks of infrastructure work instead of product development.

Northflank: Production ready AI deployment

This is where Northflank transforms your AI deployment experience. Instead of spending months building infrastructure, you get production-ready deployment in minutes.

Container-based deployment with GPU support means you can package your AI models with their dependencies and deploy across multiple cloud providers. Whether you're running Mistral with vLLM or setting up a custom text-generation-webui, Northflank handles the orchestration.

Built-in CI/CD eliminates deployment friction. Connect your GitHub repository, configure your Dockerfile, and Northflank automatically builds and deploys your models. Updates become as simple as pushing code.

Autoscaling responds to demand automatically. Your AI services scale up during traffic spikes and scale down during quiet periods, optimizing both performance and costs without manual intervention.

Comprehensive observability provides insight into your AI workloads. Track inference latency, GPU utilization, and error rates through integrated monitoring and logging.

The Weights case study demonstrates this in action. JonLuca DeCaro, founder of Weights and former engineer at Citadel and Pinterest, could have built his own infrastructure from scratch. Instead, he used Northflank to scale Weights into a multi-cloud, GPU-optimized AI platform serving millions.

The results speak for themselves: With 9 clusters across AWS, GCP, and Azure, 40+ microservices, 250+ concurrent GPUs, 10,000+ AI training jobs and half a million inference runs per day, Weights operates at scale - and does it so seamlessly that most Series B+ startups wish they could be them.

Practical deployment example: Deploying Mistral 7B with vLLM on Northflank requires just a Dockerfile and configuration. The platform handles GPU scheduling, load balancing, and scaling automatically.

"We cut our model loading time from 7 minutes to 55 seconds with Northflank's multi-read-write cache layer - that's direct savings on our GPU costs."

Why you should deploy your AI workloads on Northflank

Bring Your Own Cloud (BYOC) gives you the flexibility to deploy on AWS, GCP, or Azure while maintaining control over your infrastructure and data. You get the benefits of managed services without vendor lock-in.

Full workload support means Northflank isn't just for AI models. You can run your entire application stack - databases, APIs, background jobs, and AI services - on the same platform with unified monitoring and management.

Stateless and stateful applications side-by-side let you build complete AI-powered applications. Run your vector databases alongside your embedding models, your chat APIs next to your LLMs, all with consistent deployment and scaling patterns.

As JonLuca puts it:

"If we didn't have Northflank managing everything, just keeping track of the Kubernetes clusters, setting up registries, actually running all of it - I think it's three to five people at this point."

For AI teams, this translates to faster time-to-market, lower operational overhead, and the ability to focus on what matters: building AI applications that solve real problems.

Speed is everything in AI development.

"Now that something like Northflank exists, there's no reason not to use it. It'll let you move faster, figure out what your company is doing, save you money, and save you time."

Deploy your first open source AI model on Northflank today.

Share this article with your network

Will Stewart • 5th August 2025

Run OpenAI's new GPT-OSS (open-source) model on Northflank

OpenAI just released GPT-OSS, its first fully open-source large language model family under an Apache 2.0 license.

Daniel Adeboye • 5th August 2025

How much does an NVIDIA H100 GPU cost?

Compare H100 GPU pricing across top cloud platforms. Learn how Northflank delivers fully bundled, ready-to-run H100 instances with no quotas, fast startup, and a full-stack AI development platform.

Also from the blog