

Open source LLMs: The complete developer's guide to choosing and deploying LLMs
Running your own language models is all about avoiding API costs and taking control of your AI infrastructure. This guide shows you exactly how to select, deploy, and scale open source LLMs for production use.
- Open source LLMs are language models with publicly available weights you can run on your own hardware
- Top models: Llama 4, DeepSeek-V3, Qwen 3, and Mistral offer different performance/efficiency tradeoffs
- Deployment complexity ranges from simple inference servers to full production stacks with autoscaling
- Infrastructure is critical: The right platform can reduce deployment time from weeks to minutes
An open source Large Language Model (LLM) is a neural network trained for natural language processing tasks whose model weights, architecture specifications, and often training code are freely available for download and use. Unlike proprietary models accessed through APIs (like GPT-4 or Claude), open source LLMs can be:
- Downloaded and run locally on your own hardware
- Modified and fine-tuned for specific use cases
- Deployed without usage restrictions (depending on license)
- Integrated directly into your applications without API dependencies
Read more: Deploy Deepseek R1 on Northflank, in minutes.
Complete data control: Your prompts and responses never leave your infrastructure. For industries handling sensitive data (healthcare, finance, legal) this isn't optional.
Predictable costs: No surprise bills from token usage spikes. Your costs are tied to infrastructure, not API calls.
Customization freedom: Fine-tune models on your domain-specific data. A base model becomes your specialized AI assistant.
Latency optimization: Deploy models close to your users. No round trips to external APIs mean faster response times.
No vendor dependencies: API changes, rate limits, or service outages won't break your application.
Choosing the best open source LLM depends on your specific requirements. Here's a comprehensive breakdown of leading models:
- Parameters: 17B active (400B total MoE)
- Strengths: Multimodal capabilities, excellent reasoning, strong coding
- Context: 10M tokens (industry leading)
- Use case: Production applications requiring frontier performance
- Parameters: 32B
- Strengths: Hybrid thinking modes, multilingual support
- Context: 32K tokens
- Use case: Complex reasoning tasks with moderate hardware
- Parameters: 3.8B
- Strengths: Runs on consumer GPUs, fast inference
- Context: 4K or 128K variants
- Use case: Edge deployment, mobile applications
- Parameters: 671B
- Strengths: Matches closed-source performance
- Context: 128K tokens
- Use case: Research and premium applications
- Parameters: 46.7B (MoE architecture)
- Strengths: Efficient inference, function calling
- Context: 32K tokens
- Use case: API services, chatbots
Model | Size | MMLU score | HumanEval | Inference speed | VRAM Required |
---|---|---|---|---|---|
Llama 4 Maverick | 400B MoE | 88.6 | 92.8 | Moderate | 80GB+ |
DeepSeek-V3 | 671B | 87.2 | 89.9 | Slow | 160GB+ |
Qwen 3 32B | 32B | 82.3 | 81.2 | Fast | 64GB |
Mixtral 8x7B | 46.7B | 75.3 | 74.4 | Fast | 48GB |
Phi 3 Mini | 3.8B | 68.8 | 62.2 | Very Fast | 8GB |
Running an open source LLM involves several key steps. Here's a practical guide to get you started:
Local development (quick testing):
# Using Hugging Face Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "microsoft/Phi-3-mini-4k-instruct"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
Production inference server (recommended):
# Using vLLM for optimized serving
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model mistralai/Mixtral-8x7B-Instruct-v0.1
Quantization reduces model size and speeds up inference:
# 4-bit quantization example
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
Batching improves throughput:
- Use inference servers like vLLM or TGI
- Configure dynamic batching for optimal GPU utilization
- Monitor queue depths and adjust batch sizes
Create a production-ready API wrapper:
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class CompletionRequest(BaseModel):
prompt: str
max_tokens: int = 100
temperature: float = 0.7
@app.post("/completions")
async def generate(request: CompletionRequest):
# Your inference logic here
return {"completion": generated_text}
Memory management:
- Use model sharding for large models
- Implement proper garbage collection
- Monitor VRAM usage continuously
Performance optimization:
- Enable Flash Attention for longer contexts
- Use tensor parallelism for multi-GPU setups
- Implement caching for repeated queries
Reliability:
- Add health checks and automatic restarts
- Implement request queuing and backpressure
- Set up proper logging and monitoring
Here's where most teams hit a wall. You've selected your model, optimized inference, built your API, but production deployment introduces new complexities:
Autoscaling: Your lunch-hour traffic might be 10x your morning load. Manual scaling isn't sustainable.
Multi-region deployment: Users expect low latency. Deploying models globally requires sophisticated orchestration.
Zero-downtime Updates: Model improvements shouldn't mean service interruptions.
Cost optimization: GPUs are expensive. Inefficient utilization directly impacts your bottom line.
Observability: You need visibility into inference latency, error rates, and resource utilization.
Building this infrastructure from scratch typically requires:
- Kubernetes expertise for orchestration
- Custom CI/CD pipelines for model updates
- Monitoring stack setup and maintenance
- Load balancer configuration
- GPU scheduling optimization
This represents months of engineering work before you can focus on your actual AI application.
Modern platforms eliminate this infrastructure complexity. Here's how a production-ready deployment actually looks:
Package your LLM with its dependencies:
FROM nvidia/cuda:12.1-base
RUN pip install vllm transformers
COPY . /app
CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
"--model", "mistralai/Mixtral-8x7B-Instruct-v0.1"]
Deploy with automatic GPU provisioning, load balancing, and monitoring, all handled by the platform.
Consider how Weights scaled their AI platform:
- 250+ concurrent GPUs across multiple clouds
- 500,000+ inference runs daily
- Multi-cloud deployment (AWS, GCP, Azure)
- Managed by a small team without dedicated DevOps
This scale would typically require a full infrastructure team. With the right platform, it's achievable by a few engineers focused on product development.
Start small, scale smart: Northflank's platform adapts to your growth:
- Begin with a single GPU for prototyping
- Scale to hundreds of GPUs across regions
- Mix on-demand and spot instances for cost optimization
- No infrastructure rewrite as you grow
Global GPU availability: Deploy where your users are:
- GPUs available in 15+ regions worldwide
- Automatic failover between availability zones
- No vendor lock-in or regional contracts
- Same deployment experience everywhere
The combination of BYOC flexibility and managed GPU access means you can optimize for both compliance requirements and cost efficiency without compromise.
- β Verify license compatibility with your use case
- β Benchmark on your specific tasks, not just general benchmarks
- β Consider total cost of ownership, not just model performance
- β Test quantized versions for better efficiency
- β Evaluate ecosystem support and documentation
The gap between experimenting with open source LLMs and running them in production doesn't have to be insurmountable. Here's your action plan:
- Start small: Deploy Phi 3 Mini for initial testing
- Measure everything: Establish baselines for latency and cost
- Scale gradually: Move to larger models as needed
- Optimize continuously: Monitor usage patterns and adjust
For teams ready to move beyond notebooks, platforms like Northflank provide the infrastructure layer that makes production deployment accessible. No Kubernetes expertise required, just your model and a Dockerfile.
Open source LLMs represent a fundamental shift in how we build AI applications. The technology is ready; the models rival proprietary alternatives; the tooling ecosystem is maturing rapidly.
The question isn't whether to use open source LLMs, it's how quickly you can move from experimentation to production. With the right approach and infrastructure, that journey is shorter than ever.
Ready to deploy your first open source LLM? Try Northflank today.