Deploy Ollama on Northflank
Ollama is a self-hosted AI system for running large language models such as Llama 3, Mistral, and Qwen directly on your own infrastructure. It provides a simple HTTP API and supports GPU acceleration, making it easy to integrate advanced AI models into applications without relying on external providers.
With Northflank, you can deploy Ollama in minutes using the Ollama stack template. The template handles GPU configuration, storage, networking, and service management, allowing you to focus on building AI features rather than maintaining servers.
Ollama is a lightweight model runtime that enables you to run open-source language models locally or in the cloud. It includes a model registry, supports quantized formats such as GGUF, and exposes a REST API for generating text or running embeddings. Many developers use Ollama for building chatbots, automations, RAG systems, and backend AI agents that require full control over performance and data.
The Ollama stack template provisions everything needed to run a GPU-accelerated model server with persistent storage.
It includes:
- A GPU-backed (NVIDIA A100) service running the official
ollama/ollama:latestimage - Persistent storage mounted at
/root/.ollamaso models are retained across restarts
This setup provides a stable environment for running models efficiently, ensuring that downloads, weights, and runtime data persist across deployments.
- Create an account on Northflank
- Click
deploy Ollama now - Click
deploy stackto create the project and GPU service - Wait for the deployment to complete
- Open the service and copy your public URL to begin using the Ollama API
You can now pull a model using the API. For example, to download Qwen 2.5:
curl https://YOUR_PUBLIC_URL/api/pull -d '{
"name": "qwen2.5"
}'
After the model is installed, you can generate text:
curl https://YOUR_PUBLIC_URL/api/generate -d '{
"model": "qwen2.5",
"prompt": "What is Northflank?",
"stream": false
}'
This template gives you a production-ready setup for AI workloads:
- Run open-source models with full GPU acceleration
- Persist downloaded model weights with a dedicated volume
- Interact with models through a public or private API endpoint
- Scale resources vertically or horizontally as needed
- Manage service configuration through environment settings
- Integrate the API directly into Python, JavaScript, or backend systems
The environment follows recommended practices for model hosting and provides a repeatable deployment process for development and production use cases.
- Ollama Service - Runs the official Ollama image, manages the model runtime, exposes the
/apiendpoints, and provides GPU-accelerated execution for text generation and embeddings. - Persistent Volume - Stores downloaded models, ensuring they remain available after restarts or updates.
- Public HTTP Endpoint - Allows you to send generation and pull requests to the Ollama API.
- GPU Configuration - Allocates an NVIDIA A100 (80GB VRAM) for high-performance inference.
- Project Workspace - Organises the deployment, networking, configuration, and monitoring inside Northflank.
Deploying Ollama on Northflank gives you a streamlined and reliable way to run open-source language models at scale. With GPU acceleration, persistent storage, and a fully managed environment, you can experiment, prototype, or run production workloads without managing servers manually.
You now have a complete setup for hosting models, integrating AI features into applications, and building systems that require private and high-performance inference.