Deploy Ollama on Northflank

Published 9th December 2025

Ollama is a self-hosted AI system for running large language models such as Llama 3, Mistral, and Qwen directly on your own infrastructure. It provides a simple HTTP API and supports GPU acceleration, making it easy to integrate advanced AI models into applications without relying on external providers.

With Northflank, you can deploy Ollama in minutes using the Ollama stack template. The template handles GPU configuration, storage, networking, and service management, allowing you to focus on building AI features rather than maintaining servers.

What is Ollama?

Ollama is a lightweight model runtime that enables you to run open-source language models locally or in the cloud. It includes a model registry, supports quantized formats such as GGUF, and exposes a REST API for generating text or running embeddings. Many developers use Ollama for building chatbots, automations, RAG systems, and backend AI agents that require full control over performance and data.

What this template deploys

The Ollama stack template provisions everything needed to run a GPU-accelerated model server with persistent storage.

It includes:

A GPU-backed (NVIDIA A100) service running the official ollama/ollama:latest image
Persistent storage mounted at /root/.ollama so models are retained across restarts

This setup provides a stable environment for running models efficiently, ensuring that downloads, weights, and runtime data persist across deployments.

How to get started

Create an account on Northflank
Click deploy Ollama now
Click deploy stack to create the project and GPU service
Wait for the deployment to complete
Open the service and copy your public URL to begin using the Ollama API

You can now pull a model using the API. For example, to download Qwen 2.5:

curl https://YOUR_PUBLIC_URL/api/pull -d '{
  "name": "qwen2.5"
}'

After the model is installed, you can generate text:

curl https://YOUR_PUBLIC_URL/api/generate -d '{
  "model": "qwen2.5",
  "prompt": "What is Northflank?",
  "stream": false
}'

Key features

This template gives you a production-ready setup for AI workloads:

Run open-source models with full GPU acceleration
Persist downloaded model weights with a dedicated volume
Interact with models through a public or private API endpoint
Scale resources vertically or horizontally as needed
Manage service configuration through environment settings
Integrate the API directly into Python, JavaScript, or backend systems

The environment follows recommended practices for model hosting and provides a repeatable deployment process for development and production use cases.

How it works

Ollama Service - Runs the official Ollama image, manages the model runtime, exposes the /api endpoints, and provides GPU-accelerated execution for text generation and embeddings.
Persistent Volume - Stores downloaded models, ensuring they remain available after restarts or updates.
Public HTTP Endpoint - Allows you to send generation and pull requests to the Ollama API.
GPU Configuration - Allocates an NVIDIA A100 (80GB VRAM) for high-performance inference.
Project Workspace - Organises the deployment, networking, configuration, and monitoring inside Northflank.

Conclusion

Deploying Ollama on Northflank gives you a streamlined and reliable way to run open-source language models at scale. With GPU acceleration, persistent storage, and a fully managed environment, you can experiment, prototype, or run production workloads without managing servers manually.

You now have a complete setup for hosting models, integrating AI features into applications, and building systems that require private and high-performance inference.

Share this template with your network