Header image for blog post: How to deploy and self-host DeepSeek-V3.1 on Northflank

Published 21st August 2025

How to deploy and self-host DeepSeek-V3.1 on Northflank

This guide shows you how to deploy and self-host DeepSeek-V3.1 on Northflank using our one-click template or by setting it up manually. The model runs with vLLM for high-throughput inference and includes an OpenAI-compatible endpoint plus a full Open WebUI interface.

DeepSeek-V3.1 supports both thinking and non-thinking chat modes and features a 128K context window, large enough to hold a 300-page book.

📌 TL;DR

DeepSeek-V3.1 is a 671B parameter Mixture-of-Experts model with 128K context, hybrid thinking modes, and improved reasoning speed.
Runs best on 8× NVIDIA H200 GPUs with vLLM.
Deploy / Self-host on Northflank in minutes with our one-click template or configure manually for full control.
Once deployed, you get a rate-limit-free, OpenAI-compatible API and a user-friendly web interface.

👉 Deploy DeepSeek-V3.1 (128K) on Northflank now

What is DeepSeek-V3.1?

DeepSeek-V3.1 is the latest upgrade in the DeepSeek family of large language models. It builds on V3 and R1 with better reasoning speed, hybrid inference modes, and agentic improvements.

Key details:

Architecture: Mixture-of-Experts (671B total parameters, ~37B active per token)
Context window: 128K tokens
Modes: Chat vs Think (toggleable in WebUI with “DeepThink” button)
Efficiency: FP8 UE8M0 optimizations for H200 and domestic chips
Inference: Faster than R1 and V3 in thinking mode, higher throughput in non-thinking mode

These improvements make DeepSeek-V3.1 one of the most capable open-weight LLMs available today.

Why DeepSeek-V3.1 matters

Hybrid inference: Choose between standard chat or reasoning-heavy “Think” mode.
Faster reasoning: V3.1-Think responds quicker than R1 and earlier DeepSeek releases.
Agent improvements: Stronger tool use and multi-step planning.
128K context: Enough space for large documents, codebases, or entire books.
Open weights: Can be run on your own infra with no API restrictions.

On Northflank, you can deploy it securely, scale on demand, and avoid rate limits.

How to deploy DeepSeek-V3.1 on Northflank

You have two options: one-click template or manual setup.

1️⃣ Option 1: One-click deploy

Create a Northflank account

Sign up and enable GPU regions.
Select the template

From the template catalog, click Deploy DeepSeek-V3.1 128K on 8×H200 Now.
Deploy stack
- Creates a vLLM service with a mounted volume for the 671B model.
- Deploys Open WebUI with persistent storage for user data.
Wait for load

The vLLM service downloads and shards the model across GPUs. First load takes ~45–60 minutes.
Open WebUI

Navigate to the assigned code.run domain.
Create your account and start interacting with DeepSeek-V3.1 in chat or think mode.

You’ll also get an OpenAI-compatible endpoint to use with any client library.

2️⃣ Option 2: Manual deployment

1. Create a GPU-enabled project

In Northflank dashboard → Create Project.
Name: deepseek-v31.
Region: select one with H200 GPUs.

2. Deploy vLLM service

Create a new Deployment service → deepseek-v31-vllm.
Source: External image
```
vllm/vllm-openai:deepseek
```
Runtime variable:
- OPENAI_API_KEY → generate 128-char random key.
Networking:
- Add port 8000, protocol HTTP, expose publicly.
Compute:
- 8 × NVIDIA H200 GPUs.
Advanced → command:
```
sleep 1d
```

3. Attach persistent storage

Add volume deepseek-models.
Size: 1TB.
Mount path: /root/.cache/huggingface.
Attach to vLLM service.

4. Download and serve model

In service shell:

export HF_HUB_ENABLE_HF_TRANSFER=1
pip install --upgrade transformers torch hf-transfer
vllm serve deepseek-ai/DeepSeek-V3.1 --tensor-parallel-size 8

To automate:

bash -c "export HF_HUB_ENABLE_HF_TRANSFER=1 && pip install --upgrade transformers torch hf-transfer && vllm serve deepseek-ai/DeepSeek-V3.1 --tensor-parallel-size 8"

5. Deploy Open WebUI

New service: deepseek-v31-webui.
Image:
```
ghcr.io/open-webui/open-webui:latest
```
Volume: persistent for sessions.
Port: 8080, expose publicly.
Env vars:
- OPENAI_API_BASE=https://<vllm-service>.code.run/v1
- OPENAI_API_KEY=<same key>

6. Test via API

Example (Python):

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["OPENAI_API_KEY"],
    base_url="https://<vllm-service>.code.run/v1",
)

resp = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3.1",
    messages=[
        {"role": "user", "content": "Explain DeepSeek-V3.1's benefits"}
    ]
)

print(resp.choices[0].message)

Cost of deploying DeepSeek-V3.1

How much does it cost to self-host DeepSeek-V3.1?

Many teams choose to self host DeepSeek 3.1 for cost efficiency and data privacy. Northflank makes it easy to deploy or self-host DeepSeek-V3.1 without infrastructure headaches.

Running DeepSeek-V3.1 at production scale requires 8× H200 GPUs.

Northflank GPU pricing (as of August 2025):

H200: ~$3.20/hour per GPU
8×H200 = ~$25.60/hour

Token cost equivalent with vLLM optimizations:

Input: ~$0.10 per 1M tokens
Output: ~$2.20 per 1M tokens

You pay only for the GPUs and storage you run, no hidden charges.

This makes Northflank one of the most cost-efficient platforms for MoE inference at scale.

DeepSeek-V3.1 vs earlier versions

DeepSeek has iterated quickly, with each release pushing reasoning, speed, and usability forward.

DeepSeek-V3.1 vs DeepSeek-V3

Architecture: Both use a 671B Mixture-of-Experts design with ~37B active parameters per forward pass.
Context window: V3 had 64K tokens, while V3.1 doubles this to 128K tokens.
Performance: V3.1 runs more efficiently on H200 GPUs thanks to FP8 (UE8M0) optimizations.
Inference modes: V3 supported standard chat-style inference only. V3.1 introduces hybrid inference with both chat and think modes.
Reasoning: V3 was capable but slower at multi-step reasoning. V3.1 improves both speed and accuracy in reasoning-heavy tasks.

👉 Verdict: DeepSeek-V3.1 is a direct upgrade, more context, faster reasoning, and flexible inference modes.

DeepSeek-V3.1 vs DeepSeek-R1

Purpose: R1 was tuned specifically for chain-of-thought reasoning using reinforcement learning. V3.1 integrates those improvements into a general-purpose model.
Context window: R1 was limited to 64K. V3.1 expands this to 128K tokens.
Speed: R1 reasoning was accurate but often slower. V3.1’s “Think” mode is faster while maintaining quality.
Flexibility: R1 forced reasoning-heavy outputs. V3.1 gives you a toggle between fast chat and deep reasoning.
Agent performance: V3.1 shows stronger results on tool use and multi-step tasks compared to R1.

👉 Verdict: DeepSeek-V3.1 replaces R1 by offering reasoning at higher speed, with the option to switch back to standard inference.

🔗 Useful links

Deploy DeepSeek-V3.1 on Northflank
Deploy DeepSeek’s older versions in your own cloud
- GCP
- Azure
- AWS
Deploy Qwen3 on Northflank
Self-host gpt-oss on Northflank

Final thoughts

DeepSeek-V3.1 represents a leap forward in open-weight reasoning models: hybrid inference, faster chain-of-thought, and a 128K context.

On Northflank, you can run it securely, scale across H200 GPUs, and interact through an OpenAI-compatible API or a user-friendly WebUI, with no rate limits.

👉 Deploy DeepSeek-V3.1 on Northflank now

Share this article with your network

Deborah Emeni • 4th December 2025

Top 7 Fluidstack alternatives in 2025

Fluidstack alternatives: Compare Northflank, RunPod, Lambda Labs, Vast.ai & more for GPU cloud in 2025. Find the right platform for your AI workloads

Deborah Emeni • 3rd December 2025

Top 5 SaladCloud alternatives for production GPU workloads in 2025

SaladCloud alternatives: Compare Northflank, Vast.ai, RunPod, Lambda Labs & more for production GPU workloads with stable infrastructure in 2025

Also from the blog