

How to deploy and self-host DeepSeek-V3.1 on Northflank
This guide shows you how to deploy and self-host DeepSeek-V3.1 on Northflank using our one-click template or by setting it up manually. The model runs with vLLM for high-throughput inference and includes an OpenAI-compatible endpoint plus a full Open WebUI interface.
DeepSeek-V3.1 supports both thinking and non-thinking chat modes and features a 128K context window, large enough to hold a 300-page book.
- DeepSeek-V3.1 is a 671B parameter Mixture-of-Experts model with 128K context, hybrid thinking modes, and improved reasoning speed.
- Runs best on 8× NVIDIA H200 GPUs with vLLM.
- Deploy / Self-host on Northflank in minutes with our one-click template or configure manually for full control.
- Once deployed, you get a rate-limit-free, OpenAI-compatible API and a user-friendly web interface.
DeepSeek-V3.1 is the latest upgrade in the DeepSeek family of large language models. It builds on V3 and R1 with better reasoning speed, hybrid inference modes, and agentic improvements.
Key details:
- Architecture: Mixture-of-Experts (671B total parameters, ~37B active per token)
- Context window: 128K tokens
- Modes: Chat vs Think (toggleable in WebUI with “DeepThink” button)
- Efficiency: FP8 UE8M0 optimizations for H200 and domestic chips
- Inference: Faster than R1 and V3 in thinking mode, higher throughput in non-thinking mode
These improvements make DeepSeek-V3.1 one of the most capable open-weight LLMs available today.
- Hybrid inference: Choose between standard chat or reasoning-heavy “Think” mode.
- Faster reasoning: V3.1-Think responds quicker than R1 and earlier DeepSeek releases.
- Agent improvements: Stronger tool use and multi-step planning.
- 128K context: Enough space for large documents, codebases, or entire books.
- Open weights: Can be run on your own infra with no API restrictions.
On Northflank, you can deploy it securely, scale on demand, and avoid rate limits.
You have two options: one-click template or manual setup.
-
Create a Northflank account
Sign up and enable GPU regions.
-
Select the template
From the template catalog, click Deploy DeepSeek-V3.1 128K on 8×H200 Now.
-
Deploy stack
- Creates a vLLM service with a mounted volume for the 671B model.
- Deploys Open WebUI with persistent storage for user data.
-
Wait for load
The vLLM service downloads and shards the model across GPUs. First load takes ~45–60 minutes.
-
Open WebUI
Navigate to the assigned
code.run
domain. -
Create your account and start interacting with DeepSeek-V3.1 in chat or think mode.
You’ll also get an OpenAI-compatible endpoint to use with any client library.
- In Northflank dashboard → Create Project.
- Name:
deepseek-v31
. - Region: select one with H200 GPUs.
-
Create a new Deployment service →
deepseek-v31-vllm
. -
Source: External image
vllm/vllm-openai:deepseek
-
Runtime variable:
OPENAI_API_KEY
→ generate 128-char random key.
-
Networking:
- Add port 8000, protocol HTTP, expose publicly.
-
Compute:
- 8 × NVIDIA H200 GPUs.
-
Advanced → command:
sleep 1d
- Add volume
deepseek-models
. - Size: 1TB.
- Mount path:
/root/.cache/huggingface
. - Attach to vLLM service.
In service shell:
export HF_HUB_ENABLE_HF_TRANSFER=1
pip install --upgrade transformers torch hf-transfer
vllm serve deepseek-ai/DeepSeek-V3.1 --tensor-parallel-size 8
To automate:
bash -c "export HF_HUB_ENABLE_HF_TRANSFER=1 && pip install --upgrade transformers torch hf-transfer && vllm serve deepseek-ai/DeepSeek-V3.1 --tensor-parallel-size 8"
-
New service:
deepseek-v31-webui
. -
Image:
ghcr.io/open-webui/open-webui:latest
-
Volume: persistent for sessions.
-
Port: 8080, expose publicly.
-
Env vars:
OPENAI_API_BASE=https://<vllm-service>.code.run/v1
OPENAI_API_KEY=<same key>
Example (Python):
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["OPENAI_API_KEY"],
base_url="https://<vllm-service>.code.run/v1",
)
resp = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V3.1",
messages=[
{"role": "user", "content": "Explain DeepSeek-V3.1's benefits"}
]
)
print(resp.choices[0].message)
How much does it cost to self-host DeepSeek-V3.1?
Many teams choose to self host DeepSeek 3.1 for cost efficiency and data privacy. Northflank makes it easy to deploy or self-host DeepSeek-V3.1 without infrastructure headaches.
Running DeepSeek-V3.1 at production scale requires 8× H200 GPUs.
Northflank GPU pricing (as of August 2025):
- H200: ~$3.20/hour per GPU
- 8×H200 = ~$25.60/hour
Token cost equivalent with vLLM optimizations:
- Input: ~$0.10 per 1M tokens
- Output: ~$2.20 per 1M tokens
You pay only for the GPUs and storage you run, no hidden charges.
This makes Northflank one of the most cost-efficient platforms for MoE inference at scale.
DeepSeek has iterated quickly, with each release pushing reasoning, speed, and usability forward.
- Architecture: Both use a 671B Mixture-of-Experts design with ~37B active parameters per forward pass.
- Context window: V3 had 64K tokens, while V3.1 doubles this to 128K tokens.
- Performance: V3.1 runs more efficiently on H200 GPUs thanks to FP8 (UE8M0) optimizations.
- Inference modes: V3 supported standard chat-style inference only. V3.1 introduces hybrid inference with both chat and think modes.
- Reasoning: V3 was capable but slower at multi-step reasoning. V3.1 improves both speed and accuracy in reasoning-heavy tasks.
👉 Verdict: DeepSeek-V3.1 is a direct upgrade, more context, faster reasoning, and flexible inference modes.
- Purpose: R1 was tuned specifically for chain-of-thought reasoning using reinforcement learning. V3.1 integrates those improvements into a general-purpose model.
- Context window: R1 was limited to 64K. V3.1 expands this to 128K tokens.
- Speed: R1 reasoning was accurate but often slower. V3.1’s “Think” mode is faster while maintaining quality.
- Flexibility: R1 forced reasoning-heavy outputs. V3.1 gives you a toggle between fast chat and deep reasoning.
- Agent performance: V3.1 shows stronger results on tool use and multi-step tasks compared to R1.
👉 Verdict: DeepSeek-V3.1 replaces R1 by offering reasoning at higher speed, with the option to switch back to standard inference.
- Deploy DeepSeek-V3.1 on Northflank
- Deploy DeepSeek’s older versions in your own cloud
- Deploy Qwen3 on Northflank
- Self-host gpt-oss on Northflank
DeepSeek-V3.1 represents a leap forward in open-weight reasoning models: hybrid inference, faster chain-of-thought, and a 128K context.
On Northflank, you can run it securely, scale across H200 GPUs, and interact through an OpenAI-compatible API or a user-friendly WebUI, with no rate limits.