← Back to Blog
Header image for blog post: How to deploy and self-host DeepSeek-V3.1 on Northflank
Will Stewart
Published 21st August 2025

How to deploy and self-host DeepSeek-V3.1 on Northflank

This guide shows you how to deploy and self-host DeepSeek-V3.1 on Northflank using our one-click template or by setting it up manually. The model runs with vLLM for high-throughput inference and includes an OpenAI-compatible endpoint plus a full Open WebUI interface.

DeepSeek-V3.1 supports both thinking and non-thinking chat modes and features a 128K context window, large enough to hold a 300-page book.

📌 TL;DR

  • DeepSeek-V3.1 is a 671B parameter Mixture-of-Experts model with 128K context, hybrid thinking modes, and improved reasoning speed.
  • Runs best on 8× NVIDIA H200 GPUs with vLLM.
  • Deploy / Self-host on Northflank in minutes with our one-click template or configure manually for full control.
  • Once deployed, you get a rate-limit-free, OpenAI-compatible API and a user-friendly web interface.

👉 Deploy DeepSeek-V3.1 (128K) on Northflank now

What is DeepSeek-V3.1?

DeepSeek-V3.1 is the latest upgrade in the DeepSeek family of large language models. It builds on V3 and R1 with better reasoning speed, hybrid inference modes, and agentic improvements.

Key details:

  • Architecture: Mixture-of-Experts (671B total parameters, ~37B active per token)
  • Context window: 128K tokens
  • Modes: Chat vs Think (toggleable in WebUI with “DeepThink” button)
  • Efficiency: FP8 UE8M0 optimizations for H200 and domestic chips
  • Inference: Faster than R1 and V3 in thinking mode, higher throughput in non-thinking mode

These improvements make DeepSeek-V3.1 one of the most capable open-weight LLMs available today.

Why DeepSeek-V3.1 matters

  • Hybrid inference: Choose between standard chat or reasoning-heavy “Think” mode.
  • Faster reasoning: V3.1-Think responds quicker than R1 and earlier DeepSeek releases.
  • Agent improvements: Stronger tool use and multi-step planning.
  • 128K context: Enough space for large documents, codebases, or entire books.
  • Open weights: Can be run on your own infra with no API restrictions.

On Northflank, you can deploy it securely, scale on demand, and avoid rate limits.

How to deploy DeepSeek-V3.1 on Northflank

You have two options: one-click template or manual setup.

1️⃣ Option 1: One-click deploy

  1. Create a Northflank account

    Sign up and enable GPU regions.

  2. Select the template

    From the template catalog, click Deploy DeepSeek-V3.1 128K on 8×H200 Now.

  3. Deploy stack

    • Creates a vLLM service with a mounted volume for the 671B model.
    • Deploys Open WebUI with persistent storage for user data.
  4. Wait for load

    The vLLM service downloads and shards the model across GPUs. First load takes ~45–60 minutes.

  5. Open WebUI

    Navigate to the assigned code.run domain.

  6. Create your account and start interacting with DeepSeek-V3.1 in chat or think mode.

You’ll also get an OpenAI-compatible endpoint to use with any client library.

2️⃣ Option 2: Manual deployment

1. Create a GPU-enabled project

  • In Northflank dashboard → Create Project.
  • Name: deepseek-v31.
  • Region: select one with H200 GPUs.

2. Deploy vLLM service

  • Create a new Deployment service → deepseek-v31-vllm.

  • Source: External image

    vllm/vllm-openai:deepseek
  • Runtime variable:

    • OPENAI_API_KEY → generate 128-char random key.
  • Networking:

    • Add port 8000, protocol HTTP, expose publicly.
  • Compute:

    • 8 × NVIDIA H200 GPUs.
  • Advanced → command:

    sleep 1d

3. Attach persistent storage

  • Add volume deepseek-models.
  • Size: 1TB.
  • Mount path: /root/.cache/huggingface.
  • Attach to vLLM service.

4. Download and serve model

In service shell:

export HF_HUB_ENABLE_HF_TRANSFER=1
pip install --upgrade transformers torch hf-transfer
vllm serve deepseek-ai/DeepSeek-V3.1 --tensor-parallel-size 8

To automate:

bash -c "export HF_HUB_ENABLE_HF_TRANSFER=1 && pip install --upgrade transformers torch hf-transfer && vllm serve deepseek-ai/DeepSeek-V3.1 --tensor-parallel-size 8"

5. Deploy Open WebUI

  • New service: deepseek-v31-webui.

  • Image:

    ghcr.io/open-webui/open-webui:latest
  • Volume: persistent for sessions.

  • Port: 8080, expose publicly.

  • Env vars:

    • OPENAI_API_BASE=https://<vllm-service>.code.run/v1
    • OPENAI_API_KEY=<same key>

6. Test via API

Example (Python):

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["OPENAI_API_KEY"],
    base_url="https://<vllm-service>.code.run/v1",
)

resp = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3.1",
    messages=[
        {"role": "user", "content": "Explain DeepSeek-V3.1's benefits"}
    ]
)

print(resp.choices[0].message)

Cost of deploying DeepSeek-V3.1

How much does it cost to self-host DeepSeek-V3.1?

Many teams choose to self host DeepSeek 3.1 for cost efficiency and data privacy. Northflank makes it easy to deploy or self-host DeepSeek-V3.1 without infrastructure headaches.

Running DeepSeek-V3.1 at production scale requires 8× H200 GPUs.

Northflank GPU pricing (as of August 2025):

  • H200: ~$3.20/hour per GPU
  • 8×H200 = ~$25.60/hour

Token cost equivalent with vLLM optimizations:

  • Input: ~$0.10 per 1M tokens
  • Output: ~$2.20 per 1M tokens

You pay only for the GPUs and storage you run, no hidden charges.

This makes Northflank one of the most cost-efficient platforms for MoE inference at scale.

DeepSeek-V3.1 vs earlier versions

DeepSeek has iterated quickly, with each release pushing reasoning, speed, and usability forward.

DeepSeek-V3.1 vs DeepSeek-V3

  • Architecture: Both use a 671B Mixture-of-Experts design with ~37B active parameters per forward pass.
  • Context window: V3 had 64K tokens, while V3.1 doubles this to 128K tokens.
  • Performance: V3.1 runs more efficiently on H200 GPUs thanks to FP8 (UE8M0) optimizations.
  • Inference modes: V3 supported standard chat-style inference only. V3.1 introduces hybrid inference with both chat and think modes.
  • Reasoning: V3 was capable but slower at multi-step reasoning. V3.1 improves both speed and accuracy in reasoning-heavy tasks.

👉 Verdict: DeepSeek-V3.1 is a direct upgrade, more context, faster reasoning, and flexible inference modes.

DeepSeek-V3.1 vs DeepSeek-R1

  • Purpose: R1 was tuned specifically for chain-of-thought reasoning using reinforcement learning. V3.1 integrates those improvements into a general-purpose model.
  • Context window: R1 was limited to 64K. V3.1 expands this to 128K tokens.
  • Speed: R1 reasoning was accurate but often slower. V3.1’s “Think” mode is faster while maintaining quality.
  • Flexibility: R1 forced reasoning-heavy outputs. V3.1 gives you a toggle between fast chat and deep reasoning.
  • Agent performance: V3.1 shows stronger results on tool use and multi-step tasks compared to R1.

👉 Verdict: DeepSeek-V3.1 replaces R1 by offering reasoning at higher speed, with the option to switch back to standard inference.

Final thoughts

DeepSeek-V3.1 represents a leap forward in open-weight reasoning models: hybrid inference, faster chain-of-thought, and a 128K context.

On Northflank, you can run it securely, scale across H200 GPUs, and interact through an OpenAI-compatible API or a user-friendly WebUI, with no rate limits.

👉 Deploy DeepSeek-V3.1 on Northflank now

Share this article with your network
X