← Back to Guides
Header image for blog post: Deploy DeepSeek R1 with vLLM on Northflank
Daniel Cosby
Published 10th March 2025

Deploy DeepSeek R1 with vLLM on Northflank

If you're deploying LLMs like DeepSeek R1 you need high-performance inference, scalability, and fast iteration. vLLM provides an optimized, OpenAI-compatible serving engine, making it easy to deploy and interact with models using familiar API patterns.

This guide walks you through deploying DeepSeek R1 Qwen 1.5B - and other LLM models - on Northflank using vLLM, a high-performance serving engine for Large Language Models. vLLM serves models with an OpenAI-compatible API endpoint, allowing you to seamlessly integrate and interact with models using familiar OpenAI API patterns and tooling.

With Northflank's streamlined deployment process and on-demand GPU infrastructure, you can have your model up and running in minutes rather than spending hours on configuration and setup. The platform handles the complex infrastructure requirements, allowing you to focus on developing your AI applications and shipping new features.

They key benefits of deploying on Northflank include:

  • Simplified GPU resource management
  • Easy scaling capabilities
  • Usage-based billing
  • Persistent storage for models
  • Integrated monitoring and logging

You can deploy GPU-enabled workloads on Northflank’s cloud, or bring your own cloud provider to deploy into your own GCP, AWS, Azure, and other accounts.

First we’ll cover deploying vLLM manually, then look at how to deploy vLLM into your own cloud provider and finally how to adapt our stack template to deploy other models and into other clouds.

Prerequisites

  1. A Northflank account
  2. Python installed on your local machine (optional)

One-click deploy

You can deploy vLLM and run DeepSeek - or any other model - using our stack template with one-click.

Deploy vLLM

Create a new GPU-enabled project

  1. In your Northflank account create a new project
  2. Give it a name, select a GPU-enabled region, and click create

Create a vLLM deployment

First we’ll deploy vLLM using the image from Docker Hub. This contains all the functionality we need out of the box, and we can configure it using environment variables, changing the startup command, and running commands in the shell.

  1. Deploy a new service in your project, and select deployment
  2. Enter the name openai-vllm for the service and select external image as the deployment source
  3. Enter the path vllm/vllm-openai:latest
  4. Add a runtime variable with the key OPENAI_API_KEY and click the key to generate a random value. Select length 128 or greater and copy this to the environment variable value.
  5. In networking add port, set it to 8000 and http as the protocol. For this guide, choose to publicly expose the port.
  6. Choose a deployment plan with sufficient resources for your requirements, for example to run DeepSeek R1 Qwen 1.5B we’ve chosen nf-compute-400 which gives us 4 dedicated vCPU and 8GB memory
  7. Select A100 from the GPU dropdown with a count of 1, to give instances access to a NVIDIA A100 GPU
  8. In advanced options, change the Docker runtime mode and select custom command. Enter sleep 1d to start vLLM without loading the default model.
  9. Finalise with create service

Persist models

After creating your service you’ll be taken to its overview. Since containers on Northflank are ephemeral any data will be lost when a redeployment is triggered.

To avoid having to re-download models you can add a volume to your service. To do this:

  1. Click volumes in the left-hand menu in your service dashboard
  2. Click add volume, and name it something like vllm-models
  3. Adjust the storage size to meet your requirements, for this example 10GB will be sufficient
  4. For Hugging Face model downloads, set the mount path as /root/.cache/huggingface/hub
  5. Finalise with create & attach volume

Download and serve models

Next, we need to download the model and serve it using vLLM

  1. Open the shell in one of your service’s running instances from the service overview or observability dashboard. You can click the terminal button, or select an instance and open the shell page.
  2. In the shell, download the model using the Hugging Face CLI command: huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
  3. After the model is downloaded you can serve it with the vLLM command: vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

Use vLLM

Now we can test our vLLM instance to make sure it’s working as intended. To connect to it you can use the public code.run URL generated by Northflank, found in the header of the service.

Copy the URL from your service header and paste it into your browser address bar. Append the /v1/models endpoint to it (for example https://api--openai-vllm--lbn9y6p7pz7t.code.run/v1/models and request it to see a list of available models.

Interact with models

You can interact with models served by your vLLM instance using OpenAI API endpoints. To achieve this locally you can write applications using Python, TypeScript, JavaScript, .NET, Java, or Golang with the official OpenAI REST API libraries. Community libraries also exist for many more popular languages.

For this example we’ll use Python. Follow along with the guide, or clone our example repo. You can run this locally, or create a Git repository and build and run in on Northflank as a combined service.

  1. Create a new directory for your project and add the following files:

    ai-project/
    ├── .env
    ├── Dockerfile
    └── src/
        └── main.py
  2. In the .env file add your environment variables:

    # The API key in your service's runtime variables
    OPENAI_API_KEY=your_api_key_here
    # The URL from your vLLM service's header
    OPENAI_API_BASE="https://your-vllm-instance-url/v1"
    # The model you downloaded and served with vLLM
    MODEL="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
  3. Create a Dockerfile to build your application:

    FROM python:3.9-slim
    
    WORKDIR /app
    
    COPY requirements.txt .
    RUN pip install -r requirements.txt
    
    COPY . .
    
    CMD ["python", "src/main.py"]
  4. Add a requirements.txt file:

    openai>=1.65.1
    openapi>=2.0.0
  5. Add the following to your main.py :

    import os
    from dotenv import load_dotenv
    from openai import OpenAI
    
    load_dotenv()
    
    client = OpenAI(
        api_key=os.environ.get("OPENAI_API_KEY"),
        base_url=os.environ.get("OPENAI_API_BASE"),
    )
    
    completion = client.completions.create(
        model=os.environ.get("MODEL"),
        prompt="San Francisco is a"
    )
    
    print("Completion result:", completion)
    
    chat_response = client.chat.completions.create(
        model=os.environ.get("MODEL"),
        messages=[
            {"role": "user", "content": "Why is the sky blue?"}
        ]
    )
    
    print("Chat response:", chat_response)

You can use python src/main.py to run the application in your terminal, and see the response generated by the model running in your vLLM instance.

When you restart your vLLM service you’ll need to access the shell and run the model again. To avoid this you can set a custom entrypoint and command.

Set your entrypoint to /bin/bash -c and your command to vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B to run the model when your instances deploy.

Access vLLM using private networking

If you don’t want to publicly expose your instance, you can leave the port private so that only your other Northflank project resources will be able to access it using its private endpoint (openai-vllm:8000 for example).

Alternatively you can forward your service using the Northflank CLI.

Optimise and serve other models

vLLM is highly configurable so that you can optimise your settings and run models efficiently.

You can use arguments when starting vLLM to make use of multiple GPUs and and change your configuration, for example:

> vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
  --tensor-parallel-size 8 \
  --max-model-len 1024 \
  --gpu-memory-utilization 0.8 \
  --dtype bfloat16 \
  --enforce-eager

And finally, you can add the following environment variables to help with debugging:

VLLM_LOGGING_LEVEL="DEBUG" # sets vLLM's logging level to DEBUG
VLLM_TRACE_FUNCTION="1" # function-level tracing within vLLM
CUDA_LAUNCH_BLOCKING="1" # executes CUDA operations synchronously, makes excution slow
NCCL_DEBUG="TRACE" # for multi-GPU debugging

Deploy other models

You can change the commands to download the model from Hugging Face and serve it using vLLM. To run larger models you will need to increase the resources available to your services such as the compute plan, ephemeral disks, and persistent volume storage.

Deploy GPU workloads in your AWS, GCP, Azure cloud accounts

You can bring your own cloud account (BYOC) to Northflank and gain all the advantages of the Northflank platform for infrastructure management and developer experience. See the guide to self-host vLLM with BYOC.

BYOC allows you to maintain control over your data residency, networking, security, and cloud expenses. You can deploy the same Northflank workloads and projects across any cloud provider without having to change a single configuration detail.

Simply integrate your cloud account and create a cluster to get started.

You can add a node pool of spot or on-demand GPU nodes and choose resources that suit your requirements. Check the list of available GPU node types from other providers.

Northflank allows you to deploy your code and databases within minutes. Sign up for a Northflank account and create a free project to get started.

  • Deployment of Docker containers
  • On-demand GPU instances
  • Observe & monitor with real-time metrics & logs
  • Low latency and high performance
  • Managed databases, storage, and persistent volumes
Share this article with your network
X