Header image for blog post: Self-host vLLM in your own cloud account with Northflank BYOC

Published 21st March 2025

Self-host vLLM in your own cloud account with Northflank BYOC

If you're building AI-powered applications and services you need high-performance inference, scalability, and fast iteration. vLLM provides an optimized inference engine that serves models with an OpenAI-compatible API endpoint, allowing you to seamlessly integrate AI using familiar API patterns and tooling.

In this guide we’ll cover deploying vLLM into your own cloud provider account using Northflank. You can deploy CPU and GPU-enabled workloads into your own GCP, AWS, Azure, and other accounts, with all the advantages of the Northflank platform for infrastructure management and developer experience.

Northflank’s bring your own cloud (BYOC) allows you to maintain control over your data residency, networking, security, and cloud expenses. You can deploy the same Northflank workloads and projects across any cloud provider without having to change a single configuration detail.

This example deploys DeepSeek R1 Qwen 1.5B but you can run any model you want as a CPU or GPU workload.

Prerequisites

A Northflank account
An account with a supported cloud provider
Python installed on your local machine (optional)

One-click deploy

You can deploy vLLM in your own cloud provider account with one click using our stack templates. It deploys a new cluster into your cloud account with GPU nodes, and then deploys vLLM into a new project, with commands to download and run DeepSeek R1 Qwen 1.5B.

Set up your cluster

Integrate your cloud provider account

First you’ll need to integrate your cloud provider account with Northflank. This will let you deploy Northflank-managed clusters into your chosen cloud provider and fine-tune your infrastructure and networking to suit your requirements.

Create a new integration in your team
Give it a name and select your provider
Follow the instructions and create the integration

Creating a new cloud provider integration on Northflank

Deploy a new cluster

Next you’ll need to provision a new cluster in your cloud account using Northflank.

Give it a name, select the provider, and choose your credentials
Choose the region to deploy your cluster into. You’ll need to have the quota to deploy your desired node types in the region.
Add your node pools. Each cluster requires at least one node pool, and a combined minimum of 4 vCPU and 8GB memory across all node pools. To enable the deployment of GPU-workloads, add a node pool with GPU-enabled node types.
Create your new cluster and wait for it to provision (this can take 15-30 minutes)

Learn more about configuring clusters on Northflank.

Deploying a cluster with GPU nodes on Northflank

Deploy vLLM

Now we can create a new project on our cluster and deploy vLLM into it.

Create a new project

In your Northflank account create a new project
Select bring your own cloud and enter a name
Choose your cluster and click create

Create a vLLM deployment

We’ll deploy vLLM using the image from Docker Hub. This contains all the functionality we need out of the box, and we can configure it using environment variables, changing the startup command, and running commands in the shell.

Deploy a new service in your project, and select deployment
Enter the name openai-vllm for the service and select external image as the deployment source
Enter the path vllm/vllm-openai:latest
(Optional) Add a runtime variable with the key VLLM_API_KEY and click the key to generate a random value. Select length 128 or greater and copy this to the environment variable value. This key will be required to access API endpoints for your service.
In networking add port, set it to 8000 and http as the protocol. For this guide, choose to publicly expose the port.
Choose a deployment plan with sufficient resources for your requirements, for example to run DeepSeek R1 Qwen 1.5B we’ve chosen nf-compute-400 which gives us 4 dedicated vCPU and 8GB memory
Select an available GPU from the dropdown with a count of 1, to give your workloads access to a GPU
In advanced options, change the Docker runtime mode and select custom command. Enter sleep 1d to start vLLM without loading the default model.
Finalise with create service

Deploying a vLLM image on Northflank

Persist models

After creating your service you’ll be taken to its overview. Since containers on Northflank are ephemeral any data will be lost when a redeployment is triggered.

To avoid having to re-download models you can add a volume to your service. To do this:

Click volumes in the left-hand menu in your service dashboard
Click add volume, and name it something like vllm-models
Adjust the storage size to meet your requirements, for this example 10GB will be sufficient
For Hugging Face model downloads, set the mount path as /root/.cache/huggingface/hub
Finalise with create & attach volume

Download and serve models

Next, we need to download the model and serve it using vLLM:

Open the shell in one of your service’s running instances from the service overview or observability dashboard. You can click the terminal button, or select an instance and open the shell page.
In the shell, download the model using the Hugging Face CLI command: huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
After the model is downloaded you can serve it with the vLLM command: vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

Use vLLM

Now we can test our vLLM instance to make sure it’s working as intended. To connect to it you can use the public code.run URL generated by Northflank, found in the header of the service.

Copy the URL from your service header and paste it into your browser address bar. Append the /v1/models endpoint to it (for example https://<your-endpoint.code.run>/v1/models and request it to see a list of available models.

Retrieving a list of models by curling an OpenAI API endpoint for vLLM running on Northflank

You can also use curl to test requests to other endpoints, for example for chat completion you can try:

curl https://<your-endpoint>.code.run/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
    "messages": [
      {"role": "system", "content": "You are a helpful AI assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

If you added an API key, you'll need to include it when querying the endpoint.

Chat completion with a model served by vLLM deployed on Northflank

Access vLLM using private networking

If you don’t want to publicly expose your instance, you can leave the port private so that only your other Northflank project resources will be able to access it using its private endpoint (openai-vllm:8000 for example).

Alternatively you can forward your service using the Northflank CLI.

Optimise vLLM

vLLM is highly configurable so that you can optimise your settings and run models efficiently.

You can use arguments when starting vLLM to make use of multiple GPUs and and change your configuration, for example:

> vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \\
  --tensor-parallel-size 8 \\
  --max-model-len 1024 \\
  --gpu-memory-utilization 0.8 \\
  --dtype bfloat16 \\
  --enforce-eager

And you can add the following environment variables to the service help with debugging:

VLLM_LOGGING_LEVEL="DEBUG" # sets vLLM's logging level to DEBUG
VLLM_TRACE_FUNCTION="1" # function-level tracing within vLLM
CUDA_LAUNCH_BLOCKING="1" # executes CUDA operations synchronously, makes excution slow
NCCL_DEBUG="TRACE" # for multi-GPU debugging

Serve other models

You can change the commands to download the model from Hugging Face and serve it using vLLM. To run larger models you will need to increase the resources available to your services such as the compute plan, ephemeral disks, and persistent volume storage.

You can override the command to launch the container and run the model you want with the parameters you require on startup. In the stack templates the vLLM service includes an entrypoint script mounted using a secret file, which is then run to start the container.

Setting the command to start containers in a Northflank service