
Self-host vLLM in your own cloud account with Northflank BYOC
If you're building AI-powered applications and services you need high-performance inference, scalability, and fast iteration. vLLM provides an optimized inference engine that serves models with an OpenAI-compatible API endpoint, allowing you to seamlessly integrate AI using familiar API patterns and tooling.
In this guide we’ll cover deploying vLLM into your own cloud provider account using Northflank. You can deploy CPU and GPU-enabled workloads into your own GCP, AWS, Azure, and other accounts, with all the advantages of the Northflank platform for infrastructure management and developer experience.
Northflank’s bring your own cloud (BYOC) allows you to maintain control over your data residency, networking, security, and cloud expenses. You can deploy the same Northflank workloads and projects across any cloud provider without having to change a single configuration detail.
This example deploys DeepSeek R1 Qwen 1.5B but you can run any model you want as a CPU or GPU workload.
Prerequisites
- A Northflank account
- An account with a supported cloud provider
- Python installed on your local machine (optional)
One-click deploy
You can deploy vLLM in your own cloud provider account with one click using our stack templates. It deploys a new cluster into your cloud account with GPU nodes, and then deploys vLLM into a new project, with commands to download and run DeepSeek R1 Qwen 1.5B.
Set up your cluster
First you’ll need to integrate your cloud provider account with Northflank. This will let you deploy Northflank-managed clusters into your chosen cloud provider and fine-tune your infrastructure and networking to suit your requirements.
- Create a new integration in your team
- Give it a name and select your provider
- Follow the instructions and create the integration
Next you’ll need to provision a new cluster in your cloud account using Northflank.
- Give it a name, select the provider, and choose your credentials
- Choose the region to deploy your cluster into. You’ll need to have the quota to deploy your desired node types in the region.
- Add your node pools. Each cluster requires at least one node pool, and a combined minimum of 4 vCPU and 8GB memory across all node pools. To enable the deployment of GPU-workloads, add a node pool with GPU-enabled node types.
- Create your new cluster and wait for it to provision (this can take 15-30 minutes)
Learn more about configuring clusters on Northflank.
Deploy vLLM
Now we can create a new project on our cluster and deploy vLLM into it.
- In your Northflank account create a new project
- Select bring your own cloud and enter a name
- Choose your cluster and click create
We’ll deploy vLLM using the image from Docker Hub. This contains all the functionality we need out of the box, and we can configure it using environment variables, changing the startup command, and running commands in the shell.
- Deploy a new service in your project, and select deployment
- Enter the name
openai-vllm
for the service and select external image as the deployment source - Enter the path
vllm/vllm-openai:latest
- (Optional) Add a runtime variable with the key
VLLM_API_KEY
and click the key to generate a random value. Select length 128 or greater and copy this to the environment variable value. This key will be required to access API endpoints for your service. - In networking add port, set it to
8000
andhttp
as the protocol. For this guide, choose to publicly expose the port. - Choose a deployment plan with sufficient resources for your requirements, for example to run DeepSeek R1 Qwen 1.5B we’ve chosen
nf-compute-400
which gives us 4 dedicated vCPU and 8GB memory - Select an available GPU from the dropdown with a count of
1
, to give your workloads access to a GPU - In advanced options, change the Docker runtime mode and select custom command. Enter
sleep 1d
to start vLLM without loading the default model. - Finalise with create service
After creating your service you’ll be taken to its overview. Since containers on Northflank are ephemeral any data will be lost when a redeployment is triggered.
To avoid having to re-download models you can add a volume to your service. To do this:
- Click volumes in the left-hand menu in your service dashboard
- Click add volume, and name it something like
vllm-models
- Adjust the storage size to meet your requirements, for this example
10GB
will be sufficient - For Hugging Face model downloads, set the mount path as
/root/.cache/huggingface/hub
- Finalise with create & attach volume
Next, we need to download the model and serve it using vLLM:
- Open the shell in one of your service’s running instances from the service overview or observability dashboard. You can click the terminal button, or select an instance and open the shell page.
- In the shell, download the model using the Hugging Face CLI command:
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
- After the model is downloaded you can serve it with the vLLM command:
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
Use vLLM
Now we can test our vLLM instance to make sure it’s working as intended. To connect to it you can use the public code.run URL generated by Northflank, found in the header of the service.
Copy the URL from your service header and paste it into your browser address bar. Append the /v1/models
endpoint to it (for example https://<your-endpoint.code.run>/v1/models
and request it to see a list of available models.
You can also use curl to test requests to other endpoints, for example for chat completion you can try:
curl https://<your-endpoint>.code.run/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
"messages": [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0.7,
"max_tokens": 100
}'
If you added an API key, you'll need to include it when querying the endpoint.
If you don’t want to publicly expose your instance, you can leave the port private so that only your other Northflank project resources will be able to access it using its private endpoint (openai-vllm:8000
for example).
Alternatively you can forward your service using the Northflank CLI.
vLLM is highly configurable so that you can optimise your settings and run models efficiently.
You can use arguments when starting vLLM to make use of multiple GPUs and and change your configuration, for example:
> vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \\
--tensor-parallel-size 8 \\
--max-model-len 1024 \\
--gpu-memory-utilization 0.8 \\
--dtype bfloat16 \\
--enforce-eager
And you can add the following environment variables to the service help with debugging:
VLLM_LOGGING_LEVEL="DEBUG" # sets vLLM's logging level to DEBUG
VLLM_TRACE_FUNCTION="1" # function-level tracing within vLLM
CUDA_LAUNCH_BLOCKING="1" # executes CUDA operations synchronously, makes excution slow
NCCL_DEBUG="TRACE" # for multi-GPU debugging
You can change the commands to download the model from Hugging Face and serve it using vLLM. To run larger models you will need to increase the resources available to your services such as the compute plan, ephemeral disks, and persistent volume storage.
You can override the command to launch the container and run the model you want with the parameters you require on startup. In the stack templates the vLLM service includes an entrypoint script mounted using a secret file, which is then run to start the container.
You can deploy larger node types to your cluster, and create custom plans to make full use of a node's compute resources.
Northflank allows you to deploy your code and databases within minutes. Sign up for a Northflank account and create a free project to get started.
- Bring your own cloud account
- Deployment of Docker containers
- On-demand GPU instances
- Observe & monitor with real-time metrics & logs
- Low latency and high performance
- Managed databases, storage, and persistent volumes