Header image for blog post: How to deploy machine learning models: Step-by-step guide to ML model deployment in production

Published 11th June 2025

How to deploy machine learning models: Step-by-step guide to ML model deployment in production

Deploying a machine learning model is the last, and hardest, step in the ML lifecycle. You’ve trained your model, tuned your hyperparameters, and now it’s time to move from experimentation to production. This guide walks through the full process of ML model deployment, including containerization, CI/CD, and infrastructure setup, with examples using Northflank.

💡 TL;DR (if you’re in a rush)

Model deployment means taking a trained ML model and making it available in a production environment, usually as an API or part of a larger application. The challenge isn’t the model. It’s everything else: infra, security, CI/CD, observability, latency guarantees, rollout strategies, and update pipelines.

Platforms like Northflank give you a framework to manage that complexity without giving up control. You still own the model, the logic, the lifecycle, while offloading the infrastructure burden.

What is model deployment in machine learning?

Model deployment is the process of serving your trained machine learning model so it can actually be used, by users, apps, or systems. It usually means:

Packaging the model (i.e., as a Python app, container, or microservice)
Deploying it somewhere users or applications can access (i.e., behind an API)
Making sure it runs consistently and reliably in production

You might deploy the model as a REST API, a batch job, a streaming service, or embed it in an existing product. Either way, deployment is about making the model useful, turning your .pkl file into something real.

Then there’s orchestrating:

Model artifact versioning
Model-serving logic (loading, preprocessing, inference, postprocessing)
Runtime dependencies (CUDA, transformers, custom layers)
Interfaces (HTTP APIs, batch queues, streaming sinks)
Resource isolation and scaling
Monitoring and alerting
Deployment rollback mechanisms

Running inference isn't enough. You need infrastructure that can handle it at scale and under real-world constraints.

Why is ML model deployment hard in production environments?

Most ML work happens in notebooks, not systems. Your model might depend on a random seed or NumPy version. Your preprocessing code might live in a different repo. Your training pipeline might hardcode file paths to an S3 bucket in your personal AWS account. All of this is invisible until something breaks.

The typical pain points:

Non-deterministic builds (i.e., package resolution in pip install causing subtle shifts in behavior)
No hash-based versioning for model artifacts
Fragile dependency trees that assume a local dev setup
Inference that silently degrades under high load (i.e., batch size changes model output due to floating point artifacts)

And most ML “deployments” are glue scripts running in a VM with no alerting, no retries, no rollback.

How to deploy a machine learning model step-by-step

Train your model and export it (i.e., using torch.save).
Create an inference script (i.e., FastAPI server).
Containerize the inference app using Docker.
Set up CI/CD pipelines for versioned deployment.
Add monitoring/logging (i.e., request/response logging, latency tracking).
Deploy to a cloud environment.
Automate testing, validation, and rollback.

This stack is fairly universal, but still leaves you to build out all the infra: Kubernetes manifests, deployment configs, TLS setup, scaling policies, metrics dashboards.

Best practices for ML model deployment (with examples)

Containerize everything
- Wrap your model in a Docker image so it runs the same everywhere.
💡 Northflank builds from your repo and containerizes it automatically.
Use CI/CD for ML model deployments
- Don’t manually upload models. Use Git pushes to trigger builds.
💡 Northflank handles builds, deploys, and rollbacks from your commits.
Expose models behind clean, versioned APIs
- Make it callable from other services. Version it. Don’t break consumers.
💡 Northflank gives you automatic TLS, subdomains, and deploy previews.
Automate retraining and redeployments
- Set up pipelines for data drift detection, retraining, and redeployment.
💡 You can trigger jobs in Northflank via Git or API.
Monitor everything
- Log inputs and outputs. Track latency, errors, usage.
💡 Northflank comes with built-in logging and metrics dashboards.
Secure it properly
- Use fine-grained access controls, secret management, and sandboxing.
💡 Northflank has secure environments, encrypted secrets, and microVM support.
Don’t glue infra together with duct tape
- Avoid bespoke scripts or managing your own Kubernetes for one model.
💡 Northflank abstracts the infra but still gives you full control when you need it.

Step-by-step: Deploying an ML model with Northflank

Here’s how to deploy a machine learning model using Northflank.

Prerequisites

A trained model (we’ll use PyTorch for this example)
Codebase in GitHub or GitLab
Dockerfile in the root of your repo
Basic understanding of containerized apps (FastAPI, Flask, etc.)

1. Prepare the inference app Write a FastAPI app that loads your model and handles POST requests:

from fastapi import FastAPI, Request
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

app = FastAPI()

model = AutoModelForSequenceClassification.from_pretrained("my-bert-model")
tokenizer = AutoTokenizer.from_pretrained("my-bert-model")
model.eval()

@app.post("/predict")
async def predict(request: Request):
    body = await request.json()
    inputs = tokenizer(body["text"], return_tensors="pt", truncation=True)
    with torch.no_grad():
        logits = model(**inputs).logits
        probs = torch.nn.functional.softmax(logits, dim=-1)
    return {"confidence": probs.tolist()}

2. Write a Dockerfile

FROM python:3.10-slim

RUN pip install fastapi uvicorn transformers torch python-multipart

COPY . /app
WORKDIR /app

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]

3. Push code to GitHub Make sure your repo is structured and includes all dependencies and configs. Northflank integrates directly with your Git provider.

4. Create a Northflank service

Log in to Northflank
Create a new service and connect your Git repo
Select Dockerfile as the build method
(Optional: if your app loads weights from disk or uses local caching, set vars like MODEL_PATH or TRANSFORMERS_CACHE)

5. Configure builds and deployments

Enable auto-deploy on push to main
Use preview environments for PRs
(Optional: Add a /predict health check with known input to catch silent failures.)

6. Monitor the deployment

Screenshot 2025-06-12 at 17.38.13.png

Access logs in real time via the dashboard
Track CPU/memory usage and request latency
Use Prometheus-compatible metrics for observability

7. Roll out new model versions

Push a new version with updated weights or code
Use preview environments to validate behavior
Promote the new version to production with one click
Roll back if metrics regress

8. Schedule batch jobs (optional) For batch inference or retraining workflows:

Create a job service in Northflank
Trigger on cron schedule or via webhook

Common mistakes to avoid

❌ Serving models directly from notebooks

❌ Ignoring dependency management (your requirements.txt will betray you)

❌ Hardcoding secrets (they will leak)

❌ Skipping monitoring (“it works” is not a metric)

❌ Building a one-off deployment pipeline you forget how to maintain

💡 FAQs

Machine learning deployment

What is model deployment in machine learning?
It’s the process of making a trained model available for use, typically by wrapping it in an API or embedding it in a product or service.
How do you deploy a machine learning model?
Typically, you export the model, wrap it in a server (i.e., FastAPI) , containerize it (Docker), deploy it to a platform (like Northflank), monitor and maintain it.
What’s the best way to keep environments consistent?
Use container builds with pinned dependency versions. Northflank builds from Git so the same image gets tested, previewed, and shipped.
Can I use Northflank for batch inference?
Yes. You can run jobs or services depending on your use case.
What about GPUs?
You can deploy to GPU-enabled nodes in your own cloud using Northflank’s BYOC model.
How does Northflank compare to managed ML platforms?
It gives you infrastructure primitives (builds, deploys, environments) without locking you into an ML-specific abstraction.

Final thoughts

ML deployment needs versioned control over code, dependencies, data, and rollout strategy. If you can’t reproduce your model or trace its outputs, it’s not production.

Northflank integrates with your Git repo, builds clean containers, offers isolated deploys, and supports GPU jobs in your own cloud. It’s infrastructure for teams who want to ship machine learning models without reinventing the backend.

Deploy your first ML model on Northflank today.

Share this article with your network

Will Stewart • 23rd July 2025

The best alternatives to E2B.dev for running untrusted code in secure sandboxes

AI developers are increasingly reaching for platforms that allow them to safely execute arbitrary or user-submitted code, typically generated by agents or LLMs, inside isolated, short-lived environments.

Daniel Adeboye • 23rd July 2025

12 Best GPU cloud providers for AI/ML in 2025

Discover the top 12 GPU cloud platforms for AI/ML in 2025. Compare providers like Northflank, AWS, GCP, and more for training, inference, CI/CD, and full-stack AI deployment.

Also from the blog