

Running reinforcement learning (RL) agents in secure sandboxes
Running reinforcement learning (RL) agents in secure sandboxes means isolating each training episode in its own containerised environment, so agent actions affect only that episode's state and cannot interfere with other concurrent rollouts.
At production scale, this requires infrastructure that can manage large numbers of these environments in parallel, spin them up and reset them quickly between episodes, and maintain hard isolation boundaries while keeping latency overhead to a minimum.
If you are building or evaluating infrastructure for reinforcement learning (RL) training, this article covers what your sandbox environments need to handle and how to think about each requirement at production scale.
- Production-scale reinforcement learning (RL) training typically runs large numbers of isolated, stateful sandbox environments concurrently. Each environment holds per-episode state, resets between rollouts, and must not affect other concurrently running environments.
- Running RL agents in secure sandboxes at scale involves several infrastructure considerations: container spin-up and reset speed, support for ephemeral and persistent environment modes, orchestration at high concurrency, lightweight isolation between rollouts, and data residency controls for proprietary training data.
- Common infrastructure pitfalls include state leakage between rollouts, reset latency accumulating into throughput loss, resource contention when inference and environment execution are co-located on the same machine, non-deterministic environments, and orchestration overhead that becomes harder to predict and diagnose as rollout counts increase.
Platforms like Northflank are built for this workload. Northflank supports 100,000+ concurrent sandbox environments, environment creation in around 1-2 seconds, both ephemeral and persistent filesystem state per environment, and microVM-based isolation using Kata, Firecracker, and gVisor depending on workload. It also offers self-serve BYOC (Bring Your Own Cloud) deployment that is production-ready, on-demand GPUs without quota requests, and API, CLI, and SSH access.
A reinforcement learning (RL) agent is a system that learns to make decisions by interacting with an environment and receiving a reward signal based on the outcomes of its actions.
The agent observes the current state of the environment, takes an action, receives a reward, and observes the resulting new state. It repeats this loop across many episodes, with the policy being updated periodically based on the collected experience.
Each episode produces a rollout, a complete sequence of agent actions and environment responses from start to termination. In production RL training, many rollouts are collected concurrently to generate enough data to update the agent's policy at each training step.
This is where sandboxes come in. An RL sandbox is an isolated execution environment that maintains its own state for the duration of a rollout and can be reset to a clean baseline between rollouts without affecting other concurrently running environments.
In stateful reinforcement learning (RL) environments, the agent's actions directly modify the environment's state.
Depending on the task, the agent might write files, execute code, modify a database, call external tools, or navigate a simulated interface. That state needs to persist within an episode so the agent experiences the consequences of its actions across multiple steps.
At the same time, that state needs to be isolated from other concurrent rollouts and reset cleanly at the start of each new episode.
Without isolation, a single failed rollout can corrupt the state of other concurrently running environments. Without clean resets, training data becomes harder to reproduce reliably. Without fast spin-up and reset, environment overhead accumulates across episode boundaries and reduces the rate at which training data is collected.
Three properties production RL sandbox environments should provide:
- Isolation: Each rollout runs in its own environment with no shared filesystem, process namespace, or network state between concurrent episodes.
- Stateful resets: The environment holds per-episode state during a rollout, then resets to a known baseline at episode end.
- Reproducibility: Given the same initial state and the same sequence of actions, an environment should produce the same observations and rewards. This is important for debugging, for comparing policy versions, and for training stability.
Northflank for reinforcement learning (RL) sandbox infrastructure
Northflank provides container orchestration for high-concurrency workloads. It supports 100,000+ concurrent sandbox environments, environment creation in around 1-2 seconds, and both persistent and ephemeral filesystem state per environment.
Isolation is microVM-based, using Kata, Firecracker, and gVisor depending on the workload. Bring Your Own Cloud (BYOC) deployment is self-serve and production-ready, with support for deploying inside your own cloud or VPC. On-demand GPUs are available without quota requests, and access is via API, CLI, or SSH.
Northflank has been in production since 2021 across startups, public companies, and government deployments. See how Northflank handles sandbox infrastructure.
The main infrastructure considerations when running RL agents at scale are container lifecycle speed, stateful reset management, CPU and GPU resource separation, high-concurrency orchestration, isolation model selection, and data residency controls.
Full environment creation includes container scheduling, image pulling, runtime initialisation, and application startup. Environment creation time is a factor in training throughput, particularly in high-frequency training loops where environments are created and reset repeatedly across many rollouts.
Reset latency is a separate concern. Even if your container starts quickly, a slow reset routine adds overhead to every episode boundary and compounds into a measurable reduction in training data collection at high rollout counts.
Ephemeral environments are created fresh for each rollout and discarded after the episode ends. They provide a strong isolation guarantee since no state persists between rollouts. See Ephemeral sandbox environments for how this works in practice.
Persistent environments hold state across multiple steps within an episode. Some implementations snapshot that state at checkpoints, allowing the environment to fork from a known point rather than rebuilding from scratch. This suits long-horizon tasks where the agent builds up filesystem state or database records over many actions.
Environment execution is CPU-bound in most cases. Policy inference is GPU-bound and latency-sensitive. Running both on the same machine can create resource contention that degrades inference latency and environment throughput. A common pattern at production scale is to separate inference nodes from environment nodes and connect them via an async API so each scales independently. See Kata Containers vs Firecracker vs gVisor for more on isolation technologies used in multi-tenant sandbox clusters.
Running thousands of environments introduces orchestration challenges beyond what container runtime alone handles. You need a control plane that can manage that many concurrent containers without becoming the bottleneck itself. Kubernetes-based orchestration is a common approach at this scale. See Kubernetes multi-tenancy for how this is typically structured.
For production RL training on untrusted or unpredictable code, hard isolation between rollouts is a common requirement. Full VMs take too long to boot for high-reset-frequency workloads. Process-level namespacing alone is not sufficient. The practical options are microVMs (Firecracker, Kata Containers) or syscall sandboxing (gVisor), each with different trade-offs around boot time, memory overhead, and compatibility. See Firecracker vs gVisor for a detailed comparison.
Teams building RL pipelines on proprietary data often need training environments to run inside their own cloud or on-premise infrastructure. Rollout trajectories, reward signals, and environment states can contain sensitive information that should not leave the company's network boundary. See Top BYOC AI sandboxes and Self-hosted AI sandboxes for how different platforms approach this.
The infrastructure requirements covered in this article are what Northflank is built to handle: fast environment creation, clean stateful resets, hard isolation between rollouts, support for ephemeral and persistent modes, high-concurrency orchestration, and BYOC deployment.
Northflank supports 100,000+ concurrent sandbox environments, environment creation in around 1-2 seconds, microVM-based isolation using Kata, Firecracker, and gVisor depending on workload, and self-serve Bring Your Own Cloud (BYOC) deployment that is production-ready. On-demand GPUs are available without quota requests. Access is via API, CLI, or SSH. Northflank has been in production since 2021 across startups, public companies, and government deployments.

To see how this works in practice, the guide below walks through spinning up a secure sandbox with Firecracker, gVisor, and Kata on Northflank:
How to spin up a secure code sandbox and microVM with Northflank
For an overview of Northflank's sandbox infrastructure, see Northflank sandboxes
To get started or speak with the team about your infrastructure requirements: Get started or Book a demo
An RL agent is a system that learns by interacting with an environment. It observes state, takes actions, receives rewards, and updates its policy to maximise cumulative reward over time. RL agents are used in robotics, game AI, LLM post-training pipelines, and agentic AI systems that learn through environment interaction.
An LLM agent uses a large language model to reason and take actions, typically through prompting, tool calls, and multi-turn reasoning. A reinforcement learning (RL) agent learns a policy through trial-and-error interaction with an environment. The two are often combined in modern LLM training pipelines: approaches like RLHF use human preference signals, while GRPO and PPO-based post-training use verifiable environment feedback to improve decision-making.
Many agentic AI systems use reinforcement learning (RL) during training. RLHF is used to align LLMs with human preferences. More recent approaches like GRPO and PPO-based post-training fine-tune LLMs using verifiable reward signals from environment interaction. The inference-time behaviour of these agents may not look like classical RL, but the training pipeline often involves RL environment infrastructure, particularly for approaches that use verifiable environment feedback.
It depends on the task and training algorithm. Production-scale reinforcement learning (RL) training commonly runs hundreds to tens of thousands of parallel environments per training step. Larger experiments or those using async RL architectures can require 100,000 or more concurrent environments. The number of parallel environments directly affects how quickly the model collects training data and how efficiently the inference cluster is utilised.
MicroVMs (Firecracker, Kata Containers) and syscall sandboxing (gVisor) are the main practical options. Each has different trade-offs around boot time, memory overhead, and compatibility. The right option depends on your task profile, reset frequency, and the nature of the code running inside environments. Northflank, for example, uses Kata, Firecracker, and gVisor, applying the appropriate isolation model based on the workload.
A rollout is a complete sequence of interactions between a reinforcement learning (RL) agent and its environment for a single training episode. It consists of state observations, agent actions, environment transitions, and reward signals from episode start to termination. RL training typically collects multiple rollouts to generate sufficient data for each policy update step.
- What is an AI sandbox: A foundational overview of what sandbox environments are, how they provide isolation, and where they are used across AI workloads.
- Ephemeral sandbox environments: How ephemeral execution environments work, when to use them, and how they differ from persistent sandboxes.
- How to sandbox AI agents: A practical guide to sandboxing AI agents, covering isolation methods, filesystem controls, and network restrictions.
- Kubernetes multi-tenancy: How Kubernetes handles multi-tenant workloads and what that means for running large numbers of isolated environments on shared infrastructure.
- Kata Containers vs Firecracker vs gVisor: A detailed comparison of the three main container isolation technologies used in secure sandbox environments.
- Firecracker vs gVisor: A focused comparison of Firecracker microVMs and gVisor syscall sandboxing for workloads that need fast, lightweight isolation.
- What is AWS Firecracker: How Firecracker works, what it provides over standard containers, and where it fits in a sandbox infrastructure stack.
- Top BYOC AI sandboxes: A look at sandbox platforms that support bring-your-own-cloud deployment for teams with data residency requirements.
- Self-hosted AI sandboxes: Options for teams that need to run sandbox infrastructure entirely within their own infrastructure.


