

What comes after the code? Event recap: Northflank x Augment Code x Zed
In January we brought together engineers and founders from Augment Code, Zed, and Human Layer for an evening in San Francisco, co-hosted with our friends at Kindred Ventures.
The panel, which was moderated by Steve Jang, covered what coding agents can realistically do today, what the new developer workflows look like when they’re in the loop, and where the tooling still falls short. Then Will, Northflank’s co-founder/CEO talked about what Northflank is seeing from the infrastructure side.
You can watch the full recording below. This post covers the ground we covered on the night.
The opening question was a familiar one: when will agents write all the software? Nobody on stage took the bait. The more useful framing, which came from Dex at Human Layer, is the distinction between typing code and writing code. Models have been doing most of his typing since late last year. But typing is not the same as deciding what to build, what the API should feel like, or what trade-off is right for a particular system. Those decisions still belong to the engineer.

Chris from Augment Code put it in terms of the underlying technology. An LLM is a function that takes context and tells you what the next likely token is. The job of the engineer is to constrain the probability space aggressively enough, through tests, type systems, good prompts, and well-scoped tasks, that the only possible next token is the right one.
Mikayla from Zed added a point that reframed a lot of the hype. The reason someone could build a programming language in a few weeks with an agent is that programming languages are highly verifiable. You run it and it either works or it does not. A product that a human has to navigate and form an opinion about is much harder for an agent to validate on its own. The more a system can check its own output, the more autonomy you can safely give it.
Chris described his workflow: give the agent a reasonably scoped task, review the diff in source control, stage the changes that look right, go back to the agent for the next step. Repeat until you have a commit worth making. It is not the version of agentic development that gets shared on social media. But it produces code he can vouch for.
The shared view across the panel was that scope matters more than almost anything else. Large, ambiguous tasks fail. Well-specified, bounded tasks with clear verification criteria tend to work. The agents that perform best are the ones with access to the same tools a good engineer would use to check their own work: linters, type checkers, test suites, CI logs. As Dex said, this is just good engineering. None of it is new.
What is new is how badly it is missed when it is not there.
Steve polled the room on tooling. Claude Code had a strong showing, yet very few people seemed to be using Cursor.
Chris made the case for why third-party coding tools are not simply waiting to be acquired or undercut. The model is not the whole product. Context is. Getting the right information from a large codebase into the system before the agent starts working is a hard problem, and retrieval alone does not solve it. Augment built a context engine from the ground up, which is why their evals beat Claude Code on the same tasks, using the same underlying model, at lower token cost.
What they cannot compete on is price. Anthropic subsidises Claude Code's inference because the usage data feeds back into training. That creates a structural cost advantage that is very difficult to match. It has also set a price expectation in the market that pressures every tool built on top of API pricing. For now that tension is manageable. It is worth watching.
Code generation is roughly 30% of the software development lifecycle. Northflank automates the other 70%
Northflank started out as a game server hosting platform. Now we help teams deploy and run their most critical production software. The connection between those two things is very direct.
Will framed it like this: if you think of writing code as gaming, then these LLMs and coding tools have made everyone into an eSports professional. StarCraft players are measured in clicks per second. The words per minute in a codebase have gone up by an order of magnitude. But words per minute is not the same as shipping something that works.
Code generation is roughly 30% of the software development lifecycle. Northflank lives in the other 70%: build, deploy, release, autoscaling, disaster recovery, metrics, alerting. The part that runs when you are not looking at it. That part has not changed. It has just been asked to move much faster than it was designed to.

When you are not reviewing every line, and most people are not, you are deploying untrusted code to production. The fact that a model wrote it does not change that. In some ways it makes it harder, because the code can look fluent and still be wrong.
Northflank's approach to this problem grew out of the game server era. Workloads run in micro VMs, isolated from the node and the cluster, so that a compromised or buggy workload cannot affect anything outside its boundary. It turns out to be exactly what teams deploying agent-generated code need.
One government agency customer has built a workflow that reflects where a lot of teams are heading. Engineers write software using Claude Code inside Northflank sandboxes, push to version control, and loop back in. The sandbox is the safe surface. The agent operates within it. That separation lets the team move fast without the security exposure that comes from pointing an agent at a live environment.

On the startup side, some teams do not even have dev environments. They push from agent output to production and iterate when something breaks. It works until it does not, and when it does not, it tends to be very visible.
On the enterprise side, most deployments require a manual approval from a senior engineer before anything touches production. The cost of a bad deployment is too high. What is changing is the volume of things queued up for that approval. Agents generate pull requests faster than any review process was designed to handle, and that mismatch is only going to grow.
The gap between those two worlds is where most of the interesting infrastructure problems are right now. How do you move fast without removing the checkpoints that catch what a code review missed? Preview environments, canary deployments, test runners that can tell you whether the deployed thing actually does what the ticket described. These are not glamorous. They are what makes high-velocity development sustainable rather than just fast.
And is what the Northflank platform excels at.
There is a compliance case for running models inside your own VPC that most enterprise teams already understand. Sending your codebase to a third-party API is a risk that would have triggered an incident five years ago. The fact that it became normal does not mean it is a good idea.
But there is a cost case emerging too. H100s that were around $0.90 an hour not long ago are now trading at $2 to $3 depending on region. Data centre capacity takes years to build. The subsidised API pricing that the major labs are currently absorbing is not a permanent feature of the market. Will's read is that real compute costs are more likely to go up through 2027 than down, once that subsidy starts to compress.
For teams thinking about self-hosting, whether that is DeepSeek, Qwen, something fine-tuned on their own codebase, or a combination, the question is no longer purely technical. It is starting to be a cost question too.
The way Northflank thinks about this: LLMs are not a special category of thing. They are a component of an infrastructure stack, like a database or a job runner. You should be able to deploy a model, a Postgres instance, a GPU workload, and a Node service from the same control plane, with the same observability, in the same place.

We see customers move in that direction naturally. They come to Northflank for one thing, a deployment or a database, and within a few weeks they are running GPU workloads alongside it, then self-hosting a model, then asking about preview environments for their agent-generated PRs. The stack keeps getting wider. The control plane needs to keep up.
The two infrastructure primitives that matter most for teams building with agents right now are sandboxes and preview environments. They solve different parts of the same problem: how do you run a lot of agent-generated code safely, and how do you know whether it works before it reaches production.

On the sandbox side, the thing Northflank is built for is scale and isolation. We can run 100,000+ concurrent sandboxes, in your VPC or ours, each fully isolated at the micro VM level. That isolation is not a configuration option or a best-effort boundary. It is the foundation the whole thing is built on. An agent operating inside a Northflank sandbox cannot reach anything outside it, which means you can run thousands of them in parallel without the security surface growing with the number of agents.
Northflank Sandboxes boot fast, which matters when agents are spinning up environments on demand. This is obviously important, but the part that tends to surprise people is the concurrency. Most teams do not start thinking about running agents at that scale until they are already blocked on it.
Preview environments are the other piece. The question they answer is whether the thing works the way a real user would encounter it, connected to real services, running in an environment that reflects production. For teams with complex multi-service setups, getting that right has historically been painful. That is where we spend a lot of our time.

Thanks to Dex, Chris, and Mikayla for a great panel, and to Steve and the whole Kindred Ventures team for co-hosting and keeping the conversation smart.
If any of the infrastructure questions here are ones your team is working through, we would be glad to talk.