
The Four Stages of AI: LLMs, Agents, Multi-Agents, and the Harness
A field note on how AI assistance for engineers actually evolves in practice: from a model that can only talk, to agents that act, to teams of agents, to the harness around all of it, and why that last layer is where the real senior engineering work has moved.
A few months back I went through the commit history of this portfolio looking for something specific, and ended up reading much further than I needed to. What struck me was not the code itself but how differently I had clearly been working at different points. The earliest commits are small and single file, written in the voice of someone who typed every character after asking a chat window for a starting point. The later ones touch a dozen files at once, arrive with descriptive messages, and read less like notes to myself and more like the output of a process. The model behind the work had not changed that much in the same window of time. What changed was everything wrapped around it.
That gap is why I think about AI assistance as four distinct stages rather than one continuous blur: the LLM, the Agent, Multi-Agents, and the Harness. Each one is a real shift in what the tool can do without me, and each one moves what a senior engineer actually spends their day on. The first two get most of the attention online. The last two are where most of the actual engineering is happening right now, and they are also the two that are easiest to get badly wrong.
Stage One: The LLM
The first stage is the one everyone has used: a model in a chat window, or an autocomplete suggestion in an editor. It is a stateless text predictor. It does not know your repository exists, it has no memory of yesterday's conversation unless you paste it back in, and it cannot run anything. What it is genuinely good at is explaining things: an unfamiliar error message, a config option you have never seen, a regex someone else wrote three years ago. It is also a capable first draft generator, for boilerplate, for a commit message, for a rough pass at a function whose shape you already know.
The thing that defines this stage is not the model's intelligence. It is that you are the entire runtime. Every interaction is a round trip through you: copy the error, paste it in, read the explanation, copy the suggested fix, paste it into your editor, run it, and if it is wrong, copy the new error and start again. The model can be right about nearly everything it says and you will still feel the tax of being the wire between it and your machine. That tax is the actual ceiling on this stage, not the model's accuracy.
A useful test for which stage a tool is actually in: can it find out it is wrong without you telling it? If every correction loop runs through a copy and paste back into the chat, you are still in Stage One, no matter how good the model's answers are.
Stage Two: The Agent
An agent is an LLM given two things a chat window does not have: a set of tools, read a file, edit a file, run a command, search a codebase, and a loop that keeps going until some condition is met rather than stopping after one reply. The qualitative shift is not that the model got smarter. It is that the model can now see the consequence of its own action instead of only describing one.
The first time I watched this happen end to end, it was almost mundane. An agent made a small change, ran the test suite, read a failing assertion, opened the file the failure pointed at, fixed an off by one, and ran the suite again before reporting back. None of those individual steps were impressive on their own. What was different was that the loop which used to run through me, run code, read the error, paste it back in, ask again, was now running entirely on the other side of the screen. I only saw the start and the end of it.
This is the point where a senior engineer's job changes shape for the first time. The work shifts from writing the function to writing the brief: what the task is, what counts as done, what the constraints are, and what is explicitly out of scope. The bottleneck stops being typing speed and becomes how precisely you can describe correctness, and how carefully you read what comes back.
Agents without a tight scope wander. Left unconstrained, they will rename things, add error handling for cases that cannot occur, or touch a file you never mentioned, because nothing told them not to. Reviewing the diff, not just whether the build is green, is the part of the job that does not go away at this stage. It just moves.
Stage Three: Multi-Agents
Multi-agent setups split a task across several agent instances, each with its own context window and sometimes its own narrower set of tools. The pitch you hear most often is some version of collective intelligence, several models thinking together producing something better than any one of them could alone. In practice, the value I actually get from this is much more mundane, and much more useful: context isolation and specialization.
Context isolation looks like this. If I ask an agent to find every place a particular piece of styling logic is used across a dozen components in this codebase, answering that question requires reading all dozen files. If that reading happens in my main conversation, my context window is now full of file contents I do not need for the rest of the work. If instead a separate, search-focused agent does that reading and returns a short answer, the main thread stays clean and I can keep working in it for much longer before anything needs to be summarized away.
Specialization looks like giving different agents different jobs and different permissions to match: one that can only read and is asked to produce a plan, one that can edit files and is asked to implement that plan, one that only reads diffs and is asked to find problems with them. Each of those roles benefits from a narrower, more focused context than a single generalist agent juggling all three at once.
- A read-only agent for research and codebase exploration, kept separate so its findings do not fill up your main context
- A planning agent that proposes an approach before any file gets touched
- An implementation agent with write access, scoped to the plan it was given
- A review agent that reads the diff with fresh eyes and no investment in having written it
The "swarm of agents builds your product overnight" framing is mostly marketing. What reliably works today looks much smaller and much more boring: you, or a lead agent acting on your behalf, decompose a task into bounded pieces, hand each piece to the right kind of specialist, and review what comes back before any of it gets integrated. That is not a swarm. That is a small team with a lead, which is a structure most senior engineers already recognize, because it is the structure of the teams they have led.
Stage Four: The Harness
The harness is everything that sits around the model and the agent loop. Which tools it is allowed to call, and which of those calls need your approval before they happen. What gets remembered between sessions and what gets forgotten the moment the window closes. How the project's own conventions get enforced, automatically or by a human. How the available context gets managed as a task grows past what fits in one window. None of this is the model being clever. All of it is plain engineering, and it is the layer that decides whether an agent is a liability or something you can actually trust with a codebase that has real consequences attached to it.
- Permission rules that distinguish a safe read from a destructive write, and ask before the second one
- Project convention files, loaded automatically at the start of every session, so standards do not need to be repeated
- Persistent memory, so a correction given once does not need to be given again next week
- Hooks that run linters, type checks, or tests automatically after an edit, without anyone asking
- Packaged, repeatable workflows for multi-step tasks that would otherwise need re-explaining each time
A small, concrete example. A permission configuration might say that read-only operations are allowed without asking, edits and writes prompt for confirmation, and genuinely destructive commands are blocked outright.
{
"permissions": {
"allow": ["Read", "Grep", "Glob"],
"ask": ["Edit", "Write", "Bash(git push:*)"],
"deny": ["Bash(rm -rf:*)"]
}
}None of those three lines involve the model reasoning about anything. They are a fixed boundary, decided in advance, that the model operates inside no matter how it is feeling about a particular task that day. That fixedness is the point.
This article was produced inside exactly this kind of harness. Before a sentence of it existed, the model read this project's conventions file and a memory file recording earlier decisions about copy style and animation rules, and it operated under a permission system that decided which file changes needed a sign off before they happened. None of that made the writing better on its own. It made the process repeatable, and it meant decisions made weeks ago did not have to be re-explained.
The highest leverage work for a senior engineer is increasingly here, not in the code itself. Deciding what an agent can touch unsupervised. Deciding what gets caught by an automated check versus a human review. Deciding what gets written down so it does not have to be re-explained. Deciding where the human checkpoint has to stay regardless of how good the model gets. That is systems design applied to your own workflow, and it is a more senior skill than writing the function was, not less.
Running All Four at Once
These four stages are not a ladder you climb once and leave behind. On an ordinary day I might paste a stack trace into a plain LLM because it is faster than searching, hand a scoped bug fix to an agent, send a research question to a subagent so it does not clutter my main session, and all of it happens inside a harness that decided in advance which of those actions needed my approval and which did not. The stages stack. They do not replace each other.
What stays constant across all four is the part nobody has automated: someone has to know what correct looks like, someone has to decide what is worth building, and someone has to set the boundaries the system operates inside. As the first three stages get more capable, that responsibility does not disappear. It concentrates, more and more, into the harness, designed by people who a few years ago spent most of their working day writing the code the harness now writes for them.
That is the actual shift, and it is easy to miss if you are only watching how good the model's answers are. Engineers are not writing less code because they got lazy or replaceable. They are writing less code because the code most worth writing carefully now is the code that decides how the rest of the code gets written, and reviewed, and remembered. That is not a smaller job. It is a different one, and for a senior engineer, it maps remarkably well onto skills they already had.