Workflows10 min read

The 5 levels of AI agents: from simple LLM calls to multi-agent systems

A working framework for the five levels of AI agent complexity, from augmented LLM calls and DAG workflows to agent harnesses and multi-agent orchestration, and how to pick the simplest level that solves the problem.

The most critical decision in any AI build happens at the start. It's the level of complexity the problem actually needs. Is a simple LLM call or a workflow enough, or do you need a full agent system? Get it wrong in one direction and the system can't handle the messy cases. Get it wrong in the other and you've built something you can't debug, can't unit test, and can't afford to run.

I think about this as five levels of AI agent complexity. The framework comes from the client systems we build at Datalumina. The short version is that the boring levels still do most of the work in production, and the exciting levels are where the field is heading. You want a mental model of all five, both to pick the right one per problem and to see through the hype when the next agent tool blows up.

Level 1: the augmented LLM

The simplest level is a single LLM API call, augmented with structured outputs. You send a prompt, you get typed data back, and you can engineer a system around it because the output follows a schema instead of free text.

Nothing agentic is happening yet. This is the building block everything else is made of.

Level 2: prompt chaining and routing with DAGs

Level two chains those calls together and routes between them, usually as a directed acyclic graph. For the past two years, this is what I've been preaching. Don't reach for fancy agent systems. Take the process you want to automate and ask whether you can classify the first step, the data coming in.

A customer care ticket is the classic case. Is it a billing question, a technical question, or a general question? An LLM can classify that, and from there deterministic if-else rules decide how each ticket gets handled. As cool as the agent frameworks are, DAGs are still the bread and butter of automating B2B systems reliably, especially at scale.

The catch is growth. Every project starts with a small graph. It's maintainable, everyone understands it, the codebase stays lean. Then you automate more. And more. What you end up with can turn into a Frankenstein of a directed acyclic graph, 20 or 30 times the size of the tutorial diagram, often with multiple developers working on different branches. You can decompose it. You can split it into microservices. The complexity still grows, and finding the right pathway when something breaks gets harder every month. For large organizations with big teams and lots of data, this is one of the biggest challenges in production AI work.

Level 3: tool calling at the edge nodes

Now some decisions move into the model. Instead of hard if-else rules or a dictionary lookup where your code dictates the next step, you give the LLM a set of tools to look something up in a database, fetch a policy, or request information. The model decides what to call, can loop, and can call multiple tools. At this point the system is actually agentic, an agent reasoning over tools in a loop.

Most people pick a camp here. One side says give the model tools and let it figure everything out. The other side says everything must be a deterministic DAG. The best systems use both, and that's exactly what we do now. The models are getting more capable, but we still start from a structured place. Route and classify as much as possible, and use tools as a last resort.

What the combination looks like in production

One of our client systems shows where this lands. It's a customer care support system we've been optimizing for over a year and a half. It automates almost the entire customer care team for the company, escalating only the tickets that still need human intervention. It just runs.

In Langfuse, the monitoring tool we use for all our client systems, you can pull up the full trace of a single workflow run. Before the ticket even reaches the analysis step, five deterministic steps have already executed. Then, at an edge node, an agent takes over.

That edge node is a dinnerware defect agent. To resolve a defect ticket it sometimes needs to request missing information, like how many pieces were in the dinnerware set. It can also pull the specific product rules, what the knowledge base says about company policy for an exchange or a refund on that product. Giving that one node a small set of tools works really well, and in the trace you can watch it decide that this particular ticket needed missing information requested.

That trace is the evolution of our production systems in one picture. Start by solving the problem as simply as possible. More edge cases show up, so you build a graph around them that you control. The project grows into real complexity, so you focus on the edge nodes and introduce tool calls there.

Level 4: agent harnesses

Level four is where it gets more interesting. Agent harnesses are what power tools like Claude Code, Codex CLI, and OpenClaw. Here the model gets a complete runtime with bash execution, file system access, grep and web search, and external APIs through MCP servers or plain scripts.

You can run this inside your own applications. The Claude Agent SDK is a Python package (pip install claude-agent-sdk) that gives you the same functionality Claude Code has, in your own backend. You configure allowed tools, system prompts, MCP servers, permissions, a max budget, even subagents and environment variables, and you get a Claude Code-like environment running behind your own API.

My example setup points the agent at a knowledge folder of markdown files, with read, glob, and grep as allowed tools plus a few custom MCP tools that call an API. Hand it a customer survey request and it goes into the agentic loop. It runs a glob pattern to see which files it can access, reads through them, finds the refund confirmation document, and produces the actions to take. A few lines of configuration, and the model behaves like a research agent over your own files.

This is also where you should feel the danger. Running Claude Code on your own machine with you in the loop is one thing. Putting the same capability in a production system where it can crawl the file system, change files, and reach the internet is something else. My demo runs with permissions wide open, basically yolo mode. You can make it safer by putting it in a container, blocking internet access, and keeping the file system read-only. The honest label for this level is still powerful but experimental.

The harness itself is the part to study. The models getting better is half the story; the harness around them is the other half, and it's why Claude Code is so freaking good right now. If you want to see how one is built, OpenClaw runs on pi-mono, an open-source TypeScript library. Most of you are Python engineers, but you can read the patterns, or ask Claude Code to produce a Python version of them. You can also build harness-style systems with Pydantic AI or LangGraph.

GenAI Accelerator

The gap between a demo and production

Anyone can wire up an LLM call. The real skill is designing, evaluating, and shipping systems that hold up.

See Curriculum

Level 5: multi-agent orchestration

One level deeper sits multi-agent orchestration. The Claude Agent SDK has this built in through an agents parameter where each agent gets its own prompt, its own tools, and optionally its own model. The orchestrator decides which one to call.

The point of the pattern is context isolation. On longer tasks, an agent that digs through a knowledge base bloats its own context window; by the time it finds the answer it might be at 70 to 80 percent of the window. With subagents, each one spawns with a separate context window, goes off to do its research, and reports back. The orchestrator's context stays clean.

You can do the same with Pydantic AI or LangGraph, with more control over whether agents share context or start fresh. The Claude Agent SDK just makes it easy to set up and fun to experiment with. But it's early. We don't use this in any of our production systems yet. I test it internally, and in some cases it's unreliable, and it can get expensive.

Use the simplest level that gets the job done

The final picture combines all five, with augmented LLM nodes inside a DAG, tool calls at the edges, an agent harness where a full runtime earns its keep, and an orchestrator above subagents when context isolation matters. We decide per problem which level we need, then put the levels together in one system.

The default matters, though. DAGs with simple augmented LLM nodes are still the bread and butter of reliable AI engineering. Tool calls in edge nodes are fine now that the models can handle them, but add them only when the complexity forces you to, because an almost deterministic DAG is easier to maintain and write unit tests around than an LLM node with five tool calls. That same logic carries into how to build reliable AI agents once the agentic pieces are in place.

Then weigh cost and latency:

LevelWhat it isCost and latencyWhere it fits
Augmented LLMSingle API call with structured outputsCheap, fastClassification, extraction, single decisions
DAG workflowChained calls with deterministic routingPredictableB2B automation, especially at scale
Tool callingAgent reasoning over tools in a loopHigher per runEdge nodes where rules get too complex
Agent harnessFull runtime: bash, files, search, MCPsHard to capCoding and research agents, experimental backends
Multi-agent orchestrationOrchestrator plus subagents in separate context windowsRuns long, costs moneyLong-horizon tasks where context bloats

A coding agent can take ten minutes and spend money if it solves the problem. A cheap, fast automation needs to stay quick and as deterministic as possible. Same toolbox, different ends of the spectrum.

The mental model pays off when the next tool blows up

A mental model of these levels guides what you pick for your own projects, and it also tells you how to read new tooling. Last month the hype was OpenClaw, and most people had simply no idea what abstractions sit underneath it, so they concluded it was an entirely new AI paradigm where suddenly everything can be automated.

Don't get me wrong, OpenClaw is an awesome project. It's well built. But it stands on principles we already know. An LLM, the right system prompts, a set of tools. It took the coding-agent capabilities from tools like Claude Code, creating files, crawling the file system, searching, and hooked them up to WhatsApp. The result feels like magic. As an engineer, you can break it down. It's another agent harness. An LLM, a bunch of tools, files, and prompts. That's all it is.

The deeper point is that beyond the shifts described here, the principles of AI engineering have barely moved in the past two years. A lot is happening, but the foundations hold. Learn the five levels once and you can place every new tool, trend, and weekly blow-up on the map.

This framework is one slice of the bigger discipline. For everything that surrounds the model, the APIs, evals, deployment, and monitoring, start with how to build production AI systems. And for the diagrams and code walkthroughs of all five levels, watch the full video, 5 levels of AI agents.

FAQ

What is the difference between an AI workflow and an AI agent?

A workflow routes between LLM calls with deterministic code; you decide the paths in advance, usually as a directed acyclic graph. An agent hands that control to the model. It reasons over a set of tools in a loop and decides what to call next. Workflows are easier to maintain and unit test, which is why they still carry most production automation.

What is an agent harness?

An agent harness is the runtime around the model. It handles bash execution, file system access, grep and web search, permissions, budgets, and tool management. It's what powers Claude Code, Codex CLI, and OpenClaw, and it's a big part of why those tools work as well as they do. You can run one inside your own application with the Claude Agent SDK.

When should you use a multi-agent system?

When a single agent's context window becomes the bottleneck. On long tasks an agent can fill 70 to 80 percent of its context before finding an answer; subagents with separate context windows do the research and report back, keeping the orchestrator clean. For most automations it's overkill. It runs long, costs money, and is still unreliable in places, which is why we keep it out of client production systems for now.

Are DAG workflows still relevant for AI automation?

Yes. Directed acyclic graphs with simple LLM nodes still carry most reliable B2B automation, especially at scale. The models are now capable enough to handle tool calls at edge nodes, but the structured, mostly deterministic graph remains the default starting point.

Written by

Dave Ebbelaar

Dave Ebbelaar

Senior AI Engineer

AI engineer and founder of Datalumina. Dave helps developers build production AI systems and turn technical skills into client work.