Workflows9 min read
How to build reliable AI agents
A field guide to reliable AI agents. Map the workflow first, use structured outputs for decisions, keep tool calling at the edges, and add recovery and human approval paths.
Reliable AI agents start with a workflow you understand, not with an agent framework.
That is the part most tutorials skip. They show a large language model with a bunch of tools, the model goes off, something magical happens, and now you have an "agent." It looks great in a demo. Then you try to put the same pattern into a business process and suddenly you are debugging why the model chose the wrong tool, why it ignored one instruction, why the context window is full, and why the same customer ticket works on Monday but fails on Tuesday.
My opinion on this has not changed much. Most reliable AI agents are not that agentic. They are normal software systems with a few LLM calls placed exactly where they help.
If you are building production AI systems, this is the mental model that keeps you sane.
Start with the workflow, not the framework
The easiest way to make an AI agent unreliable is to let the LLM own the whole process.
You give it a long prompt, 12 tools, access to a database, maybe some files, and ask it to "figure it out." This can work when the user is sitting there watching the output, like in ChatGPT, Cursor, Claude Code, or Codex. The user sees the mistake and corrects it.
That is a personal assistant pattern.
A background automation is different. A webhook comes in. A support ticket arrives. A document needs to be processed. A message may be sent to a customer. In that case, you usually do not have a person watching every step. The system needs more control.
So before picking LangChain, LlamaIndex, Pydantic AI, the OpenAI SDK, the Claude Agent SDK, or anything else, draw the workflow:
- What comes in?
- What needs to be decided?
- What can regular code handle?
- Where does language reasoning help?
- What should happen when the system is not sure?
Spend the most time on question five.
Most production agent work is deciding where not to use the agent.
Use the simplest agent level that solves the problem
There are levels to this. A single LLM call is not the same thing as a multi-agent system. A workflow with one classifier step is not the same thing as an autonomous agent with file system access.
For most business automations, I would think in this order:
| Level | What it means | When I would use it |
|---|---|---|
| LLM call | One model call with a prompt and maybe structured output | Extraction, classification, rewriting, summarization |
| Workflow | Fixed steps with routing and model calls where needed | Support triage, document processing, CRM automation |
| Tool-using agent | The model chooses tools inside a loop | Edge cases where rules become too complex |
| Agent harness | A runtime with tools, files, shell, web, and permissions | Coding agents, research agents, internal power tools |
| Multi-agent orchestration | Separate agents with separate context windows | Long-horizon work where context needs to stay isolated |
The mistake is jumping straight to the bottom because it looks more advanced.
In client systems, the boring workflow is still the bread and butter. Start with routing. Start with a DAG if that fits the process. Use normal functions. Use normal database calls. Use a model when the input is fuzzy and the system needs language understanding.
Then, when a branch of the workflow grows too complex, you can add a tool-using agent at that edge node. That is very different from making the whole system one open-ended agent.
The seven building blocks
The reliable-agents pattern can be broken into seven building blocks. You can implement these in Python, TypeScript, Java, n8n, or whatever stack you already use. The concepts barely change.
| Building block | Job |
|---|---|
| Intelligence | Make the LLM API call when language reasoning is needed |
| Memory | Store and pass the state the model needs for the current task |
| Tools | Let the model request external actions, while your code executes them |
| Validation | Force outputs into a schema and reject bad data |
| Control | Route the workflow with deterministic code |
| Recovery | Retry, back off, fail safely, and return useful fallback behavior |
| Feedback | Stop for human approval when the decision is too risky |
That table is the whole game.
If you understand these pieces, new frameworks become much less intimidating. You can look at a tool and ask a few questions. Is this just a wrapper around tool calling? Is this an agent harness? Is this workflow orchestration? Does it help with context? Does it make validation easier?
Now you are evaluating the abstraction instead of being dragged around by it.
Make the LLM return data, not vibes
If you want the system to make a decision, ask the LLM for structured output.
For example, say a customer message comes in. Do not ask the model, "What should I do next?" and let it pick from a pile of tools. Ask it to classify the message into a small set of categories:
| Field | Example |
|---|---|
intent | question, request, complaint, other |
confidence | 0.82 |
reasoning | "The user is unhappy with product quality and asks for help." |
Then validate that output with Pydantic, Zod, a data class, or whatever your stack uses. If the schema fails, send the validation error back to the model or route the case to a fallback.
Once you have typed data, your code can take over:
if result.intent == "question":
answer_question(ticket)
elif result.intent == "request":
handle_request(ticket)
elif result.intent == "complaint":
escalate_or_resolve(ticket)
else:
send_to_human(ticket)This is easier to debug than tool calling for many production workflows. You have the input, the category, the confidence, and the reasoning. If the workflow fails, you can inspect the exact decision that caused the problem.
With tool calling, the bug can be harder to see. Did the model misunderstand the tool? Did the descriptions overlap? Did the context push the right tool too far down? Did the model decide not to call anything?
Structured output gives you a clean log.
Keep tool calling narrow
Tool calling is powerful. It is also where systems start to get messy.
A tool is just a contract. It has a name, a description, an input schema, and a function your code executes. The model decides whether it needs the tool, fills the arguments, and your application runs the function.
That loop is what powers most agentic applications:
- Send the model the user message and tool list.
- Let the model choose a tool or answer directly.
- Execute the tool in your code.
- Send the tool result back to the model.
- Ask for the final answer or the next tool call.
That pattern is useful when the model needs to reason over what to do next. It is less useful when you already know the next step.
If the workflow can classify the ticket and call the correct function with normal code, do that. If the agent needs to inspect a policy, request missing information, or choose between a few context-dependent actions, tool calling may fit.
I like tool calling most at the edges of a workflow. The main path stays controlled. The hard edge case gets a small set of tools.
GenAI Accelerator
The gap between a demo and production
Anyone can wire up an LLM call. The real skill is designing, evaluating, and shipping systems that hold up.
Engineer the context like a limited resource
Most agent failures are context failures.
The model is not always the problem. The loop is not always the problem. Often the system is giving the model bad context, like too much information, outdated information, overlapping tools, a bloated message history, vague instructions, or a prompt full of negative examples.
Context is everything the model sees during inference:
- System instructions.
- User messages.
- Tool descriptions.
- Tool outputs.
- Retrieved documents, memory, and conversation history.
Do not try to stuff all of it into the context window. Send the smallest set of high-signal information for the current step.
That means your context strategy changes by problem. For documents, use retrieval and maybe a reranker instead of pasting the whole file. For tools, keep descriptions short and make sure tools do not overlap. For memory, prune or summarize long conversations. For multi-step flows, store state in the database and inject only the instructions for the current phase.
One useful smell is when you look at the full trace in Langfuse or another tracing tool and think, "damn, that is a lot of information." The model is probably struggling too.
Add recovery before you call it reliable
Things will break.
APIs go down. LLMs return weird output. Rate limits hit. A tool times out. A user sends input you did not expect. The model produces JSON that almost matches your schema but not quite.
Normal production software, in other words.
Reliable agents need recovery paths:
- Retry with backoff when the failure is temporary.
- Validate outputs before downstream code touches them.
- Return a fallback response when the system cannot answer safely.
- Log the full trace so you can inspect the exact failure.
- Send ambiguous or high-risk cases to a human.
Do not hide this behind prompts. "Always be accurate" is not an error-handling strategy.
If a customer-facing support agent cannot find the right policy, it should not improvise. It should say it cannot resolve the case and hand it off. If an internal workflow cannot parse an invoice, it should mark the event for review. If a tool call fails three times, it should stop and alert.
A reliable system still fails. It fails in a way you can see, handle, and recover from.
Put humans in the loop where mistakes are expensive
There is a big difference between an assistant and an automation.
With an assistant, the user is already in the loop. You ask Cursor to edit a file. You inspect the diff. You reject the bad change. That works because the feedback is immediate.
With a background agent, the system may act without anyone watching. That is where you need approval points.
Use a human stop when the agent is about to:
- Send a sensitive customer email.
- Make a purchase or refund.
- Modify important records.
- Publish content.
- Escalate a decision with legal, financial, or customer impact.
This does not have to be complex. The workflow can pause, send a Slack message with approve and reject buttons, and continue only after a human responds. If rejected, the human can add feedback and the system can regenerate or route the case manually.
Sometimes that approval step is the correct product decision. You do not need to keep optimizing the prompt until it works 80% of the time and fails badly 20% of the time.
Just stop the workflow.
A practical build order
If you are building your first reliable agent system, keep it small.
Start with one workflow. Not a platform. Not a multi-agent architecture. One workflow with a clear input and output.
Build it in this order:
- Write the deterministic path in code.
- Add one LLM call for the step that needs language reasoning.
- Make that LLM call return structured output.
- Add tracing, validation, retries, and fallbacks.
- Add tool calling or human approval only where the workflow proves it needs it.
That order matters because it keeps the system inspectable. You can see the input, the prompt, the structured output, the routing decision, the downstream action, and the final result.
Once that works, you can add a second branch. Then a third. Eventually you may have a proper agent layer, but it grows out of a working system instead of replacing one.
How this fits into production AI systems
Reliable agents are one layer of a larger production AI system.
You still need APIs, auth, queues, databases, retrieval, evals, deployment, monitoring, and sometimes human review. The agent is one part inside that system, which is why I keep coming back to boring software engineering.
The model providers are improving. Tooling will keep changing. New frameworks will show up every week. But the systems that survive are the ones where you understand every moving part and can replace the abstraction when needed.
Start simple. Use the model where it helps. Keep the rest in code.
Then your agent has a real chance of working outside the demo.
Resources
- Watch the full video: How to build reliable AI agents.
- Review the companion explanation: How AI agents actually work.
- Study the complexity model: 5 levels of AI agents.
- Go deeper on context: Effective context engineering for AI agents.
- Read the engineering hub: How to build production AI systems.
