Workflows8 min read
How to build human-in-the-loop for AI agents
Practical human-in-the-loop for AI agents, covering two ways to halt a process in your Python backend, plus the SSE streaming and async production patterns that save state and resume after approval.
Human-in-the-loop is a strategic halt point in an agentic system. You stop the process and wait for approval or feedback from a human before the flow of operations continues. Halting a process inside your Python backend is the easy part, and I'll show you two ways to do it. The hard part is everything around it, because in a real application the frontend, the API, the agent, and the database all need to talk to each other before an approval step actually works.
Take a banking agent. Checking a balance is safe. A deposit is fine too. But transferring money, especially higher amounts, is exactly where you want a human to approve before the agent goes off and processes the transaction. In almost any serious LLM-based application that grows in scale, you'll find a branch or route like this where human-in-the-loop helps.
This guide covers two ways to halt a process in your backend code, and the two production patterns that tie the whole application stack together.
Why approval steps get you to production faster
You don't need the LLM to handle every case before going live. If the model can handle 50% of cases fully automated, and the other 50% can be clearly classified as tricky, route the tricky half to humans and ship. You get to production quicker, and the approval queue shows you what the edge cases are, which is how the system gets more reliable over time.
A lot of people still find this hard to implement in practice. Not because the halting logic is difficult, but because human-in-the-loop done correctly is a full-stack concern, not a backend feature. The approval path is also one of the building blocks I cover in how to build reliable AI agents. When an action is too risky to automate, stop the workflow instead of prompting your way to safety.
Two ways to halt a process in your backend
There are two common ways to control the flow of an LLM application, and the halt mechanics differ between them.
The first uses structured output with the LLM as a router. Your application moves through defined steps, the model classifies something, and depending on the label you execute a function or enter a particular node. Somewhere in that flow you reach the action that needs approval.
The second is tool calling, the more agentic version. You give the LLM tools and a prompt, and the model decides on its own whether it needs to get the balance, deposit money, or transfer it.
The difference matters because of where you halt. In the workflow approach, you check the structured output your LLM returns. In the tool calling approach, you watch for specific tool calls before executing them. Same idea, different interception point.
Halting a structured-output workflow
The banking agent starts with a balance of 1,000 and three functions to get the balance, transfer money, and deposit money. When a user question comes in, the first step is an action model, not a tool call.
It's a Pydantic model whose literal action type is check the balance, transfer, or deposit. The LLM decides which actions it needs, possibly several, and returns them as an action plan, which is just a list of actions.
The important move is in the validation. Instead of trusting the LLM to figure out the hard rules for when approval is required, a model validator enforces confirmation whenever a transfer is above 100:
class Action(BaseModel):
action_type: Literal["check_balance", "transfer", "deposit"]
amount: float | None = None
recipient: str | None = None
requires_confirmation: bool = False
@model_validator(mode="after")
def enforce_confirmation(self):
if self.action_type == "transfer" and (self.amount or 0) > 100:
self.requires_confirmation = True
return selfCreate a transfer of 500 to Alice with requires_confirmation set to false, which is the kind of mistake an LLM could make, and the validator flips it back to true. The decision logic now lives in your backend rather than in the model's judgment.
The agent itself is small. It's literally the OpenAI client with the Responses API, some simple instructions, and a text format that forces structured output into the action plan. The run function takes the user input as a string, runs the model, gets the plan back, and filters the actions with plain if-else logic. When an action has requires_confirmation set to true, the script prints what it's about to do and waits.
In the demo, "check my balance and transfer $50" produces a two-step plan that checks the balance first to confirm there's enough money, then transfers. Neither step needs confirmation, so the 50 bucks goes to Alice and the new balance is 950. A deposit also runs straight through. A larger transfer halts the process. A y/n prompt appears, y executes the transfer, n cancels it.
That halt uses Python's native input() function. Fine for a demo, wrong for production. We'll fix that in a moment.
Halting a tool-calling agent
The tool calling version is almost exactly the same, with one nuance. You define a tool schema that tells the model which functions exist and what parameters they take, then run a while loop instead of collecting the full action plan up front.
If the model's response contains no tool call, you return the text output directly. That's the case where a user asks the agent a question and it just replies. When a tool call is present, you drop into the checks. Balance lookups and deposits pass through, and a transfer above the threshold triggers the same manual confirmation before your code executes the function.
Both versions are meant as playground material. Look at the process you're automating, figure out which architecture you're using, and identify where to halt. That's the starting point.
Production pattern 1: SSE streaming for chat applications
SSE streaming fits real-time chat applications and assistants, anywhere the user is actively engaged in the process.
Follow the full path. The user types "can we transfer 150 to Alice?" into a chat window. The frontend sends a POST request to your API. The API hands the message to the agent, your backend, where the validator flags that 150 is above the 100 threshold and approval is required.
Python's input() would halt the backend process while communicating nothing back to the user. So instead of halting in place, you save the state of the system to a state store, usually a database. What the user asked, what the context was, which tool call is pending. A JSON schema, a database model, whatever; the requirement is that you can store it and retrieve it.
With the state saved, the agent signals back through the API that a tool call is pending and needs approval. The frontend shows that to the user, for example as approve and deny buttons in the chat. And then the connection closes.
Closing the connection is the point of saving state. The user might grab a coffee first, or never come back. Leave a streaming connection open during that window and any latency spike or network error loses everything, and the user has to ask the question again.
When the user clicks approve, the frontend sends a new POST request, but not to the chat endpoint. It hits a separate execute endpoint with approved set to true (or false for a denial) and an ID that identifies the saved state. Your backend loads that state into memory, resumes with the approval, and executes the action, an ordinary Python function. The transaction actually happens, and you stream the response back. Transfer complete. The loop is finished.
GenAI Accelerator
The gap between a demo and production
Anyone can wire up an LLM call. The real skill is designing, evaluating, and shipping systems that hold up.
Production pattern 2: async workflows with a notification layer
The async pattern covers everything where no user is actively waiting. That includes a process triggered by an API call, a webhook, or a queue, and event-driven architectures generally. The flow is similar to streaming with one structural difference. You need a notification layer, because this is a backend process and nobody is sitting in a chat window.
There are infinite ways to build that layer. A simple version is email or Slack. Say an invoice comes in and the system processes it automatically, any time of day. When the amount is above a certain threshold, ping a Slack channel with the proposed action, "we're about to transfer 500 USD to this account," and two buttons, approve and deny.
The rest follows the same shape, just starting from the API layer instead of a frontend. A process gets started and a worker picks it up; whether that's multiple event-based workers or a single FastAPI worker doesn't matter. The agent layer is identical to the streaming pattern. Then, instead of streaming back to a frontend, you hit the notification service. You save the state and the process simply stops.
In that sense it's a little simpler, since there's no direct frontend communication. The user clicks approve inside Slack, which calls another API endpoint that resumes the workflow. You load the state, get the agent context and pending actions back in memory, resume with the approval, and execute. A final Slack notification carries the completion notice that the money is transferred.
Save the state instead of executing
Three concepts make every version of this work, regardless of what you're building:
- Deferred execution. When approval is needed, save the state instead of executing. This is lesson number one.
- State serialization. Persist the agent context, message history, and pending actions. Everything the follow-up run needs, including every parameter that goes into the functions, should be easy to store and retrieve.
- Stateless resume. Load everything from storage and hold nothing in memory. If the user takes days to hit the approval button, everything still works. Even a week later, the state is in the database, the original connection closed long ago, and nothing is waiting from an execution point of view.
Get those three right and human-in-the-loop stops being a prompting problem and becomes a state management problem.
This post is part of the broader guide to building production AI systems. For the diagrams and the live demo, watch the full video; the code examples are in the GitHub repository linked under it.
FAQ
How do you pause an AI agent and wait for human approval?
Halt the process at the point where the agent proposes a risky action, then defer execution. In a structured-output workflow you check the action plan for a requires_confirmation flag; in a tool-calling agent you intercept the tool call before executing it. In production, save the agent state to a database and notify the human instead of blocking the process.
Why should the connection close while you wait for approval?
Because the wait is unbounded. The user might respond in seconds or never, and an open streaming connection loses everything on a timeout or network error. Saving state first means the approval can arrive whenever it arrives and the workflow resumes from the database.
What state do you need to save when you halt an agent?
Everything the follow-up run needs. That means the user's original request, the message history, the agent context, and the pending actions with their parameters. If the resume step can rebuild the exact situation from storage alone, you've saved enough.
Do you need an agent framework to build human-in-the-loop?
No. The examples in this guide use the plain OpenAI client with the Responses API, Pydantic for validation, and ordinary if-else logic. The production patterns add a database and a notification channel like Slack. The concepts transfer to any framework, but none of them require one.
