EngineeringDec 19, 2025Updated Jun 25, 202610 min read

Context engineering for AI agents: why they still fail in production

Context engineering for AI agents in practice, covering why bad context, not the model, causes most production failures, and how to fix system prompts, tools, and message history.

Dave EbbelaarSenior AI Engineer

TL;DR

Context engineering is the set of strategies for curating and maintaining the optimal set of tokens during LLM inference. That means instructions, tool descriptions, tool outputs, retrieved documents, and message history. It matters because model performance degrades as context grows, so the goal is the smallest set of high-signal tokens that maximizes the likelihood of your desired outcome. In practice that means calibrating system prompts between too vague and too specific, using positive few-shot examples instead of lists of don'ts, tracing full conversations with a tool like Langfuse, retrieving 50 chunks and reranking to the top 8 instead of pasting whole documents, keeping tool sets short and focused, and pruning or summarizing message history so the system still behaves correctly on turn 10 and turn 20.

Most AI agents that fail in production are reasoning over bad context. The model can reason fine. The loop works. But the engineers behind the system set it up to feed the model too much information, too little, contradicting instructions, or outdated data, and behavior falls apart from there.

The past year made the pattern hard to miss. Agents look promising in research demos, then some of the biggest companies in the world struggle to put them into real products. Apple and Amazon had to pull back AI products, and Microsoft has run into the same problem. The gap between demo and product is rarely the model, the tools, or the agent loop. It is context engineering at scale, and most of us are still bad at it.

In the consulting work we do at Datalumina, we see this all the time when a client's AI product degrades. This post covers what context engineering means, why agent behavior breaks down as context grows, and the strategies that hold up in the systems we build, drawing on Anthropic's effective context engineering post throughout.

What context engineering actually means

Ask 10 different people what context engineering is and you will get 10 different answers. The definition I use comes from Anthropic's engineering blog. Context engineering is the set of strategies for curating and maintaining the optimal set of tokens during LLM inference, including everything that lands in the window outside the prompt itself.

That last clause is the point. Prompt engineering was last year's skill. Write a system prompt, take a user message, get an assistant message back. Working with LLMs 101. An agent sees far more than that. Documents, tool descriptions, tool outputs, memory files, domain knowledge, and the full message history all compete for the same window, and every one of them shapes behavior.

The tool loop multiplies the problem. A model with tools can answer directly, or it can generate input parameters for a tool, which your application executes and feeds back into the context for another iteration. That can be one tool call or ten before the model decides it has reached its goal. Everything generated along the way becomes context for the next step.

So context engineering is a constant curation problem. Out of everything you could give the model at this moment, what do you actually send?

Context is a finite resource

Models keep shipping with bigger context windows, and stuffing them still degrades performance. Needle in a haystack studies show this clearly. Give a model a small amount of information and it extracts every fact perfectly. Give it an entire book and the results get less reliable. Some models degrade more gently than others, but the curve shows up across all of them.

This mirrors how people work. We have limited working memory, a cap on how much we can handle at once, and so do these systems. That leads to Anthropic's recommendation to treat context as a finite resource with diminishing marginal returns.

The whole discipline then compresses into one sentence. Find the smallest set of high-signal tokens that maximizes the likelihood of your desired outcome. Easy to write down, hard to live by. Entire teams spend weeks and months on it, and the largest companies in the world still get it wrong.

Calibrate the system prompt between too vague and too specific

System prompts drift toward two failure modes, and almost every project walks the same path between them. An engineer starts with a vague base prompt describing roughly what the system should do. They ship it. Users complain that the tone should be more casual, that it should say this when I ask that. The developer takes every piece of feedback and hardcodes it into the prompt as another rule, until the prompt reads like a stack of if-else statements.

That works for a while. With a small application, a handful of cases, and a recent model, you will get away with it. It does not scale. As the window fills up with the system prompt, user messages, and data, the model starts ignoring rules you wrote down explicitly.

Calibration means being specific enough to steer behavior while leaving the model room to be creative. If you keep failing to get it right, stop growing the prompt. Split the problem into two sub-problems and put a router in front, where an LLM call decides which branch handles the request. A smaller problem means a smaller prompt, and that beats one giant prompt powering a single agent that has to solve everything.

The structural side is mostly solved at this point. Anthropic and OpenAI both publish prompting guides. Use clear sections in XML or markdown for background information, instructions, tool guidance, and output format. Short and focused.

Give positive examples, not a list of don'ts

A lot of clients come to us with an existing product that no longer behaves. Usually it is an assistant that walks users through a flow toward some goal, like an assessment, an onboarding, a coaching or therapy session. It worked early on. Then performance degraded as conversations got deeper, until at some point the outputs stopped making sense.

When we open the codebase, the same patterns show up. Prompts are too restrictive, packed with negative examples. Don't do this, don't say that. The history is easy to reconstruct. Users complained, and each complaint became another don't.

LLMs are bad at handling negative examples. They thrive on positive ones. Showing the model what a good answer looks like beats telling it what to avoid; it is the prompting version of a picture saying more than a thousand words. Take the feedback, analyze where the failure actually comes from, and encode the fix as a few-shot positive example.

This connects to a wider gap in developers transitioning into AI engineering. They are often strong software engineers with years of experience, but little of it on nondeterministic systems. They don't look at the data, and they don't take prompts seriously. Client feedback goes straight into Cursor or Claude Code with the instruction "fix the prompt." Do the analysis yourself first, then let AI help you optimize the prompt. Not the other way around.

Read the full trace before you blame the model

From inside your IDE it is very hard to see what the message history looks like when real people interact with your app. That is why we integrate a tracing tool early on every project; we use Langfuse for all of ours. A trace gives you the whole tree of a conversation, with the system prompt, the first user message, every tool call, and everything that came back.

When an LLM system misbehaves, reading that full trace almost always exposes the error immediately. With the better models, the failure is rarely intelligence. The context stopped making sense, because it was engineered in a way where it no longer adds up for the model.

A simple smell is reading an entire trace and thinking, damn, that is a lot of information. That is a signal the model struggles with it too.

Stuck building prototypes?

Get our production stack

What you're missing is the architecture, evaluation, and judgment to ship a real project.

See Curriculum

Know when a workflow beats an agent

The word agent gets slapped on pretty much every piece of software that uses an LLM, but there is a real distinction. An agent, as the term is used now, is an LLM autonomously using tools in a loop to solve a problem. Claude Code and Cursor are truly agentic. I made an entire video on this distinction, walking through Anthropic's patterns for what works in production, and it got almost half a million views.

Most engineers, myself included, are not building the next ChatGPT or Cursor. We are building AI automations for business processes, and for most of those, simple workflows with routing and prompt chaining are more reliable than agents. You still use LLM calls; you just keep them inside a structure you control.

The deciding factor is who catches mistakes. Tool loops work well in chat assistants because the user is in the loop. You ask a question, the model goes off, and when the output is wrong you course correct in the chat window. A backend automation triggered by a webhook has no human watching, because removing the human was the point of automating it. And a support agent facing your customers should not think out loud about which tool it needs and correct itself mid-conversation. In both cases you want the system right in one shot, which means you want more control over what it is doing.

Models are getting more capable, and I do see a future where we solve business problems by handing LLMs tools and letting them run in an environment. Right now it is still a trade-off. I cover how to build the controlled version in how to build reliable AI agents.

Strategies for documents, tools, and memory

There is no single context engineering hack that works all the time. There are only practices per type of information you feed the model.

Large documents don't belong in the context whole. Use RAG, and pay attention to how many chunks you retrieve from the vector database. A pattern that works well is to go broad first with 50 chunks, run a reranker, and keep the top eight results. If you are setting up retrieval, start with hybrid search.

Tools count against the window too, descriptions included. Don't bloat the agent. Tools should be short, descriptive, focused, and they should never be confusing or overlapping. When the tool set grows past that, split it. Expose a sub-agent as a single tool with its own tools underneath, or fall back to the workflow approach with a router and a separate LLM call.

The message history is the tricky one, because it hides during development. You test with short, isolated exchanges and everything passes. Users go deeper, and then they complain that on turn 10 the assistant completely forgot what you were talking about and stopped following instructions. Keep an eye on the history at all times. Prune earlier messages, or take the first 20, summarize them, and put the summary back into the conversation.

Engineer the history with a state machine

Context engineering doesn't have to be linear, where you watch one growing message history and prune or summarize it whenever it bloats. For flows with phases, a state machine works better. Track where the user is in a database, and give each state its own system prompt, or at least its own section within it.

The signal that you need this is a prompt that captures an entire assessment in one block, with instructions like first complete this phase, and once it is done move to that phase. Break it up. Store the state, and on every incoming message pull the right context from the database into memory at runtime. You can rebuild the message history strategically, drop it entirely, or inject specific messages at specific points in the conversation. It depends on the problem you are solving and the application you are building, and that is what makes context engineering creative work.

The bar is turn 20, not turn one

The hardest part of all of this is that context problems stay hidden during development, because context only builds up as people interact with the system. Engineers are used to writing a feature, writing a unit test, and watching it pass. AI systems need more than that. The system doesn't need to pass once or twice; it needs to pass on turn 10 and on turn 20. That is where all the complexity lives.

So treat context as the scarce resource it is, read your traces, and assume the model is fine until the trace proves otherwise. Context engineering is one layer of building production AI systems; the rest of the stack matters just as much.

For the full walkthrough with the visuals from Anthropic's post, watch the video, Effective context engineering for AI agents.

FAQ

What is context engineering?

Context engineering is the set of strategies for curating and maintaining the optimal set of tokens an LLM sees during inference. That covers the system prompt, tool descriptions, tool outputs, retrieved documents, memory, and message history. The goal is the smallest set of high-signal tokens that maximizes the likelihood of the outcome you want.

How is context engineering different from prompt engineering?

Context engineering is the natural progression of prompt engineering. Prompt engineering focuses on writing effective instructions, mostly the system prompt. Context engineering extends the problem to everything else that lands in the model's window while an agent runs, including tool outputs and conversation history that accumulate across iterations.

Why do AI agents fail in production?

Most fail because they reason over bad context. That means too much information, too little, contradicting instructions, or outdated data. The model is rarely the bottleneck, especially with recent models. Reading the full trace of a failing conversation almost always shows where the context stopped making sense.

How do I stop an agent from forgetting instructions in long conversations?

Manage the message history. Prune earlier messages, or summarize the first 20 turns and inject the summary back into the conversation. For multi-phase flows, store conversation state in a database and swap in the system prompt for the current phase instead of carrying everything forward.

Should I build an agent or a workflow?

For most business process automation, a workflow. Routing and prompt chaining with LLM calls inside a structure you control is more reliable when no human is watching. Save real agents, an LLM using tools in a loop, for chat-style applications where the user can course correct.