Engineering8 min read
How to set up LLM evals
A practical setup for LLM evals using raw events, unit tests, human review, LLM-as-a-judge, and A/B tests for production AI systems.
LLM evals are how you stop guessing.
When you build a normal software feature, you can write tests and know if the function still behaves the way you expect. AI systems are messier. The same prompt can behave differently with slightly different context. One answer can be factually correct but still wrong for the tone. A classification can work on your happy-path examples and then fall apart when real users show up.
That is why evals matter in production AI systems. They give you a feedback cycle. You can change a prompt, swap a model, adjust retrieval, or fix one failure without creating three new ones somewhere else.
Skip the fancy eval platform on day one. First, understand your system well enough that you know what to measure.
The three-level framing in this post builds on Hamel Husain's original guide, Your AI Product Needs Evals, which separates eval work into unit tests, model and human evaluation, and A/B testing. I use that framing for the eval setup in my own production AI workflows.
What LLM evals are for
An eval is a systematic measurement of quality in an AI system. One eval measures one specific thing, like factual accuracy, output format, classification quality, tone, safety, retrieval quality, or whether the model followed a business rule.
That sounds simple, but it changes how you build. Instead of looking at one output and saying "this feels better", you create a way to answer questions like:
- What happens if I adjust the system prompt?
- Is the RAG pipeline retrieving the right context?
- Are we classifying the ticket into the right category?
- Does the response match the tone the business wants?
- Did this change fix one failure while breaking another?
These are the questions every AI engineer runs into once a workflow leaves the demo stage.
Langfuse or another tracing tool can help you see what happened after something goes wrong. That is useful. But traces alone do not tell you whether a change made the system better across the cases you care about. Evals turn those cases into a repeatable feedback loop.
The loop: analyze, measure, improve
The eval workflow I use is simple. Analyze, measure, improve.
Analyze means you look at real data. User inputs, model outputs, traces, failures, support complaints, strange edge cases, broken JSON, bad classifications. You collect examples and group the failure modes.
Measure means you turn what you found into a metric. Sometimes that metric is just pass or fail. Sometimes it is a score from 0 to 1. Sometimes it is a human label with a short critique.
Improve means you change the system. You adjust the prompt, retrieval step, model, tool call, output schema, routing logic, or overall architecture. Then you run the evals again.
That loop matters because an AI application is never really finished. You can usually make the prompt a little better. You can usually improve retrieval a little. You can usually find one more edge case hiding in production data. The question is whether you can improve the system systematically, or whether every fix turns into trial and error.
Level one: unit tests
The first level is boring, and that is why it is useful.
Unit tests are fast, cheap assertions you can run on every code or prompt change. If you have a workflow that classifies customer messages, start there. Take a real or realistic input, run it through the workflow, and assert the things that must be true.
For a customer support ticket, the tests might check:
- The category is one of the allowed values.
- The confidence score is a float.
- The confidence score sits between 0 and 1.
- A known billing question is categorized as billing.
- The generated response is not empty.
Even that small set is useful. It will catch broken structured output, bad routing, missing responses, and regressions in examples you already understand.
In my example project, there is an evals folder with an events folder inside it. The events are raw JSON tickets. The test script loads an event, runs the customer-message workflow, and checks the result with simple Python assertions.
You do not need a full testing framework to start. A plain Python file with assert statements is enough. Later, you can move the same logic into pytest and run the whole eval folder from the terminal.
The important habit is to save the raw event and add a test around it every time you find a failure mode. If you fixed a billing classification bug this week, that same bug should not silently return next week.
Level two: human review and LLM-as-a-judge
Most people want to skip straight to an LLM judge. I would not start there.
An LLM-as-a-judge can be useful, but only after you know what good looks like. If you do not understand the data yourself, and a domain expert has not reviewed real examples, your judge prompt is just a nice-looking guess.
Start with human review. Take input and output pairs from the system. Put them in a spreadsheet if that is the easiest tool. Look at the user input, the model response, and ask a human reviewer to mark whether the response is good or bad. Then ask for a short critique.
That critique is the important part. "Bad" is too thin. A useful critique says something like this. The response is too casual for a delayed order, it should acknowledge the frustration, show more urgency, and offer a compensation path if the policy allows it.
Now you have the beginning of an evaluation standard.
Once you have that, run an LLM judge on the same examples. Ask it to critique the output against the same criteria and return a structured result, for example pass or fail plus a reason. Then compare the model outcome with the human outcome.
My first alignment sheet had 10 examples and a 70 percent agreement score. That means the judge matched the human label on 7 out of 10 rows. Is that good enough? It depends on the system. For many production workflows, you would keep iterating.
The way to improve the judge is also practical. Keep the human labels fixed. Look at where the model disagreed. Feed the examples, the current judge prompt, the human critiques, and the mismatches into a strong model, then ask it to improve the judge prompt. Run the sheet again and see if agreement goes up.
This is the work. It is not glamorous. But it is where the eval starts to reflect your domain instead of generic internet criteria.
Do not outsource judgment before you have done judgment. Human evaluation and LLM-as-a-judge are layers in the same process. The human standard comes first.
GenAI Accelerator
The gap between a demo and production
Anyone can wire up an LLM call. The real skill is designing, evaluating, and shipping systems that hold up.
Level three: A/B tests
A/B testing is the third level, and it is the most expensive one.
This is where you compare two prompts, two models, or two workflows against real outcomes. For a support system, you might look at user satisfaction, time to resolution, task completion, retention, or another business metric tied to the workflow.
You can also run smaller internal A/B tests before you have a full user-feedback system. For example, run the same dataset through prompt A and prompt B, then compare the human scores or judge scores. That can be enough to decide whether a prompt change is worth shipping.
The trap is pretending that five examples are a serious experiment. They are not. Five examples can tell you where to look. They cannot prove that a new workflow is better.
Use A/B tests when the system is mature enough to make the cost worthwhile. For most early AI applications, level one and level two will already expose the problems that matter.
Choose metrics based on the system
There are two broad types of metrics, reference-based and reference-free.
Reference-based metrics are easier. You know the correct answer, so you compare the model output against it. This works well for exact string matching, SQL correctness, code execution, structured data validation, and classifications with a known label.
Reference-free metrics are harder. You do not have one perfect answer. Customer support replies, research summaries, sales emails, analyst notes, and many agent outputs can be good in multiple ways. In those cases, you measure qualities such as tone, helpfulness, safety, hallucination risk, format compliance, or whether the answer uses the retrieved context correctly.
Start simple. Good or bad is enough in the beginning. A binary label forces you to decide what matters. Once that works, you can add more detailed scores.
The mistake is starting with ten generic metrics because a platform gave them to you. Helpfulness 4.2 and truthfulness 4.5 do not mean much if nobody can explain what a good answer looks like for this exact workflow.
The setup I would start with
If I were adding evals to a new LLM application, I would keep the first version small.
I would create an evals folder in the repo. Inside it, I would store raw events from the workflow, ideally the exact JSON payloads the system receives. If there is no production data yet, I would generate around 100 synthetic examples that look like the inputs I expect.
Then I would write a simple script that does three things:
- Loads each event.
- Runs it through the real workflow.
- Checks assertions against the result.
Three steps, and you have a regression suite for the known cases. You can run it locally before changing prompts. You can expand it every time production teaches you something new.
After that, I would create a review sheet for level two. Columns for input, output, model critique, model outcome, human critique, human outcome, and alignment. Nothing fancy. Just enough structure to compare human and model judgment over the same data.
Once the agreement is good enough for the risk level of the system, I would automate the judge and track the score over time in Langfuse or another observability tool.
Mistakes to avoid
The first mistake is tool-first thinking. If RAG quality is bad, people immediately want a different vector database. If accuracy is bad, they want a bigger model. If they need metrics, they buy an eval tool. Sometimes those are the right moves, but not before you understand the failure.
The second mistake is avoiding your data. You cannot improve an AI system you never inspect. Look at traces. Read real inputs. Read bad outputs. Save examples. Build your intuition around what users are doing.
The third mistake is unaligned judges. An LLM judge that has never been compared against human review is not a source of truth. It is another model output.
The fourth mistake is trying to make the first eval system too perfect. Start with the biggest failure mode, turn it into a test, then repeat. That compounds faster than a dashboard full of metrics nobody trusts.
Where this fits in production AI engineering
Evals sit next to retrieval, agents, APIs, deployment, monitoring, and human review. They are one of the core layers that separates a demo from a production AI system.
A demo works when the happy path works. A production system keeps improving after real users hit it with messy inputs.
That is the real goal. Not perfect scores. Not a beautiful dashboard. A workflow where failures turn into examples, examples turn into tests, tests make changes safer, and the system gets better over time.
For the broader engineering skill set this belongs to, start with how to build production AI systems. If you want the full training behind this post, watch the LLM evals setup video.
