EngineeringOct 29, 202511 min read

How to build production AI systems

How to build production AI systems with APIs, retrieval, agents, evals, deployment, monitoring, and human review around the model.

Dave EbbelaarSenior AI Engineer

Most AI demos are one model call away from looking useful.

Production AI systems are different. They need to survive real users, messy data, latency, weird questions, bad inputs, permissions, cost, and the moment someone says, "This answer is wrong. Why?"

That is where AI engineering starts.

The model matters, of course. But in most real projects the model is not the hard part. The hard part is everything around it. That means the workflow, the API, the data layer, retrieval, background jobs, evals, monitoring, deployment, and the human loop when the system is not confident enough to act on its own.

This is the practical version of the AI engineering roadmap. It is also the way I think about client builds at Datalumina. We have done more than 50 custom B2B AI solutions, and the pattern keeps coming back.

Start with the workflow. Then build the smallest reliable system around it.

Video version

What a production AI system actually is

An AI engineer is basically a software engineer who builds production-ready systems with pre-trained AI models and APIs. That means you are usually working on the applied side, taking model capabilities and turning them into usable software rather than training a foundation model from scratch.

That software has a lot more inside it than a prompt. The product workflow defines who uses the system and what job it does. The API boundary receives requests, validates input, and controls access. The data layer stores users, messages, documents, jobs, events, and results. Retrieval pulls private or external knowledge into the model context. The model layer generates, classifies, extracts, reasons, or calls tools. Evals and monitoring catch regressions and trace failures. Deployment makes the system available, secure, and maintainable.

If one of those layers is missing, the system can still work as a demo. It just gets fragile fast.

The more useful framing is that production AI is software engineering with a probabilistic component in the middle. You decide which parts should be deterministic and which parts should be handled by the LLM.

Why demos fail when users touch them

A demo usually has a clean path. You know the input. You know the happy output. You can retry until it looks good.

Users do not behave like that.

They ask vague questions. They paste broken data. They upload the wrong file. They ask the same thing in five different ways. They click before the background job finishes. They expect the system to remember something it was never given.

On client projects, the failure is usually scope, data, integration, evaluation, or unclear ownership of the last 20 percent. Almost never "the model is not smart enough."

That last part matters. Your first build might get to 70 or 80 percent. After iteration, you can often get to 90 percent. But for some workflows, 80 percent is not useful yet. A support triage assistant can escalate uncertain tickets. A financial analysis tool that invents a number from a filing is a different story.

So you need to ask this early:

What happens if the AI is wrong?
Can the mistake be caught later?
Does a human need to approve the output?
Can we measure success on real examples?
Is there enough data to test the workflow?

If you cannot answer those questions, do not start with a platform. Start with the workflow.

Start with one workflow

The best AI systems start narrow.

Pick one repeatable workflow and describe it in normal language. What comes in? What should go out? Who uses it? What does "good" look like? What happens when it fails?

For a document processing workflow, the input might be a PDF or email attachment. The output might be a validated JSON object, a decision, and an escalation flag. For a knowledge chatbot, the input is a user question and the output is an answer with citations. For customer support, the input is a ticket and the output might be a category, priority, draft response, and human handoff.

This is the part people skip because they want to build with agents immediately.

Don't.

If the workflow is unclear, agents make it worse. They add movement before you know the direction.

On client work, I like to define the single core workflow before touching the stack. Must-haves, nice-to-haves, success criteria, technical risk, and scope boundaries. The system gets much easier once that is clear.

Build the API boundary

At some point your local script needs to become a backend.

This is where FastAPI, Pydantic, background workers, Postgres, and Docker start to matter. You need a clean boundary where requests come in, inputs get validated, users are checked, jobs are created, and the rest of the system can run without depending on someone watching a terminal.

The API layer should be boring. That is a good thing.

For most AI applications, the API should do a small set of jobs:

Authenticate the user or integration.
Validate the request with typed schemas.
Persist the event or message before processing.
Dispatch slow work to a worker.
Return a clear status to the client.

Do not put the whole AI workflow inside the request handler. LLM calls are slow. Retrieval can be slow. Tool loops can be slow. If the request times out or the process dies, you do not want to lose the input.

The pattern I use a lot is verify, persist, dispatch.

That works for webhooks, chat messages, document uploads, and scheduled jobs. It also makes debugging much easier because the raw event is stored before the clever part starts.

Add retrieval when private knowledge matters

RAG is one of the core skills of AI engineering because most useful systems need knowledge that is not inside the model.

A private knowledge chatbot is the cleanest example. In the Document Copilot project, analysts ask questions over SEC filings. The system ingests documents, chunks them, embeds them, stores them in Postgres, retrieves relevant passages, and answers with citations.

The important part is the trust layer.

If the system answers a question about a filing, the user needs to see where the answer came from. Otherwise it is just a confident paragraph. That might be fine for brainstorming. It is weak proof for real work.

Basic semantic search is a start, but it is rarely enough for serious use cases. In the full-stack build, the retrieval layer combines semantic search with keyword search and gives the agent tools to read surrounding chunks. That matters because exact terms, dates, tickers, and section names often decide whether the answer is grounded or made up.

"Can I get some chunks back?" is the easy part.

The question worth asking is, "Can a user inspect the evidence and decide whether the answer is trustworthy?"

Add agents when the workflow needs tools

Agents are useful when the system needs to make decisions, call tools, inspect results, and continue.

They are not the default starting point.

A normal RAG pipeline can take a question, retrieve context, and answer. An agentic version can search, inspect the first result, decide it needs more context, read neighboring chunks, call another tool, and then answer. That loop is powerful. It is also slower, more expensive, and harder to debug.

Use agents when the workflow earns them.

Good agent use cases usually have tool calls, like searching a database, reading a file, updating a CRM, creating a draft, checking a policy, routing a ticket, or asking for approval. If the task is just "take this input and classify it," a normal function call or structured LLM output may be enough.

This is where a lot of builders overcomplicate things. They build a multi-agent setup before they can explain the simple data flow.

My approach is simpler. First, draw the workflow. Then decide where an LLM belongs. Then decide whether it needs tools. Then decide whether it needs a loop.

GenAI Accelerator

The gap between a demo and production

Anyone can wire up an LLM call. The real skill is designing, evaluating, and shipping systems that hold up.

See Curriculum

Add evals before users depend on it

You cannot improve what you cannot inspect.

For production AI systems, evals are how you stop guessing. You create a dataset of inputs and expected behavior, then run the workflow against it whenever prompts, retrieval, models, or code change.

This can start very simple. Unit tests. A few representative examples. Structured outputs checked against Pydantic models. Later, you can add human-labeled examples, regression datasets, and LLM-as-judge checks where they make sense.

On client projects, a useful eval often starts with raw records from the database. Take the actual payload that came in. Run it through the workflow. Check whether step five produced the output you expected. If a ticket should escalate, the eval should catch it when a prompt change stops that from happening.

This is why structured outputs matter. If every processing step returns a typed object, you can test the workflow like software. You can inspect where it failed instead of staring at one giant answer and hoping it feels right.

Evals are how you ship changes without breaking the ten cases that worked yesterday.

Monitor the system after deployment

Local errors are easy. They shout at you in the terminal.

Production errors are quieter. A webhook fails. A worker crashes on one event. A model call costs more than expected. A user gets a bad answer and never tells you why.

Observability is part of the product for exactly this reason.

For LLM calls, I like Langfuse because it gives you traces with inputs, outputs, latency, token cost, prompts, and model behavior. For application errors, Sentry is the boring useful tool. When something breaks in FastAPI, a worker, or a background job, you need to see it without SSH-ing into a server and guessing.

A production AI system should give you answers to basic questions:

What did the model see?
What did retrieval return?
How much did the call cost?
Where did the workflow fail?
Which user or event triggered it?

Without that, you are flying blind. And if you are building for clients, flying blind is not an acceptable operating mode.

Deploy the smallest reliable version

Deployment is where the demo becomes real.

You do not need Kubernetes to start. Most people underestimate how far a simple stack can go. A VM, Docker Compose, Caddy for HTTPS, FastAPI, Redis, Celery, Postgres, and a clear CI/CD path.

We deploy many client solutions almost exactly like that. Work locally. Push to GitHub. Run a deployment script or pipeline. Keep secrets out of random places. Block access by default. Open only what needs to be open. Use private networks where possible.

For a learning project, Railway or a managed platform can be fine because it gets you to the full loop quickly. For client work, be stricter. The difference between "this runs online" and "this is production-ready" is security, logging, access control, secret handling, backups, and a deployment process you can repeat.

Ship the narrow version first. Then watch what users do.

They will show you the next system requirement. Maybe the agent always searches even when someone just says hello. Maybe retrieval needs query classification. Maybe the citations are hard to inspect. Maybe the table parser is wrong. Maybe the UI hides the failure state.

That is normal. The first deployment is where feedback finally becomes real, not the finish line.

Example builds from Datalumina

The easiest way to understand production AI engineering is to look at the projects that force the layers to connect.

Document Copilot is a private knowledge chatbot for analysts. It has a browser app, FastAPI backend, Supabase Postgres, auth, document ingestion, pgvector, full-text search, citations, chat history, and deployment. The brief uses a pilot group of about 40 analysts and a target of saving at least 3 hours per analyst per week.

How to build your own AI platform is the infrastructure side. It covers triggers, schedules, agents, shared context, workers, and deployment. That kind of system is useful when you want one place for webhooks, cron jobs, and agent workflows instead of a pile of disconnected repos.

The client delivery process adds the commercial reality. Discovery, ROI, scope, success criteria, human review, two-week sprints, testing, handoff, and maintenance. This is what turns the system from "cool project" into something a company can actually use.

The pattern is always the same. Build around a real workflow. Keep the stack understandable. Add AI where it earns its place. Then instrument the system so you can improve it.

What to learn next

If you want to build production AI systems, follow the roadmap in order. Skipping the boring parts is what makes projects fragile.

Start with Python and the model API. Authentication, requests, structured outputs, prompt design, project structure, environment variables, testing, debugging, and logging. None of it is glamorous. All of it shows up again the first time something breaks.

System design comes next. Learn how data flows through an AI application, where the LLM fits, and how to explain the architecture on a whiteboard.

Then the backend. FastAPI, Pydantic, async work, background workers, Postgres, migrations, Docker, and clean API routes. A lot of people skip this stage and jump straight to RAG, and you can tell, because when retrieval returns garbage they have no system underneath to inspect. The backend is what makes the AI layer debuggable.

Once that foundation holds, learn RAG properly. Ingestion, chunking, embeddings, hybrid search, reranking, retrieval evals, and failure cases.

Observability and evals follow. Langfuse, Sentry, regression datasets, guardrails, output validation, and cost tracking. Skip this layer and every prompt change becomes a coin flip you find out about from users.

Finally, deploy real projects. Use one cloud provider or VM setup until you understand it. Set up HTTPS, logs, secrets, health checks, alerts, and CI/CD.

Build three complete projects if you want proof. Not notebooks. Real systems with a frontend or API, data, deployment, and enough instrumentation to explain how they work. That is the skill.

Where to go next

If you want to turn this roadmap into a concrete build, start with the private knowledge chatbot guide. It connects the browser, FastAPI, Supabase, retrieval, citations, chat history, and deployment in one full-stack project.

If your problem is infrastructure, read how to build your own AI platform, then the webhooks guide. Those two pieces cover triggers, schedules, workers, durable events, and the platform layer around agents.

Then strengthen the part of the stack that is weakest:

Reliable AI agents if the workflow needs tools, decisions, and recovery paths.
LLM evals if you need to catch regressions before users do.
Hybrid search for RAG if your answers need better grounding and citations.
AI platform engineering if you are coming from data science and want to understand the role shift.

And if you want the full path in order, the GenAI Accelerator AI engineering course is the structured version. We go through the stack, prompts, tools, design patterns, Docker, deployment, and production workflow with more room than a YouTube video can give you.