Data10 min read

How to build agentic RAG in pure Python

Build agentic RAG from scratch with three file tools that list, grep, and read. The same loop coding agents run, applied to your own knowledge, plus the production hardening that makes it hold up.

Classic semantic RAG gets one shot at retrieval. You embed the query, pull the top chunks, and make a single LLM call with whatever came back. If the right document didn't make the cut, the model answers from the wrong context and nobody notices until the answer is wrong.

Agentic RAG removes that single point of failure. You give the model three tools over your knowledge, one that lists files, one that searches them, one that reads them, and you let it run a loop. Search, inspect the results, search again with better terms, read what looks promising, answer. Because the model sees what each call returned, it can self-correct mid-retrieval.

This is the exact pattern the popular coding agents are built on. Claude Code, Cursor, and Codex navigate entire codebases with these same three primitives. The difference is that here we point them at company data and private notes instead of source code. In this post I'll build the whole thing from scratch in pure Python over plain markdown files, then add the production practices we apply at Datalumina for client systems. The code lives in the knowledge folder of my AI cookbook on GitHub, and the only thing you need to follow along is an OpenAI API key, or any other provider you prefer.

When agentic RAG beats semantic RAG

Semantic RAG is not dead, whatever people on the internet claim. If you need raw speed or you're optimizing hard for cost, the classic embed-and-retrieve stack is probably still the better starting point. I covered that stack, BM25, dense embeddings, fusion, and reranking, in the hybrid search for RAG post.

But if you have the time and the budget, agentic RAG will generally outperform it, and the reason is structural. Semantic RAG is a linear process. You use the model's intelligence exactly once, in a single API call, after all the retrieval has already happened. Agentic RAG puts that intelligence inside the loop. The model picks the search terms, reads the results, and decides what to do next, multiple times per question. When the first search misses, it reformulates and tries again instead of answering from bad context.

The tradeoff is latency and cost. Every loop iteration is another model call. Keep that in mind before you replace a retrieval system that needs to respond in milliseconds.

Three tools and a folder of markdown files

The simplest way to start is markdown files on the file system, combined with three tools that list, search, and read files. No database, no embeddings, no chunking strategy.

The walkthrough uses a folder of hypothetical engineering team notes, runbooks, architecture decisions, that kind of material. For your own project you swap in whatever knowledge you want the model to have. The format matters less than you'd think, but markdown is the easiest for this loop to work over.

Everything starts from a clean path to that folder:

Python
from pathlib import Path
 
NOTES_DIR = (Path(__file__).parent / "notes").resolve()

Path(__file__).parent puts you next to the script, appending notes points at the knowledge folder, and .resolve() is a safety precaution that cleans out symlinks and other oddities so every tool works from the same absolute path.

Build the three tools in pure Python

The point of building these by hand, rather than importing them from a framework, is that you end up understanding exactly what your agent does. That understanding is what you'll lean on later when you're debugging why a search came back empty.

List files

glob finds files matching a pattern, and relative_to strips the path down to just the filename:

Python
def list_files() -> list[str]:
    return [str(p.relative_to(NOTES_DIR)) for p in sorted(NOTES_DIR.glob("*.md"))]

The relative_to call matters more than it looks. The agent doesn't need the full absolute path of every file, and shorter paths save tokens on every single loop iteration. Without touching a database, this is how an AI agent finds files on your system.

Search files

Grep is the most involved of the three. Compile a case-insensitive pattern, split each file into lines, and collect every match with its file and line number:

Python
import re
 
def grep(pattern: str) -> list[dict]:
    regex = re.compile(pattern, re.IGNORECASE)
    results = []
    for path in sorted(NOTES_DIR.glob("*.md")):
        lines = path.read_text().splitlines()
        for line_number, line in enumerate(lines, start=1):
            if regex.search(line):
                results.append({
                    "file": str(path.relative_to(NOTES_DIR)),
                    "line": line_number,
                    "text": line,
                })
    return results

enumerate starts at 1 because line numbers are for humans, and IGNORECASE means "Connection Pool" and "connection pool" both match. Searching the example notes for "connection pool" finds it on line 46 of the billing runbook, plus a second hit in the architecture decisions file. That output format, file, line number, line text, is exactly how Claude Code or Cursor navigates a codebase to find what to edit.

Read files

Reading is the easy one, with one check that matters:

Python
def read_file(filename: str) -> str:
    target = (NOTES_DIR / filename).resolve()
    if not target.is_relative_to(NOTES_DIR):
        raise ValueError(f"{filename} is not inside the notes directory")
    return target.read_text()

is_relative_to confirms the target actually sits inside the notes directory. It's a containment boundary. The agent stays inside the folder you gave it, no matter what path the model generates.

Wire the tools into an agent loop

With the three functions in a tools.py, the agent setup is a few lines. The walkthrough uses Pydantic AI because it simplifies exposing the tools to the model and running the execution loop, but nothing here depends on it. Any agent framework works, and so does building your own loop directly against the OpenAI or Claude APIs.

You register list_files, grep, and read_file as tools, then ask a question over the engineering wiki. Why does our nightly deploy job run at this specific time? The agent goes off, loops, and comes back with the answer, an overlap with the European batch ETL window, including where that information came from. In this run it took five tool calls, five trips through list, search, and read before the model had what it needed.

Look behind the scenes at every tool call

Knowing the run took five tool calls doesn't tell you what the agent actually did. To see that, you intercept the agent's steps. Pydantic AI exposes this through agent.iter, and every framework has some equivalent. The mechanics are framework-specific; what you're after is the visibility.

Now you can watch the run unfold. The agent starts with a grep, and the parameters of that call are the interesting part. They're the search terms the language model decided to generate. This is what makes agentic search powerful. The model writes its own queries, looks at what came back, and queries again. With a debug flag turned on you also see the results of each call, and in the walkthrough one grep returned 14 matches, each with the document, the line number, and the line itself, which the model then used to decide which file to read.

When you're optimizing one of these systems and trying to work out whether it's finding the right documents, this view is where the answers are. Look at what the agent searched for. Check whether the right documents came up. Then adjust.

The main lever for adjusting is the tool definition itself. Everything in the function signature and the docstring is information the model uses. The docstring tells it what the tool is for and what the parameters mean, and you can load it with extra instructions or domain knowledge to steer what the model puts into the call. The frontier labs are optimizing their models hard to be good coding agents, and a good coding agent is one that has mastered this loop of searching over information with tools. We're borrowing that capability and pointing it at markdown files.

GenAI Accelerator

The gap between a demo and production

Anyone can wire up an LLM call. The real skill is designing, evaluating, and shipping systems that hold up.

See Curriculum

Return structured answers with citations

A plain text answer is fine for a demo. For a real system you want a contract. Adding structured output means defining an output model, a search answer with the plain-English answer plus a list of citations, where each citation carries the file, the quote, and the line number. Same agent, same tools, one extra output_type parameter.

Running it, the answer takes 10 to 15 seconds. The agent still needs its five tool calls, and there's a capable model behind it. What you get back is a typed object that downstream code can rely on. Show the answer to the user, render the citations as clickable references in the front end, or hand the whole thing to the next function in a pipeline.

That latency number is also your reality check. This loop is for questions where answer quality matters more than response time.

Harden it for production

The learning version and the production version are the same system. The production file in the cookbook keeps the identical structure and adds the safety layers, modeled directly on what Codex, OpenCode, Cursor, and Claude Code do in their harnesses. Four changes carry most of the weight.

First, limits. An agent request limit stops the loop from running away, and a max-lines cap on the read tool stops one huge file from blowing up the context window. Both are cheap parameters that prevent expensive failure modes.

Second, ripgrep. The production grep drops the Python re loop and shells out to ripgrep through subprocess. Ripgrep is built in Rust, it's fast, it skips hidden files, and it respects your gitignore out of the box, which is why pretty much every modern agent harness uses it. On macOS it installs via Homebrew, and it becomes a real dependency of your deployment environment, so treat it like one. The subprocess call passes flags for line numbers and case-insensitive matching, the same output contract the pure Python version had.

Third, errors are returned, not raised. If the agent passes a file that doesn't exist, a raised exception kills the process. A returned human-readable error message goes back into the loop, the model reads it, and it course-corrects on the next call. This one change is the difference between an agent that crashes on edge cases and one that recovers from them.

Fourth, logging, so the visibility you had in the debug walkthrough survives into the deployed system.

To use this in your own project, take the production file, refactor it to fit your codebase, point it at a folder of markdown files, and combine it with whichever framework and model you already use.

Run it beyond local files

You don't have to stay on the local file system. The questions I get are always the same. Does this work on a VPS, in a container app, on serverless? It all works. The concepts stay identical and the functions need small adjustments for the environment.

You can put the markdown files in a PostgreSQL database instead of on disk; the list, search, and read tools then query the database rather than the file system, and the loop doesn't change. The same goes for the deployment targets. At Datalumina this is routine client work. Some clients want container apps, some want a VPS, some run serverless functions, and we take the same working pattern and adjust it to fit.

The retrieval loop is one layer of a larger system. For how it connects to the rest, the data layer, evals, and deployment, start with the production AI systems hub. And if you want to watch the whole build end to end, the agentic RAG walkthrough covers every file in the cookbook, from the first glob call to the production version.

FAQ

Is semantic RAG dead?

No. For low-latency scenarios or when you're optimizing for cost, classic semantic RAG is still probably the better starting point, because it answers in one model call. Agentic RAG generally outperforms it on answer quality when you can afford the extra calls.

Do I need a vector database for agentic RAG?

No. The whole system in this post runs over markdown files with list, search, and read tools. Embeddings and vector databases solve a different problem, fast similarity search at scale, and you can add them later as another tool if the corpus demands it.

Does this only work with local markdown files?

No. You can store the same markdown in a PostgreSQL database and point the three tools at it, and the loop runs unchanged on a VPS, in container apps, or on serverless functions. The concepts stay the same; only the function internals adjust to the environment.

Why use ripgrep instead of Python's re module?

Speed and defaults. Ripgrep is a Rust-based search tool that ignores hidden and gitignored files without any configuration, which is why the modern agent harnesses standardized on it. The pure Python version is for understanding the mechanics; in production you shell out to ripgrep via subprocess.

How slow is agentic RAG?

Expect seconds, not milliseconds. The example question took five tool calls and 10 to 15 seconds to come back with a cited answer. Every loop iteration is a model call, so reserve this pattern for questions where accuracy is worth the wait.

Written by

Dave Ebbelaar

Dave Ebbelaar

Senior AI Engineer

AI engineer and founder of Datalumina. Dave helps developers build production AI systems and turn technical skills into client work.