Data8 min read
How to prepare data for AI agents
Build an open-source pipeline that prepares documents, PDFs, and websites for AI agents, using Docling extraction, hybrid chunking, embeddings, and vector search in Python.
One of the first things you do when building AI agents is give them access to your own data. Documents, PDFs, websites, anything that gives the agent specific knowledge about your company or the problem you're solving. Plenty of tools will handle this for you, and most share the same catch. They're closed source. You get an API key, you send your documents to their platform, they do the parsing, and you get the data back.
You don't need that. This post walks through a fully open-source document extraction pipeline in Python, built on Docling. Docling is an IBM project, and based on my experiments and what I hear from other AI engineers working with serious companies, it is by far the open-source document extraction library of choice right now. It's already strong as is, and the roadmap has more coming.
The pipeline runs in five steps. You extract document content, chunk it, embed the chunks into a vector database, test the search, and bring everything together in a chat application that answers questions with sources as citations. I'll use a specific set of tools, but the concepts hold no matter which vector database, embedding model, or AI model you pick. The extraction part is fully open source and runs locally; I still use OpenAI for embeddings and chat here, and that part is optional, you can swap in open-source models there too.
Parse PDFs into one unified document model
Docling's core abstraction is the DocumentConverter. Install the library with pip install docling, import the converter, and point it at a file:
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("docling-technical-report.pdf")
document = result.documentThe first run takes a while because Docling downloads local extraction models. After that, it works through the PDF, analyzes the blocks and components on each page, runs OCR, and hands back a data model, the DoclingDocument.
That data model is the reason I like this library. You can throw PDFs, PowerPoint, docx, or websites at the converter and every one of them comes back as the same DoclingDocument object. Your downstream pipeline stops caring whether the source was a PDF or a web page.
From there you export. document.export_to_markdown() gives you clean markdown, and there's a JSON export as well. For the Docling technical report I used as a test file, the markdown output had every heading at the correct level and no weird characters.
Tables are where Docling pulls ahead. A lot of open-source Python libraries that parse PDFs struggle with table extraction. Docling returned a perfectly formatted markdown table from the technical report, headers included, straight out of the box.
The library also takes custom extraction parameters for documents that are trickier, but for most files the defaults get you there.
Extract entire websites with the sitemap trick
The same converter handles HTML. Call convert with a URL instead of a file path and you get a DoclingDocument back for that page. This is fast, since it only parses HTML rather than running layout models, and the markdown export is an exact replica of the page content.
One page is rarely enough. For a full website, use the sitemap. Most websites expose a sitemap.xml at the root (check by appending sitemap.xml to the domain in your browser), and it lists the URLs for every page on the site. I wrote a small helper called get_sitemap_urls that fetches the sitemap and returns those URLs, and Docling's convert_all method takes the whole list:
sitemap_urls = get_sitemap_urls("https://docling-project.github.io/docling/")
result_iter = converter.convert_all(sitemap_urls)
docs = [result.document for result in result_iter]Now you have a DoclingDocument for every page on the site. A PDF, a single web page, or an entire website all end up as the same object in the same pipeline. That's the first step of a knowledge extraction system done.
Chunk by document structure, not character counts
Chunking splits documents into pieces before they go into the database, so a query returns the specific parts relevant to the question instead of an entire book. The goal is logical splits, components that fit well together. That's harder than splitting the text every X words or characters, and Docling handles it out of the box with two chunkers you can combine.
The hierarchical chunker splits the document along its actual structure of paragraphs, lists, and other logical groups, with their children attached. That alone is a good starting point, and it runs automatically. The hybrid chunker builds on top of it and makes the chunks fit your embedding model.
That fit matters because every embedding model has a max input. OpenAI documents the limits for its text-embedding-3-small, large, and ada models, and your chunks have to stay below the number for the model you're using. The hybrid chunker splits chunks that are too large for the model and stitches together ones that are too small, like a chunk that's just a lone header or a short paragraph. It does both with your actual tokenizer, so the chunks fit the exact model you chose.
One catch is that Docling's documentation shows this setup with an open-source tokenizer from Hugging Face. I'm embedding with OpenAI, so I wrote a small OpenAI tokenizer wrapper that follows the exact API spec the chunker expects. With that in place:
from docling.chunking import HybridChunker
chunker = HybridChunker(
tokenizer=tokenizer, # the OpenAI tokenizer wrapper
max_tokens=MAX_TOKENS, # your embedding model's max input
merge_peers=True, # default; merges undersized chunks
)
chunks = list(chunker.chunk(result.document))I ran the full Docling technical report through this and got 36 chunks back. The entire PDF, condensed into 36 text blobs that are each guaranteed to fit the context of text-embedding-3-large. That step normally takes real work to get right. Here it's a few lines.
GenAI Accelerator
The gap between a demo and production
Anyone can wire up an LLM call. The real skill is designing, evaluating, and shipping systems that hold up.
Embed and store the chunks in a vector database
For the vector database I'm using LanceDB. The specific implementation doesn't really matter; in our own projects I typically use PostgreSQL with the pgvector extension. LanceDB is just easy to work with for an example like this because the database lives in persistent storage, like a SQLite database. The file literally shows up in your file system.
Two parts of the LanceDB API do real work here. First, you can register an embedding model as a function on the table itself. I point it at OpenAI's text-embedding-3-large, and from then on the table manages embedding internally. Second, you define the table schema with a Pydantic model that inherits from LanceModel, marking which field is the text source for the embedding function and which column holds the vector.
My schema's three parts are a text field, a vector field for search, and a metadata object carrying the file name, the page numbers the chunk came from, and the section title. Those three metadata fields come straight out of the Docling chunk objects, and they're what lets the chat app cite sources later.
One warning cost me about an hour. If your Pydantic schema has a sub-model, its fields must be in alphabetical order. Otherwise you get weird errors. That's probably a bug in the current code, but you'll hit it.
Processing the chunks is a loop. Extract the text from each chunk, pull the file name, page numbers, and title out of the metadata, skip everything else. Then table.add(processed_chunks) writes it all to the database, and because the embedding function lives at the table level, the embeddings are created in the background during the add. No separate embedding step to manage. A row count afterwards confirms exactly 36 records.
Test retrieval before you touch the agent
The data is now ready for an AI system to use, and the next move is to query it directly.
With LanceDB this takes a few lines. Connect to the database by its local path, load the table, and call search with a query string, which is essentially the user's question. I set the query type to vector, which runs a similarity search over the embeddings, and the limit to 5 results. Searching for "pdf" pulls back the five most relevant chunks with their text, vector, metadata, and a distance score. Change the query to "what's Docling" and different chunks come back. Change the limit to 3 and you get three.
LanceDB also supports keyword search and a hybrid mode that combines both. For why that combination matters in production, and how to measure whether it helps on your data, see the post on hybrid search for RAG.
At this point the key components are all covered. Parsing, chunking, embedding, retrieval. That's everything a knowledge system for an AI agent needs.
Bring it together in a chat application
The last file wires the database into a Streamlit app. Streamlit's chat elements give you a simple interactive chat interface in pure Python that you can spin up locally, which makes it great for demos and examples. At a high level the app does three things. It connects to the database, it searches for context relevant to the user's question, and it streams the model's answer back through the chat components.
Run it from the folder the chat file lives in, with your environment active:
streamlit run 5-chat.pyAsk it "what's Docling" and it searches the documents first, shows you what it retrieved, and answers with the section name, file name, and page number attached. Citations, built from the metadata we stored.
Right now that's one PDF. The system doesn't care. Add more documents through the same pipeline and the agent's knowledge grows with the table. The retrieval settings are live too. I changed the number of retrieved results to five, saved the file, asked again, and the answer was grounded in five chunks.
A Streamlit demo is the proof of concept, not the product. For the production version of this idea, with authentication, chat history, and citation validation, read the private knowledge chatbot architecture.
This pipeline is the data layer of the broader skill set in how to build production AI systems. The model call is the easy part. Whether your agent knows anything useful is decided here, in extraction and chunking, before a prompt ever runs. The full walkthrough, including the repository with all five files, is in the video.
FAQ
What is Docling?
Docling is an open-source document extraction library from IBM. It converts PDFs, Word documents, PowerPoint files, and websites into a unified DoclingDocument object that you can export as markdown or JSON. In my experience it's currently the strongest open-source option, especially for table extraction, where most Python PDF libraries fall apart.
Can I run this pipeline without OpenAI?
Yes. The document extraction is fully open source and runs locally. I use OpenAI for embeddings and chat in the walkthrough, but you can swap in an open-source embedding model and language model without changing the pipeline's structure.
Why can't I just split documents every 500 characters?
Because the goal of chunking is logical splits, pieces that fit well together, and a fixed character count ignores the document's structure. Docling's hierarchical chunker splits along that structure (paragraphs, lists, groups with their children), and the hybrid chunker then sizes those pieces for your embedding model's max input using its tokenizer.
Do I need LanceDB for this?
No. The same principles work with any vector database. I use LanceDB here because the database is a local file, like SQLite, which keeps the example simple. In client projects I typically use PostgreSQL with pgvector; you only swap the code that creates the embeddings and writes the table.
