DataOct 1, 2024Updated Jun 11, 20269 min read

How to build RAG with PostgreSQL

Build RAG with PostgreSQL using pgvector and pgvectorscale, keeping relational data and embeddings in one database, with similarity search, structured output, and advanced filtering.

Dave EbbelaarSenior AI Engineer

TL;DR

To build RAG with PostgreSQL, run a TimescaleDB image with Docker Compose, enable the pgvector and pgvectorscale extensions, and store embeddings in a table with a vector column. Embed your documents with text-embedding-3-small (1,536 dimensions), upsert them, and create a DiskANN index to speed up queries. Retrieval is a cosine similarity search. pgvector computes cosine distance, which is 1 minus cosine similarity, so lower scores mean closer matches. Then pass the retrieved context to an LLM with structured output (the instructor library plus Pydantic) and have the model return an answer along with an enough_context boolean, so questions your data cannot answer get routed to a human instead of hallucinated. At Datalumina this setup replaced Pinecone two years ago. Unless you scale into billions of records, one Postgres database handles both your relational data and your vectors.

We run RAG on PostgreSQL at Datalumina. Two years ago we started with Pinecone, and we moved everything to Postgres with the pgvector extension not long after. Today the setup runs on pgvectorscale, and it is our preferred method by far. For the projects we work on, there has been no need for a dedicated vector database at all.

This post walks through the full pipeline from my video tutorial. You spin up the database with Docker, embed and insert your data, run the similarity search, synthesize answers with structured output, and sharpen retrieval with metadata filters, predicates, and time ranges. To follow along you need Docker, Python, an OpenAI key, and something to view your Postgres database in.

Why Postgres instead of a dedicated vector database

Simplicity. With this approach, one database manages your regular data, so chats, messages, or the content your application generates, and a table in that same database stores the embeddings for your RAG system. One database means one deployment and simpler APIs. You are not syncing records between your application store and a separate vector service.

Speed used to be the argument for dedicated vector databases like Pinecone, Weaviate, or Qdrant. It no longer holds for most workloads. By the time we adopted pgvector, the latency was already better than or as good as a dedicated vector database in most cases, which makes a separate system pretty much redundant for most applications. I would even call it overkill.

There is a real caveat. If you scale into billions of records, a dedicated vector database might make sense at some point. But for most applications, a regular Postgres database with this setup is more than enough, and it will be fast enough to cover the needs of your AI application. You get an open source database that manages both relational data and vector data, and it is fast. What else do you want from a system like this?

What pgvector and pgvectorscale actually do

pgvector is the extension that lets a regular Postgres database work as a vector database. It adds the vector data type and similarity search. That alone gets you a long way, and I covered it in an earlier pgvector guide.

pgvectorscale builds on top of it. It comes from Timescale, a company that makes products to extend Postgres (their hypertables for time series data are also worth knowing). The extension complements pgvector with a new index type, StreamingDiskANN, inspired by Microsoft's DiskANN research, which speeds up vector queries beyond what pgvector does on its own. Timescale's blog post "How we made PostgreSQL a better vector database" walks through the speed comparisons if you want the numbers.

At the core you are still running a normal Postgres database. pgvector enables the vector capabilities, and pgvectorscale adds the index that makes querying fast.

Set up the database with Docker Compose

The whole database layer is one Docker Compose file. It runs the TimescaleDB image on Postgres 16, names the container, sets the default database and password, exposes the default port, and mounts a volume:

YAML

services:
  timescaledb:
    image: timescale/timescaledb-ha:pg16
    container_name: timescaledb
    environment:
      - POSTGRES_DB=postgres
      - POSTGRES_PASSWORD=password
    ports:
      - "5432:5432"
    volumes:
      - timescaledb_data:/var/lib/postgresql/data
 
volumes:
  timescaledb_data:

The volume matters. It gives Docker a separate place to store the data, so when you stop the container, your records persist. The postgres/password credentials are fine locally, but change them before this goes anywhere near production.

Start it in detached mode:

Shell

docker compose up -d

To inspect what lands in the database, connect a GUI client. I use TablePlus, but DataGrip, pgAdmin, or a database extension in VS Code or Cursor all work. Host is localhost, port and user stay on the Postgres defaults, password is whatever you set in the compose file. Test the connection and you should see an empty database.

Embed and insert your data

The demo dataset is an FAQ for a hypothetical e-commerce store building an internal knowledge base, questions like "what are the shipping options" and "how can I track my order", each with an answer and a category. These records become the rows in the vector table.

Preparation is one step. Take the question and the answer columns and concatenate them into a single string per record. That string is what gets embedded.

For configuration you need two environment variables, your OpenAI API key and a service URL that points at the local Postgres instance. If you did not change the compose file, the default URL from the example env file works as is. The settings file also pins the embedding model, text-embedding-3-small, at 1,536 dimensions. Whatever model you pick, the embedding dimension must match the dimension configured for the table, or inserts will fail.

The embedding call itself is unspectacular. The project wraps the timescale-vector Python library in a small vector store class, and get_embedding is a plain request to the OpenAI embeddings API. I like working with pandas DataFrames here, so the script applies that function across every row and produces a frame with four columns (id, metadata, content, and embedding). That structure works out of the box with the timescale-vector client.

Three calls finish the job. Create the tables, create the index, and upsert the DataFrame. For the index, start with DiskANN. It is the piece pgvectorscale adds and the reason querying stays fast as the table grows. The demo inserts 20 records, and refreshing the GUI shows each one with its metadata, content, and embedding vector.

The Python wrapper is convenience, not magic. Behind the scenes it runs regular SQL, and you can do all of this directly when you need a custom table structure:

SQL

CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;
 
CREATE TABLE embeddings (
    id UUID PRIMARY KEY,
    metadata JSONB,
    contents TEXT,
    embedding VECTOR(1536)
);
 
CREATE INDEX ON embeddings
USING diskann (embedding);

Any table with a vector column works. The library defaults are just a sensible starting shape.

Run the similarity search

Retrieval is one call. Embed the question, search the table, get the closest records back.

Python

results = vec.search("What are your shipping options?", limit=3)

The limit controls how many results come back, so this returns the three most relevant records. The search uses cosine similarity, currently the only supported metric in this setup. Behind the scenes the SQL operator computes cosine distance, which is 1 minus the cosine similarity. The score ranges from 0 to 2, and lower is better. The closer to zero, the more similar the vectors.

Now the trap. Ask "what's the weather in Tokyo" and you still get three results back. They are the nearest neighbors in the FAQ table, and they have nothing to do with the question. Vector search has no concept of "no good match"; it always returns the closest thing it has. Managing that gap is one of the real challenges of scaling RAG applications.

Stuck building prototypes?

Get our production stack

What you're missing is the architecture, evaluation, and judgment to ship a real project.

See Curriculum

Synthesize answers with structured output

The synthesizer closes that gap. It is a separate class that takes the question and the retrieved context, sends both to OpenAI, and returns the response in a structured way using the instructor library. The three fields on the Pydantic response model are a thought process that lists the reasoning steps (a simple way to promote chain-of-thought), the answer itself, and an enough_context boolean.

That boolean is the part I would never skip. The model has to decide whether it actually used the provided context or made the answer up, and the system prompt instructs it accordingly. For the shipping question, you get a grounded answer and enough_context comes back true. For the Tokyo weather question, the model answers "I'm sorry, but I cannot provide current weather information" and enough_context comes back false.

Your application can branch on that value with plain conditional logic. In a customer support setup, false routes the conversation to a real customer care representative instead of letting the system improvise. This is a simple, effective way to reduce hallucinations, and it still excites me every time a system returns that structured response reliably. In my experience, structured output is the key to building reliable AI solutions.

Filter with metadata, predicates, and time ranges

Plain similarity search is the baseline. The search function takes three more parameters that make retrieval precise.

Metadata filtering restricts the search to records whose metadata matches. Filter on the category shipping and the same query returns only 2 results, even with the limit set to 3, because only two records in the dataset carry that category.

Predicates go further. They express conditional logic over the metadata and combine with boolean operators:

Python

from timescale_vector import client
 
predicates = client.Predicates("category", "==", "shipping") | client.Predicates(
    "category", "==", "services"
)
results = vec.search(query, limit=3, predicates=predicates)

That returns both the shipping and the services records. The operators adapt to the data type. Strings compare as strings, while integers, floats, and datetimes support comparisons like greater-than. Filter shipping records on a created_at of September or later and you get results; move the cutoff to October and you get an empty DataFrame back.

Time-based filtering is the third option, and it works without any metadata at all. The default ids are UUID version 1, which embeds a timestamp in the id itself. Pass a time_range as a tuple of two datetime objects and the search only considers records created in that window. The whole month of September returns the demo records; August returns nothing.

Every dedicated vector database offers some version of this. The difference is that here it all lives inside the open source Postgres database that already holds the rest of your data. If you want to push retrieval further with keyword search and reranking on top of vectors, that is the territory of hybrid search for RAG.

Put an analysis step in front of retrieval

Right now the pipeline sends every question straight to vector search. The upgrade that makes this setup powerful is one extra step before retrieval, an LLM call with structured output that analyzes the incoming question and assigns it a category. That category then feeds the metadata filter, the filtered search retrieves the right slice of the data, and the synthesizer produces the final answer.

You are building a chain of actions. Classify, filter, retrieve, synthesize. Once you start designing pipelines like that, the sky is the limit, and it is how the pros integrate AI into real systems. That pipeline design matters more than the choice of vector store, and it belongs to the broader skill set of building production AI systems.

If you want to see every step on screen, watch the full Postgres RAG tutorial. The repository with the vector store class, the synthesizer, and the FAQ dataset is linked in the video description.

FAQ

Can PostgreSQL be used as a vector database?

Yes. The pgvector extension adds a vector data type and similarity search to a regular Postgres database, and pgvectorscale adds a DiskANN index that speeds up queries further. You store embeddings in a table next to your relational data and query them with SQL or a Python client.

Do I need a dedicated vector database for RAG?

For most applications, no. Postgres with pgvector matches or beats dedicated vector databases on latency for typical workloads, and you skip running a second system. If you scale into billions of records, a dedicated vector database might make sense at some point.

What is the difference between pgvector and pgvectorscale?

pgvector provides the vector data type and the similarity search operators. pgvectorscale, built by Timescale, complements it with the StreamingDiskANN index, which makes vector queries faster. You enable both extensions and still work with a normal Postgres database.

How do I read the similarity scores in the results?

The search returns a cosine distance, which is 1 minus the cosine similarity. The value ranges from 0 to 2. Lower is better, and a score close to zero means the vectors are nearly identical.

How do I stop the system from answering questions outside my data?

Have the model return an enough_context boolean alongside the answer using structured output. When retrieval returns context that does not cover the question, the boolean comes back false, and your application can route the conversation to a human instead of serving a made-up answer.