What is RAG and why does it exist?
LLMs have a knowledge cutoff and no access to your private data. RAG (Retrieval-Augmented Generation) solves both:
Without RAG: "What does our Q3 contract with Acme Corp say?"
โ Model: "I don't have access to your contracts."
With RAG:
1. Embed query โ find relevant contract chunks โ inject into prompt
2. Model: "According to the Q3 2024 Acme Corp contract (clause 4.2), ..."
The full pipeline:
Ingestion (offline):
Documents โ Chunk โ Embed โ Store in vector DB
Query (online):
User query โ Embed โ Retrieve top-K chunks โ Augment prompt โ LLM โ Answer
Embeddings โ the math you need to know
An embedding is a fixed-size vector of floats that represents semantic meaning. Texts with similar meaning have vectors that are close in high-dimensional space.
import anthropic
client = anthropic.Anthropic()
def embed(texts: list[str]) -> list[list[float]]:
response = client.embeddings.create(
model="voyage-3", # Anthropic's embedding model
input=texts,
)
return [r.embedding for r in response.data]
# Two semantically similar sentences
vecs = embed(["How do I cancel my subscription?",
"What's the process for ending my plan?"])
# Their cosine similarity will be ~0.92
Cosine similarity โ measures angle between vectors (not magnitude):
import numpy as np
def cosine_similarity(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
# Returns 1.0 = identical direction, 0.0 = orthogonal, -1.0 = opposite
Common embedding models:
| Model | Dimension | Context | Good for |
|---|---|---|---|
text-embedding-3-large (OpenAI) | 3072 | 8K tokens | High accuracy |
text-embedding-3-small (OpenAI) | 1536 | 8K tokens | Cost/quality balance |
voyage-3 (Anthropic) | 1024 | 32K tokens | Long documents |
nomic-embed-text (open source) | 768 | 8K tokens | Self-hosted |
Chunking โ the most important RAG decision
Why it matters: if your chunks are too small, they lose context. Too large, they dilute relevance. Bad chunking is the #1 cause of RAG failures.
Fixed-size chunking (baseline, rarely best)
def chunk_fixed(text: str, size: int = 512, overlap: int = 50) -> list[str]:
chunks = []
for i in range(0, len(text), size - overlap):
chunks.append(text[i:i + size])
return chunks
Sentence-boundary chunking (better)
import re
def chunk_by_sentences(text: str, max_chars: int = 1000) -> list[str]:
# Split on sentence boundaries
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks, current = [], ""
for sentence in sentences:
if len(current) + len(sentence) > max_chars and current:
chunks.append(current.strip())
current = sentence
else:
current += " " + sentence
if current:
chunks.append(current.strip())
return chunks
Semantic chunking (best for complex documents)
Split where the topic changes, not at fixed character counts:
from sklearn.metrics.pairwise import cosine_similarity
def chunk_semantic(sentences: list[str], threshold: float = 0.7) -> list[str]:
embeddings = embed(sentences)
chunks, current_chunk = [], [sentences[0]]
for i in range(1, len(sentences)):
similarity = cosine_similarity([embeddings[i-1]], [embeddings[i]])[0][0]
if similarity < threshold: # Topic shift detected
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
Practical rules:
- Chunk size โ 512โ1000 tokens (longer = less precise retrieval)
- Include document title / section header in each chunk (adds context)
- Overlap 10โ15% of chunk size to avoid splitting mid-thought
pgvector โ vector storage in Postgres
You already run Postgres. pgvector adds a vector column type and similarity search operators.
-- Install extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Table with embedding column
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
content TEXT NOT NULL,
source_url TEXT,
section TEXT,
metadata JSONB,
embedding vector(1536), -- must match your model's dimension
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Create HNSW index for fast approximate nearest neighbor search
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- HNSW is faster than IVFFlat for most workloads; trades a little recall for speed
-- Cosine similarity search: find top 5 most similar chunks
SELECT id, content, 1 - (embedding <=> $1) AS similarity
FROM documents
ORDER BY embedding <=> $1 -- <=> = cosine distance
LIMIT 5;
Python ingestion:
import psycopg2
import json
def ingest_chunks(chunks: list[str], source_url: str, conn):
embeddings = embed(chunks) # batch embed โ much cheaper than one at a time
with conn.cursor() as cur:
for chunk, vec in zip(chunks, embeddings):
cur.execute(
"""INSERT INTO documents (content, source_url, embedding)
VALUES (%s, %s, %s)""",
(chunk, source_url, json.dumps(vec))
)
conn.commit()
Full RAG pipeline
import anthropic
client = anthropic.Anthropic()
def retrieve(query: str, conn, top_k: int = 5) -> list[dict]:
query_vec = embed([query])[0]
with conn.cursor() as cur:
cur.execute("""
SELECT content, source_url, 1 - (embedding <=> %s::vector) AS similarity
FROM documents
WHERE 1 - (embedding <=> %s::vector) > 0.7 -- similarity threshold
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (query_vec, query_vec, query_vec, top_k))
return [
{"content": row[0], "source": row[1], "similarity": row[2]}
for row in cur.fetchall()
]
def rag_answer(user_question: str, conn) -> str:
# 1. Retrieve
chunks = retrieve(user_question, conn)
if not chunks:
return "I couldn't find relevant information to answer your question."
# 2. Format context
context = "\n\n---\n\n".join(
f"[Source: {c['source']}]\n{c['content']}" for c in chunks
)
# 3. Augment prompt
prompt = f"""Answer the user's question based only on the provided context.
If the context doesn't contain enough information, say so clearly.
<context>
{context}
</context>
Question: {user_question}"""
# 4. Generate
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system="You are a helpful assistant. Answer based on the provided context only.",
messages=[{"role": "user", "content": prompt}],
)
return response.content[0].text
Hybrid search โ better than pure vector search
Pure vector search misses exact keyword matches (โinvoice #INV-2847โ). BM25 full-text search misses semantic similarity. Hybrid combines both.
-- Enable full-text search on content
ALTER TABLE documents ADD COLUMN fts_vector tsvector
GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;
CREATE INDEX ON documents USING gin(fts_vector);
-- Hybrid search: RRF (Reciprocal Rank Fusion)
WITH
vector_results AS (
SELECT id, content, ROW_NUMBER() OVER (ORDER BY embedding <=> $1) AS rank
FROM documents
LIMIT 20
),
text_results AS (
SELECT id, content, ROW_NUMBER() OVER (ORDER BY ts_rank(fts_vector, query) DESC) AS rank
FROM documents, to_tsquery('english', $2) query
WHERE fts_vector @@ query
LIMIT 20
)
SELECT
COALESCE(v.id, t.id) AS id,
COALESCE(v.content, t.content) AS content,
-- RRF score: 1/(k + rank) โ k=60 is standard
COALESCE(1.0/(60 + v.rank), 0) + COALESCE(1.0/(60 + t.rank), 0) AS rrf_score
FROM vector_results v
FULL OUTER JOIN text_results t ON v.id = t.id
ORDER BY rrf_score DESC
LIMIT 5;
Reranking โ precision over recall
Retrieve 20 chunks with fast vector search, then re-rank with a slower but more accurate cross-encoder model:
# Install: pip install sentence-transformers
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def retrieve_and_rerank(query: str, conn, initial_k: int = 20, final_k: int = 5) -> list[dict]:
# Step 1: fast retrieval (recall-optimized)
candidates = retrieve(query, conn, top_k=initial_k)
# Step 2: slow reranking (precision-optimized)
pairs = [(query, c["content"]) for c in candidates]
scores = reranker.predict(pairs)
# Step 3: sort by reranker score, take top final_k
ranked = sorted(zip(scores, candidates), reverse=True, key=lambda x: x[0])
return [c for _, c in ranked[:final_k]]
Evaluating RAG quality
| Metric | What it measures | Tool |
|---|---|---|
| Retrieval precision | Are retrieved chunks actually relevant? | Manual labels or LLM judge |
| Retrieval recall | Are all relevant chunks being retrieved? | Manual labels |
| Answer faithfulness | Does the answer contradict the context? | RAGAS, LLM judge |
| Answer relevancy | Does the answer address the question? | RAGAS, LLM judge |
# LLM-as-judge for faithfulness
def evaluate_faithfulness(question: str, context: str, answer: str) -> float:
prompt = f"""Rate whether the ANSWER is faithfully grounded in the CONTEXT.
Score 1-5: 1=completely contradicts context, 5=fully supported by context.
Return only the number.
CONTEXT: {context}
QUESTION: {question}
ANSWER: {answer}
Score:"""
response = client.messages.create(
model="claude-haiku-4-5-20251001", # use cheap model for evals
max_tokens=5,
messages=[{"role": "user", "content": prompt}],
)
return float(response.content[0].text.strip())
Answer based on context only. The #1 quality lever is chunking โ most RAG failures trace back to chunks that lack enough context or are too large to be semantically precise. Evaluate faithfulness and relevancy with an LLM judge; improve retrieval by checking precision@5 on a labeled eval set.โ