Phase 2 — Embeddings, Vector Search & RAG

What is RAG and why does it exist?

LLMs have a knowledge cutoff and no access to your private data. RAG (Retrieval-Augmented Generation) solves both:

Without RAG: "What does our Q3 contract with Acme Corp say?"
  → Model: "I don't have access to your contracts."

With RAG:
  1. Embed query → find relevant contract chunks → inject into prompt
  2. Model: "According to the Q3 2024 Acme Corp contract (clause 4.2), ..."

The full pipeline:

Ingestion (offline):
  Documents → Chunk → Embed → Store in vector DB

Query (online):
  User query → Embed → Retrieve top-K chunks → Augment prompt → LLM → Answer

Embeddings — the math you need to know

An embedding is a fixed-size vector of floats that represents semantic meaning. Texts with similar meaning have vectors that are close in high-dimensional space.

import anthropic

client = anthropic.Anthropic()

def embed(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        model="voyage-3",  # Anthropic's embedding model
        input=texts,
    )
    return [r.embedding for r in response.data]

# Two semantically similar sentences
vecs = embed(["How do I cancel my subscription?",
              "What's the process for ending my plan?"])
# Their cosine similarity will be ~0.92

Cosine similarity — measures angle between vectors (not magnitude):

import numpy as np

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
# Returns 1.0 = identical direction, 0.0 = orthogonal, -1.0 = opposite

Common embedding models:

Model	Dimension	Context	Good for
`text-embedding-3-large` (OpenAI)	3072	8K tokens	High accuracy
`text-embedding-3-small` (OpenAI)	1536	8K tokens	Cost/quality balance
`voyage-3` (Anthropic)	1024	32K tokens	Long documents
`nomic-embed-text` (open source)	768	8K tokens	Self-hosted

Chunking — the most important RAG decision

Why it matters: if your chunks are too small, they lose context. Too large, they dilute relevance. Bad chunking is the #1 cause of RAG failures.

Fixed-size chunking (baseline, rarely best)

def chunk_fixed(text: str, size: int = 512, overlap: int = 50) -> list[str]:
    chunks = []
    for i in range(0, len(text), size - overlap):
        chunks.append(text[i:i + size])
    return chunks

Sentence-boundary chunking (better)

import re

def chunk_by_sentences(text: str, max_chars: int = 1000) -> list[str]:
    # Split on sentence boundaries
    sentences = re.split(r'(?<=[.!?])\s+', text)
    chunks, current = [], ""
    
    for sentence in sentences:
        if len(current) + len(sentence) > max_chars and current:
            chunks.append(current.strip())
            current = sentence
        else:
            current += " " + sentence
    
    if current:
        chunks.append(current.strip())
    return chunks

Semantic chunking (best for complex documents)

Split where the topic changes, not at fixed character counts:

from sklearn.metrics.pairwise import cosine_similarity

def chunk_semantic(sentences: list[str], threshold: float = 0.7) -> list[str]:
    embeddings = embed(sentences)
    chunks, current_chunk = [], [sentences[0]]
    
    for i in range(1, len(sentences)):
        similarity = cosine_similarity([embeddings[i-1]], [embeddings[i]])[0][0]
        if similarity < threshold:  # Topic shift detected
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])
    
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks

Practical rules:

Chunk size ≈ 512–1000 tokens (longer = less precise retrieval)
Include document title / section header in each chunk (adds context)
Overlap 10–15% of chunk size to avoid splitting mid-thought

pgvector — vector storage in Postgres

You already run Postgres. pgvector adds a vector column type and similarity search operators.

-- Install extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Table with embedding column
CREATE TABLE documents (
  id           BIGSERIAL PRIMARY KEY,
  content      TEXT NOT NULL,
  source_url   TEXT,
  section      TEXT,
  metadata     JSONB,
  embedding    vector(1536),     -- must match your model's dimension
  created_at   TIMESTAMPTZ DEFAULT NOW()
);

-- Create HNSW index for fast approximate nearest neighbor search
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);
-- HNSW is faster than IVFFlat for most workloads; trades a little recall for speed

-- Cosine similarity search: find top 5 most similar chunks
SELECT id, content, 1 - (embedding <=> $1) AS similarity
FROM documents
ORDER BY embedding <=> $1  -- <=> = cosine distance
LIMIT 5;

Python ingestion:

import psycopg2
import json

def ingest_chunks(chunks: list[str], source_url: str, conn):
    embeddings = embed(chunks)  # batch embed — much cheaper than one at a time
    
    with conn.cursor() as cur:
        for chunk, vec in zip(chunks, embeddings):
            cur.execute(
                """INSERT INTO documents (content, source_url, embedding)
                   VALUES (%s, %s, %s)""",
                (chunk, source_url, json.dumps(vec))
            )
    conn.commit()

Full RAG pipeline

import anthropic

client = anthropic.Anthropic()

def retrieve(query: str, conn, top_k: int = 5) -> list[dict]:
    query_vec = embed([query])[0]
    
    with conn.cursor() as cur:
        cur.execute("""
            SELECT content, source_url, 1 - (embedding <=> %s::vector) AS similarity
            FROM documents
            WHERE 1 - (embedding <=> %s::vector) > 0.7  -- similarity threshold
            ORDER BY embedding <=> %s::vector
            LIMIT %s
        """, (query_vec, query_vec, query_vec, top_k))
        
        return [
            {"content": row[0], "source": row[1], "similarity": row[2]}
            for row in cur.fetchall()
        ]

def rag_answer(user_question: str, conn) -> str:
    # 1. Retrieve
    chunks = retrieve(user_question, conn)
    
    if not chunks:
        return "I couldn't find relevant information to answer your question."
    
    # 2. Format context
    context = "\n\n---\n\n".join(
        f"[Source: {c['source']}]\n{c['content']}" for c in chunks
    )
    
    # 3. Augment prompt
    prompt = f"""Answer the user's question based only on the provided context.
If the context doesn't contain enough information, say so clearly.

<context>
{context}
</context>

Question: {user_question}"""
    
    # 4. Generate
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="You are a helpful assistant. Answer based on the provided context only.",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.content[0].text

Hybrid search — better than pure vector search

Pure vector search misses exact keyword matches (“invoice #INV-2847”). BM25 full-text search misses semantic similarity. Hybrid combines both.

-- Enable full-text search on content
ALTER TABLE documents ADD COLUMN fts_vector tsvector
  GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;
CREATE INDEX ON documents USING gin(fts_vector);

-- Hybrid search: RRF (Reciprocal Rank Fusion)
WITH 
vector_results AS (
    SELECT id, content, ROW_NUMBER() OVER (ORDER BY embedding <=> $1) AS rank
    FROM documents
    LIMIT 20
),
text_results AS (
    SELECT id, content, ROW_NUMBER() OVER (ORDER BY ts_rank(fts_vector, query) DESC) AS rank
    FROM documents, to_tsquery('english', $2) query
    WHERE fts_vector @@ query
    LIMIT 20
)
SELECT 
    COALESCE(v.id, t.id) AS id,
    COALESCE(v.content, t.content) AS content,
    -- RRF score: 1/(k + rank) — k=60 is standard
    COALESCE(1.0/(60 + v.rank), 0) + COALESCE(1.0/(60 + t.rank), 0) AS rrf_score
FROM vector_results v
FULL OUTER JOIN text_results t ON v.id = t.id
ORDER BY rrf_score DESC
LIMIT 5;

Reranking — precision over recall

Retrieve 20 chunks with fast vector search, then re-rank with a slower but more accurate cross-encoder model:

# Install: pip install sentence-transformers
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def retrieve_and_rerank(query: str, conn, initial_k: int = 20, final_k: int = 5) -> list[dict]:
    # Step 1: fast retrieval (recall-optimized)
    candidates = retrieve(query, conn, top_k=initial_k)
    
    # Step 2: slow reranking (precision-optimized)
    pairs = [(query, c["content"]) for c in candidates]
    scores = reranker.predict(pairs)
    
    # Step 3: sort by reranker score, take top final_k
    ranked = sorted(zip(scores, candidates), reverse=True, key=lambda x: x[0])
    return [c for _, c in ranked[:final_k]]

Evaluating RAG quality

Metric	What it measures	Tool
Retrieval precision	Are retrieved chunks actually relevant?	Manual labels or LLM judge
Retrieval recall	Are all relevant chunks being retrieved?	Manual labels
Answer faithfulness	Does the answer contradict the context?	RAGAS, LLM judge
Answer relevancy	Does the answer address the question?	RAGAS, LLM judge

# LLM-as-judge for faithfulness
def evaluate_faithfulness(question: str, context: str, answer: str) -> float:
    prompt = f"""Rate whether the ANSWER is faithfully grounded in the CONTEXT.
Score 1-5: 1=completely contradicts context, 5=fully supported by context.
Return only the number.

CONTEXT: {context}
QUESTION: {question}
ANSWER: {answer}

Score:"""
    
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # use cheap model for evals
        max_tokens=5,
        messages=[{"role": "user", "content": prompt}],
    )
    return float(response.content[0].text.strip())

Say it out loud

“The pipeline is: chunk documents at natural boundaries (not fixed characters), embed with a model that matches your context length needs, store in pgvector with an HNSW index, retrieve top-20 with hybrid search (vector + BM25 via RRF), rerank with a cross-encoder to get top-5, inject into a prompt with Answer based on context only. The #1 quality lever is chunking — most RAG failures trace back to chunks that lack enough context or are too large to be semantically precise. Evaluate faithfulness and relevancy with an LLM judge; improve retrieval by checking precision@5 on a labeled eval set.”

Phase 2 — Embeddings, Vector Search & RAG

What is RAG and why does it exist?

Embeddings — the math you need to know

Chunking — the most important RAG decision

Fixed-size chunking (baseline, rarely best)

Sentence-boundary chunking (better)

Semantic chunking (best for complex documents)

pgvector — vector storage in Postgres

Full RAG pipeline

Hybrid search — better than pure vector search

Reranking — precision over recall

Evaluating RAG quality

References