Phase 2 โ€” Embeddings, Vector Search & RAG

Build a production RAG pipeline โ€” chunking, embedding, pgvector retrieval, reranking, and the quality levers that actually matter.

must hard โฑ 40 min ragembeddingspgvectorvector-searchchunkingrerankinghybrid-search
Mastery:
Why interviewers ask this
RAG is the backbone of most real AI applications. Every AI engineering interview expects you to know the full pipeline, common failure modes, and quality improvement techniques.

What is RAG and why does it exist?

LLMs have a knowledge cutoff and no access to your private data. RAG (Retrieval-Augmented Generation) solves both:

Without RAG: "What does our Q3 contract with Acme Corp say?"
  โ†’ Model: "I don't have access to your contracts."

With RAG:
  1. Embed query โ†’ find relevant contract chunks โ†’ inject into prompt
  2. Model: "According to the Q3 2024 Acme Corp contract (clause 4.2), ..."

The full pipeline:

Ingestion (offline):
  Documents โ†’ Chunk โ†’ Embed โ†’ Store in vector DB

Query (online):
  User query โ†’ Embed โ†’ Retrieve top-K chunks โ†’ Augment prompt โ†’ LLM โ†’ Answer

Embeddings โ€” the math you need to know

An embedding is a fixed-size vector of floats that represents semantic meaning. Texts with similar meaning have vectors that are close in high-dimensional space.

import anthropic

client = anthropic.Anthropic()

def embed(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        model="voyage-3",  # Anthropic's embedding model
        input=texts,
    )
    return [r.embedding for r in response.data]

# Two semantically similar sentences
vecs = embed(["How do I cancel my subscription?",
              "What's the process for ending my plan?"])
# Their cosine similarity will be ~0.92

Cosine similarity โ€” measures angle between vectors (not magnitude):

import numpy as np

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
# Returns 1.0 = identical direction, 0.0 = orthogonal, -1.0 = opposite

Common embedding models:

ModelDimensionContextGood for
text-embedding-3-large (OpenAI)30728K tokensHigh accuracy
text-embedding-3-small (OpenAI)15368K tokensCost/quality balance
voyage-3 (Anthropic)102432K tokensLong documents
nomic-embed-text (open source)7688K tokensSelf-hosted

Chunking โ€” the most important RAG decision

Why it matters: if your chunks are too small, they lose context. Too large, they dilute relevance. Bad chunking is the #1 cause of RAG failures.

Fixed-size chunking (baseline, rarely best)

def chunk_fixed(text: str, size: int = 512, overlap: int = 50) -> list[str]:
    chunks = []
    for i in range(0, len(text), size - overlap):
        chunks.append(text[i:i + size])
    return chunks

Sentence-boundary chunking (better)

import re

def chunk_by_sentences(text: str, max_chars: int = 1000) -> list[str]:
    # Split on sentence boundaries
    sentences = re.split(r'(?<=[.!?])\s+', text)
    chunks, current = [], ""
    
    for sentence in sentences:
        if len(current) + len(sentence) > max_chars and current:
            chunks.append(current.strip())
            current = sentence
        else:
            current += " " + sentence
    
    if current:
        chunks.append(current.strip())
    return chunks

Semantic chunking (best for complex documents)

Split where the topic changes, not at fixed character counts:

from sklearn.metrics.pairwise import cosine_similarity

def chunk_semantic(sentences: list[str], threshold: float = 0.7) -> list[str]:
    embeddings = embed(sentences)
    chunks, current_chunk = [], [sentences[0]]
    
    for i in range(1, len(sentences)):
        similarity = cosine_similarity([embeddings[i-1]], [embeddings[i]])[0][0]
        if similarity < threshold:  # Topic shift detected
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])
    
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks

Practical rules:

  • Chunk size โ‰ˆ 512โ€“1000 tokens (longer = less precise retrieval)
  • Include document title / section header in each chunk (adds context)
  • Overlap 10โ€“15% of chunk size to avoid splitting mid-thought

pgvector โ€” vector storage in Postgres

You already run Postgres. pgvector adds a vector column type and similarity search operators.

-- Install extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Table with embedding column
CREATE TABLE documents (
  id           BIGSERIAL PRIMARY KEY,
  content      TEXT NOT NULL,
  source_url   TEXT,
  section      TEXT,
  metadata     JSONB,
  embedding    vector(1536),     -- must match your model's dimension
  created_at   TIMESTAMPTZ DEFAULT NOW()
);

-- Create HNSW index for fast approximate nearest neighbor search
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);
-- HNSW is faster than IVFFlat for most workloads; trades a little recall for speed

-- Cosine similarity search: find top 5 most similar chunks
SELECT id, content, 1 - (embedding <=> $1) AS similarity
FROM documents
ORDER BY embedding <=> $1  -- <=> = cosine distance
LIMIT 5;

Python ingestion:

import psycopg2
import json

def ingest_chunks(chunks: list[str], source_url: str, conn):
    embeddings = embed(chunks)  # batch embed โ€” much cheaper than one at a time
    
    with conn.cursor() as cur:
        for chunk, vec in zip(chunks, embeddings):
            cur.execute(
                """INSERT INTO documents (content, source_url, embedding)
                   VALUES (%s, %s, %s)""",
                (chunk, source_url, json.dumps(vec))
            )
    conn.commit()

Full RAG pipeline

import anthropic

client = anthropic.Anthropic()

def retrieve(query: str, conn, top_k: int = 5) -> list[dict]:
    query_vec = embed([query])[0]
    
    with conn.cursor() as cur:
        cur.execute("""
            SELECT content, source_url, 1 - (embedding <=> %s::vector) AS similarity
            FROM documents
            WHERE 1 - (embedding <=> %s::vector) > 0.7  -- similarity threshold
            ORDER BY embedding <=> %s::vector
            LIMIT %s
        """, (query_vec, query_vec, query_vec, top_k))
        
        return [
            {"content": row[0], "source": row[1], "similarity": row[2]}
            for row in cur.fetchall()
        ]

def rag_answer(user_question: str, conn) -> str:
    # 1. Retrieve
    chunks = retrieve(user_question, conn)
    
    if not chunks:
        return "I couldn't find relevant information to answer your question."
    
    # 2. Format context
    context = "\n\n---\n\n".join(
        f"[Source: {c['source']}]\n{c['content']}" for c in chunks
    )
    
    # 3. Augment prompt
    prompt = f"""Answer the user's question based only on the provided context.
If the context doesn't contain enough information, say so clearly.

<context>
{context}
</context>

Question: {user_question}"""
    
    # 4. Generate
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="You are a helpful assistant. Answer based on the provided context only.",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.content[0].text

Pure vector search misses exact keyword matches (โ€œinvoice #INV-2847โ€). BM25 full-text search misses semantic similarity. Hybrid combines both.

-- Enable full-text search on content
ALTER TABLE documents ADD COLUMN fts_vector tsvector
  GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;
CREATE INDEX ON documents USING gin(fts_vector);

-- Hybrid search: RRF (Reciprocal Rank Fusion)
WITH 
vector_results AS (
    SELECT id, content, ROW_NUMBER() OVER (ORDER BY embedding <=> $1) AS rank
    FROM documents
    LIMIT 20
),
text_results AS (
    SELECT id, content, ROW_NUMBER() OVER (ORDER BY ts_rank(fts_vector, query) DESC) AS rank
    FROM documents, to_tsquery('english', $2) query
    WHERE fts_vector @@ query
    LIMIT 20
)
SELECT 
    COALESCE(v.id, t.id) AS id,
    COALESCE(v.content, t.content) AS content,
    -- RRF score: 1/(k + rank) โ€” k=60 is standard
    COALESCE(1.0/(60 + v.rank), 0) + COALESCE(1.0/(60 + t.rank), 0) AS rrf_score
FROM vector_results v
FULL OUTER JOIN text_results t ON v.id = t.id
ORDER BY rrf_score DESC
LIMIT 5;

Reranking โ€” precision over recall

Retrieve 20 chunks with fast vector search, then re-rank with a slower but more accurate cross-encoder model:

# Install: pip install sentence-transformers
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def retrieve_and_rerank(query: str, conn, initial_k: int = 20, final_k: int = 5) -> list[dict]:
    # Step 1: fast retrieval (recall-optimized)
    candidates = retrieve(query, conn, top_k=initial_k)
    
    # Step 2: slow reranking (precision-optimized)
    pairs = [(query, c["content"]) for c in candidates]
    scores = reranker.predict(pairs)
    
    # Step 3: sort by reranker score, take top final_k
    ranked = sorted(zip(scores, candidates), reverse=True, key=lambda x: x[0])
    return [c for _, c in ranked[:final_k]]

Evaluating RAG quality

MetricWhat it measuresTool
Retrieval precisionAre retrieved chunks actually relevant?Manual labels or LLM judge
Retrieval recallAre all relevant chunks being retrieved?Manual labels
Answer faithfulnessDoes the answer contradict the context?RAGAS, LLM judge
Answer relevancyDoes the answer address the question?RAGAS, LLM judge
# LLM-as-judge for faithfulness
def evaluate_faithfulness(question: str, context: str, answer: str) -> float:
    prompt = f"""Rate whether the ANSWER is faithfully grounded in the CONTEXT.
Score 1-5: 1=completely contradicts context, 5=fully supported by context.
Return only the number.

CONTEXT: {context}
QUESTION: {question}
ANSWER: {answer}

Score:"""
    
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # use cheap model for evals
        max_tokens=5,
        messages=[{"role": "user", "content": prompt}],
    )
    return float(response.content[0].text.strip())

Say it out loud
โ€œThe pipeline is: chunk documents at natural boundaries (not fixed characters), embed with a model that matches your context length needs, store in pgvector with an HNSW index, retrieve top-20 with hybrid search (vector + BM25 via RRF), rerank with a cross-encoder to get top-5, inject into a prompt with Answer based on context only. The #1 quality lever is chunking โ€” most RAG failures trace back to chunks that lack enough context or are too large to be semantically precise. Evaluate faithfulness and relevancy with an LLM judge; improve retrieval by checking precision@5 on a labeled eval set.โ€

Likely follow-up questions
  • What is an embedding and what does cosine similarity measure?
  • How do you choose chunk size?
  • What is hybrid search?
  • How do you evaluate RAG quality?
  • When does RAG fail and what do you do?

References