Building a RAG Pipeline

Retrieval-Augmented Generation (RAG) is one of the most practical techniques for building AI applications that need access to specific knowledge. Instead of fine-tuning a model, you retrieve relevant documents and include them in the prompt.

What Is RAG?

RAG combines retrieval (finding relevant information) with generation (producing an answer using an LLM). It solves the key problem of LLMs: they only know what was in their training data.

RAG vs Other Approaches

Approach	Cost	Freshness	Accuracy	Complexity
Prompt only	Very low	Static	Limited	Very low
Fine-tuning	High	Snapshot	Good	High
RAG	Medium	Real-time	High	Medium
RAG + Fine-tune	High	Real-time	Highest	High

RAG Architecture

graph TD
  A["Document Loading"] --> B["Chunking"]
  B --> C["Embedding"]
  C --> D["Vector Database"]
  D --> E["Retrieval"]
  E --> F["Prompt Assembly"]
  F --> G["LLM Response"]

Step 1: Document Loading

Load your source documents into a common format:

from pathlib import Path

def load_documents(directory: str) -> list[dict]:
    """Load text files from a directory."""
    documents = []
    for path in Path(directory).glob("**/*.txt"):
        text = path.read_text(encoding="utf-8")
        documents.append({
            "content": text,
            "metadata": {"source": str(path), "filename": path.name},
        })
    return documents

docs = load_documents("./knowledge_base")
print(f"Loaded {len(docs)} documents")

Common Document Types

Format	Library
PDF	`PyMuPDF`, `pdfplumber`
HTML	`BeautifulSoup`
DOCX	`python-docx`
CSV	`pandas`
JSON	Built-in `json`

Step 2: Chunking

Documents are too long to embed as a single vector. Split them into smaller, meaningful chunks.

Why Chunking Matters

Embedding models have token limits (typically 512–8192 tokens)
Smaller chunks give more precise retrieval
Chunks should be semantically coherent — not split mid-sentence

Chunking Strategies

def chunk_by_tokens(text: str, chunk_size: int = 500,
                    overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks by approximate token count."""
    words = text.split()
    chunks = []
    start = 0

    while start < len(words):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start = end - overlap  # overlap for context continuity

    return chunks

Building a RAG Pipeline

Building a RAG Pipeline

What Is RAG?

RAG vs Other Approaches

RAG Architecture

Step 1: Document Loading

Common Document Types

Step 2: Chunking

Why Chunking Matters

Chunking Strategies

Choosing Chunk Size

More in AI