Academy

Building Your First RAG Knowledge Base: Zero to Production

Step-by-step technical guide for implementing retrieval-augmented generation (RAG) knowledge bases -from vector embeddings to production deployment with code examples.

OpenHelm Team· Content

·Aug 18, 2025·18 min read

TL;DR

RAG (retrieval-augmented generation) lets LLMs answer questions using your private data without fine-tuning -lower cost, faster iteration, better accuracy for domain-specific queries.
Core components: document processing (chunking + embeddings), vector database (semantic search), retrieval pipeline (hybrid ranking), generation (LLM + context).
Real performance: Notion's AI search achieves 94% answer accuracy using RAG versus 67% with base GPT-4 (Notion Engineering Blog, 2024).

Jump to Why RAG over fine-tuning · Jump to Architecture overview · Jump to Step 1: Document processing · Jump to Step 2: Vector database · Jump to Step 3: Retrieval · Jump to Step 4: Generation · Jump to Step 5: Production

# Building Your First RAG Knowledge Base: Zero to Production

Large language models are powerful but limited: they don't know your company docs, product specs, customer data, or institutional knowledge. Training or fine-tuning models on private data is slow, expensive, and brittle -every update requires retraining.

RAG (retrieval-augmented generation) solves this by combining semantic search with LLM generation. When a user asks a question, RAG retrieves relevant context from your knowledge base, then feeds that context to the LLM to generate an accurate, grounded answer. No fine-tuning required.

This guide walks through building a production-ready RAG system from scratch: processing documents, generating embeddings, storing in a vector database, implementing hybrid retrieval, and deploying with monitoring. By the end, you'll have a working knowledge base that answers domain-specific questions with >90% accuracy.

Key takeaways - RAG is 10–50× cheaper than fine-tuning for domain-specific Q&A and iterates faster (add new docs instantly vs retrain). - Core workflow: chunk documents → embed chunks → store in vector DB → retrieve relevant chunks → generate answer with LLM + context. - Production considerations: chunking strategy (200–500 tokens), hybrid search (semantic + keyword), caching, and evaluation metrics.

Why RAG over fine-tuning

When you need an LLM to answer questions about private data, you have three options:

Option 1: Prompt the base model directly

Approach: Paste your knowledge base into the prompt.

Pros: Zero setup.

Cons: Limited by context window (even 200K tokens = ~500 pages max), expensive (every query re-processes all context), no way to scale beyond context limit.

Verdict: Only viable for tiny knowledge bases (<100 pages).

Option 2: Fine-tune the model

Approach: Train a custom model on your data.

Pros: Model "learns" your knowledge, potentially better long-term performance.

Cons:

Expensive: $500–$5,000+ per training run (OpenAI fine-tuning, Anthropic Claude fine-tuning).
Slow: Takes hours to days, blocks iteration.
Brittle: Every data update requires retraining.
Hallucination risk: Model memorises training data but may fabricate details when uncertain.

Verdict: Use for style/tone adaptation, not knowledge injection.

Option 3: RAG (retrieval-augmented generation)

Approach: Store knowledge in a searchable database. At query time, retrieve relevant snippets, inject into LLM prompt.

Pros:

Fast iteration: Add/update docs instantly (no retraining).
Scalable: Works with millions of documents.
Cost-effective: Only embed documents once; retrieval is cheap.
Grounded: LLM answers are anchored to retrieved sources (reduces hallucinations).

Cons:

Engineering complexity: Requires vector database, embedding pipeline, retrieval logic.
Quality depends on retrieval: If retrieval fails to find relevant docs, LLM can't generate good answers.

Verdict: Best choice for 95% of knowledge-based use cases.

According to a 2024 study by Stanford, RAG outperforms fine-tuning for factual Q&A tasks whilst costing 12× less and iterating 40× faster (Stanford AI Lab, 2024).

<text x="30" y="35" fill="#38bdf8" font-size="18">RAG vs Fine-Tuning: Cost & Iteration Speed</text>

<text x="50" y="80" fill="#94a3b8" font-size="14">Implementation Cost</text>

<text x="280" y="115" fill="#fff" font-size="12">Fine-Tuning: $2,500</text>

<text x="50" y="165" fill="#94a3b8" font-size="14">Add New Document</text>

<text x="150" y="200" fill="#fff" font-size="11">Fine-Tuning</text>

<text x="150" y="220" fill="#fff" font-size="10">6–48 hours</text>

<text x="300" y="250" fill="#10b981" font-size="13">→ RAG: 50× cheaper, 40× faster iteration</text>

</svg>

<figcaption>RAG costs 50× less than fine-tuning and lets you add new knowledge in minutes versus hours/days.</figcaption>

</figure>

"AI-assisted development isn't about replacing developers - it's about amplifying them. The best engineers are shipping 3-5x more code with AI tools while maintaining quality." - Kelsey Hightower, Principal Engineer at Google Cloud

Architecture overview

A production RAG system has five components:

1. Document ingestion

Parse docs (PDFs, Markdown, HTML, DOCX), extract text, clean formatting.

2. Chunking

Split documents into semantic chunks (200–500 tokens each) that fit in embedding models and provide focused context.

3. Embedding

Convert chunks into vector representations (embeddings) using models like text-embedding-3-small (OpenAI) or e5-mistral-7b-instruct (HuggingFace).

4. Vector database

Store embeddings in a searchable index (Pinecone, Weaviate, Qdrant, Chroma).

5. Retrieval + generation

Query: User asks question.
Embed query: Convert question to embedding.
Retrieve: Search vector DB for top-k similar chunks.
Rerank (optional): Use hybrid search (semantic + keyword) or cross-encoder for better relevance.
Generate: Pass retrieved chunks + question to LLM, get answer.

<text x="30" y="35" fill="#f59e0b" font-size="18">RAG Architecture Pipeline</text>

<text x="55" y="100" fill="#0f172a" font-size="11">Documents</text>

<text x="195" y="95" fill="#fff" font-size="10">Chunk +</text>

<text x="195" y="110" fill="#fff" font-size="10">Embed</text>

<text x="330" y="100" fill="#0f172a" font-size="11">Vector DB</text>

<text x="555" y="100" fill="#0f172a" font-size="11">Answer</text>

<text x="65" y="190" fill="#0f172a" font-size="11">User Query</text>

<text x="330" y="185" fill="#fff" font-size="10">Retrieve</text>

<text x="330" y="200" fill="#fff" font-size="10">Chunks</text>

<text x="455" y="200" fill="#fff" font-size="10">Context</text>

</svg>

<figcaption>RAG pipeline: Documents → Chunk/Embed → Vector DB. At query time: Search DB → Retrieve chunks → LLM generates answer.</figcaption>

</figure>

Step 1: Document processing pipeline

Document parsing

First, extract text from various formats. Use these libraries:

PDF: pymupdf (PyMuPDF) or pdfplumber
DOCX: python-docx
HTML: BeautifulSoup or trafilatura
Markdown: Native Python string parsing

Example (PDF parsing):

import pymupdf  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    """Extract text from PDF."""
    doc = pymupdf.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

Chunking strategy

Why chunk?

Embedding models have token limits (e.g., OpenAI text-embedding-3-small: 8,191 tokens).
LLMs work better with focused context (5–10 relevant paragraphs, not entire manuals).

Chunking approaches:

1. Fixed-size chunks (simple)

Split text every N tokens (e.g., 500 tokens) with overlap (e.g., 50 tokens).

Pros: Simple, fast.

Cons: May split mid-sentence or mid-concept.

2. Semantic chunks (better)

Split on natural boundaries: paragraphs, sections, headings.

Pros: Preserves context, more coherent chunks.

Cons: Variable chunk sizes.

3. Sliding window (best for dense docs)

Create overlapping chunks to ensure no context is lost.

Recommendation: Start with semantic chunking (split on \n\n or headings), then enforce max chunk size (500 tokens).

Example (semantic chunking with tiktoken):

import tiktoken

def chunk_text(text, max_tokens=500, overlap_tokens=50):
    """Chunk text into semantic segments with max token limit."""
    enc = tiktoken.get_encoding("cl100k_base")  # GPT-4 encoding

    # Split on double newlines (paragraphs)
    paragraphs = text.split("\n\n")

    chunks = []
    current_chunk = ""
    current_tokens = 0

    for para in paragraphs:
        para_tokens = len(enc.encode(para))

        if current_tokens + para_tokens <= max_tokens:
            current_chunk += para + "\n\n"
            current_tokens += para_tokens
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = para + "\n\n"
            current_tokens = para_tokens

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

# Usage
text = extract_text_from_pdf("knowledge_base.pdf")
chunks = chunk_text(text, max_tokens=500)
print(f"Created {len(chunks)} chunks")

Metadata enrichment

Attach metadata to each chunk for better retrieval and provenance:

Source: Document title, URL, file path.
Section: Heading, chapter.
Date: Last updated, published date.
Author: Document owner.

Example:

chunks_with_metadata = [
    {
        "text": chunk,
        "metadata": {
            "source": "product_docs.pdf",
            "section": "Authentication",
            "updated": "2025-08-01"
        }
    }
    for chunk in chunks
]

Step 2: Vector database setup

Choosing a vector database

Database	Hosting	Pros	Cons	Best For
Pinecone	Managed (cloud)	Easiest setup, generous free tier, fast	Vendor lock-in	Startups (quick start)
Weaviate	Self-hosted or cloud	Open-source, hybrid search, multi-tenancy	More complex setup	Advanced use cases
Qdrant	Self-hosted or cloud	Fast, Rust-based, great filtering	Smaller ecosystem	Performance-critical apps
Chroma	Local/self-hosted	Lightweight, Python-native, good for dev	Not production-grade yet	Prototyping

Recommendation: Start with Pinecone (fastest setup, free tier covers early stage). Graduate to Weaviate or Qdrant if you need self-hosting or advanced features.

Setting up Pinecone

Install:

pip install pinecone-client openai tiktoken

Initialise:

from pinecone import Pinecone, ServerlessSpec
import openai

# Initialize Pinecone
pc = Pinecone(api_key="YOUR_PINECONE_API_KEY")

# Create index (1536 dims = OpenAI text-embedding-3-small)
index_name = "knowledge-base"
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

index = pc.Index(index_name)

Generating embeddings

OpenAI embedding API:

import openai

openai.api_key = "YOUR_OPENAI_API_KEY"

def embed_text(text):
    """Generate embedding for text using OpenAI."""
    response = openai.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

# Example
chunk = "OpenHelm is an AI-powered research assistant for startups."
embedding = embed_text(chunk)
print(f"Embedding dimension: {len(embedding)}")  # 1536

Inserting embeddings into Pinecone

def insert_chunks(chunks_with_metadata):
    """Embed chunks and insert into Pinecone."""
    vectors = []

    for i, item in enumerate(chunks_with_metadata):
        chunk_text = item["text"]
        metadata = item["metadata"]

        # Generate embedding
        embedding = embed_text(chunk_text)

        # Prepare vector
        vector_id = f"chunk_{i}"
        vectors.append({
            "id": vector_id,
            "values": embedding,
            "metadata": {
                **metadata,
                "text": chunk_text  # Store original text in metadata
            }
        })

    # Batch insert (Pinecone supports up to 100 vectors per batch)
    index.upsert(vectors=vectors)
    print(f"Inserted {len(vectors)} vectors into Pinecone")

# Usage
insert_chunks(chunks_with_metadata)

Step 3: Retrieval pipeline

Basic semantic search

Query workflow:

User asks question.
Embed question.
Search vector DB for top-k nearest neighbours.
Return matching chunks.

Example:

def search_knowledge_base(query, top_k=5):
    """Search Pinecone for relevant chunks."""
    # Embed query
    query_embedding = embed_text(query)

    # Search Pinecone
    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        include_metadata=True
    )

    # Extract chunks
    chunks = [
        {
            "text": match["metadata"]["text"],
            "score": match["score"],
            "source": match["metadata"]["source"]
        }
        for match in results["matches"]
    ]

    return chunks

# Example
query = "How does OpenHelm handle authentication?"
results = search_knowledge_base(query, top_k=3)

for i, result in enumerate(results):
    print(f"\n--- Result {i+1} (score: {result['score']:.3f}) ---")
    print(result["text"][:200])  # First 200 chars

Hybrid search (semantic + keyword)

Why hybrid?

Pure semantic search may miss exact keyword matches (e.g., specific API names, error codes).
Keyword search (BM25) is brittle but excellent for exact terms.

Approach: Combine semantic (vector) search with keyword (BM25) search, then rerank.

Example (using Weaviate's built-in hybrid search):

# If using Weaviate instead of Pinecone
import weaviate

client = weaviate.Client("http://localhost:8080")

def hybrid_search(query, top_k=5):
    """Hybrid search: semantic + keyword."""
    result = (
        client.query
        .get("KnowledgeChunk", ["text", "source"])
        .with_hybrid(query=query, alpha=0.75)  # 0.75 = 75% semantic, 25% keyword
        .with_limit(top_k)
        .do()
    )

    chunks = result["data"]["Get"]["KnowledgeChunk"]
    return chunks

If using Pinecone (no native hybrid): Implement separately with Elasticsearch or use cross-encoder reranking (see below).

Reranking with cross-encoders

Problem: Vector search returns semantically similar chunks, but not always the most relevant.

Solution: Use a cross-encoder model to rerank top-k results. Cross-encoders score query-chunk pairs for relevance.

Example (using Cohere Rerank API):

import cohere

co = cohere.Client("YOUR_COHERE_API_KEY")

def rerank_results(query, chunks, top_n=3):
    """Rerank search results using Cohere Rerank."""
    docs = [chunk["text"] for chunk in chunks]

    rerank_response = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=docs,
        top_n=top_n
    )

    # Return reranked chunks
    reranked = [
        chunks[result.index]
        for result in rerank_response.results
    ]

    return reranked

# Usage
initial_results = search_knowledge_base(query, top_k=10)
final_results = rerank_results(query, initial_results, top_n=3)

Step 4: Generation pipeline

Constructing the prompt

RAG prompt template:

You are a helpful assistant answering questions based on the provided context.

Context:
{retrieved_chunks}

Question: {user_question}

Instructions:
- Answer based only on the context above.
- If the context doesn't contain enough information, say "I don't have enough information to answer that."
- Cite the source for each claim (e.g., "According to [source], ...").

Answer:

Example implementation:

def generate_answer(query, chunks):
    """Generate answer using OpenAI GPT-4 + retrieved context."""

    # Build context from chunks
    context = "\n\n".join([
        f"[Source: {chunk['source']}]\n{chunk['text']}"
        for chunk in chunks
    ])

    # Construct prompt
    prompt = f"""You are a helpful assistant answering questions based on the provided context.

Context:
{context}

Question: {query}

Instructions:
- Answer based only on the context above.
- If the context doesn't contain enough information, say "I don't have enough information to answer that."
- Cite the source for each claim.

Answer:"""

    # Call OpenAI
    response = openai.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0  # Deterministic for factual Q&A
    )

    answer = response.choices[0].message.content
    return answer

# Full RAG pipeline
query = "How does OpenHelm integrate with Slack?"
chunks = search_knowledge_base(query, top_k=5)
reranked_chunks = rerank_results(query, chunks, top_n=3)
answer = generate_answer(query, reranked_chunks)

print(f"Question: {query}")
print(f"Answer: {answer}")

Handling edge cases

1. No relevant context found

If top search result has low similarity score (<0.7), return: "I don't have information about that in my knowledge base."

2. Conflicting information in chunks

LLM may struggle. Prompt adjustment:

"If the context contains conflicting information, note the conflict and explain both perspectives."

3. Multi-hop reasoning

If answer requires combining multiple chunks (e.g., "What's the pricing for enterprise customers in the UK?"), ensure retrieval returns enough diverse chunks. Consider iterative retrieval (retrieve → reason → retrieve again).

Step 5: Production deployment

Caching

Problem: Identical queries hit the LLM repeatedly (expensive, slow).

Solution: Cache query-answer pairs. If query matches cached query (exact or high similarity), return cached answer.

Example (using Redis):

import redis
import hashlib

r = redis.Redis(host='localhost', port=6379, db=0)

def get_cached_answer(query):
    """Check if query has cached answer."""
    query_hash = hashlib.md5(query.encode()).hexdigest()
    cached = r.get(query_hash)
    if cached:
        return cached.decode()
    return None

def cache_answer(query, answer):
    """Cache answer for query."""
    query_hash = hashlib.md5(query.encode()).hexdigest()
    r.setex(query_hash, 3600, answer)  # Expire after 1 hour

# Usage
cached = get_cached_answer(query)
if cached:
    print("Returning cached answer")
    answer = cached
else:
    # Run full RAG pipeline
    chunks = search_knowledge_base(query)
    answer = generate_answer(query, chunks)
    cache_answer(query, answer)

Monitoring and evaluation

Key metrics:

Answer accuracy: % of answers marked correct by humans (spot-check 10% of queries).
Retrieval precision: % of retrieved chunks actually relevant to query.
Response time: p50, p95, p99 latency (target: <2s for RAG pipeline).
Cache hit rate: % of queries served from cache.

Evaluation framework:

def evaluate_rag(test_queries):
    """Evaluate RAG performance on test set."""
    correct = 0
    total = len(test_queries)

    for item in test_queries:
        query = item["question"]
        expected_answer = item["answer"]

        # Run RAG
        chunks = search_knowledge_base(query)
        answer = generate_answer(query, chunks)

        # Simple string matching (better: use LLM to judge correctness)
        if expected_answer.lower() in answer.lower():
            correct += 1

    accuracy = correct / total
    print(f"Accuracy: {accuracy:.1%} ({correct}/{total})")
    return accuracy

# Example test set
test_queries = [
    {"question": "What is OpenHelm?", "answer": "AI-powered research assistant"},
    {"question": "How do I reset my password?", "answer": "click Settings > Security"}
]

evaluate_rag(test_queries)

Error handling

Common failure modes:

Embedding API rate limit: Implement exponential backoff, batch requests.
Vector DB downtime: Add retry logic, fallback to cached answers.
LLM API timeout: Set timeout (e.g., 30s), return partial answer or error message.

Example (retry with backoff):

import time

def embed_text_with_retry(text, max_retries=3):
    """Embed text with exponential backoff on failure."""
    for attempt in range(max_retries):
        try:
            return embed_text(text)
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            wait_time = 2 ** attempt
            print(f"Embedding failed, retrying in {wait_time}s...")
            time.sleep(wait_time)

Scaling considerations

For <100K chunks:

Pinecone free tier or Chroma local.

For 100K–1M chunks:

Pinecone paid tier or self-hosted Weaviate/Qdrant.
Batch embedding generation (process 1,000 chunks at a time).

For 1M+ chunks:

Distributed vector DB (Weaviate cluster, Qdrant distributed mode).
Implement approximate nearest neighbour search (ANN) for speed (HNSW, IVF).

Real-world examples

Notion AI

Notion's AI search uses RAG to answer questions about users' personal workspaces. Architecture:

Chunking: 300-token chunks with 50-token overlap.
Embeddings: OpenAI text-embedding-ada-002 (now upgraded to -3-small).
Vector DB: Pinecone (as of 2024).
Retrieval: Hybrid search (semantic + keyword), reranked with cross-encoder.
Generation: GPT-4 with custom prompt emphasising citation.

Performance: 94% answer accuracy, <1.5s p95 latency (Notion Engineering Blog, 2024).

Intercom Fin

Intercom's AI customer support agent (Fin) uses RAG to answer support queries from help centre docs.

Chunking: Semantic (split on headings).
Vector DB: Custom Postgres + pgvector extension.
Retrieval: Top-10 semantic search → rerank with proprietary model.
Result: Resolves 48% of support tickets autonomously (Intercom Product Updates, 2024).

Next steps

Week 1: Build MVP

Parse 10–50 docs from your knowledge base.
Chunk, embed, insert into Pinecone.
Implement basic search + generation.
Test with 20 queries, measure accuracy.

Week 2: Improve retrieval

Experiment with chunk sizes (200, 400, 600 tokens).
Add metadata filtering (e.g., "only search product docs").
Implement reranking or hybrid search.

Week 3: Deploy and monitor

Add caching (Redis).
Set up monitoring (log queries, track latency, spot-check answers).
Deploy behind API (FastAPI, Flask).

Month 2+: Advanced features

Multi-modal RAG: Support images, tables (using multimodal embeddings).
Conversational RAG: Track dialogue history, support follow-up questions.
Federated search: Search across multiple knowledge bases (internal docs + external sources).

---

RAG transforms private knowledge into actionable intelligence. By combining semantic search with LLM generation, you unlock instant, accurate answers to domain-specific questions -without the cost and rigidity of fine-tuning. Start with this zero-to-production guide, iterate on retrieval quality, and scale as your knowledge base grows. Within 2–3 weeks, you'll have a production system answering 90%+ of queries correctly.

---

Frequently Asked Questions

Q: Will AI replace software developers?

AI is augmenting developers, not replacing them. The most likely scenario is that developers become more productive, handling more complex work while AI handles routine coding tasks. Demand for senior engineering judgment is increasing, not decreasing.

Q: What's the security risk of AI-generated code?

AI models can introduce vulnerabilities or insecure patterns. Treat AI-generated code with the same scrutiny as any external code contribution - security scanning, code review, and testing are essential regardless of the code's source.

Q: How do I choose between different AI coding assistants?

Evaluate based on your primary languages and frameworks, integration with your existing tools, quality of suggestions for your use case, and data privacy policies. Most teams benefit from trying multiple options before committing.