Building Your First RAG Knowledge Base: Zero to Production
Step-by-step technical guide for implementing retrieval-augmented generation (RAG) knowledge bases -from vector embeddings to production deployment with code examples.

TL;DR
- RAG (retrieval-augmented generation) lets LLMs answer questions using your private data without fine-tuning -lower cost, faster iteration, better accuracy for domain-specific queries.
- Core components: document processing (chunking + embeddings), vector database (semantic search), retrieval pipeline (hybrid ranking), generation (LLM + context).
- Real performance: Notion's AI search achieves 94% answer accuracy using RAG versus 67% with base GPT-4 (Notion Engineering Blog, 2024).
Jump to Why RAG over fine-tuning · Jump to Architecture overview · Jump to Step 1: Document processing · Jump to Step 2: Vector database · Jump to Step 3: Retrieval · Jump to Step 4: Generation · Jump to Step 5: Production
# Building Your First RAG Knowledge Base: Zero to Production
Large language models are powerful but limited: they don't know your company docs, product specs, customer data, or institutional knowledge. Training or fine-tuning models on private data is slow, expensive, and brittle -every update requires retraining.
RAG (retrieval-augmented generation) solves this by combining semantic search with LLM generation. When a user asks a question, RAG retrieves relevant context from your knowledge base, then feeds that context to the LLM to generate an accurate, grounded answer. No fine-tuning required.
This guide walks through building a production-ready RAG system from scratch: processing documents, generating embeddings, storing in a vector database, implementing hybrid retrieval, and deploying with monitoring. By the end, you'll have a working knowledge base that answers domain-specific questions with >90% accuracy.
Key takeaways - RAG is 10–50× cheaper than fine-tuning for domain-specific Q&A and iterates faster (add new docs instantly vs retrain). - Core workflow: chunk documents → embed chunks → store in vector DB → retrieve relevant chunks → generate answer with LLM + context. - Production considerations: chunking strategy (200–500 tokens), hybrid search (semantic + keyword), caching, and evaluation metrics.
Why RAG over fine-tuning
When you need an LLM to answer questions about private data, you have three options:
Option 1: Prompt the base model directly
Approach: Paste your knowledge base into the prompt.
Pros: Zero setup.
Cons: Limited by context window (even 200K tokens = ~500 pages max), expensive (every query re-processes all context), no way to scale beyond context limit.
Verdict: Only viable for tiny knowledge bases (<100 pages).
Option 2: Fine-tune the model
Approach: Train a custom model on your data.
Pros: Model "learns" your knowledge, potentially better long-term performance.
Cons:
- Expensive: $500–$5,000+ per training run (OpenAI fine-tuning, Anthropic Claude fine-tuning).
- Slow: Takes hours to days, blocks iteration.
- Brittle: Every data update requires retraining.
- Hallucination risk: Model memorises training data but may fabricate details when uncertain.
Verdict: Use for style/tone adaptation, not knowledge injection.
Option 3: RAG (retrieval-augmented generation)
Approach: Store knowledge in a searchable database. At query time, retrieve relevant snippets, inject into LLM prompt.
Pros:
- Fast iteration: Add/update docs instantly (no retraining).
- Scalable: Works with millions of documents.
- Cost-effective: Only embed documents once; retrieval is cheap.
- Grounded: LLM answers are anchored to retrieved sources (reduces hallucinations).
Cons:
- Engineering complexity: Requires vector database, embedding pipeline, retrieval logic.
- Quality depends on retrieval: If retrieval fails to find relevant docs, LLM can't generate good answers.
Verdict: Best choice for 95% of knowledge-based use cases.
According to a 2024 study by Stanford, RAG outperforms fine-tuning for factual Q&A tasks whilst costing 12× less and iterating 40× faster (Stanford AI Lab, 2024).
<figure>
<svg role="img" aria-label="RAG vs fine-tuning comparison" viewBox="0 0 720 260" xmlns="http://www.w3.org/2000/svg">
<rect width="720" height="260" fill="#0f172a" />
<text x="30" y="35" fill="#38bdf8" font-size="18">RAG vs Fine-Tuning: Cost & Iteration Speed</text>
<!-- Cost comparison -->
<text x="50" y="80" fill="#94a3b8" font-size="14">Implementation Cost</text>
<rect x="50" y="90" width="80" height="40" rx="8" fill="#10b981" opacity="0.8" />
<text x="60" y="115" fill="#0f172a" font-size="12">RAG: $50</text>
<rect x="140" y="90" width="400" height="40" rx="8" fill="#ef4444" opacity="0.8" />
<text x="280" y="115" fill="#fff" font-size="12">Fine-Tuning: $2,500</text>
<!-- Iteration speed -->
<text x="50" y="165" fill="#94a3b8" font-size="14">Add New Document</text>
<rect x="50" y="175" width="30" height="40" rx="8" fill="#10b981" opacity="0.8" />
<text x="55" y="200" fill="#0f172a" font-size="11">RAG</text>
<text x="50" y="220" fill="#10b981" font-size="10">5 min</text>
<rect x="90" y="175" width="200" height="40" rx="8" fill="#ef4444" opacity="0.8" />
<text x="150" y="200" fill="#fff" font-size="11">Fine-Tuning</text>
<text x="150" y="220" fill="#fff" font-size="10">6–48 hours</text>
<text x="300" y="250" fill="#10b981" font-size="13">→ RAG: 50× cheaper, 40× faster iteration</text>
</svg>
<figcaption>RAG costs 50× less than fine-tuning and lets you add new knowledge in minutes versus hours/days.</figcaption>
</figure>
"AI-assisted development isn't about replacing developers - it's about amplifying them. The best engineers are shipping 3-5x more code with AI tools while maintaining quality." - Kelsey Hightower, Principal Engineer at Google Cloud
Architecture overview
A production RAG system has five components:
1. Document ingestion
Parse docs (PDFs, Markdown, HTML, DOCX), extract text, clean formatting.
2. Chunking
Split documents into semantic chunks (200–500 tokens each) that fit in embedding models and provide focused context.
3. Embedding
Convert chunks into vector representations (embeddings) using models like text-embedding-3-small (OpenAI) or e5-mistral-7b-instruct (HuggingFace).
4. Vector database
Store embeddings in a searchable index (Pinecone, Weaviate, Qdrant, Chroma).
5. Retrieval + generation
- Query: User asks question.
- Embed query: Convert question to embedding.
- Retrieve: Search vector DB for top-k similar chunks.
- Rerank (optional): Use hybrid search (semantic + keyword) or cross-encoder for better relevance.
- Generate: Pass retrieved chunks + question to LLM, get answer.
<figure>
<svg role="img" aria-label="RAG architecture pipeline" viewBox="0 0 760 240" xmlns="http://www.w3.org/2000/svg">
<rect width="760" height="240" fill="#0f172a" />
<text x="30" y="35" fill="#f59e0b" font-size="18">RAG Architecture Pipeline</text>
<rect x="40" y="70" width="100" height="50" rx="10" fill="#38bdf8" opacity="0.8" />
<text x="55" y="100" fill="#0f172a" font-size="11">Documents</text>
<rect x="170" y="70" width="100" height="50" rx="10" fill="#a855f7" opacity="0.8" />
<text x="195" y="95" fill="#fff" font-size="10">Chunk +</text>
<text x="195" y="110" fill="#fff" font-size="10">Embed</text>
<rect x="300" y="70" width="100" height="50" rx="10" fill="#22d3ee" opacity="0.8" />
<text x="330" y="100" fill="#0f172a" font-size="11">Vector DB</text>
<rect x="530" y="70" width="100" height="50" rx="10" fill="#10b981" opacity="0.8" />
<text x="555" y="100" fill="#0f172a" font-size="11">Answer</text>
<!-- Query path -->
<rect x="40" y="160" width="100" height="50" rx="10" fill="#f59e0b" opacity="0.8" />
<text x="65" y="190" fill="#0f172a" font-size="11">User Query</text>
<rect x="300" y="160" width="100" height="50" rx="10" fill="#a855f7" opacity="0.8" />
<text x="330" y="185" fill="#fff" font-size="10">Retrieve</text>
<text x="330" y="200" fill="#fff" font-size="10">Chunks</text>
<rect x="430" y="160" width="100" height="50" rx="10" fill="#6366f1" opacity="0.8" />
<text x="460" y="185" fill="#fff" font-size="10">LLM +</text>
<text x="455" y="200" fill="#fff" font-size="10">Context</text>
<!-- Arrows -->
<polyline points="140,95 170,95" stroke="#f8fafc" stroke-width="3" />
<polyline points="270,95 300,95" stroke="#f8fafc" stroke-width="3" />
<polyline points="530,95 560,95" stroke="#f8fafc" stroke-width="3" />
<polyline points="140,185 300,185" stroke="#f8fafc" stroke-width="3" />
<polyline points="350,120 350,160" stroke="#cbd5e1" stroke-width="2" stroke-dasharray="4,4" />
<polyline points="400,185 430,185" stroke="#f8fafc" stroke-width="3" />
<polyline points="530,185 560,185" stroke="#f8fafc" stroke-width="3" />
</svg>
<figcaption>RAG pipeline: Documents → Chunk/Embed → Vector DB. At query time: Search DB → Retrieve chunks → LLM generates answer.</figcaption>
</figure>
Step 1: Document processing pipeline
Document parsing
First, extract text from various formats. Use these libraries:
- PDF:
pymupdf(PyMuPDF) orpdfplumber - DOCX:
python-docx - HTML:
BeautifulSouportrafilatura - Markdown: Native Python string parsing
Example (PDF parsing):
import pymupdf # PyMuPDF
def extract_text_from_pdf(pdf_path):
"""Extract text from PDF."""
doc = pymupdf.open(pdf_path)
text = ""
for page in doc:
text += page.get_text()
return textChunking strategy
Why chunk?
- Embedding models have token limits (e.g., OpenAI
text-embedding-3-small: 8,191 tokens). - LLMs work better with focused context (5–10 relevant paragraphs, not entire manuals).
Chunking approaches:
1. Fixed-size chunks (simple)
Split text every N tokens (e.g., 500 tokens) with overlap (e.g., 50 tokens).
Pros: Simple, fast.
Cons: May split mid-sentence or mid-concept.
2. Semantic chunks (better)
Split on natural boundaries: paragraphs, sections, headings.
Pros: Preserves context, more coherent chunks.
Cons: Variable chunk sizes.
3. Sliding window (best for dense docs)
Create overlapping chunks to ensure no context is lost.
Recommendation: Start with semantic chunking (split on \n\n or headings), then enforce max chunk size (500 tokens).
Example (semantic chunking with tiktoken):
import tiktoken
def chunk_text(text, max_tokens=500, overlap_tokens=50):
"""Chunk text into semantic segments with max token limit."""
enc = tiktoken.get_encoding("cl100k_base") # GPT-4 encoding
# Split on double newlines (paragraphs)
paragraphs = text.split("\n\n")
chunks = []
current_chunk = ""
current_tokens = 0
for para in paragraphs:
para_tokens = len(enc.encode(para))
if current_tokens + para_tokens <= max_tokens:
current_chunk += para + "\n\n"
current_tokens += para_tokens
else:
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = para + "\n\n"
current_tokens = para_tokens
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
# Usage
text = extract_text_from_pdf("knowledge_base.pdf")
chunks = chunk_text(text, max_tokens=500)
print(f"Created {len(chunks)} chunks")Metadata enrichment
Attach metadata to each chunk for better retrieval and provenance:
- Source: Document title, URL, file path.
- Section: Heading, chapter.
- Date: Last updated, published date.
- Author: Document owner.
Example:
chunks_with_metadata = [
{
"text": chunk,
"metadata": {
"source": "product_docs.pdf",
"section": "Authentication",
"updated": "2025-08-01"
}
}
for chunk in chunks
]Step 2: Vector database setup
Choosing a vector database
| Database | Hosting | Pros | Cons | Best For |
|---|---|---|---|---|
| Pinecone | Managed (cloud) | Easiest setup, generous free tier, fast | Vendor lock-in | Startups (quick start) |
| Weaviate | Self-hosted or cloud | Open-source, hybrid search, multi-tenancy | More complex setup | Advanced use cases |
| Qdrant | Self-hosted or cloud | Fast, Rust-based, great filtering | Smaller ecosystem | Performance-critical apps |
| Chroma | Local/self-hosted | Lightweight, Python-native, good for dev | Not production-grade yet | Prototyping |
Recommendation: Start with Pinecone (fastest setup, free tier covers early stage). Graduate to Weaviate or Qdrant if you need self-hosting or advanced features.
Setting up Pinecone
Install:
pip install pinecone-client openai tiktokenInitialise:
from pinecone import Pinecone, ServerlessSpec
import openai
# Initialize Pinecone
pc = Pinecone(api_key="YOUR_PINECONE_API_KEY")
# Create index (1536 dims = OpenAI text-embedding-3-small)
index_name = "knowledge-base"
if index_name not in pc.list_indexes().names():
pc.create_index(
name=index_name,
dimension=1536,
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
index = pc.Index(index_name)Generating embeddings
OpenAI embedding API:
import openai
openai.api_key = "YOUR_OPENAI_API_KEY"
def embed_text(text):
"""Generate embedding for text using OpenAI."""
response = openai.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
# Example
chunk = "OpenHelm is an AI-powered research assistant for startups."
embedding = embed_text(chunk)
print(f"Embedding dimension: {len(embedding)}") # 1536Inserting embeddings into Pinecone
def insert_chunks(chunks_with_metadata):
"""Embed chunks and insert into Pinecone."""
vectors = []
for i, item in enumerate(chunks_with_metadata):
chunk_text = item["text"]
metadata = item["metadata"]
# Generate embedding
embedding = embed_text(chunk_text)
# Prepare vector
vector_id = f"chunk_{i}"
vectors.append({
"id": vector_id,
"values": embedding,
"metadata": {
**metadata,
"text": chunk_text # Store original text in metadata
}
})
# Batch insert (Pinecone supports up to 100 vectors per batch)
index.upsert(vectors=vectors)
print(f"Inserted {len(vectors)} vectors into Pinecone")
# Usage
insert_chunks(chunks_with_metadata)Step 3: Retrieval pipeline
Basic semantic search
Query workflow:
- User asks question.
- Embed question.
- Search vector DB for top-k nearest neighbours.
- Return matching chunks.
Example:
def search_knowledge_base(query, top_k=5):
"""Search Pinecone for relevant chunks."""
# Embed query
query_embedding = embed_text(query)
# Search Pinecone
results = index.query(
vector=query_embedding,
top_k=top_k,
include_metadata=True
)
# Extract chunks
chunks = [
{
"text": match["metadata"]["text"],
"score": match["score"],
"source": match["metadata"]["source"]
}
for match in results["matches"]
]
return chunks
# Example
query = "How does OpenHelm handle authentication?"
results = search_knowledge_base(query, top_k=3)
for i, result in enumerate(results):
print(f"\n--- Result {i+1} (score: {result['score']:.3f}) ---")
print(result["text"][:200]) # First 200 charsHybrid search (semantic + keyword)
Why hybrid?
- Pure semantic search may miss exact keyword matches (e.g., specific API names, error codes).
- Keyword search (BM25) is brittle but excellent for exact terms.
Approach: Combine semantic (vector) search with keyword (BM25) search, then rerank.
Example (using Weaviate's built-in hybrid search):
# If using Weaviate instead of Pinecone
import weaviate
client = weaviate.Client("http://localhost:8080")
def hybrid_search(query, top_k=5):
"""Hybrid search: semantic + keyword."""
result = (
client.query
.get("KnowledgeChunk", ["text", "source"])
.with_hybrid(query=query, alpha=0.75) # 0.75 = 75% semantic, 25% keyword
.with_limit(top_k)
.do()
)
chunks = result["data"]["Get"]["KnowledgeChunk"]
return chunksIf using Pinecone (no native hybrid): Implement separately with Elasticsearch or use cross-encoder reranking (see below).
Reranking with cross-encoders
Problem: Vector search returns semantically similar chunks, but not always the most relevant.
Solution: Use a cross-encoder model to rerank top-k results. Cross-encoders score query-chunk pairs for relevance.
Example (using Cohere Rerank API):
import cohere
co = cohere.Client("YOUR_COHERE_API_KEY")
def rerank_results(query, chunks, top_n=3):
"""Rerank search results using Cohere Rerank."""
docs = [chunk["text"] for chunk in chunks]
rerank_response = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=docs,
top_n=top_n
)
# Return reranked chunks
reranked = [
chunks[result.index]
for result in rerank_response.results
]
return reranked
# Usage
initial_results = search_knowledge_base(query, top_k=10)
final_results = rerank_results(query, initial_results, top_n=3)Step 4: Generation pipeline
Constructing the prompt
RAG prompt template:
You are a helpful assistant answering questions based on the provided context.
Context:
{retrieved_chunks}
Question: {user_question}
Instructions:
- Answer based only on the context above.
- If the context doesn't contain enough information, say "I don't have enough information to answer that."
- Cite the source for each claim (e.g., "According to [source], ...").
Answer:Example implementation:
def generate_answer(query, chunks):
"""Generate answer using OpenAI GPT-4 + retrieved context."""
# Build context from chunks
context = "\n\n".join([
f"[Source: {chunk['source']}]\n{chunk['text']}"
for chunk in chunks
])
# Construct prompt
prompt = f"""You are a helpful assistant answering questions based on the provided context.
Context:
{context}
Question: {query}
Instructions:
- Answer based only on the context above.
- If the context doesn't contain enough information, say "I don't have enough information to answer that."
- Cite the source for each claim.
Answer:"""
# Call OpenAI
response = openai.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=0 # Deterministic for factual Q&A
)
answer = response.choices[0].message.content
return answer
# Full RAG pipeline
query = "How does OpenHelm integrate with Slack?"
chunks = search_knowledge_base(query, top_k=5)
reranked_chunks = rerank_results(query, chunks, top_n=3)
answer = generate_answer(query, reranked_chunks)
print(f"Question: {query}")
print(f"Answer: {answer}")Handling edge cases
1. No relevant context found
If top search result has low similarity score (<0.7), return: "I don't have information about that in my knowledge base."
2. Conflicting information in chunks
LLM may struggle. Prompt adjustment:
"If the context contains conflicting information, note the conflict and explain both perspectives."
3. Multi-hop reasoning
If answer requires combining multiple chunks (e.g., "What's the pricing for enterprise customers in the UK?"), ensure retrieval returns enough diverse chunks. Consider iterative retrieval (retrieve → reason → retrieve again).
Step 5: Production deployment
Caching
Problem: Identical queries hit the LLM repeatedly (expensive, slow).
Solution: Cache query-answer pairs. If query matches cached query (exact or high similarity), return cached answer.
Example (using Redis):
import redis
import hashlib
r = redis.Redis(host='localhost', port=6379, db=0)
def get_cached_answer(query):
"""Check if query has cached answer."""
query_hash = hashlib.md5(query.encode()).hexdigest()
cached = r.get(query_hash)
if cached:
return cached.decode()
return None
def cache_answer(query, answer):
"""Cache answer for query."""
query_hash = hashlib.md5(query.encode()).hexdigest()
r.setex(query_hash, 3600, answer) # Expire after 1 hour
# Usage
cached = get_cached_answer(query)
if cached:
print("Returning cached answer")
answer = cached
else:
# Run full RAG pipeline
chunks = search_knowledge_base(query)
answer = generate_answer(query, chunks)
cache_answer(query, answer)Monitoring and evaluation
Key metrics:
- Answer accuracy: % of answers marked correct by humans (spot-check 10% of queries).
- Retrieval precision: % of retrieved chunks actually relevant to query.
- Response time: p50, p95, p99 latency (target: <2s for RAG pipeline).
- Cache hit rate: % of queries served from cache.
Evaluation framework:
def evaluate_rag(test_queries):
"""Evaluate RAG performance on test set."""
correct = 0
total = len(test_queries)
for item in test_queries:
query = item["question"]
expected_answer = item["answer"]
# Run RAG
chunks = search_knowledge_base(query)
answer = generate_answer(query, chunks)
# Simple string matching (better: use LLM to judge correctness)
if expected_answer.lower() in answer.lower():
correct += 1
accuracy = correct / total
print(f"Accuracy: {accuracy:.1%} ({correct}/{total})")
return accuracy
# Example test set
test_queries = [
{"question": "What is OpenHelm?", "answer": "AI-powered research assistant"},
{"question": "How do I reset my password?", "answer": "click Settings > Security"}
]
evaluate_rag(test_queries)Error handling
Common failure modes:
- Embedding API rate limit: Implement exponential backoff, batch requests.
- Vector DB downtime: Add retry logic, fallback to cached answers.
- LLM API timeout: Set timeout (e.g., 30s), return partial answer or error message.
Example (retry with backoff):
import time
def embed_text_with_retry(text, max_retries=3):
"""Embed text with exponential backoff on failure."""
for attempt in range(max_retries):
try:
return embed_text(text)
except Exception as e:
if attempt == max_retries - 1:
raise
wait_time = 2 ** attempt
print(f"Embedding failed, retrying in {wait_time}s...")
time.sleep(wait_time)Scaling considerations
For <100K chunks:
- Pinecone free tier or Chroma local.
For 100K–1M chunks:
- Pinecone paid tier or self-hosted Weaviate/Qdrant.
- Batch embedding generation (process 1,000 chunks at a time).
For 1M+ chunks:
- Distributed vector DB (Weaviate cluster, Qdrant distributed mode).
- Implement approximate nearest neighbour search (ANN) for speed (HNSW, IVF).
Real-world examples
Notion AI
Notion's AI search uses RAG to answer questions about users' personal workspaces. Architecture:
- Chunking: 300-token chunks with 50-token overlap.
- Embeddings: OpenAI
text-embedding-ada-002(now upgraded to-3-small). - Vector DB: Pinecone (as of 2024).
- Retrieval: Hybrid search (semantic + keyword), reranked with cross-encoder.
- Generation: GPT-4 with custom prompt emphasising citation.
Performance: 94% answer accuracy, <1.5s p95 latency (Notion Engineering Blog, 2024).
Intercom Fin
Intercom's AI customer support agent (Fin) uses RAG to answer support queries from help centre docs.
- Chunking: Semantic (split on headings).
- Vector DB: Custom Postgres + pgvector extension.
- Retrieval: Top-10 semantic search → rerank with proprietary model.
- Result: Resolves 48% of support tickets autonomously (Intercom Product Updates, 2024).
Next steps
Week 1: Build MVP
- Parse 10–50 docs from your knowledge base.
- Chunk, embed, insert into Pinecone.
- Implement basic search + generation.
- Test with 20 queries, measure accuracy.
Week 2: Improve retrieval
- Experiment with chunk sizes (200, 400, 600 tokens).
- Add metadata filtering (e.g., "only search product docs").
- Implement reranking or hybrid search.
Week 3: Deploy and monitor
- Add caching (Redis).
- Set up monitoring (log queries, track latency, spot-check answers).
- Deploy behind API (FastAPI, Flask).
Month 2+: Advanced features
- Multi-modal RAG: Support images, tables (using multimodal embeddings).
- Conversational RAG: Track dialogue history, support follow-up questions.
- Federated search: Search across multiple knowledge bases (internal docs + external sources).
---
RAG transforms private knowledge into actionable intelligence. By combining semantic search with LLM generation, you unlock instant, accurate answers to domain-specific questions -without the cost and rigidity of fine-tuning. Start with this zero-to-production guide, iterate on retrieval quality, and scale as your knowledge base grows. Within 2–3 weeks, you'll have a production system answering 90%+ of queries correctly.
---
Frequently Asked Questions
Q: Will AI replace software developers?
AI is augmenting developers, not replacing them. The most likely scenario is that developers become more productive, handling more complex work while AI handles routine coding tasks. Demand for senior engineering judgment is increasing, not decreasing.
Q: What's the security risk of AI-generated code?
AI models can introduce vulnerabilities or insecure patterns. Treat AI-generated code with the same scrutiny as any external code contribution - security scanning, code review, and testing are essential regardless of the code's source.
Q: How do I choose between different AI coding assistants?
Evaluate based on your primary languages and frameworks, integration with your existing tools, quality of suggestions for your use case, and data privacy policies. Most teams benefit from trying multiple options before committing.
More from the blog
OpenHelm vs runCLAUDErun: Which Claude Code Scheduler Is Right for You?
A direct comparison of the two most popular Claude Code schedulers, how each works, what each costs, and which fits your workflow.
Claude Code vs Cursor Pro: Real Developer Cost Comparison
An honest look at what developers actually spend on Claude Code, Cursor Pro, and GitHub Copilot, and how to get the most from each.