You've probably seen RAG mentioned in every AI product launch for the past two years. The pitch is always the same: "we use RAG to give the AI access to your data." But what does that actually mean? And why should you care?
This guide is for developers who want a clear, no-hype explanation of RAG -- what it is, how it works, why naive implementations fail, and what a production-grade pipeline actually looks like.
The Problem RAG Solves
Large language models (GPT-4, Claude, Gemini) are trained on internet-scale data. They know a lot. But they don't know:
- Your company's internal documentation
- Your product's latest features (released after training cutoff)
- Your customer's support tickets
- Anything behind a login wall
When you ask an LLM about something it doesn't know, it does one of two things:
- Admits it doesn't know (rare, getting better)
- Confidently makes something up -- this is called a hallucination
Hallucinations are the fundamental problem. Your chatbot tells a customer the wrong refund policy. Your internal tool gives a developer outdated API docs. Your support bot invents a feature that doesn't exist.
RAG prevents this by giving the LLM your actual data to work with.
How RAG Works
RAG adds a retrieval step before the LLM generates a response:
User Question → Retrieve Relevant Documents → Feed to LLM → Generate AnswerHere's the breakdown:
1. Indexing (happens once, at ingestion time)
Before you can search your data, you need to index it:
- Chunk your documents into smaller pieces (~500 tokens each)
- Embed each chunk -- convert it into a vector (a list of numbers) using an embedding model
- Store the vectors in a searchable index
The embedding model (like Jina v4) captures the meaning of text. Similar concepts get similar vectors. "automobile" and "car" end up near each other in vector space, even though they share no characters.
2. Retrieval (happens at query time)
When a user asks a question:
- Embed the query using the same embedding model
- Search the vector index for chunks whose embeddings are similar to the query embedding
- Return the top-K most relevant chunks
This is called semantic search -- it finds results based on meaning, not just keyword matches.
3. Generation (the LLM step)
Take the retrieved chunks and pass them to the LLM as context:
System: Answer the user's question based on the following context.
Do not make up information. If the context doesn't contain the answer,
say you don't know.
Context:
[chunk 1: "All employees receive 20 days of paid vacation..."]
[chunk 2: "Vacation requests must be submitted 2 weeks in advance..."]
User: How many vacation days do I get?The LLM now generates an answer grounded in your actual data instead of guessing.
Why Naive RAG Fails
The 3-step version (embed, search, return) works for demos. It breaks in production for several reasons:
Problem 1: Semantic Search Misses Keywords
Semantic search is great at understanding meaning. But it can miss exact terms. If your document contains "error code ERR_AUTH_FAILED" and a user searches for exactly that string, semantic search might rank it lower than a conceptually similar but wrong result.
Fix: Hybrid search. Run both semantic search (vectors) and keyword search (BM25) in parallel, then merge results using Reciprocal Rank Fusion (RRF). You get the best of both worlds.
Problem 2: Vague Queries
Users ask vague questions. "How does auth work?" could mean authentication flow, authorization rules, API key management, or session handling. A single embedding of this vague query won't match any of these well.
Fix: Query expansion and decomposition. Use an LLM to break the query into sub-questions or expand it with related terms before searching. "How does auth work?" becomes:
- "What authentication methods are supported?"
- "How are sessions managed?"
- "What authorization rules are applied?"
Each sub-query is searched independently, and results are merged.
Problem 3: Short Queries Match Poorly
Short queries produce vague embeddings. "refund" as a 1024-dimensional vector doesn't carry much signal.
Fix: HyDE (Hypothetical Document Embeddings). Have the LLM generate a hypothetical answer to the query, then embed that instead of the raw query. The hypothetical answer is longer and more specific, producing a better embedding for search.
Problem 4: One Ranking Isn't Enough
Vector similarity is a rough ranking. The top-10 results by cosine similarity are not necessarily the 10 most relevant results.
Fix: Reranking. After the initial retrieval, pass the top results through a reranker model (like Jina Reranker v3) that scores each result against the query with much higher precision. This second pass dramatically improves result quality.
Problem 5: Context Is Missing
A chunk is ~500 tokens. Sometimes the answer spans two adjacent chunks, or the chunk only makes sense with its surrounding context.
Fix: Chunk expansion. After identifying relevant chunks, fetch the chunks immediately before and after them to provide more context to the LLM.
Problem 6: Isolated Documents
Documents don't exist in isolation. Your refund policy references your customer service team, which is described in a different document. The engineering handbook references the code review process, which lives in a different wiki.
Fix: Knowledge graph traversal (GraphRAG). At ingestion time, extract entities and relationships from your documents. At query time, traverse the graph to find related entities from other documents, even if they don't match the query directly.
What a Production RAG Pipeline Looks Like
Here's what a real pipeline handles, beyond the naive 3-step version:
| Step | Purpose |
|---|---|
| Cache check | Skip re-embedding for repeated queries |
| Query routing | Classify query complexity to skip unnecessary steps |
| Query decomposition | Break compound queries into sub-questions |
| Query expansion | Add related terms to improve recall |
| HyDE | Generate hypothetical answer for better embedding |
| Dual search | Semantic + BM25 in parallel |
| RRF fusion | Merge results from both search methods |
| Access control | Filter results the user isn't allowed to see |
| Deduplication | Remove redundant chunks |
| GraphRAG | Traverse knowledge graph for related context |
| Reranking | Second-pass precision scoring |
| Chunk expansion | Fetch surrounding context |
| Temporal boost | Favor recent content when relevant |
| Confidence scoring | Tell the UI how confident the result is |
| Episodic memory | Incorporate user-specific context |
| Formatting | Citations, audit logging, caching |
That's 16 steps. Building this from scratch takes weeks. Maintaining it takes longer.
Metrics That Matter
When evaluating a RAG pipeline, these are the metrics to track:
- Recall@K -- Of all relevant documents, what fraction did we retrieve in the top K results?
- Precision@K -- Of the top K results, what fraction are actually relevant?
- MRR (Mean Reciprocal Rank) -- How high up is the first relevant result?
- Latency -- How long does the full pipeline take? (Target: less than 1s for most queries)
- Hallucination rate -- How often does the LLM generate answers not grounded in the retrieved context?
A good production pipeline should hit >0.85 recall and >0.75 precision on your evaluation set while staying under 800ms for complex queries.
Building a RAG Pipeline on Convex
If you're building on Convex, you can use Memcity to get a production-grade 16-step RAG pipeline without building it from scratch.
npx shadcn@latest add @memcity/communityMemcity installs as source code into your project. The full pipeline runs on Convex's serverless infrastructure -- no separate vector database, no infrastructure to manage.
import { memory } from "./memory";
// Ingest documents
await memory.ingestText(ctx, {
orgId: "org_123",
knowledgeBaseId: "kb_456",
text: "Your document content...",
source: "policy.md",
});
// Search with the full pipeline
const results = await memory.getContext(ctx, {
orgId: "org_123",
knowledgeBaseId: "kb_456",
query: "How many vacation days do I get?",
});The community tier includes hybrid search, BM25, RRF fusion, and confidence scoring. Pro adds the full 16-step pipeline with query routing, HyDE, GraphRAG, reranking, and episodic memory.
Further Reading
- How to Build a RAG Pipeline with Convex -- Step-by-step implementation guide
- Search Pipeline -- Deep dive into each of the 16 pipeline steps
- Knowledge Graph -- How entity extraction and graph traversal improve retrieval
- Getting Started -- Install Memcity and run your first search in 5 minutes