What is RAG (Retrieval-Augmented Generation)? A Developer's Guide

You've probably seen RAG mentioned in every AI product launch for the past two years. The pitch is always the same: "we use RAG to give the AI access to your data." But what does that actually mean? And why should you care?

This guide is for developers who want a clear, no-hype explanation of RAG -- what it is, how it works, why naive implementations fail, and what a production-grade pipeline actually looks like.

The Problem RAG Solves

Large language models (GPT-4, Claude, Gemini) are trained on internet-scale data. They know a lot. But they don't know:

Your company's internal documentation
Your product's latest features (released after training cutoff)
Your customer's support tickets
Anything behind a login wall

When you ask an LLM about something it doesn't know, it does one of two things:

Admits it doesn't know (rare, getting better)
Confidently makes something up -- this is called a hallucination

Hallucinations are the fundamental problem. Your chatbot tells a customer the wrong refund policy. Your internal tool gives a developer outdated API docs. Your support bot invents a feature that doesn't exist.

RAG prevents this by giving the LLM your actual data to work with.

How RAG Works

RAG adds a retrieval step before the LLM generates a response:

typescript

User Question → Retrieve Relevant Documents → Feed to LLM → Generate Answer

Here's the breakdown:

1. Indexing (happens once, at ingestion time)

Before you can search your data, you need to index it:

Chunk your documents into smaller pieces (~500 tokens each)
Embed each chunk -- convert it into a vector (a list of numbers) using an embedding model
Store the vectors in a searchable index

The embedding model (like Jina v4) captures the meaning of text. Similar concepts get similar vectors. "automobile" and "car" end up near each other in vector space, even though they share no characters.

2. Retrieval (happens at query time)

When a user asks a question:

Embed the query using the same embedding model
Search the vector index for chunks whose embeddings are similar to the query embedding
Return the top-K most relevant chunks

This is called semantic search -- it finds results based on meaning, not just keyword matches.

3. Generation (the LLM step)

Take the retrieved chunks and pass them to the LLM as context:

typescript

System: Answer the user's question based on the following context.
Do not make up information. If the context doesn't contain the answer,
say you don't know.
 
Context:
[chunk 1: "All employees receive 20 days of paid vacation..."]
[chunk 2: "Vacation requests must be submitted 2 weeks in advance..."]
 
User: How many vacation days do I get?

The LLM now generates an answer grounded in your actual data instead of guessing.

Why Naive RAG Fails

The 3-step version (embed, search, return) works for demos. It breaks in production for several reasons:

Problem 1: Semantic Search Misses Keywords

Semantic search is great at understanding meaning. But it can miss exact terms. If your document contains "error code ERR_AUTH_FAILED" and a user searches for exactly that string, semantic search might rank it lower than a conceptually similar but wrong result.

Fix: Hybrid search. Run both semantic search (vectors) and keyword search (BM25) in parallel, then merge results using Reciprocal Rank Fusion (RRF). You get the best of both worlds.

Problem 2: Vague Queries

Users ask vague questions. "How does auth work?" could mean authentication flow, authorization rules, API key management, or session handling. A single embedding of this vague query won't match any of these well.

Fix: Query expansion and decomposition. Use an LLM to break the query into sub-questions or expand it with related terms before searching. "How does auth work?" becomes:

"What authentication methods are supported?"
"How are sessions managed?"
"What authorization rules are applied?"

Each sub-query is searched independently, and results are merged.

Problem 3: Short Queries Match Poorly

Short queries produce vague embeddings. "refund" as a 1024-dimensional vector doesn't carry much signal.

Fix: HyDE (Hypothetical Document Embeddings). Have the LLM generate a hypothetical answer to the query, then embed that instead of the raw query. The hypothetical answer is longer and more specific, producing a better embedding for search.

Problem 4: One Ranking Isn't Enough

Vector similarity is a rough ranking. The top-10 results by cosine similarity are not necessarily the 10 most relevant results.

Fix: Reranking. After the initial retrieval, pass the top results through a reranker model (like Jina Reranker v3) that scores each result against the query with much higher precision. This second pass dramatically improves result quality.

Problem 5: Context Is Missing

A chunk is ~500 tokens. Sometimes the answer spans two adjacent chunks, or the chunk only makes sense with its surrounding context.

Fix: Chunk expansion. After identifying relevant chunks, fetch the chunks immediately before and after them to provide more context to the LLM.

Problem 6: Isolated Documents

Documents don't exist in isolation. Your refund policy references your customer service team, which is described in a different document. The engineering handbook references the code review process, which lives in a different wiki.

Fix: Knowledge graph traversal (GraphRAG). At ingestion time, extract entities and relationships from your documents. At query time, traverse the graph to find related entities from other documents, even if they don't match the query directly.

What a Production RAG Pipeline Looks Like

Here's what a real pipeline handles, beyond the naive 3-step version:

Step	Purpose
Cache check	Skip re-embedding for repeated queries
Query routing	Classify query complexity to skip unnecessary steps
Query decomposition	Break compound queries into sub-questions
Query expansion	Add related terms to improve recall
HyDE	Generate hypothetical answer for better embedding
Dual search	Semantic + BM25 in parallel
RRF fusion	Merge results from both search methods
Access control	Filter results the user isn't allowed to see
Deduplication	Remove redundant chunks
GraphRAG	Traverse knowledge graph for related context
Reranking	Second-pass precision scoring
Chunk expansion	Fetch surrounding context
Temporal boost	Favor recent content when relevant
Confidence scoring	Tell the UI how confident the result is
Episodic memory	Incorporate user-specific context
Formatting	Citations, audit logging, caching

That's 16 steps. Building this from scratch takes weeks. Maintaining it takes longer.

Metrics That Matter

When evaluating a RAG pipeline, these are the metrics to track:

Recall@K -- Of all relevant documents, what fraction did we retrieve in the top K results?
Precision@K -- Of the top K results, what fraction are actually relevant?
MRR (Mean Reciprocal Rank) -- How high up is the first relevant result?
Latency -- How long does the full pipeline take? (Target: less than 1s for most queries)
Hallucination rate -- How often does the LLM generate answers not grounded in the retrieved context?

A good production pipeline should hit >0.85 recall and >0.75 precision on your evaluation set while staying under 800ms for complex queries.

Building a RAG Pipeline on Convex

If you're building on Convex, you can use Memcity to get a production-grade 16-step RAG pipeline without building it from scratch.

bash

npx shadcn@latest add @memcity/community

Memcity installs as source code into your project. The full pipeline runs on Convex's serverless infrastructure -- no separate vector database, no infrastructure to manage.

import { memory } from "./memory";
 
// Ingest documents
await memory.ingestText(ctx, {
  orgId: "org_123",
  knowledgeBaseId: "kb_456",
  text: "Your document content...",
  source: "policy.md",
});
 
// Search with the full pipeline
const results = await memory.getContext(ctx, {
  orgId: "org_123",
  knowledgeBaseId: "kb_456",
  query: "How many vacation days do I get?",
});

The community tier includes hybrid search, BM25, RRF fusion, and confidence scoring. Pro adds the full 16-step pipeline with query routing, HyDE, GraphRAG, reranking, and episodic memory.