Features
Search Architecture
How we tested 40 configurations to find the optimal search pipeline — and how to tune it for your use case.
Why This Page Exists
Most RAG libraries give you one configuration and say "good luck." We didn't think that was good enough.
We ran a 40-scenario benchmark lab against our own documentation — testing 5 different ways to prepare content and 8 different retrieval strategies. Then we used those results to implement 7 targeted improvements to the pipeline.
This page explains what we found and, more importantly, how to use those findings to configure Memcity for your specific use case. You don't need to understand AI to follow along — we'll break down every concept along the way.
The Key Insight: How You Store Content Matters More Than How You Search It
This was the most surprising finding from our research:
The way you prepare and store your content is the #1 factor in search quality — not the search algorithm, not the AI model, not the number of pipeline steps.
Think of it like a library. You could build the world's best search engine, but if the books are randomly shelved with no labels, the search engine can't help you. The same content, organized with titles, section labels, and consistent structure, becomes dramatically easier to find.
Here's the proof from our lab:
| How content was stored | Hit@1 accuracy | Description |
|---|---|---|
| Full wrapper (title + section + slug + content) | 87.5% | Each chunk includes the page title, which section it belongs to, and a URL-safe identifier |
| Minimal wrapper (title + content) | 75.0% | Only the page title is included |
| No wrapper (raw content only) | 75.0% | Content is stored as-is with no metadata |
| Smaller chunks (384 tokens) | 68.8% | Content split into smaller pieces |
| Larger chunks (768 tokens) | 68.8% | Content split into larger pieces |
The full wrapper won by a landslide — a 12.5 percentage point improvement over every other approach. And it didn't cost any extra money or add any latency. It's just better data preparation.
What is a "wrapper"?
When Memcity ingests a document, it can wrap each chunk of text with metadata:
# Vacation Policy
Section: HR Policies
Slug: vacation-policy
Employees receive 20 days of paid vacation per year...The # Title, Section:, and Slug: lines aren't part of the original document — they're metadata that Memcity adds during ingestion. This metadata serves as lexical anchors — when someone searches for "vacation policy," the BM25 keyword search matches both the content ("vacation," "paid") and the title ("Vacation Policy"), making it much more likely to rank the right chunk first.
You get this for free. Memcity adds full wrappers by default. You don't need to change anything.
The Retrieval Configurations We Tested
We tested 8 retrieval strategies against each ingestion format. Here's what they mean in plain English:
| Strategy | What it does | Latency | Accuracy |
|---|---|---|---|
| Semantic-heavy fusion | Weights "meaning" search higher than keyword search | ~800ms | 87.5% Hit@1 (best) |
| Balanced (no LLM) | Equal weight for meaning and keyword search | ~1,100ms | 87.5% Hit@1 |
| No reranker | Skips the "interview" step that re-scores results | ~1,050ms | 87.5% Hit@1 |
| Graph-enabled | Follows entity connections in the knowledge graph | ~900ms | 87.5% Hit@1 |
| High recall | Fetches more candidates, re-scores more results | ~1,500ms | 87.5% Hit@1 |
| Keyword-heavy | Weights exact keyword matching higher | ~1,100ms | 81.3% Hit@1 |
| HyDE + expansion | Generates hypothetical answers + query variations | ~8,500ms | 81.3% Hit@1 |
| Max accuracy | Everything turned on, including graph + rewrites | ~14,800ms | 87.5% Hit@1 |
What these results tell us
1. Semantic-heavy fusion is the sweet spot. It gave the highest accuracy (87.5%) at the lowest latency (~800ms). This means weighting the "meaning-based" search slightly higher than the keyword search produces the best results for most queries.
2. HyDE and query expansion don't help (and they're slow). These features generate hypothetical answers and query variations using an LLM call. They sound smart in theory, but in practice they added 8-10x latency while delivering worse accuracy (81.3% vs 87.5%). The reason: with well-structured content (full wrappers), the basic search is already very good. The LLM rewrites sometimes introduce drift — they change the query in ways that miss the exact content you're looking for.
3. The reranker is NOT the bottleneck we expected. Disabling the reranker (the "interview" step that re-evaluates results) produced the same 87.5% accuracy. This doesn't mean the reranker is useless — it improves ordering within the results — but the biggest quality lever is content preparation, not reranking.
4. Chunk size matters. Both smaller (384 tokens) and larger (768 tokens) chunks degraded accuracy to 68.8%. The default 512-token size is optimal for most content.
The 7 Improvements We Made
Based on the lab data, we implemented 7 targeted improvements (we call them "Design Proposals" or DPs). Here's what each one does and when you'd use it.
DP-1: Document-First Ranking
The problem: When you search for something, multiple chunks from the same document might match. Without document-first ranking, your top-5 results could all be from the same page — crowding out other relevant pages.
The solution: After scoring all chunks, Memcity groups them by document, ranks the documents by their best chunk score, then picks the best chunk(s) from each top document.
Analogy: Imagine searching a library catalog. You'd want to see 5 different books, not 5 chapters from the same book.
search: {
documentFirstRanking: true, // Group by document first (default: true for Pro+)
maxChunksPerDocument: 1, // Return 1 chunk per document (default: 1)
}When to change maxChunksPerDocument: If your documents are long and a single chunk isn't enough context, increase to 2-3. For short documents (like FAQ pages), 1 is fine.
DP-2: Canonical Slug Aliases
The problem: Different search results might refer to the same page with different names. "Access Control" and "ACL" are the same page. "Usage Quotas" and "Quotas" are the same page. Without normalization, these show up as different results.
The solution: Memcity maintains a mapping of common aliases to canonical page identifiers, so results are always consistent.
This is an internal improvement — you don't need to configure anything.
DP-3: Structured Metadata Boost
The problem: A chunk might match the query semantically, but the chunk about "setting up access control" from the ACL page should rank higher than a passing mention of "access" in the Getting Started page.
The solution: After the main search, Memcity checks if the chunk's title, section name, or headings match words in the query. Matching chunks get a small score boost.
search: {
metadataBoost: 0.15, // Boost factor (0 = disabled, default: 0.15)
}How it works: If you search for "access control," chunks from a document titled "Access Control" get a boost because the title words match the query. It's additive (not multiplicative), so it nudges results rather than completely reshuffling them.
When to change: Set to 0 if your documents don't have meaningful titles. Increase to 0.25 if your documents have very descriptive titles and you want title matches to matter more.
DP-4: Heuristic Intent Routing
The problem: Not all queries have the same intent. "How do I install Memcity?" is clearly asking about setup. "memory.getContext parameters" is an API reference lookup. "What does the pro tier include?" is about pricing. Each intent should prioritize different pages.
The solution: A zero-latency regex classifier detects the intent of each query and applies a targeted boost to matching results.
| Detected intent | Example queries | Boost targets |
|---|---|---|
| API symbol | memory.getContext, rlmQuery parameters, new Memory class | API reference, SDK docs |
| Install/setup | how do I install, npx convex setup, getting started | Getting started, registry |
| Tier/pricing | pro tier features, pricing, what does team include | Tiers, pricing pages |
| Config | MemoryConfig options, cacheTtlMs, what does maxCandidates do | Configuration docs |
| General | Everything else | No boost applied |
search: {
heuristicIntentBoost: true, // Enable intent detection (default: true for Pro+)
}Why "heuristic"? It uses pattern matching (regex), not an AI call. This means it's instant (0ms added latency) and free (no API costs). The tradeoff is that it only catches common patterns — unusual phrasings fall through to "general" intent, which is fine.
DP-5: Adaptive Two-Stage Retrieval
The problem: Sometimes the fast search path returns low-confidence results. Maybe the query uses unusual terminology, or the answer is buried in an unexpected page. You want a fallback that tries harder.
The solution: When the first pass returns results below a confidence threshold, Memcity automatically triggers a second pass with query expansion — generating semantic variations of the query to cast a wider net.
search: {
adaptiveRetrieval: false, // Off by default (opt-in)
adaptiveMinScore: 0.25, // Trigger threshold (default: 0.25)
}The key insight: We found that query expansion hurts accuracy when results are already good (see the lab data above). But it helps when the first pass fails. Adaptive retrieval gives you the best of both worlds — fast path for easy queries, expensive fallback only when needed.
When to enable: If your content is diverse and users ask unpredictable questions. Don't enable for focused knowledge bases where queries are predictable.
Important detail: The second pass preserves enrichments from the first pass (citations, expanded context, temporal boosts). It doesn't throw away work — it only adds genuinely new results.
DP-6: Score Calibration
The problem: Raw search scores are meaningless to users. A reranker score of 0.72 doesn't mean "72% confident." Score ranges vary wildly between different pipeline configurations — a score of 0.02 might be excellent for RRF-only (no reranker), but terrible for reranked results.
The solution: Memcity maps raw scores to human-meaningful confidence levels based on benchmark-calibrated thresholds.
| Confidence level | Reranked score | RRF-only score | What it means |
|---|---|---|---|
| High | ≥ 0.50 | ≥ 75% of top score | Strong match — the result is almost certainly relevant |
| Medium | ≥ 0.15 | ≥ 40% of top score | Relevant — the result likely answers the query |
| Low | under 0.15 | under 40% of top score | Possibly relevant — consider with caution |
| None | ≤ 0 | ≤ 0 | No match |
These thresholds come from our benchmark data: 0.50 is the 75th percentile of correct rank-1 scores, and 0.15 is the 25th percentile. They're not arbitrary — they're calibrated against real search results.
This is an internal improvement applied to the confidence score in search results. You can use the confidence field in getContext results to make UI decisions (e.g., show a "low confidence" warning when confidence is below 0.3).
DP-7: CI Evaluation Gate
The problem: Search quality can degrade silently when code changes. A refactor might accidentally break the ranking logic, and you won't know until users complain.
The solution: An automated quality gate that runs on every pull request touching the search pipeline. It executes real queries against the production deployment and verifies:
- Hit@1 stays above 85%
- MRR@5 stays above 0.85
- P95 latency stays below 2 seconds
- Boilerplate contamination stays below 60%
If any threshold is breached, the CI check fails and the PR is blocked.
This runs automatically — no configuration needed. It's defined in .github/workflows/search-quality-gate.yml.
Configuration Recipes
"Documentation Search" — This Site's Config
This is the exact configuration powering memcity.dev's search. Optimized for the best accuracy/latency ratio based on our 40-scenario lab.
const memory = new Memory(components.memory, {
tier: "pro",
ai: {
gateway: "openrouter",
model: "google/gemini-2.5-flash",
embeddingModel: "google/gemini-embedding-001",
},
search: {
maxCandidates: 35,
rerankTopN: 10,
rrfWeightSemantic: 1.4, // Semantic-heavy (lab finding)
rrfWeightKeyword: 0.8,
reranking: true,
documentFirstRanking: true, // DP-1
maxChunksPerDocument: 1, // DP-1
metadataBoost: 0.15, // DP-3
heuristicIntentBoost: true, // DP-4
queryRouting: false, // Disabled — adds latency, no accuracy gain
queryExpansion: false, // Disabled — adds latency, hurts accuracy
hyde: false, // Disabled — adds latency, hurts accuracy
queryDecomposition: false, // Not needed for single-question queries
chunkExpansion: true,
maxChunkExpansions: 3,
semanticDedup: true,
temporalAwareness: false, // Docs don't change often enough to matter
citations: true,
recursiveSummaries: false, // Not enough content to benefit from RAPTOR
},
chunking: {
strategy: "recursive",
chunkSize: 512, // Lab-validated optimal size
chunkOverlap: 50,
},
});Results: 87.5% Hit@1, 100% Hit@3, ~800ms average latency.
"Customer Support Bot" — High Recall, Forgiving
Support queries are often vague ("it's not working"), misspelled, or multi-part. Enable more recall features and be generous with results.
const memory = new Memory(components.memory, {
tier: "pro",
search: {
maxCandidates: 50,
rerankTopN: 12,
rrfWeightSemantic: 1.2,
rrfWeightKeyword: 1.0,
reranking: true,
documentFirstRanking: true,
maxChunksPerDocument: 2, // Support answers often span sections
metadataBoost: 0.15,
heuristicIntentBoost: true,
queryRouting: true, // Lets simple queries be fast
queryExpansion: true, // Helps with vague/misspelled queries
maxQueryExpansions: 3,
hyde: false, // Still not worth the latency tradeoff
adaptiveRetrieval: true, // Fallback for hard queries
adaptiveMinScore: 0.2,
chunkExpansion: true,
maxChunkExpansions: 3,
citations: true, // Users need to verify answers
},
});"Internal Knowledge Base" — Speed Over Precision
For internal tools where users can refine their search, speed matters more than perfect first-result accuracy.
const memory = new Memory(components.memory, {
tier: "pro",
search: {
maxCandidates: 25,
rerankTopN: 8,
rrfWeightSemantic: 1.0,
rrfWeightKeyword: 1.0, // Balanced — internal docs are more keyword-heavy
reranking: true,
documentFirstRanking: true,
metadataBoost: 0.15,
heuristicIntentBoost: false, // Internal queries are less predictable
queryRouting: false,
queryExpansion: false,
hyde: false,
chunkExpansion: true,
maxChunkExpansions: 2,
citations: false, // Less important for internal use
},
});"Compliance / Legal" — Maximum Recall, Citations Required
For regulated industries where missing a relevant document is unacceptable, and every answer needs a paper trail.
const memory = new Memory(components.memory, {
tier: "team",
search: {
maxCandidates: 70,
rerankTopN: 20,
rrfWeightSemantic: 1.0,
rrfWeightKeyword: 1.2, // Keyword-heavy — exact legal terms matter
reranking: true,
documentFirstRanking: true,
maxChunksPerDocument: 3, // Legal docs need more context per source
metadataBoost: 0.2,
heuristicIntentBoost: true,
queryRouting: true,
queryExpansion: true,
maxQueryExpansions: 5,
adaptiveRetrieval: true,
adaptiveMinScore: 0.15, // Lower threshold — try harder before giving up
chunkExpansion: true,
maxChunkExpansions: 5,
citations: true, // Non-negotiable for compliance
},
enterprise: {
acl: true, // Per-document access control
auditLog: true, // Immutable operation logging
quotas: true, // Rate limiting per organization
},
});Understanding the Numbers
What is Hit@1?
Hit@1 means "the correct answer was the #1 result." If you search for "vacation policy" and the top result is from the vacation policy page, that's a hit. If it's from the PTO overview page (close, but not the specific page), that's a miss.
- 87.5% Hit@1 means that out of every 8 test queries, 7 returned the correct page as the top result.
- 100% Hit@3 means the correct answer was always in the top 3 results.
What is MRR@5?
MRR (Mean Reciprocal Rank) accounts for where the correct answer appears:
- If the correct answer is result #1: score = 1/1 = 1.0
- If it's result #2: score = 1/2 = 0.5
- If it's result #3: score = 1/3 = 0.33
- If it's not in the top 5: score = 0
MRR@5 averages these scores across all queries. Our pipeline achieves 0.90 MRR@5, meaning the correct answer is almost always in the top 1-2 results.
What is "Boilerplate@1"?
This measures how often the top result contains metadata text like Section: Features or Slug: search-pipeline that was added during ingestion. A 50% boilerplate rate sounds bad, but it's actually expected — the full wrapper deliberately includes this metadata because it dramatically improves accuracy (see the lab results above). The metadata is in the stored chunk text, but you can strip it in your UI if you don't want users to see it.
What about latency?
| Tier | Typical latency | Steps active |
|---|---|---|
| Community | 100-200ms | Cache, embed, search, fuse, dedup, format |
| Pro (fast path) | 400-900ms | Above + reranking, chunk expansion, citations, DPs |
| Pro (with expansion) | 1-2s | Above + query expansion |
| Team | 450-950ms | Pro + ACL filtering, audit logging |
The main latency costs are:
- Embedding generation (~50ms, or 0ms if cached)
- Reranking (~100ms) — the biggest single cost, but also the biggest quality win
- Query expansion (~80ms per expansion) — often not worth it
- HyDE (~150ms) — usually not worth it for well-structured content
Frequently Asked Questions
"Should I enable HyDE?"
Probably not. Our lab found that HyDE adds 8-10x latency while producing equal or worse accuracy on well-structured content. HyDE was designed for scenarios where queries and documents are in completely different "languages" — e.g., searching medical papers with patient-written questions. If your content is well-organized with clear titles and headings, the basic search is already very good.
Exception: If your users frequently ask very short, ambiguous queries ("refund?") against long, detailed documents, HyDE might help. Test it with your own data before enabling in production.
"Should I enable query expansion?"
For most use cases, no. Our lab showed it doesn't improve accuracy and adds latency. However, it can help in two specific scenarios:
- Diverse vocabulary — Your users use very different words than your documents. E.g., users say "cancel" but docs say "terminate."
- Adaptive retrieval — Enable
adaptiveRetrievalinstead ofqueryExpansion. This uses expansion as a fallback only when the first pass fails, avoiding the latency penalty on easy queries.
"What chunk size should I use?"
512 tokens for most content. Our lab tested 384, 512, and 768 tokens:
- 384 (smaller chunks): More chunks per document, but each chunk has less context. Accuracy dropped to 68.8%.
- 512 (default): Best accuracy at 87.5%.
- 768 (larger chunks): Fewer chunks, more context per chunk, but less precise matching. Accuracy dropped to 68.8%.
Exception: FAQ-style content with very short answers benefits from smaller chunks (256-384). Very long narrative content (novels, transcripts) might benefit from larger chunks (768-1024).
"Do I need the reranker?"
The reranker (Jina Reranker v3) adds ~100ms of latency. Our lab showed it doesn't significantly change Hit@1 on well-structured content (87.5% with and without). However, it improves the ordering of results — the difference between having the "pretty good" answer at #1 and having the "best" answer at #1. For user-facing applications, the ~100ms is usually worth it.
When to skip: Autocomplete or typeahead search where latency matters more than precision. Internal tools where users will click through multiple results anyway.
"Why is my Hit@1 lower than expected?"
Common causes, in order of likelihood:
- Content preparation — Are your documents ingested with meaningful titles? Missing or generic titles ("Untitled Document") dramatically reduce accuracy.
- Chunk size — Try the default 512 if you've changed it.
- Fusion weights — For documentation/factual content, try
rrfWeightSemantic: 1.4, rrfWeightKeyword: 0.8. For code/technical content, try equal weights. - Missing content — The answer might not be in your knowledge base at all. Check the
confidencefield — low confidence suggests the content doesn't exist.