memcity

Features

Search Architecture

How we tested 40 configurations to find the optimal search pipeline — and how to tune it for your use case.

Why This Page Exists

Most RAG libraries give you one configuration and say "good luck." We didn't think that was good enough.

We ran a 40-scenario benchmark lab against our own documentation — testing 5 different ways to prepare content and 8 different retrieval strategies. Then we used those results to implement 7 targeted improvements to the pipeline.

This page explains what we found and, more importantly, how to use those findings to configure Memcity for your specific use case. You don't need to understand AI to follow along — we'll break down every concept along the way.

The Key Insight: How You Store Content Matters More Than How You Search It

This was the most surprising finding from our research:

The way you prepare and store your content is the #1 factor in search quality — not the search algorithm, not the AI model, not the number of pipeline steps.

Think of it like a library. You could build the world's best search engine, but if the books are randomly shelved with no labels, the search engine can't help you. The same content, organized with titles, section labels, and consistent structure, becomes dramatically easier to find.

Here's the proof from our lab:

How content was storedHit@1 accuracyDescription
Full wrapper (title + section + slug + content)87.5%Each chunk includes the page title, which section it belongs to, and a URL-safe identifier
Minimal wrapper (title + content)75.0%Only the page title is included
No wrapper (raw content only)75.0%Content is stored as-is with no metadata
Smaller chunks (384 tokens)68.8%Content split into smaller pieces
Larger chunks (768 tokens)68.8%Content split into larger pieces

The full wrapper won by a landslide — a 12.5 percentage point improvement over every other approach. And it didn't cost any extra money or add any latency. It's just better data preparation.

What is a "wrapper"?

When Memcity ingests a document, it can wrap each chunk of text with metadata:

typescript
# Vacation Policy
 
Section: HR Policies
Slug: vacation-policy
 
Employees receive 20 days of paid vacation per year...

The # Title, Section:, and Slug: lines aren't part of the original document — they're metadata that Memcity adds during ingestion. This metadata serves as lexical anchors — when someone searches for "vacation policy," the BM25 keyword search matches both the content ("vacation," "paid") and the title ("Vacation Policy"), making it much more likely to rank the right chunk first.

You get this for free. Memcity adds full wrappers by default. You don't need to change anything.

The Retrieval Configurations We Tested

We tested 8 retrieval strategies against each ingestion format. Here's what they mean in plain English:

StrategyWhat it doesLatencyAccuracy
Semantic-heavy fusionWeights "meaning" search higher than keyword search~800ms87.5% Hit@1 (best)
Balanced (no LLM)Equal weight for meaning and keyword search~1,100ms87.5% Hit@1
No rerankerSkips the "interview" step that re-scores results~1,050ms87.5% Hit@1
Graph-enabledFollows entity connections in the knowledge graph~900ms87.5% Hit@1
High recallFetches more candidates, re-scores more results~1,500ms87.5% Hit@1
Keyword-heavyWeights exact keyword matching higher~1,100ms81.3% Hit@1
HyDE + expansionGenerates hypothetical answers + query variations~8,500ms81.3% Hit@1
Max accuracyEverything turned on, including graph + rewrites~14,800ms87.5% Hit@1

What these results tell us

1. Semantic-heavy fusion is the sweet spot. It gave the highest accuracy (87.5%) at the lowest latency (~800ms). This means weighting the "meaning-based" search slightly higher than the keyword search produces the best results for most queries.

2. HyDE and query expansion don't help (and they're slow). These features generate hypothetical answers and query variations using an LLM call. They sound smart in theory, but in practice they added 8-10x latency while delivering worse accuracy (81.3% vs 87.5%). The reason: with well-structured content (full wrappers), the basic search is already very good. The LLM rewrites sometimes introduce drift — they change the query in ways that miss the exact content you're looking for.

3. The reranker is NOT the bottleneck we expected. Disabling the reranker (the "interview" step that re-evaluates results) produced the same 87.5% accuracy. This doesn't mean the reranker is useless — it improves ordering within the results — but the biggest quality lever is content preparation, not reranking.

4. Chunk size matters. Both smaller (384 tokens) and larger (768 tokens) chunks degraded accuracy to 68.8%. The default 512-token size is optimal for most content.

The 7 Improvements We Made

Based on the lab data, we implemented 7 targeted improvements (we call them "Design Proposals" or DPs). Here's what each one does and when you'd use it.

DP-1: Document-First Ranking

The problem: When you search for something, multiple chunks from the same document might match. Without document-first ranking, your top-5 results could all be from the same page — crowding out other relevant pages.

The solution: After scoring all chunks, Memcity groups them by document, ranks the documents by their best chunk score, then picks the best chunk(s) from each top document.

Analogy: Imagine searching a library catalog. You'd want to see 5 different books, not 5 chapters from the same book.

ts
search: {
  documentFirstRanking: true,   // Group by document first (default: true for Pro+)
  maxChunksPerDocument: 1,      // Return 1 chunk per document (default: 1)
}

When to change maxChunksPerDocument: If your documents are long and a single chunk isn't enough context, increase to 2-3. For short documents (like FAQ pages), 1 is fine.

DP-2: Canonical Slug Aliases

The problem: Different search results might refer to the same page with different names. "Access Control" and "ACL" are the same page. "Usage Quotas" and "Quotas" are the same page. Without normalization, these show up as different results.

The solution: Memcity maintains a mapping of common aliases to canonical page identifiers, so results are always consistent.

This is an internal improvement — you don't need to configure anything.

DP-3: Structured Metadata Boost

The problem: A chunk might match the query semantically, but the chunk about "setting up access control" from the ACL page should rank higher than a passing mention of "access" in the Getting Started page.

The solution: After the main search, Memcity checks if the chunk's title, section name, or headings match words in the query. Matching chunks get a small score boost.

ts
search: {
  metadataBoost: 0.15,  // Boost factor (0 = disabled, default: 0.15)
}

How it works: If you search for "access control," chunks from a document titled "Access Control" get a boost because the title words match the query. It's additive (not multiplicative), so it nudges results rather than completely reshuffling them.

When to change: Set to 0 if your documents don't have meaningful titles. Increase to 0.25 if your documents have very descriptive titles and you want title matches to matter more.

DP-4: Heuristic Intent Routing

The problem: Not all queries have the same intent. "How do I install Memcity?" is clearly asking about setup. "memory.getContext parameters" is an API reference lookup. "What does the pro tier include?" is about pricing. Each intent should prioritize different pages.

The solution: A zero-latency regex classifier detects the intent of each query and applies a targeted boost to matching results.

Detected intentExample queriesBoost targets
API symbolmemory.getContext, rlmQuery parameters, new Memory classAPI reference, SDK docs
Install/setuphow do I install, npx convex setup, getting startedGetting started, registry
Tier/pricingpro tier features, pricing, what does team includeTiers, pricing pages
ConfigMemoryConfig options, cacheTtlMs, what does maxCandidates doConfiguration docs
GeneralEverything elseNo boost applied
ts
search: {
  heuristicIntentBoost: true,  // Enable intent detection (default: true for Pro+)
}

Why "heuristic"? It uses pattern matching (regex), not an AI call. This means it's instant (0ms added latency) and free (no API costs). The tradeoff is that it only catches common patterns — unusual phrasings fall through to "general" intent, which is fine.

DP-5: Adaptive Two-Stage Retrieval

The problem: Sometimes the fast search path returns low-confidence results. Maybe the query uses unusual terminology, or the answer is buried in an unexpected page. You want a fallback that tries harder.

The solution: When the first pass returns results below a confidence threshold, Memcity automatically triggers a second pass with query expansion — generating semantic variations of the query to cast a wider net.

ts
search: {
  adaptiveRetrieval: false,     // Off by default (opt-in)
  adaptiveMinScore: 0.25,       // Trigger threshold (default: 0.25)
}

The key insight: We found that query expansion hurts accuracy when results are already good (see the lab data above). But it helps when the first pass fails. Adaptive retrieval gives you the best of both worlds — fast path for easy queries, expensive fallback only when needed.

When to enable: If your content is diverse and users ask unpredictable questions. Don't enable for focused knowledge bases where queries are predictable.

Important detail: The second pass preserves enrichments from the first pass (citations, expanded context, temporal boosts). It doesn't throw away work — it only adds genuinely new results.

DP-6: Score Calibration

The problem: Raw search scores are meaningless to users. A reranker score of 0.72 doesn't mean "72% confident." Score ranges vary wildly between different pipeline configurations — a score of 0.02 might be excellent for RRF-only (no reranker), but terrible for reranked results.

The solution: Memcity maps raw scores to human-meaningful confidence levels based on benchmark-calibrated thresholds.

Confidence levelReranked scoreRRF-only scoreWhat it means
High≥ 0.50≥ 75% of top scoreStrong match — the result is almost certainly relevant
Medium≥ 0.15≥ 40% of top scoreRelevant — the result likely answers the query
Lowunder 0.15under 40% of top scorePossibly relevant — consider with caution
None≤ 0≤ 0No match

These thresholds come from our benchmark data: 0.50 is the 75th percentile of correct rank-1 scores, and 0.15 is the 25th percentile. They're not arbitrary — they're calibrated against real search results.

This is an internal improvement applied to the confidence score in search results. You can use the confidence field in getContext results to make UI decisions (e.g., show a "low confidence" warning when confidence is below 0.3).

DP-7: CI Evaluation Gate

The problem: Search quality can degrade silently when code changes. A refactor might accidentally break the ranking logic, and you won't know until users complain.

The solution: An automated quality gate that runs on every pull request touching the search pipeline. It executes real queries against the production deployment and verifies:

  • Hit@1 stays above 85%
  • MRR@5 stays above 0.85
  • P95 latency stays below 2 seconds
  • Boilerplate contamination stays below 60%

If any threshold is breached, the CI check fails and the PR is blocked.

This runs automatically — no configuration needed. It's defined in .github/workflows/search-quality-gate.yml.

Configuration Recipes

"Documentation Search" — This Site's Config

This is the exact configuration powering memcity.dev's search. Optimized for the best accuracy/latency ratio based on our 40-scenario lab.

ts
const memory = new Memory(components.memory, {
  tier: "pro",
  ai: {
    gateway: "openrouter",
    model: "google/gemini-2.5-flash",
    embeddingModel: "google/gemini-embedding-001",
  },
  search: {
    maxCandidates: 35,
    rerankTopN: 10,
    rrfWeightSemantic: 1.4,       // Semantic-heavy (lab finding)
    rrfWeightKeyword: 0.8,
    reranking: true,
    documentFirstRanking: true,    // DP-1
    maxChunksPerDocument: 1,       // DP-1
    metadataBoost: 0.15,           // DP-3
    heuristicIntentBoost: true,    // DP-4
    queryRouting: false,           // Disabled — adds latency, no accuracy gain
    queryExpansion: false,         // Disabled — adds latency, hurts accuracy
    hyde: false,                   // Disabled — adds latency, hurts accuracy
    queryDecomposition: false,     // Not needed for single-question queries
    chunkExpansion: true,
    maxChunkExpansions: 3,
    semanticDedup: true,
    temporalAwareness: false,      // Docs don't change often enough to matter
    citations: true,
    recursiveSummaries: false,     // Not enough content to benefit from RAPTOR
  },
  chunking: {
    strategy: "recursive",
    chunkSize: 512,                // Lab-validated optimal size
    chunkOverlap: 50,
  },
});

Results: 87.5% Hit@1, 100% Hit@3, ~800ms average latency.

"Customer Support Bot" — High Recall, Forgiving

Support queries are often vague ("it's not working"), misspelled, or multi-part. Enable more recall features and be generous with results.

ts
const memory = new Memory(components.memory, {
  tier: "pro",
  search: {
    maxCandidates: 50,
    rerankTopN: 12,
    rrfWeightSemantic: 1.2,
    rrfWeightKeyword: 1.0,
    reranking: true,
    documentFirstRanking: true,
    maxChunksPerDocument: 2,       // Support answers often span sections
    metadataBoost: 0.15,
    heuristicIntentBoost: true,
    queryRouting: true,            // Lets simple queries be fast
    queryExpansion: true,          // Helps with vague/misspelled queries
    maxQueryExpansions: 3,
    hyde: false,                   // Still not worth the latency tradeoff
    adaptiveRetrieval: true,       // Fallback for hard queries
    adaptiveMinScore: 0.2,
    chunkExpansion: true,
    maxChunkExpansions: 3,
    citations: true,               // Users need to verify answers
  },
});

"Internal Knowledge Base" — Speed Over Precision

For internal tools where users can refine their search, speed matters more than perfect first-result accuracy.

ts
const memory = new Memory(components.memory, {
  tier: "pro",
  search: {
    maxCandidates: 25,
    rerankTopN: 8,
    rrfWeightSemantic: 1.0,
    rrfWeightKeyword: 1.0,         // Balanced — internal docs are more keyword-heavy
    reranking: true,
    documentFirstRanking: true,
    metadataBoost: 0.15,
    heuristicIntentBoost: false,   // Internal queries are less predictable
    queryRouting: false,
    queryExpansion: false,
    hyde: false,
    chunkExpansion: true,
    maxChunkExpansions: 2,
    citations: false,              // Less important for internal use
  },
});

For regulated industries where missing a relevant document is unacceptable, and every answer needs a paper trail.

ts
const memory = new Memory(components.memory, {
  tier: "team",
  search: {
    maxCandidates: 70,
    rerankTopN: 20,
    rrfWeightSemantic: 1.0,
    rrfWeightKeyword: 1.2,         // Keyword-heavy — exact legal terms matter
    reranking: true,
    documentFirstRanking: true,
    maxChunksPerDocument: 3,       // Legal docs need more context per source
    metadataBoost: 0.2,
    heuristicIntentBoost: true,
    queryRouting: true,
    queryExpansion: true,
    maxQueryExpansions: 5,
    adaptiveRetrieval: true,
    adaptiveMinScore: 0.15,        // Lower threshold — try harder before giving up
    chunkExpansion: true,
    maxChunkExpansions: 5,
    citations: true,               // Non-negotiable for compliance
  },
  enterprise: {
    acl: true,                     // Per-document access control
    auditLog: true,                // Immutable operation logging
    quotas: true,                  // Rate limiting per organization
  },
});

Understanding the Numbers

What is Hit@1?

Hit@1 means "the correct answer was the #1 result." If you search for "vacation policy" and the top result is from the vacation policy page, that's a hit. If it's from the PTO overview page (close, but not the specific page), that's a miss.

  • 87.5% Hit@1 means that out of every 8 test queries, 7 returned the correct page as the top result.
  • 100% Hit@3 means the correct answer was always in the top 3 results.

What is MRR@5?

MRR (Mean Reciprocal Rank) accounts for where the correct answer appears:

  • If the correct answer is result #1: score = 1/1 = 1.0
  • If it's result #2: score = 1/2 = 0.5
  • If it's result #3: score = 1/3 = 0.33
  • If it's not in the top 5: score = 0

MRR@5 averages these scores across all queries. Our pipeline achieves 0.90 MRR@5, meaning the correct answer is almost always in the top 1-2 results.

What is "Boilerplate@1"?

This measures how often the top result contains metadata text like Section: Features or Slug: search-pipeline that was added during ingestion. A 50% boilerplate rate sounds bad, but it's actually expected — the full wrapper deliberately includes this metadata because it dramatically improves accuracy (see the lab results above). The metadata is in the stored chunk text, but you can strip it in your UI if you don't want users to see it.

What about latency?

TierTypical latencySteps active
Community100-200msCache, embed, search, fuse, dedup, format
Pro (fast path)400-900msAbove + reranking, chunk expansion, citations, DPs
Pro (with expansion)1-2sAbove + query expansion
Team450-950msPro + ACL filtering, audit logging

The main latency costs are:

  1. Embedding generation (~50ms, or 0ms if cached)
  2. Reranking (~100ms) — the biggest single cost, but also the biggest quality win
  3. Query expansion (~80ms per expansion) — often not worth it
  4. HyDE (~150ms) — usually not worth it for well-structured content

Frequently Asked Questions

"Should I enable HyDE?"

Probably not. Our lab found that HyDE adds 8-10x latency while producing equal or worse accuracy on well-structured content. HyDE was designed for scenarios where queries and documents are in completely different "languages" — e.g., searching medical papers with patient-written questions. If your content is well-organized with clear titles and headings, the basic search is already very good.

Exception: If your users frequently ask very short, ambiguous queries ("refund?") against long, detailed documents, HyDE might help. Test it with your own data before enabling in production.

"Should I enable query expansion?"

For most use cases, no. Our lab showed it doesn't improve accuracy and adds latency. However, it can help in two specific scenarios:

  1. Diverse vocabulary — Your users use very different words than your documents. E.g., users say "cancel" but docs say "terminate."
  2. Adaptive retrieval — Enable adaptiveRetrieval instead of queryExpansion. This uses expansion as a fallback only when the first pass fails, avoiding the latency penalty on easy queries.

"What chunk size should I use?"

512 tokens for most content. Our lab tested 384, 512, and 768 tokens:

  • 384 (smaller chunks): More chunks per document, but each chunk has less context. Accuracy dropped to 68.8%.
  • 512 (default): Best accuracy at 87.5%.
  • 768 (larger chunks): Fewer chunks, more context per chunk, but less precise matching. Accuracy dropped to 68.8%.

Exception: FAQ-style content with very short answers benefits from smaller chunks (256-384). Very long narrative content (novels, transcripts) might benefit from larger chunks (768-1024).

"Do I need the reranker?"

The reranker (Jina Reranker v3) adds ~100ms of latency. Our lab showed it doesn't significantly change Hit@1 on well-structured content (87.5% with and without). However, it improves the ordering of results — the difference between having the "pretty good" answer at #1 and having the "best" answer at #1. For user-facing applications, the ~100ms is usually worth it.

When to skip: Autocomplete or typeahead search where latency matters more than precision. Internal tools where users will click through multiple results anyway.

"Why is my Hit@1 lower than expected?"

Common causes, in order of likelihood:

  1. Content preparation — Are your documents ingested with meaningful titles? Missing or generic titles ("Untitled Document") dramatically reduce accuracy.
  2. Chunk size — Try the default 512 if you've changed it.
  3. Fusion weights — For documentation/factual content, try rrfWeightSemantic: 1.4, rrfWeightKeyword: 0.8. For code/technical content, try equal weights.
  4. Missing content — The answer might not be in your knowledge base at all. Check the confidence field — low confidence suggests the content doesn't exist.