memcity

Features

RLM (Recursive Language Models)

Deep recursive retrieval that processes ALL your context through an iterative LLM-driven code execution loop instead of just returning top-K results.

What is RLM?

Standard RAG returns the top 5-10 most relevant chunks for a query. That works well for simple questions, but what if the answer requires synthesizing information scattered across hundreds of documents? What if the AI needs to explore your data, form hypotheses, run analyses, and iterate?

RLM (Recursive Language Models) is a fundamentally different approach based on the algorithm from arXiv:2512.24601. Instead of retrieving a handful of chunks:

  1. All your context (every memory, every document chunk, entire conversation histories) is loaded into a sandbox as a variable
  2. An LLM writes JavaScript code to explore that context
  3. The code executes in a sandboxed REPL, and results feed back to the LLM
  4. The LLM iterates — writing more code, calling sub-LLMs for analysis, filtering and transforming data
  5. When it has an answer, it terminates with FINAL() or FINAL_VAR()

The key insight: the context lives in the sandbox, not in the LLM's context window. The LLM only sees metadata about the context (type, length) and explores it programmatically. This means you can process millions of characters of context without hitting token limits.

Why Use RLM Instead of getContext?

getContext (Standard RAG)rlmQuery (RLM)
Context usedTop-K chunks (5-10)ALL chunks / ALL memories
ApproachVector similarity + rerankingLLM-driven iterative code execution
Best forSpecific factual questionsComplex analytical questions
SpeedFast (1-5 seconds)Slower (10-120 seconds)
Cost1 LLM call + embeddingsMultiple LLM calls + sub-LLM calls
Example query"What is the return policy?""Compare all product mentions across our docs and identify inconsistencies"

Use getContext for most queries. Use rlmQuery when you need deep analysis that requires seeing the full picture.

How It Works

The REPL Loop

The RLM algorithm works like a data scientist with a Jupyter notebook:

typescript
1. LLM receives: system prompt + context metadata + user query
2. LLM writes JavaScript code in a ```repl block
3. Sandbox executes the code, captures stdout/stderr
4. Results are fed back to the LLM
5. LLM writes more code (or calls FINAL to terminate)
6. Repeat until FINAL() / FINAL_VAR() or max iterations

Scaffold Functions

Inside the sandbox, the LLM has access to these functions:

FunctionDescription
contextThe full context data (string or array)
llm_query(prompt)Call a sub-LLM for analysis (returns a string)
llm_query_batched(prompts)Call multiple sub-LLMs in parallel
FINAL_VAR(varName)Declare a REPL variable as the final answer
SHOW_VARS()List all user-created variables
console.log(...)Print to stdout (visible to the LLM)

Example Execution

For the query "What are the top 3 themes across all user memories?":

typescript
Iteration 1: LLM writes code to split context into individual memories
Iteration 2: LLM batches memories into groups, calls llm_query on each group
             to extract themes
Iteration 3: LLM aggregates themes, counts frequencies
Iteration 4: LLM calls FINAL_VAR("top_themes") with the result

Usage

Basic Usage

ts
const result = await memory.rlmQuery(ctx, {
  orgId: "org_abc123",
  query: "Summarize the key themes across all documents in this knowledge base",
  knowledgeBaseId: "kb_xyz",
  contextType: "documents",
});
 
console.log(result.answer);       // The synthesized answer
console.log(result.iterations);   // How many REPL iterations it took
console.log(result.totalSubCalls); // How many sub-LLM calls were made

Context Types

RLM supports four context types, each gathering different data:

Context TypeWhat It GathersRequired Params
memoriesAll episodic memories for a user (up to 1,000)userId
documentsAll chunks from a knowledge base (up to 5,000)knowledgeBaseId
conversationFull conversation history for a user (up to 50 sessions)userId
customRaw string you provide directlycustomContext
ts
// Analyze all user memories
const result = await memory.rlmQuery(ctx, {
  orgId, userId: "user_456",
  query: "What patterns do you see in this user's preferences?",
  contextType: "memories",
});
 
// Analyze custom data
const result = await memory.rlmQuery(ctx, {
  orgId,
  query: "Find contradictions in these reports",
  contextType: "custom",
  customContext: myLargeDataset,
});

Configuration

You can tune the RLM engine via the rlmConfig parameter:

ts
const result = await memory.rlmQuery(ctx, {
  orgId,
  query: "...",
  knowledgeBaseId: "kb_xyz",
  contextType: "documents",
  rlmConfig: {
    runtime: "python",           // "python" (default) or "bun"
    maxIterations: 30,           // Default: 20
    model: "anthropic/claude-sonnet-4-20250514", // Root LLM (default: your ai.model)
    subModel: "google/gemini-2.0-flash-001",  // Sub-LLM for llm_query calls
    maxBudget: 1.00,            // USD budget cap per query (default: null / unlimited)
    maxTimeout: 600,             // Max seconds per query (default: 600)
  },
});
OptionTypeDefaultDescription
runtime"python" | "bun""python"Execution runtime (see Runtimes below)
runtimeFallbackToPythonbooleantrueIf Bun runtime fails, automatically retry with Python
maxIterationsnumber20Maximum REPL loop iterations before forced termination
maxDepthnumber1Maximum recursion depth (sub-LLM calls are depth 1)
modelstringconfig.ai.modelModel for the root LLM (the one writing code)
subModelstringSame as modelModel for sub-LLM calls inside the sandbox
customSystemPromptstring(built-in)Override the default RLM system prompt (advanced)
maxTimeoutnumber600Max seconds for a single RLM call
maxBudgetnumber | nullnullUSD budget cap per query. The rlms package stops with a partial answer if exceeded. Recommended: set this to avoid runaway costs.
maxErrorsnumber | nullnullMax consecutive REPL errors before stopping

Sandbox Configuration

When you need to tune the Daytona sandbox resources:

OptionTypeDefaultDescription
sandboxMode"shared" | "per_org""shared"One sandbox for all orgs (cost-efficient) or dedicated per-org (stronger isolation)
sandboxSnapshotstring"memcity-rlm-v1"Daytona snapshot name (pre-created via create_rlm_snapshot.ts)
sandboxAutoStopMinutesnumber30Minutes of inactivity before sandbox auto-stops
sandboxAutoArchiveMinutesnumber1440Minutes after stop before archiving to cold storage
sandboxCpunumber2vCPUs allocated to the sandbox
sandboxMemorynumber4GiB of RAM allocated
sandboxDisknumber8GiB of disk allocated

Runtimes

All RLM execution runs in ephemeral Daytona cloud sandboxes. You choose which runtime executes inside the sandbox.

Setup

  1. Get a Daytona API key from daytona.io
  2. Get an OpenRouter API key from openrouter.ai
  3. Pass them through your config:
ts
const memory = new Memory(components.memory, {
  tier: "team",
  apiKeys: {
    openrouter: process.env.OPENROUTER_API_KEY,
    daytona: process.env.DAYTONA_API_KEY,
  },
});
  1. Create the Daytona snapshot (one-time setup):
bash
DAYTONA_API_KEY=your_key npx tsx scripts/create_rlm_snapshot.ts

This builds a sandbox image with Python 3.12, the rlms package, numpy, pandas, and scipy pre-installed.

Python (Default)

The Python runtime runs the official rlms package (from arXiv:2512.24601) inside the Daytona sandbox. This is the most battle-tested path.

Advantages:

  • Faithful implementation of the paper's algorithm
  • Concurrent llm_query_batched execution
  • Cost tracking via the rlms package's usage_summary
  • No action timeout limit (runs outside Convex)

Bun (Alternative)

The Bun runtime runs a native JavaScript implementation of the same algorithm using Bun + node:vm for sandboxed code execution.

ts
const result = await memory.rlmQuery(ctx, {
  orgId,
  query: "...",
  contextType: "documents",
  knowledgeBaseId: "kb_xyz",
  rlmConfig: { runtime: "bun" },
});

Advantages:

  • Faster startup (no Python interpreter boot)
  • Single universal script handles all task types

Limitations:

  • llm_query_batched runs sequentially (not concurrent)
  • Cost tracking not available (cost always returns null)

When runtimeFallbackToPython is true (the default), a Bun failure automatically retries with the Python backend.

Sandbox cost: ~$0.00056 per 30-second execution (2 vCPUs, 4 GiB RAM default).

Cost and Budget Control

RLM queries are significantly more expensive than standard getContext calls because each query involves multiple LLM calls -- one per REPL iteration plus sub-LLM calls made via llm_query() inside the sandbox.

Estimated Costs Per Query

ScenarioIterationsSub-callsEstimated Cost
Simple analysis (small context)3-50-2$0.01 - $0.05
Moderate analysis (medium context)5-105-10$0.05 - $0.30
Deep analysis (large context, many sub-calls)10-2010-30+$0.30 - $2.00+

Costs depend on your chosen models, context size, and how many iterations/sub-calls the LLM uses. Cheaper sub-models (via subModel) can significantly reduce costs.

Budget Controls

We strongly recommend setting a maxBudget to prevent runaway costs:

ts
const result = await memory.rlmQuery(ctx, {
  orgId,
  query: "...",
  contextType: "documents",
  knowledgeBaseId: "kb_xyz",
  rlmConfig: {
    maxBudget: 1.00,       // Stop at $1.00 and return partial answer
    maxIterations: 15,     // Cap iterations (default: 20)
    maxTimeout: 300,       // Cap at 5 minutes (default: 600s / 10 min)
    maxErrors: 5,          // Stop after 5 consecutive REPL errors
    subModel: "google/gemini-2.0-flash-001", // Use a cheaper model for sub-calls
  },
});
 
// Check if the result was truncated due to limits
if (result.warning) {
  console.warn("RLM returned a partial result:", result.warning);
}
ControlDefaultRecommendation
maxBudgetnull (unlimited)Set to $0.50 - $2.00 for most use cases
maxIterations2010-15 is sufficient for most queries
maxTimeout600s (10 min)300s is a good default for user-facing queries
maxErrorsnull (unlimited)Set to 3-5 to fail fast on bad queries

When any limit is reached, RLM returns the best answer it has so far (a partial result) along with a warning field explaining what happened. It does not throw an error.

Cost-Saving Tips

  1. Use a cheap subModel -- Sub-LLM calls (llm_query()) inside the REPL are often simple extraction/summarization tasks. A fast model like gemini-2.0-flash works well and costs a fraction of larger models.
  2. Lower maxIterations -- Most queries resolve in 3-10 iterations. Setting the cap to 15 instead of 20 prevents edge cases from running too long.
  3. Use documents context sparingly -- Document context can be very large (up to 5,000 chunks). For targeted analysis, consider filtering to a specific knowledge base rather than loading everything.
  4. Monitor via session logs -- Check the rlmSessions table to track execution times and iteration counts across your queries. This helps you tune budgets based on real usage.

When to Use RLM

Good Use Cases

  • Cross-document analysis — "What are the common themes across all our support tickets?"
  • Contradiction detection — "Find any conflicting information in our documentation"
  • Comprehensive summarization — "Give me a complete overview of this user's history"
  • Pattern extraction — "What trends do you see in the last 6 months of conversations?"
  • Deep Q&A — Questions that require information from many different sources

Stick with getContext For

  • Simple factual lookups ("What is the refund policy?")
  • Queries where top-K results are sufficient
  • Latency-sensitive applications (RLM is 10-100x slower)
  • High-throughput scenarios (RLM uses many more LLM tokens)

Session Logging

Every RLM query is automatically logged to the rlmSessions table for auditing and debugging. Each session records:

  • Organization and user IDs
  • The query and context type
  • Context size (characters)
  • Backend used
  • Number of iterations and sub-LLM calls
  • Execution time
  • Status (completed, failed, timeout)
  • Error message (if failed)

RLM Enrichment Pipeline

Beyond Q&A queries, the RLM engine also powers an enrichment pipeline that runs during document ingestion. Instead of querying your data at search time, it analyzes every document and chunk at ingestion time -- extracting entities, resolving pronouns, identifying key terms, and producing enriched metadata that improves search quality.

The enrichment pipeline adds three phases to ingestion:

  1. Pre-chunk structuring -- Analyze the full document for entities, relationships, and section boundaries
  2. Guided chunking -- Use section hints from Phase 1 to split text at logical boundaries
  3. Chunk-level enrichment -- Resolve pronouns, extract key terms, and summarize each chunk with doc-wide context

Enrichment data flows into embeddings, BM25 indexes, and reranker inputs -- so every search benefits from the analysis.

For full details, see the dedicated RLM Enrichment Pipeline documentation.

Availability

RLM is available on the Team tier only. It is not included in Community or Pro distributions.