Features
RLM (Recursive Language Models)
Deep recursive retrieval that processes ALL your context through an iterative LLM-driven code execution loop instead of just returning top-K results.
What is RLM?
Standard RAG returns the top 5-10 most relevant chunks for a query. That works well for simple questions, but what if the answer requires synthesizing information scattered across hundreds of documents? What if the AI needs to explore your data, form hypotheses, run analyses, and iterate?
RLM (Recursive Language Models) is a fundamentally different approach based on the algorithm from arXiv:2512.24601. Instead of retrieving a handful of chunks:
- All your context (every memory, every document chunk, entire conversation histories) is loaded into a sandbox as a variable
- An LLM writes JavaScript code to explore that context
- The code executes in a sandboxed REPL, and results feed back to the LLM
- The LLM iterates — writing more code, calling sub-LLMs for analysis, filtering and transforming data
- When it has an answer, it terminates with
FINAL()orFINAL_VAR()
The key insight: the context lives in the sandbox, not in the LLM's context window. The LLM only sees metadata about the context (type, length) and explores it programmatically. This means you can process millions of characters of context without hitting token limits.
Why Use RLM Instead of getContext?
getContext (Standard RAG) | rlmQuery (RLM) | |
|---|---|---|
| Context used | Top-K chunks (5-10) | ALL chunks / ALL memories |
| Approach | Vector similarity + reranking | LLM-driven iterative code execution |
| Best for | Specific factual questions | Complex analytical questions |
| Speed | Fast (1-5 seconds) | Slower (10-120 seconds) |
| Cost | 1 LLM call + embeddings | Multiple LLM calls + sub-LLM calls |
| Example query | "What is the return policy?" | "Compare all product mentions across our docs and identify inconsistencies" |
Use getContext for most queries. Use rlmQuery when you need deep analysis that requires seeing the full picture.
How It Works
The REPL Loop
The RLM algorithm works like a data scientist with a Jupyter notebook:
1. LLM receives: system prompt + context metadata + user query
2. LLM writes JavaScript code in a ```repl block
3. Sandbox executes the code, captures stdout/stderr
4. Results are fed back to the LLM
5. LLM writes more code (or calls FINAL to terminate)
6. Repeat until FINAL() / FINAL_VAR() or max iterationsScaffold Functions
Inside the sandbox, the LLM has access to these functions:
| Function | Description |
|---|---|
context | The full context data (string or array) |
llm_query(prompt) | Call a sub-LLM for analysis (returns a string) |
llm_query_batched(prompts) | Call multiple sub-LLMs in parallel |
FINAL_VAR(varName) | Declare a REPL variable as the final answer |
SHOW_VARS() | List all user-created variables |
console.log(...) | Print to stdout (visible to the LLM) |
Example Execution
For the query "What are the top 3 themes across all user memories?":
Iteration 1: LLM writes code to split context into individual memories
Iteration 2: LLM batches memories into groups, calls llm_query on each group
to extract themes
Iteration 3: LLM aggregates themes, counts frequencies
Iteration 4: LLM calls FINAL_VAR("top_themes") with the resultUsage
Basic Usage
const result = await memory.rlmQuery(ctx, {
orgId: "org_abc123",
query: "Summarize the key themes across all documents in this knowledge base",
knowledgeBaseId: "kb_xyz",
contextType: "documents",
});
console.log(result.answer); // The synthesized answer
console.log(result.iterations); // How many REPL iterations it took
console.log(result.totalSubCalls); // How many sub-LLM calls were madeContext Types
RLM supports four context types, each gathering different data:
| Context Type | What It Gathers | Required Params |
|---|---|---|
memories | All episodic memories for a user (up to 1,000) | userId |
documents | All chunks from a knowledge base (up to 5,000) | knowledgeBaseId |
conversation | Full conversation history for a user (up to 50 sessions) | userId |
custom | Raw string you provide directly | customContext |
// Analyze all user memories
const result = await memory.rlmQuery(ctx, {
orgId, userId: "user_456",
query: "What patterns do you see in this user's preferences?",
contextType: "memories",
});
// Analyze custom data
const result = await memory.rlmQuery(ctx, {
orgId,
query: "Find contradictions in these reports",
contextType: "custom",
customContext: myLargeDataset,
});Configuration
You can tune the RLM engine via the rlmConfig parameter:
const result = await memory.rlmQuery(ctx, {
orgId,
query: "...",
knowledgeBaseId: "kb_xyz",
contextType: "documents",
rlmConfig: {
runtime: "python", // "python" (default) or "bun"
maxIterations: 30, // Default: 20
model: "anthropic/claude-sonnet-4-20250514", // Root LLM (default: your ai.model)
subModel: "google/gemini-2.0-flash-001", // Sub-LLM for llm_query calls
maxBudget: 1.00, // USD budget cap per query (default: null / unlimited)
maxTimeout: 600, // Max seconds per query (default: 600)
},
});| Option | Type | Default | Description |
|---|---|---|---|
runtime | "python" | "bun" | "python" | Execution runtime (see Runtimes below) |
runtimeFallbackToPython | boolean | true | If Bun runtime fails, automatically retry with Python |
maxIterations | number | 20 | Maximum REPL loop iterations before forced termination |
maxDepth | number | 1 | Maximum recursion depth (sub-LLM calls are depth 1) |
model | string | config.ai.model | Model for the root LLM (the one writing code) |
subModel | string | Same as model | Model for sub-LLM calls inside the sandbox |
customSystemPrompt | string | (built-in) | Override the default RLM system prompt (advanced) |
maxTimeout | number | 600 | Max seconds for a single RLM call |
maxBudget | number | null | null | USD budget cap per query. The rlms package stops with a partial answer if exceeded. Recommended: set this to avoid runaway costs. |
maxErrors | number | null | null | Max consecutive REPL errors before stopping |
Sandbox Configuration
When you need to tune the Daytona sandbox resources:
| Option | Type | Default | Description |
|---|---|---|---|
sandboxMode | "shared" | "per_org" | "shared" | One sandbox for all orgs (cost-efficient) or dedicated per-org (stronger isolation) |
sandboxSnapshot | string | "memcity-rlm-v1" | Daytona snapshot name (pre-created via create_rlm_snapshot.ts) |
sandboxAutoStopMinutes | number | 30 | Minutes of inactivity before sandbox auto-stops |
sandboxAutoArchiveMinutes | number | 1440 | Minutes after stop before archiving to cold storage |
sandboxCpu | number | 2 | vCPUs allocated to the sandbox |
sandboxMemory | number | 4 | GiB of RAM allocated |
sandboxDisk | number | 8 | GiB of disk allocated |
Runtimes
All RLM execution runs in ephemeral Daytona cloud sandboxes. You choose which runtime executes inside the sandbox.
Setup
- Get a Daytona API key from daytona.io
- Get an OpenRouter API key from openrouter.ai
- Pass them through your config:
const memory = new Memory(components.memory, {
tier: "team",
apiKeys: {
openrouter: process.env.OPENROUTER_API_KEY,
daytona: process.env.DAYTONA_API_KEY,
},
});- Create the Daytona snapshot (one-time setup):
DAYTONA_API_KEY=your_key npx tsx scripts/create_rlm_snapshot.tsThis builds a sandbox image with Python 3.12, the rlms package, numpy, pandas, and scipy pre-installed.
Python (Default)
The Python runtime runs the official rlms package (from arXiv:2512.24601) inside the Daytona sandbox. This is the most battle-tested path.
Advantages:
- Faithful implementation of the paper's algorithm
- Concurrent
llm_query_batchedexecution - Cost tracking via the
rlmspackage'susage_summary - No action timeout limit (runs outside Convex)
Bun (Alternative)
The Bun runtime runs a native JavaScript implementation of the same algorithm using Bun + node:vm for sandboxed code execution.
const result = await memory.rlmQuery(ctx, {
orgId,
query: "...",
contextType: "documents",
knowledgeBaseId: "kb_xyz",
rlmConfig: { runtime: "bun" },
});Advantages:
- Faster startup (no Python interpreter boot)
- Single universal script handles all task types
Limitations:
llm_query_batchedruns sequentially (not concurrent)- Cost tracking not available (
costalways returnsnull)
When runtimeFallbackToPython is true (the default), a Bun failure automatically retries with the Python backend.
Sandbox cost: ~$0.00056 per 30-second execution (2 vCPUs, 4 GiB RAM default).
Cost and Budget Control
RLM queries are significantly more expensive than standard getContext calls because each query involves multiple LLM calls -- one per REPL iteration plus sub-LLM calls made via llm_query() inside the sandbox.
Estimated Costs Per Query
| Scenario | Iterations | Sub-calls | Estimated Cost |
|---|---|---|---|
| Simple analysis (small context) | 3-5 | 0-2 | $0.01 - $0.05 |
| Moderate analysis (medium context) | 5-10 | 5-10 | $0.05 - $0.30 |
| Deep analysis (large context, many sub-calls) | 10-20 | 10-30+ | $0.30 - $2.00+ |
Costs depend on your chosen models, context size, and how many iterations/sub-calls the LLM uses. Cheaper sub-models (via subModel) can significantly reduce costs.
Budget Controls
We strongly recommend setting a maxBudget to prevent runaway costs:
const result = await memory.rlmQuery(ctx, {
orgId,
query: "...",
contextType: "documents",
knowledgeBaseId: "kb_xyz",
rlmConfig: {
maxBudget: 1.00, // Stop at $1.00 and return partial answer
maxIterations: 15, // Cap iterations (default: 20)
maxTimeout: 300, // Cap at 5 minutes (default: 600s / 10 min)
maxErrors: 5, // Stop after 5 consecutive REPL errors
subModel: "google/gemini-2.0-flash-001", // Use a cheaper model for sub-calls
},
});
// Check if the result was truncated due to limits
if (result.warning) {
console.warn("RLM returned a partial result:", result.warning);
}| Control | Default | Recommendation |
|---|---|---|
maxBudget | null (unlimited) | Set to $0.50 - $2.00 for most use cases |
maxIterations | 20 | 10-15 is sufficient for most queries |
maxTimeout | 600s (10 min) | 300s is a good default for user-facing queries |
maxErrors | null (unlimited) | Set to 3-5 to fail fast on bad queries |
When any limit is reached, RLM returns the best answer it has so far (a partial result) along with a warning field explaining what happened. It does not throw an error.
Cost-Saving Tips
- Use a cheap
subModel-- Sub-LLM calls (llm_query()) inside the REPL are often simple extraction/summarization tasks. A fast model likegemini-2.0-flashworks well and costs a fraction of larger models. - Lower
maxIterations-- Most queries resolve in 3-10 iterations. Setting the cap to 15 instead of 20 prevents edge cases from running too long. - Use
documentscontext sparingly -- Document context can be very large (up to 5,000 chunks). For targeted analysis, consider filtering to a specific knowledge base rather than loading everything. - Monitor via session logs -- Check the
rlmSessionstable to track execution times and iteration counts across your queries. This helps you tune budgets based on real usage.
When to Use RLM
Good Use Cases
- Cross-document analysis — "What are the common themes across all our support tickets?"
- Contradiction detection — "Find any conflicting information in our documentation"
- Comprehensive summarization — "Give me a complete overview of this user's history"
- Pattern extraction — "What trends do you see in the last 6 months of conversations?"
- Deep Q&A — Questions that require information from many different sources
Stick with getContext For
- Simple factual lookups ("What is the refund policy?")
- Queries where top-K results are sufficient
- Latency-sensitive applications (RLM is 10-100x slower)
- High-throughput scenarios (RLM uses many more LLM tokens)
Session Logging
Every RLM query is automatically logged to the rlmSessions table for auditing and debugging. Each session records:
- Organization and user IDs
- The query and context type
- Context size (characters)
- Backend used
- Number of iterations and sub-LLM calls
- Execution time
- Status (completed, failed, timeout)
- Error message (if failed)
RLM Enrichment Pipeline
Beyond Q&A queries, the RLM engine also powers an enrichment pipeline that runs during document ingestion. Instead of querying your data at search time, it analyzes every document and chunk at ingestion time -- extracting entities, resolving pronouns, identifying key terms, and producing enriched metadata that improves search quality.
The enrichment pipeline adds three phases to ingestion:
- Pre-chunk structuring -- Analyze the full document for entities, relationships, and section boundaries
- Guided chunking -- Use section hints from Phase 1 to split text at logical boundaries
- Chunk-level enrichment -- Resolve pronouns, extract key terms, and summarize each chunk with doc-wide context
Enrichment data flows into embeddings, BM25 indexes, and reranker inputs -- so every search benefits from the analysis.
For full details, see the dedicated RLM Enrichment Pipeline documentation.
Availability
RLM is available on the Team tier only. It is not included in Community or Pro distributions.