RLM (Recursive Language Models)

What is RLM?

Standard RAG returns the top 5-10 most relevant chunks for a query. That works well for simple questions, but what if the answer requires synthesizing information scattered across hundreds of documents? What if the AI needs to explore your data, form hypotheses, run analyses, and iterate?

RLM (Recursive Language Models) is a fundamentally different approach based on the algorithm from arXiv:2512.24601. Instead of retrieving a handful of chunks:

All your context (every memory, every document chunk, entire conversation histories) is loaded into a sandbox as a variable
An LLM writes JavaScript code to explore that context
The code executes in a sandboxed REPL, and results feed back to the LLM
The LLM iterates — writing more code, calling sub-LLMs for analysis, filtering and transforming data
When it has an answer, it terminates with FINAL() or FINAL_VAR()

The key insight: the context lives in the sandbox, not in the LLM's context window. The LLM only sees metadata about the context (type, length) and explores it programmatically. This means you can process millions of characters of context without hitting token limits.

Why Use RLM Instead of getContext?

	`getContext` (Standard RAG)	`rlmQuery` (RLM)
Context used	Top-K chunks (5-10)	ALL chunks / ALL memories
Approach	Vector similarity + reranking	LLM-driven iterative code execution
Best for	Specific factual questions	Complex analytical questions
Speed	Fast (1-5 seconds)	Slower (10-120 seconds)
Cost	1 LLM call + embeddings	Multiple LLM calls + sub-LLM calls
Example query	"What is the return policy?"	"Compare all product mentions across our docs and identify inconsistencies"

Use getContext for most queries. Use rlmQuery when you need deep analysis that requires seeing the full picture.

How It Works

The REPL Loop

The RLM algorithm works like a data scientist with a Jupyter notebook:

typescript

1. LLM receives: system prompt + context metadata + user query
2. LLM writes JavaScript code in a ```repl block
3. Sandbox executes the code, captures stdout/stderr
4. Results are fed back to the LLM
5. LLM writes more code (or calls FINAL to terminate)
6. Repeat until FINAL() / FINAL_VAR() or max iterations

Scaffold Functions

Inside the sandbox, the LLM has access to these functions:

Function	Description
`context`	The full context data (string or array)
`llm_query(prompt)`	Call a sub-LLM for analysis (returns a string)
`llm_query_batched(prompts)`	Call multiple sub-LLMs in parallel
`FINAL_VAR(varName)`	Declare a REPL variable as the final answer
`SHOW_VARS()`	List all user-created variables
`console.log(...)`	Print to stdout (visible to the LLM)

Example Execution

For the query "What are the top 3 themes across all user memories?":

typescript

Iteration 1: LLM writes code to split context into individual memories
Iteration 2: LLM batches memories into groups, calls llm_query on each group
             to extract themes
Iteration 3: LLM aggregates themes, counts frequencies
Iteration 4: LLM calls FINAL_VAR("top_themes") with the result

Usage

Basic Usage

const result = await memory.rlmQuery(ctx, {
  orgId: "org_abc123",
  query: "Summarize the key themes across all documents in this knowledge base",
  knowledgeBaseId: "kb_xyz",
  contextType: "documents",
});
 
console.log(result.answer);       // The synthesized answer
console.log(result.iterations);   // How many REPL iterations it took
console.log(result.totalSubCalls); // How many sub-LLM calls were made

Context Types

RLM supports four context types, each gathering different data:

Context Type	What It Gathers	Required Params
`memories`	All episodic memories for a user (up to 1,000)	`userId`
`documents`	All chunks from a knowledge base (up to 5,000)	`knowledgeBaseId`
`conversation`	Full conversation history for a user (up to 50 sessions)	`userId`
`custom`	Raw string you provide directly	`customContext`

// Analyze all user memories
const result = await memory.rlmQuery(ctx, {
  orgId, userId: "user_456",
  query: "What patterns do you see in this user's preferences?",
  contextType: "memories",
});
 
// Analyze custom data
const result = await memory.rlmQuery(ctx, {
  orgId,
  query: "Find contradictions in these reports",
  contextType: "custom",
  customContext: myLargeDataset,
});

Configuration

You can tune the RLM engine via the rlmConfig parameter:

const result = await memory.rlmQuery(ctx, {
  orgId,
  query: "...",
  knowledgeBaseId: "kb_xyz",
  contextType: "documents",
  rlmConfig: {
    runtime: "python",           // "python" (default) or "bun"
    maxIterations: 30,           // Default: 20
    model: "anthropic/claude-sonnet-4-20250514", // Root LLM (default: your ai.model)
    subModel: "google/gemini-2.0-flash-001",  // Sub-LLM for llm_query calls
    maxBudget: 1.00,            // USD budget cap per query (default: null / unlimited)
    maxTimeout: 600,             // Max seconds per query (default: 600)
  },
});

Option	Type	Default	Description
`runtime`	`"python" \| "bun"`	`"python"`	Execution runtime (see Runtimes below)
`runtimeFallbackToPython`	`boolean`	`true`	If Bun runtime fails, automatically retry with Python
`maxIterations`	`number`	20	Maximum REPL loop iterations before forced termination
`maxDepth`	`number`	1	Maximum recursion depth (sub-LLM calls are depth 1)
`model`	`string`	`config.ai.model`	Model for the root LLM (the one writing code)
`subModel`	`string`	Same as `model`	Model for sub-LLM calls inside the sandbox
`customSystemPrompt`	`string`	(built-in)	Override the default RLM system prompt (advanced)
`maxTimeout`	`number`	600	Max seconds for a single RLM call
`maxBudget`	`number \| null`	`null`	USD budget cap per query. The `rlms` package stops with a partial answer if exceeded. Recommended: set this to avoid runaway costs.
`maxErrors`	`number \| null`	`null`	Max consecutive REPL errors before stopping

Sandbox Configuration

When you need to tune the Daytona sandbox resources:

Option	Type	Default	Description
`sandboxMode`	`"shared" \| "per_org"`	`"shared"`	One sandbox for all orgs (cost-efficient) or dedicated per-org (stronger isolation)
`sandboxSnapshot`	`string`	`"memcity-rlm-v1"`	Daytona snapshot name (pre-created via `create_rlm_snapshot.ts`)
`sandboxAutoStopMinutes`	`number`	30	Minutes of inactivity before sandbox auto-stops
`sandboxAutoArchiveMinutes`	`number`	1440	Minutes after stop before archiving to cold storage
`sandboxCpu`	`number`	2	vCPUs allocated to the sandbox
`sandboxMemory`	`number`	4	GiB of RAM allocated
`sandboxDisk`	`number`	8	GiB of disk allocated

Runtimes

All RLM execution runs in ephemeral Daytona cloud sandboxes. You choose which runtime executes inside the sandbox.

Setup

Get a Daytona API key from daytona.io
Get an OpenRouter API key from openrouter.ai
Pass them through your config:

const memory = new Memory(components.memory, {
  tier: "team",
  apiKeys: {
    openrouter: process.env.OPENROUTER_API_KEY,
    daytona: process.env.DAYTONA_API_KEY,
  },
});

Create the Daytona snapshot (one-time setup):

bash

DAYTONA_API_KEY=your_key npx tsx scripts/create_rlm_snapshot.ts

This builds a sandbox image with Python 3.12, the rlms package, numpy, pandas, and scipy pre-installed.

Python (Default)

The Python runtime runs the official rlms package (from arXiv:2512.24601) inside the Daytona sandbox. This is the most battle-tested path.

Advantages:

Faithful implementation of the paper's algorithm
Concurrent llm_query_batched execution
Cost tracking via the rlms package's usage_summary
No action timeout limit (runs outside Convex)

Bun (Alternative)

The Bun runtime runs a native JavaScript implementation of the same algorithm using Bun + node:vm for sandboxed code execution.

const result = await memory.rlmQuery(ctx, {
  orgId,
  query: "...",
  contextType: "documents",
  knowledgeBaseId: "kb_xyz",
  rlmConfig: { runtime: "bun" },
});

Advantages:

Faster startup (no Python interpreter boot)
Single universal script handles all task types

Limitations:

llm_query_batched runs sequentially (not concurrent)
Cost tracking not available (cost always returns null)

When runtimeFallbackToPython is true (the default), a Bun failure automatically retries with the Python backend.

Sandbox cost: ~$0.00056 per 30-second execution (2 vCPUs, 4 GiB RAM default).

Cost and Budget Control

RLM queries are significantly more expensive than standard getContext calls because each query involves multiple LLM calls -- one per REPL iteration plus sub-LLM calls made via llm_query() inside the sandbox.

Estimated Costs Per Query

Scenario	Iterations	Sub-calls	Estimated Cost
Simple analysis (small context)	3-5	0-2	$0.01 - $0.05
Moderate analysis (medium context)	5-10	5-10	$0.05 - $0.30
Deep analysis (large context, many sub-calls)	10-20	10-30+	$0.30 - $2.00+

Costs depend on your chosen models, context size, and how many iterations/sub-calls the LLM uses. Cheaper sub-models (via subModel) can significantly reduce costs.

Budget Controls

We strongly recommend setting a maxBudget to prevent runaway costs:

const result = await memory.rlmQuery(ctx, {
  orgId,
  query: "...",
  contextType: "documents",
  knowledgeBaseId: "kb_xyz",
  rlmConfig: {
    maxBudget: 1.00,       // Stop at $1.00 and return partial answer
    maxIterations: 15,     // Cap iterations (default: 20)
    maxTimeout: 300,       // Cap at 5 minutes (default: 600s / 10 min)
    maxErrors: 5,          // Stop after 5 consecutive REPL errors
    subModel: "google/gemini-2.0-flash-001", // Use a cheaper model for sub-calls
  },
});
 
// Check if the result was truncated due to limits
if (result.warning) {
  console.warn("RLM returned a partial result:", result.warning);
}

Control	Default	Recommendation
`maxBudget`	`null` (unlimited)	Set to `$0.50` - `$2.00` for most use cases
`maxIterations`	20	10-15 is sufficient for most queries
`maxTimeout`	600s (10 min)	300s is a good default for user-facing queries
`maxErrors`	`null` (unlimited)	Set to 3-5 to fail fast on bad queries

When any limit is reached, RLM returns the best answer it has so far (a partial result) along with a warning field explaining what happened. It does not throw an error.

Cost-Saving Tips

Use a cheap subModel -- Sub-LLM calls (llm_query()) inside the REPL are often simple extraction/summarization tasks. A fast model like gemini-2.0-flash works well and costs a fraction of larger models.
Lower maxIterations -- Most queries resolve in 3-10 iterations. Setting the cap to 15 instead of 20 prevents edge cases from running too long.
Use documents context sparingly -- Document context can be very large (up to 5,000 chunks). For targeted analysis, consider filtering to a specific knowledge base rather than loading everything.
Monitor via session logs -- Check the rlmSessions table to track execution times and iteration counts across your queries. This helps you tune budgets based on real usage.

When to Use RLM

Good Use Cases

Cross-document analysis — "What are the common themes across all our support tickets?"
Contradiction detection — "Find any conflicting information in our documentation"
Comprehensive summarization — "Give me a complete overview of this user's history"
Pattern extraction — "What trends do you see in the last 6 months of conversations?"
Deep Q&A — Questions that require information from many different sources

Stick with getContext For

Simple factual lookups ("What is the refund policy?")
Queries where top-K results are sufficient
Latency-sensitive applications (RLM is 10-100x slower)
High-throughput scenarios (RLM uses many more LLM tokens)

Session Logging

Every RLM query is automatically logged to the rlmSessions table for auditing and debugging. Each session records:

Organization and user IDs
The query and context type
Context size (characters)
Backend used
Number of iterations and sub-LLM calls
Execution time
Status (completed, failed, timeout)
Error message (if failed)

RLM Enrichment Pipeline

Beyond Q&A queries, the RLM engine also powers an enrichment pipeline that runs during document ingestion. Instead of querying your data at search time, it analyzes every document and chunk at ingestion time -- extracting entities, resolving pronouns, identifying key terms, and producing enriched metadata that improves search quality.

The enrichment pipeline adds three phases to ingestion:

Pre-chunk structuring -- Analyze the full document for entities, relationships, and section boundaries
Guided chunking -- Use section hints from Phase 1 to split text at logical boundaries
Chunk-level enrichment -- Resolve pronouns, extract key terms, and summarize each chunk with doc-wide context

Enrichment data flows into embeddings, BM25 indexes, and reranker inputs -- so every search benefits from the analysis.

For full details, see the dedicated RLM Enrichment Pipeline documentation.

Availability

RLM is available on the Team tier only. It is not included in Community or Pro distributions.