Context Curation: Preliminary REALM Tests

Run 1: gpt-4o-mini - February 3, 2026 - 04:33:52 UTC
Run 2: gpt-5-nano - February 3, 2026 - 04:40:19 UTC
Scope: read loop only (navigate + accumulate context)
Queries: Authentication and rate limiting
Max iterations: 5

REALM basic loop diagram showing document, context state, prompt, and next-section selection — Flow used in this experiment. A table of contents initializes available sections. Each iteration prompts the model with the query, the visible section menu, and previously selected sections. The model selects one next section, which is added to selected context and removed from the available pool, then the loop repeats.

REALM as Context Curation

This post treats REALM as a context curation system. The model does not receive the full document up front. Instead, it progressively pulls in only the sections that appear relevant to the current question. The mechanism is intentionally simple: a table of contents, a constrained list of candidate next sections, and a loop that appends selected sections into the working context.

This matters because common alternatives do not scale cleanly. Full-text prompting scales linearly with content size. Embedding-based retrieval can flatten or ignore the hierarchy that documentation already provides. Manual curation does not scale when both content volume and query volume increase. REALM attempts to preserve document structure while keeping per-call context bounded.

This first log focuses on navigation accuracy and context growth. The next phase is to expand toward the full loop and its operational pitfalls, including stop conditions, confidence estimation, guardrails, and richer tool and document graphs.

What I Tested

I compared two approaches across five document sizes (9 to 260 sections) and two queries, with a maximum of 5 iterations.

Iterative REALM: select a section, read it, and repeat until the relevant section is located
Full-text: attempt a one-shot selection from the entire document

Queries Used

How do I authenticate API requests?
What are the rate limits?

Results: multi-size-experiment-2026-02-03.json

Data sources:

Basic Loop

available_sections = toc(document)
selected_sections = []

for iteration in 1..5:
  prompt = build_prompt(query, available_sections, selected_sections)
  next_section = llm_select(prompt)
  selected_sections.append(next_section)
  available_sections.remove(next_section)

  if has_answer(selected_sections, query):
    # early-stop candidate
    record(iteration, selected_sections)

The focus of this post is the selection step. Can the model consistently choose a useful next section from a constrained menu, and continue doing so as the document grows?

Result Snapshot

gpt-4o-mini Token Usage by Doc Size

Average tokensTotal across both queries.

Small
Iterative
--

Full-text
--
Medium
Iterative
--

Full-text
--
XLarge
Iterative
--

Full-text
--
XXLarge
Iterative
--

Full-text
--
XXXLarge
Iterative
--

Full-text
--

gpt-5-nano Token Usage by Doc Size

Average tokensTotal across both queries.

Small
Iterative
--

Full-text
--
Medium
Iterative
--

Full-text
--
XLarge
Iterative
--

Full-text
--
XXLarge
Iterative
--

Full-text
--
XXXLarge
Iterative
--

Full-text
--

What This Suggests

Both approaches scale with document size: totals increase predictably from Small to XXXLarge.
gpt-4o-mini converges: iterative and full-text totals become similar at larger sizes.
gpt-5-nano shows a larger gap: iterative remains higher than full-text on total tokens under a fixed 5-iteration budget.

Full-Text Is a Fragile Interface

Full-text prompting appears simpler, but its failure mode is difficult to manage operationally. The model must scan thousands of tokens, select one section, and return it in a strict format. In practice, it can select an irrelevant section, return an invalid identifier, or produce a plausible answer that is not grounded in the document. Because the call is single-shot, there is no built-in recovery step.

Reliability Comparison

Aspect	Full-Text	Iterative (REALM)
Task complexity	Scan the entire document and select one section in a single step	Select from a constrained menu at each step
Error recovery	None. Single-shot selection	Iterative correction across steps
Failure predictability	Harder to anticipate across documents and queries	More predictable because each step is constrained
Fix difficulty	Harder. Prompt tuning has diminishing returns at scale	Easier. Improvements are mostly constraint clarity and validation
Production posture	Higher operational risk	More reliable through bounded steps and graceful degradation

Current view: iterative selection tends to be more operationally reliable, even when it is not always the lowest-token option for very small documents.

Early Stopping Is the Biggest Multiplier

The clearest signal is not the 5-iteration total. It is the iteration at which the loop first reaches the correct section. If the loop stops as soon as it has enough evidence to answer, token usage can drop substantially. To keep the comparison aligned with the earlier charts, the view below compares one-shot full-text against stop-at-first-correct, averaged across both queries.

gpt-4o-mini Early Stopping Savings by Doc Size

Average full-text tokensTotal vs average tokensToCorrect across both queries.

Small
Full-text
--

Stop @ first correct
--

--
Medium
Full-text
--

Stop @ first correct
--

--
XLarge
Full-text
--

Stop @ first correct
--

--
XXLarge
Full-text
--

Stop @ first correct
--

--
XXXLarge
Full-text
--

Stop @ first correct
--

--

gpt-5-nano Early Stopping Savings by Doc Size

Average full-text tokensTotal vs average tokensToCorrect across both queries.

Small
Full-text
--

Stop @ first correct
--

--
Medium
Full-text
--

Stop @ first correct
--

--
XLarge
Full-text
--

Stop @ first correct
--

--
XXLarge
Full-text
--

Stop @ first correct
--

--
XXXLarge
Full-text
--

Stop @ first correct
--

--

What This Suggests

Small documents can still favor full-text: one-shot can be cheaper at the smallest input sizes.
As documents grow, early stopping tends to win: stop-at-first-correct reduces token usage sharply for larger inputs.
Early stopping is architectural: it should be a first-class control, not an optional optimization.

Per-Iteration Growth Stays Manageable

Context grows each iteration because selected sections accumulate. In a representative medium run, growth remained controlled while output tokens stayed small, since the model was not producing long explanations during selection.

This is promising for smaller local models. Keeping each call bounded to a much smaller context window can make the loop more practical on constrained hardware, especially when stop conditions keep the number of iterations low.

Per-Iteration Token Growth (Medium, Auth Query, gpt-4o-mini)

Iter	Input	Output	Total	Cumulative
1	816	12	828	828
2	1,018	7	1,025	1,853
3	1,068	9	1,077	2,930
4	1,179	9	1,188	4,118
5	1,336	7	1,343	5,461

Input grows as selected sections accumulate, but each step remains within a narrow band. That makes the loop easier to reason about and budget.

Per-Call Context Windows Stay Small

Max Single-Iteration Input (gpt-4o-mini)

Document Size	Full-Text Input	REALM Max Single Iteration	Reduction
Small (9 sections)	443	398	10%
Medium (78 sections)	5,288	1,336	75%
XLarge (124 sections)	8,878	1,804	80%
XXLarge (195 sections)	12,052	2,538	79%
XXXLarge (260 sections)	15,081	3,245	78%

This is the practical value of the loop. It converts a large-document problem into a sequence of smaller, bounded calls.

Takeaways from This First Pass

The baseline loop is viable. Iterative section selection remains stable as content grows.
For gpt-4o-mini, iterative becomes token-competitive around 30 KB documents and remains competitive beyond that.
Full-text selection is operationally fragile because it provides no recovery path when it selects the wrong section.
Early stopping should be a first-class control. In these runs, it typically reduced a 5-iteration budget to about 2 iterations.
Next step: evaluate a local small language model on the same loop, then extend toward the full REALM design with explicit stop criteria, confidence signals, and validation.