• Continuation of: Context Curation: Preliminary REALM Tests
  • Run: gemma3:1b local SLM (Ollama), executed February 3, 2026 and finalized in this write-up on February 4, 2026
  • Scope: same read-loop benchmark shape as 02-03, focused on Medium to XXXLarge documents
  • Queries: authentication and rate-limiting
  • Max iterations: 5
REALM basic loop diagram showing document, context state, prompt, and next-section selection
Same baseline loop as the 02-03 experiment: constrained section selection with iterative context accumulation.

Why This Follow-up

The 02-03 post established that the loop mechanics work on cloud models. This continuation asks a narrower question: can a very small local model run the same loop in a useful way for larger documents?

The answer from this first pass is yes for Medium through XXXLarge. On those sizes, gemma3:1b is consistently cheaper than full-text in token terms, and still follows the same constrained section-navigation pattern.

Data for This Run

Results and extracted chart payload:

gemma3:1b vs Full-Text (Same Loop Setup)

gemma3:1b Token Usage by Document Size

Document Size Sections Iterative Avg Tokens Full-Text Avg Tokens Difference
Medium 18.1KB 78 4,260 5,364 -1,104 (-20.6%)
XLarge 30.1KB 124 5,793 8,800 -3,008 (-34.2%)
XXLarge 41.6KB 195 8,214 12,327 -4,113 (-33.4%)
XXXLarge 52.6KB 260 10,524 15,679 -5,155 (-32.9%)

Average across both queries. Negative difference means iterative used fewer tokens than full-text.

gemma3:1b Token Usage by Doc Size

Average tokensTotal across both queries.

  • Medium
    Full-text
    --
    Iterative
    --
  • XLarge
    Full-text
    --
    Iterative
    --
  • XXLarge
    Full-text
    --
    Iterative
    --
  • XXXLarge
    Full-text
    --
    Iterative
    --

gemma3:1b Early Stopping Savings

Average full-text tokensTotal vs average tokensToCorrect.

  • Medium
    Full-text
    --
    Early Stop
    --

    --

  • XLarge
    Full-text
    --
    Early Stop
    --

    --

  • XXLarge
    Full-text
    --
    Early Stop
    --

    --

  • XXXLarge
    Full-text
    --
    Early Stop
    --

    --

Comparison with the 02-03 Models

Iterative Tokens: Local SLM vs Prior Cloud Runs

Document gemma3:1b Iterative gpt-4o-mini Iterative gpt-5-nano Iterative gemma3:1b Position
Medium 4,260 5,326 8,874 20% lower than gpt-4o-mini; 52% lower than gpt-5-nano
XLarge 5,793 7,654 12,255 24% lower than gpt-4o-mini; 53% lower than gpt-5-nano
XXLarge 8,214 11,275 15,613 27% lower than gpt-4o-mini; 47% lower than gpt-5-nano
XXXLarge 10,524 14,742 20,515 29% lower than gpt-4o-mini; 49% lower than gpt-5-nano

Cross-model view uses average iterative tokens across both queries on the shared size range (Medium to XXXLarge).

Model-Level Efficiency on Shared Sizes

Model Iterative Avg Tokens Full-Text Avg Tokens Iterative - Full-Text
gpt-4o-mini 9,749 10,327 -578 (-5.6%)
gpt-5-nano 14,314 10,646 +3,668 (+34.5%)
gemma3:1b 7,198 10,542 -3,344 (-31.7%)

Model-level averages across Medium to XXXLarge only.

This does not claim that gemma3:1b is stronger overall than the larger cloud models. It does show that, within this constrained navigation task, a 1B local model can be operationally useful and token-competitive when paired with the REALM loop design.

Convergence Behavior

First-Correct and Early-Stop Comparison

Model Avg First Correct Iteration Avg Tokens to Correct Observation
gpt-4o-mini 2.00 3,592 Stable and early across sizes in this range
gpt-5-nano 2.00 4,719 Early convergence with higher token overhead
gemma3:1b 2.75 3,921 Fast on auth; slower on rate-limit queries at larger sizes

Averages computed from iterative runs with available first-correct markers.

Two patterns stand out:

  1. gemma3:1b is very strong on the authentication query (first correct at iteration 2 across sizes).
  2. It takes longer on the rate-limiting query at larger sizes (first correct at iteration 4), which lowers early-stop savings there.

Per-Call Context Remains Small

Per-Call Context Window (gemma3:1b)

Document Full-Text Input gemma3:1b Max Single Iteration Input Reduction
Medium 5,363 931 83%
XLarge 8,796 1,237 86%
XXLarge 12,318 1,729 86%
XXXLarge 15,672 2,199 86%

Per-call context stays substantially smaller than full-text, which is the key SLM usability signal.

This is the important SLM signal. The loop keeps each call bounded, which makes CPU-local inference more practical than repeatedly sending the full document.

Representative Iteration Growth

Per-Iteration Token Growth (Medium, Auth Query, gemma3:1b)

Iter Input Output Total Cumulative
1 622 53 675 675
2 789 36 825 1,500
3 825 41 866 2,366
4 899 39 938 3,304
5 931 41 972 4,276

Representative run: Medium document, auth query, gemma3:1b iterative. First correct section appears at iteration 2.

Known Failure Mode in This Pass

The Small document case (9 sections) was excluded from this JSON run because iterative selection failed with invalid section IDs (for example "1.2"). The issue appears to be prompt/schema clarity for tiny local models, not a loop design issue.

Full-Text Reliability in This Local SLM Run

Document Full-Text Attempts Incorrect Selections Example Full-Text Output
Medium 2 2 3
XLarge 2 2 --- Section: 1 ---
XXLarge 2 2 --- Section: initialization ---
XXXLarge 2 2 --- Section: data-sources

In this local SLM run set, full-text responses were incorrect for all eight attempts.

Next Steps from This SLM Pass

  1. Tighten section-ID constraints in the prompt (explicit allowed IDs + exact-match requirement).
  2. A/B test prompt variants on Small documents to recover reliability.
  3. Move to next loop component: Editor/Analyzer. Can the local SLM not only pick sections but also extract and summarize them effectively for downstream use?