Fifth Log - Writing Experiments
Writing Experiment Rerun: gemma3:1b at 82.0% Fair+Good
February 25, 2026
Estimated read: 8 min
I reran gemma3:1b on 50 story scenes with 2 samples each, then rescored with a documentation-ready rubric: If I appended this sentence to the scene, would I keep it?
This post is a follow-up to the earlier baseline write-up:
The goal here is practical drafting throughput. I want to know whether gemma3:1b can produce a usable next sentence often enough to keep momentum when writing scene-by-scene.
Run summary
- Run timestamp: February 25, 2026 (
2026-02-25T02:27:32.328Z) - Model:
gemma3:1b - Scope: 50 story cases, 2 samples per case (
100total outputs) - Main goal: maximize usable continuations (
fair + good) for next-sentence drafting
Data
Main result: Fair+Good ratio
Quality Distribution (Rerun)
| Quality | Count | Percent |
|---|---|---|
| Good | 5 / 100 | 5.0% |
| Fair | 77 / 100 | 77.0% |
| Bad | 18 / 100 | 18.0% |
| Incoherent | 0 / 100 | 0.0% |
| Fair + Good | 82 / 100 | 82.0% |
Primary decision metric in this rerun is Fair + Good (usable continuation sentence).
The decision threshold for this run is simple: if a generated sentence is fair or good, it is usable in draft flow. On that measure, this rerun lands at 82.0% usable.
Compared with the previous gemma3:1b run
Baseline vs Rerun
| Run | Model | Good + Fair | Bad | One-Sentence Violations | One-Sentence Compliant |
|---|---|---|---|---|---|
| 2026-02-23 baseline | gemma3:1b | 7 / 24 (29.2%) | 17 / 24 (70.8%) | 15 / 24 (62.5%) | 9 / 24 (37.5%) |
| 2026-02-25 rerun | gemma3:1b | 82 / 100 (82.0%) | 18 / 100 (18.0%) | 0 / 100 (0.0%) | 100 / 100 (100.0%) |
Baseline is from the previous post and uses a smaller set; the prompt setup and scoring rubric were updated in this rerun.
The baseline reference is the post linked above. This rerun expands the test suite and updates scoring, so the comparison is directional, not perfectly apples-to-apples. The deltas are still large enough to be meaningful for workflow planning:
fair+good:29.2%->82.0%(+52.8 points)bad:70.8%->18.0%(-52.8 points)- one-sentence violations:
62.5%->0.0%(-62.5 points) - one-sentence compliance:
37.5%->100.0%(+62.5 points)
Case-level reliability (2 shots per scene)
Case-Level Usability
| Case-Level Outcome (2 samples per case) | Count | Percent |
|---|---|---|
| At least 1 Fair/Good sentence | 44 / 50 | 88.0% |
| Both sentences Fair/Good | 38 / 50 | 76.0% |
| Exactly 1 Fair/Good sentence | 6 / 50 | 12.0% |
| No usable sentence (both bad) | 6 / 50 | 12.0% |
This is the practical reliability view for writing flow when generating two options per scene.
For story production, this is the ratio that matters most. With two attempts per scene, 44 out of 50 scenes produced at least one usable next sentence.
Good-scene likelihood after N sentences (same math as previous post)
Using the same compounding-risk framing from the previous write-up:
P(good scene after N sentences) = (0.82)^NP(at least 1 bad sentence by N) = 1 - (0.82)^N
Compounding Scene Quality Over N Sentences
| Sentences (N) | P(good scene, all Fair/Good) | P(at least 1 Bad) |
|---|---|---|
| 10 | 13.74% | 86.26% |
| 20 | 1.89% | 98.11% |
| 40 | 0.036% | 99.964% |
Same compounding method as the previous post, using this run's per-sentence rates: Fair+Good = 82.0%, Bad = 18.0%.
Even with much better single-sentence quality, long scenes still compound failure risk if each sentence is generated independently.
If we pick the best of two generations per sentence
This is the practical workflow version of the same question.
- observed per-step failure after best-of-2 in this run:
6/50 = 12.0% P(good scene after N sentences) = (0.88)^NP(at least 1 bad sentence that ruins the scene by N) = 1 - (0.88)^N
Compounding Risk with Best-of-2 Selection
| Sentences (N) | P(good scene with best-of-2 each step) | P(at least 1 Bad after best-of-2) |
|---|---|---|
| 10 | 27.85% | 72.15% |
| 20 | 7.76% | 92.24% |
| 40 | 0.602% | 99.40% |
Uses observed two-shot case reliability from this run: 44/50 had at least one Fair/Good, so per-step failure after best-of-2 is 6/50 = 12.0%.
Picking the better of two helps a lot versus single-shot generation, but the risk still compounds over longer scenes.
Breakdown by reversal type
Quality by Reversal Type
| Reversal Type | Good | Fair | Bad | Fair + Good |
|---|---|---|---|---|
| Positive (34 outputs) | 3 (8.8%) | 28 (82.4%) | 3 (8.8%) | 31 / 34 (91.2%) |
| Neutral (32 outputs) | 1 (3.1%) | 25 (78.1%) | 6 (18.8%) | 26 / 32 (81.3%) |
| Negative (34 outputs) | 1 (2.9%) | 24 (70.6%) | 9 (26.5%) | 25 / 34 (73.5%) |
Negative-turn prompts are still the hardest slice in this rerun.
Positive and neutral turns are performing well. Negative-turn prompts remain the main error bucket.
Sample good sentences from this rerun
Note - Harbor return (positive)
At the end she found only black water, a loop of severed rope, and her own breath pooling white in the cold. The salt spray carried the faint scent of woodsmoke from a distant, abandoned shack.
Note - Friendly dragon (positive)
It stared at the cookie jar on the top shelf and made a hopeful noise. It made a tiny, wheezing sound, like a rusty hinge, as it nibbled on a crumb.
Note - Wildfire evacuation (negative)
A siren wails somewhere close, then warps as the sound hits the smoke and comes back wrong. The smell of burnt plastic hangs heavy, thick enough to taste, and a distant siren screams a fractured melody.
Rerun rubric for quality
This rerun uses a keep-or-cut standard:
If I appended this sentence to the scene, would I keep it?
Rule compliance only matters when it breaks the scene.
quality = good
Meaning: I would keep this sentence in a draft with little or no editing.
How it was evaluated:
- Scene fit: clearly continues what is happening now in the provided scene.
- POV + tense + voice: consistent with requested POV and tense, and the story voice.
- Readability: flows cleanly when appended to
sceneSoFar. - Specificity: concrete, grounded details instead of generic filler.
- Narrative function: adds momentum or sharpens the moment without premature resolution.
quality = fair
Meaning: usable draft sentence; likely needs light revision or could be replaced by a stronger option.
How it was evaluated:
- mostly fits scene continuity, but can be generic, low-energy, or slightly off-emphasis.
- does not break POV or tense, though it may drift toward abstraction.
- reads fine appended, but does not land a strong next beat.
- any rule issues are minor and do not harm continuity.
quality = bad
Meaning: not usable as-is because it harms scene continuity or misses prompt intent.
How it was evaluated:
- breaks requested POV or tense, or violates limited POV.
- contradicts scene details or introduces implausible derailments.
- voice mismatch is noticeable for the prompt.
- becomes abstract or meta instead of story continuation.
- hard rule violations that materially change story behavior.
quality = incoherent
Meaning: completely unusable.
How it was evaluated:
- grammatically or logically unreadable in context.
- does not connect to the scene at all.
- contradictions are severe enough that action cannot be parsed.
Reserved for true failures, not merely weak writing.
Notes
This rerun is designed around practical writing workflow. The question is whether gemma3:1b can produce a decent next sentence often enough to keep drafting momentum. On this pass, the fair+good ratio indicates it can in most scenes, while negative-turn scenes remain the next improvement target.