This post is a follow-up to the earlier baseline write-up:

The goal here is practical drafting throughput. I want to know whether gemma3:1b can produce a usable next sentence often enough to keep momentum when writing scene-by-scene.

Run summary

  • Run timestamp: February 25, 2026 (2026-02-25T02:27:32.328Z)
  • Model: gemma3:1b
  • Scope: 50 story cases, 2 samples per case (100 total outputs)
  • Main goal: maximize usable continuations (fair + good) for next-sentence drafting

Data

Main result: Fair+Good ratio

Quality Distribution (Rerun)

Quality Count Percent
Good 5 / 100 5.0%
Fair 77 / 100 77.0%
Bad 18 / 100 18.0%
Incoherent 0 / 100 0.0%
Fair + Good 82 / 100 82.0%

Primary decision metric in this rerun is Fair + Good (usable continuation sentence).

The decision threshold for this run is simple: if a generated sentence is fair or good, it is usable in draft flow. On that measure, this rerun lands at 82.0% usable.

Compared with the previous gemma3:1b run

Baseline vs Rerun

Run Model Good + Fair Bad One-Sentence Violations One-Sentence Compliant
2026-02-23 baseline gemma3:1b 7 / 24 (29.2%) 17 / 24 (70.8%) 15 / 24 (62.5%) 9 / 24 (37.5%)
2026-02-25 rerun gemma3:1b 82 / 100 (82.0%) 18 / 100 (18.0%) 0 / 100 (0.0%) 100 / 100 (100.0%)

Baseline is from the previous post and uses a smaller set; the prompt setup and scoring rubric were updated in this rerun.

The baseline reference is the post linked above. This rerun expands the test suite and updates scoring, so the comparison is directional, not perfectly apples-to-apples. The deltas are still large enough to be meaningful for workflow planning:

  • fair+good: 29.2% -> 82.0% (+52.8 points)
  • bad: 70.8% -> 18.0% (-52.8 points)
  • one-sentence violations: 62.5% -> 0.0% (-62.5 points)
  • one-sentence compliance: 37.5% -> 100.0% (+62.5 points)

Case-level reliability (2 shots per scene)

Case-Level Usability

Case-Level Outcome (2 samples per case) Count Percent
At least 1 Fair/Good sentence 44 / 50 88.0%
Both sentences Fair/Good 38 / 50 76.0%
Exactly 1 Fair/Good sentence 6 / 50 12.0%
No usable sentence (both bad) 6 / 50 12.0%

This is the practical reliability view for writing flow when generating two options per scene.

For story production, this is the ratio that matters most. With two attempts per scene, 44 out of 50 scenes produced at least one usable next sentence.

Good-scene likelihood after N sentences (same math as previous post)

Using the same compounding-risk framing from the previous write-up:

  • P(good scene after N sentences) = (0.82)^N
  • P(at least 1 bad sentence by N) = 1 - (0.82)^N

Compounding Scene Quality Over N Sentences

Sentences (N) P(good scene, all Fair/Good) P(at least 1 Bad)
10 13.74% 86.26%
20 1.89% 98.11%
40 0.036% 99.964%

Same compounding method as the previous post, using this run's per-sentence rates: Fair+Good = 82.0%, Bad = 18.0%.

Even with much better single-sentence quality, long scenes still compound failure risk if each sentence is generated independently.

If we pick the best of two generations per sentence

This is the practical workflow version of the same question.

  • observed per-step failure after best-of-2 in this run: 6/50 = 12.0%
  • P(good scene after N sentences) = (0.88)^N
  • P(at least 1 bad sentence that ruins the scene by N) = 1 - (0.88)^N

Compounding Risk with Best-of-2 Selection

Sentences (N) P(good scene with best-of-2 each step) P(at least 1 Bad after best-of-2)
10 27.85% 72.15%
20 7.76% 92.24%
40 0.602% 99.40%

Uses observed two-shot case reliability from this run: 44/50 had at least one Fair/Good, so per-step failure after best-of-2 is 6/50 = 12.0%.

Picking the better of two helps a lot versus single-shot generation, but the risk still compounds over longer scenes.

Breakdown by reversal type

Quality by Reversal Type

Reversal Type Good Fair Bad Fair + Good
Positive (34 outputs) 3 (8.8%) 28 (82.4%) 3 (8.8%) 31 / 34 (91.2%)
Neutral (32 outputs) 1 (3.1%) 25 (78.1%) 6 (18.8%) 26 / 32 (81.3%)
Negative (34 outputs) 1 (2.9%) 24 (70.6%) 9 (26.5%) 25 / 34 (73.5%)

Negative-turn prompts are still the hardest slice in this rerun.

Positive and neutral turns are performing well. Negative-turn prompts remain the main error bucket.

Sample good sentences from this rerun

Note - Harbor return (positive)
At the end she found only black water, a loop of severed rope, and her own breath pooling white in the cold. The salt spray carried the faint scent of woodsmoke from a distant, abandoned shack.

Note - Friendly dragon (positive)
It stared at the cookie jar on the top shelf and made a hopeful noise. It made a tiny, wheezing sound, like a rusty hinge, as it nibbled on a crumb.

Note - Wildfire evacuation (negative)
A siren wails somewhere close, then warps as the sound hits the smoke and comes back wrong. The smell of burnt plastic hangs heavy, thick enough to taste, and a distant siren screams a fractured melody.

Rerun rubric for quality

This rerun uses a keep-or-cut standard:

If I appended this sentence to the scene, would I keep it?

Rule compliance only matters when it breaks the scene.

quality = good

Meaning: I would keep this sentence in a draft with little or no editing.

How it was evaluated:

  • Scene fit: clearly continues what is happening now in the provided scene.
  • POV + tense + voice: consistent with requested POV and tense, and the story voice.
  • Readability: flows cleanly when appended to sceneSoFar.
  • Specificity: concrete, grounded details instead of generic filler.
  • Narrative function: adds momentum or sharpens the moment without premature resolution.

quality = fair

Meaning: usable draft sentence; likely needs light revision or could be replaced by a stronger option.

How it was evaluated:

  • mostly fits scene continuity, but can be generic, low-energy, or slightly off-emphasis.
  • does not break POV or tense, though it may drift toward abstraction.
  • reads fine appended, but does not land a strong next beat.
  • any rule issues are minor and do not harm continuity.

quality = bad

Meaning: not usable as-is because it harms scene continuity or misses prompt intent.

How it was evaluated:

  • breaks requested POV or tense, or violates limited POV.
  • contradicts scene details or introduces implausible derailments.
  • voice mismatch is noticeable for the prompt.
  • becomes abstract or meta instead of story continuation.
  • hard rule violations that materially change story behavior.

quality = incoherent

Meaning: completely unusable.

How it was evaluated:

  • grammatically or logically unreadable in context.
  • does not connect to the scene at all.
  • contradictions are severe enough that action cannot be parsed.

Reserved for true failures, not merely weak writing.

Notes

This rerun is designed around practical writing workflow. The question is whether gemma3:1b can produce a decent next sentence often enough to keep drafting momentum. On this pass, the fair+good ratio indicates it can in most scenes, while negative-turn scenes remain the next improvement target.