Writing Experiment Rerun: gemma3:1b at 82.0% Fair+Good

This post is a follow-up to the earlier baseline write-up:

Writing Experiment: gemma3:1b vs qwen3:1.7b Quality Mix

The goal here is practical drafting throughput. I want to know whether gemma3:1b can produce a usable next sentence often enough to keep momentum when writing scene-by-scene.

Run summary

Run timestamp: February 25, 2026 (2026-02-25T02:27:32.328Z)
Model: gemma3:1b
Scope: 50 story cases, 2 samples per case (100 total outputs)
Main goal: maximize usable continuations (fair + good) for next-sentence drafting

Data

writing-experiment-2026-02-25T02-27-32-328Z.json

Main result: Fair+Good ratio

Quality Distribution (Rerun)

Quality	Count	Percent
Good	5 / 100	5.0%
Fair	77 / 100	77.0%
Bad	18 / 100	18.0%
Incoherent	0 / 100	0.0%
Fair + Good	82 / 100	82.0%

Primary decision metric in this rerun is Fair + Good (usable continuation sentence).

The decision threshold for this run is simple: if a generated sentence is fair or good, it is usable in draft flow. On that measure, this rerun lands at 82.0% usable.

Compared with the previous gemma3:1b run

Baseline vs Rerun

Run	Model	Good + Fair	Bad	One-Sentence Violations	One-Sentence Compliant
2026-02-23 baseline	gemma3:1b	7 / 24 (29.2%)	17 / 24 (70.8%)	15 / 24 (62.5%)	9 / 24 (37.5%)
2026-02-25 rerun	gemma3:1b	82 / 100 (82.0%)	18 / 100 (18.0%)	0 / 100 (0.0%)	100 / 100 (100.0%)

Baseline is from the previous post and uses a smaller set; the prompt setup and scoring rubric were updated in this rerun.

Visual quality mix comparison (old first-sentence vs this rerun)

This chart compares:

Old run: 2026-02-23 gemma3:1b first-sentence quality mix
This run: 2026-02-25 gemma3:1b rerun quality mix

2026-02-23 (first sentence)

Good: 7 / 24 (29.2%)
Fair: 9 / 24 (37.5%)
Bad: 8 / 24 (33.3%)

2026-02-25 rerun

Good: 5 / 100 (5.0%)
Fair: 77 / 100 (77.0%)
Bad: 18 / 100 (18.0%)

Legend

Good = definitely should be used in the story Fair = could be used in the story Bad = definitely should not be used in the story

The baseline reference is the post linked above. This rerun expands the test suite and updates scoring, so the comparison is directional, not perfectly apples-to-apples. The deltas are still large enough to be meaningful for workflow planning:

fair+good: 29.2% -> 82.0% (+52.8 points)
bad: 70.8% -> 18.0% (-52.8 points)
one-sentence violations: 62.5% -> 0.0% (-62.5 points)
one-sentence compliance: 37.5% -> 100.0% (+62.5 points)

Case-level reliability (2 shots per scene)

Case-Level Usability

Case-Level Outcome (2 samples per case)	Count	Percent
At least 1 Fair/Good sentence	44 / 50	88.0%
Both sentences Fair/Good	38 / 50	76.0%
Exactly 1 Fair/Good sentence	6 / 50	12.0%
No usable sentence (both bad)	6 / 50	12.0%

This is the practical reliability view for writing flow when generating two options per scene.

For story production, this is the ratio that matters most. With two attempts per scene, 44 out of 50 scenes produced at least one usable next sentence.

Good-scene likelihood after N sentences (same math as previous post)

Using the same compounding-risk framing from the previous write-up:

P(good scene after N sentences) = (0.82)^N
P(at least 1 bad sentence by N) = 1 - (0.82)^N

Compounding Scene Quality Over N Sentences

Sentences (N)	P(good scene, all Fair/Good)	P(at least 1 Bad)
10	13.74%	86.26%
20	1.89%	98.11%
40	0.036%	99.964%

Same compounding method as the previous post, using this run's per-sentence rates: Fair+Good = 82.0%, Bad = 18.0%.

Even with much better single-sentence quality, long scenes still compound failure risk if each sentence is generated independently.

If we pick the best of two generations per sentence

This is the practical workflow version of the same question.

observed per-step failure after best-of-2 in this run: 6/50 = 12.0%
P(good scene after N sentences) = (0.88)^N
P(at least 1 bad sentence that ruins the scene by N) = 1 - (0.88)^N

Compounding Risk with Best-of-2 Selection

Sentences (N)	P(good scene with best-of-2 each step)	P(at least 1 Bad after best-of-2)
10	27.85%	72.15%
20	7.76%	92.24%
40	0.602%	99.40%

Uses observed two-shot case reliability from this run: 44/50 had at least one Fair/Good, so per-step failure after best-of-2 is 6/50 = 12.0%.

Picking the better of two helps a lot versus single-shot generation, but the risk still compounds over longer scenes.

Breakdown by reversal type

Quality by Reversal Type

Reversal Type	Good	Fair	Bad	Fair + Good
Positive (34 outputs)	3 (8.8%)	28 (82.4%)	3 (8.8%)	31 / 34 (91.2%)
Neutral (32 outputs)	1 (3.1%)	25 (78.1%)	6 (18.8%)	26 / 32 (81.3%)
Negative (34 outputs)	1 (2.9%)	24 (70.6%)	9 (26.5%)	25 / 34 (73.5%)

Negative-turn prompts are still the hardest slice in this rerun.

Positive and neutral turns are performing well. Negative-turn prompts remain the main error bucket.

Sample `good` sentences from this rerun

Note - Harbor return (positive)
At the end she found only black water, a loop of severed rope, and her own breath pooling white in the cold. The salt spray carried the faint scent of woodsmoke from a distant, abandoned shack.

Note - Friendly dragon (positive)
It stared at the cookie jar on the top shelf and made a hopeful noise. It made a tiny, wheezing sound, like a rusty hinge, as it nibbled on a crumb.

Note - Wildfire evacuation (negative)
A siren wails somewhere close, then warps as the sound hits the smoke and comes back wrong. The smell of burnt plastic hangs heavy, thick enough to taste, and a distant siren screams a fractured melody.

Rerun rubric for `quality`

This rerun uses a keep-or-cut standard:

If I appended this sentence to the scene, would I keep it?

Rule compliance only matters when it breaks the scene.

quality = `good`

Meaning: I would keep this sentence in a draft with little or no editing.

How it was evaluated:

Scene fit: clearly continues what is happening now in the provided scene.
POV + tense + voice: consistent with requested POV and tense, and the story voice.
Readability: flows cleanly when appended to sceneSoFar.
Specificity: concrete, grounded details instead of generic filler.
Narrative function: adds momentum or sharpens the moment without premature resolution.

quality = `fair`

Meaning: usable draft sentence; likely needs light revision or could be replaced by a stronger option.

How it was evaluated:

mostly fits scene continuity, but can be generic, low-energy, or slightly off-emphasis.
does not break POV or tense, though it may drift toward abstraction.
reads fine appended, but does not land a strong next beat.
any rule issues are minor and do not harm continuity.

quality = `bad`

Meaning: not usable as-is because it harms scene continuity or misses prompt intent.

How it was evaluated:

breaks requested POV or tense, or violates limited POV.
contradicts scene details or introduces implausible derailments.
voice mismatch is noticeable for the prompt.
becomes abstract or meta instead of story continuation.
hard rule violations that materially change story behavior.

quality = `incoherent`

Meaning: completely unusable.

How it was evaluated:

grammatically or logically unreadable in context.
does not connect to the scene at all.
contradictions are severe enough that action cannot be parsed.

Reserved for true failures, not merely weak writing.

Notes

This rerun is designed around practical writing workflow. The question is whether gemma3:1b can produce a decent next sentence often enough to keep drafting momentum. On this pass, the fair+good ratio indicates it can in most scenes, while negative-turn scenes remain the next improvement target.

Run summary

Data

Main result: Fair+Good ratio

Quality Distribution (Rerun)

Compared with the previous gemma3:1b run

Baseline vs Rerun

Visual quality mix comparison (old first-sentence vs this rerun)

2026-02-23 (first sentence)

2026-02-25 rerun

Legend

Case-level reliability (2 shots per scene)

Case-Level Usability

Good-scene likelihood after N sentences (same math as previous post)

Compounding Scene Quality Over N Sentences

If we pick the best of two generations per sentence

Compounding Risk with Best-of-2 Selection

Breakdown by reversal type

Quality by Reversal Type

Sample good sentences from this rerun

Rerun rubric for quality

quality = good

quality = fair

quality = bad

quality = incoherent

Notes

Sample `good` sentences from this rerun

Rerun rubric for `quality`

quality = `good`

quality = `fair`

quality = `bad`

quality = `incoherent`