Sixth Log - Writing Experiments
Writing Experiment Rerun: gemma3:1b with 3 Samples per Story
February 27, 2026
Estimated read: 7 min
I reran the same 50-story setup with gemma3:1b, but moved from 2 samples per story to 3. The key question is story reliability: how often do we avoid an all-bad outcome and get at least one fair/good line to keep drafting.
This is a direct follow-up to:
Same general setup and scoring approach, but this run uses 3 generations per story instead of 2.
Run summary
- Run timestamp: February 27, 2026 (
2026-02-27T06:26:21.050Z) - Model:
gemma3:1b - Scope: 50 story cases, 3 samples per case (
150outputs) - Primary goal: reduce story-level
all badoutcomes and increase odds of at least one usable (fairorgood) option per story step
Data
writing-experiment-2026-02-27T06-26-21-050Z.json- Prior run for comparison:
writing-experiment-2026-02-25T02-27-32-328Z.json
Current run quality distribution
2026-02-27 Quality Distribution
| Quality | Count | Percent |
|---|---|---|
| Good | 4 / 150 | 2.7% |
| Fair | 137 / 150 | 91.3% |
| Bad | 9 / 150 | 6.0% |
| Fair + Good | 141 / 150 | 94.0% |
Current run: 50 stories x 3 samples = 150 outputs.
Percentage comparison vs the previous run
Output-Level Quality Percentage Comparison
| Run | Samples per Story | Good | Fair | Bad | Fair + Good |
|---|---|---|---|---|---|
| 2026-02-25 | 2 | 5.0% | 77.0% | 18.0% | 82.0% |
| 2026-02-27 | 3 | 2.7% | 91.3% | 6.0% | 94.0% |
Same case set and rubric family; only sampling strategy changed from 2 to 3 per story.
Key deltas from 2026-02-25 -> 2026-02-27:
fair + good:82.0%->94.0%(+12.0 points)bad:18.0%->6.0%(-12.0 points)
Main metric: story reliability across samples
This is the main decision metric for workflow: per story step, did we get at least one usable option?
Story-Level Reliability (Primary Metric)
| Run | Samples per Story | At Least 1 Fair/Good | All Bad | At-Least-One : All-Bad Ratio |
|---|---|---|---|---|
| 2026-02-25 | 2 | 44 / 50 (88.0%) | 6 / 50 (12.0%) | 44:6 (7.33:1) |
| 2026-02-27 | 3 | 48 / 50 (96.0%) | 2 / 50 (4.0%) | 48:2 (24.0:1) |
This is the primary reliability lens for scene drafting throughput.
What changed with 3 samples per story:
- At least one fair/good:
88.0%->96.0%(+8.0 points) - All bad:
12.0%->4.0%(-8.0 points) - All-bad relative reduction:
6stories ->2stories (-66.7%) - Usable-to-failed story ratio:
7.33:1->24.0:1(3.27x better)
This can be read as for every 24 stories where we get at least one fair/good line added, we only get 1 story where we added a bad line. In the prior run, for every 7.33 stories with at least one fair/good line added, we got 1 story where we added a bad line.
Story outcome distribution details
Story-Level Outcome Breakdown
| Run | All Fair/Good | Exactly 2 Fair/Good | Exactly 1 Fair/Good | All Bad |
|---|---|---|---|---|
| 2026-02-25 (2 samples) | 38 / 50 (76.0%) | N/A | 6 / 50 (12.0%) | 6 / 50 (12.0%) |
| 2026-02-27 (3 samples) | 46 / 50 (92.0%) | 1 / 50 (2.0%) | 1 / 50 (2.0%) | 2 / 50 (4.0%) |
For the 2-sample run, 'All Fair/Good' means both samples were Fair/Good.
The 3-sample run leaves only 2 of 50 stories in the all-bad bucket. That is the practical improvement we wanted: more chances to keep moving with at least one fair/good continuation.
Remaining all-bad stories in this run:
loop-doorbell-negativeriver-bargain-mixed
Example kept lines (with story prefix)
Note - Noir alley (negative)
I tried another match and listened to my own breath, counting the seconds between passing cars. It felt like a shadow, a cold, wet breath on my neck, holding the echo of a single, deliberate step.
Note - Tea shop clue (positive)
On the office desk, the ledger lay open to a blank page, and the teacup beside it was still warm. A faint, unfamiliar scent of sandalwood lingered near the chipped porcelain teacup, a small inconsistency amidst the usual comforting aroma of bergamot and clove.
Note - Dinner unstated (negative)
Across from me, two pairs of eyes lift and drop, lift and drop, waiting for whatever I promised I would say. The fork scrapes against the porcelain, a slow, deliberate motion that amplifies the stillness of the room.
Notes
The scoring intent is unchanged from the prior rerun: fair and good are usable drafting options, bad is not. This 3-sample strategy materially improves the chance that each story step has at least one usable option while sharply reducing all-bad dead ends.