Writing Experiment: gemma3:1b vs qwen3:1.7b Quality Mix

Run timestamp: February 23, 2026 (2026-02-23T06:18:08.267Z)
Models: gemma3:1b and qwen3:1.7b
Samples: 24 generations per model (4 cases x 6 samples)
Evaluator: ChatGPT 5.2
Goal: more good and fair; fewer bad

Data

writing-experiment-2026-02-23T06-18-08-267Z.json

Quality Mix (Stacked Percentage Columns)

gemma3:1b

Good: 0 / 24 (0.0%)
Fair: 7 / 24 (29.2%)
Bad: 17 / 24 (70.8%)

qwen3:1.7b

Good: 1 / 24 (4.2%)
Fair: 8 / 24 (33.3%)
Bad: 15 / 24 (62.5%)

Legend

Good = definitely should be used in the story Fair = could be used in the story Bad = definitely should not be used in the story

Summary Table

Quality Distribution by Model

Model	Good	Fair	Bad	Good + Fair
gemma3:1b	0 / 24 (0.0%)	7 / 24 (29.2%)	17 / 24 (70.8%)	7 / 24 (29.2%)
qwen3:1.7b	1 / 24 (4.2%)	8 / 24 (33.3%)	15 / 24 (62.5%)	9 / 24 (37.5%)

Ratings were assigned by ChatGPT 5.2.

First-Sentence Quality Stats

This is scored only on the first sentence each generation produced.

gemma3:1b

Good: 7 / 24 (29.2%)
Fair: 9 / 24 (37.5%)
Bad: 8 / 24 (33.3%)

qwen3:1.7b

Good: 4 / 24 (16.7%)
Fair: 11 / 24 (45.8%)
Bad: 9 / 24 (37.5%)

First-Sentence Quality Distribution by Model

Model	Good	Fair	Bad	Good + Fair
gemma3:1b	7 / 24 (29.2%)	9 / 24 (37.5%)	8 / 24 (33.3%)	16 / 24 (66.7%)
qwen3:1.7b	4 / 24 (16.7%)	11 / 24 (45.8%)	9 / 24 (37.5%)	15 / 24 (62.5%)

Uses the `firstSentenceQuality` field from each result row.

One-Sentence Request Violations

The models were requested to add a single sentence. Going over was allowed but marked as a request violation.

One-Sentence Compliance

Model	One-Sentence Violations	One-Sentence Compliant
gemma3:1b	15 / 24 (62.5%)	9 / 24 (37.5%)
qwen3:1.7b	14 / 24 (58.3%)	10 / 24 (41.7%)

If the model went over one sentence, it was marked as a request violation.

Best Excerpts by Case

Best picks are selected by firstSentenceQuality for each model and case. Only the first sentence is shown below so the excerpt matches the scoring basis. When multiple samples have the same firstSentenceQuality, the lowest sample number is used.

1) Harbor return with optimistic turn

Scene Fog drifted between the pilings and softened the warning lights to bruised orange smears. Mara counted the empty slips along the pier where the cutter should have been tied. At the end she found only black water, a loop of severed rope, and her own breath clouding in the cold.

gemma3:1b

Best sample: #1 · First sentence quality: fair

The salt spray tasted of iron on her lips as she scanned the water.

qwen3:1.7b

Best sample: #1 · First sentence quality: good

She traced the rope's path to the dock's edge, where the ledger's leather scent clung to the wet sand.

2) Skybridge dead-drop with cautious relief

Scene The skybridge hummed with mag-rail vibration, and the glass beneath her palm held a faint chill that smelled of ozone and disinfectant. Ad panels along the ceiling cycled through breakfast spots and rent-by-the-hour pods, washing the crowd in pale blues and pinks. Mara kept to the edge, matching the pace of commuters with lanyards and coffee bulbs, while her thumb worried the cracked corner of her old ID card. A maintenance drone hovered near the access hatch, its camera iris clicking as it tracked faces; she angled her chin down and let a couple arguing about surge fares drift between them. “If you want to file a complaint, take it up with the algorithm,” one of them snapped, and Mara used the raised voices as cover to slip behind a structural rib. The dead-drop panel was still there, a narrow rectangle of brushed steel with a tiny service notch, and the security seal she expected to find was missing.

gemma3:1b

Best sample: #1 · First sentence quality: good

The metal panel yielded with a sharp groan, and Mara slid beneath it, the air immediately colder despite the mag-rail's residual warmth.

qwen3:1.7b

Best sample: #2 · First sentence quality: good

She reached for the panel, her fingers brushing the cold steel, then pulled back, the scent of ozone clinging to her skin.

3) Clocktower confession with hopeful risk

Scene The clocktower stairs narrowed to a ribbed iron ladder, slick with condensation that left her gloves smelling of pennies. Above, the gears ticked and clacked like teeth, each turn sending a tremor through the stone. Celia paused beneath the dial face where moonlight seeped through the numerals, and she felt the draft tug at the papers inside her coat. Someone had been here recently; fresh boot grit freckled the landing, and a lantern-wick stink still hung in the air.

gemma3:1b

Best sample: #1 · First sentence quality: good

The air felt thick and cool against my skin as I ascended the narrow staircase, each step a muffled echo against the stone.

qwen3:1.7b

Best sample: #3 · First sentence quality: fair

The gears hissed like a fevered beast, and the scent of oil mixed with the tang of rust clung to the air.

4) Conservatory retrieval with alarmed setback

Scene The conservatory’s glass panes bead with condensation, and the air tastes like wet soil and coppery fertilizer. Jun moves between tall rows of citrus trees under grow lights that buzz with a faint electric heat against his cheeks. A maintenance cart sits abandoned beside the central fountain, its battery panel open and a coil of insulated wire spilling onto the tile. He reaches the propagation station where the gene-safe drawer should be sealed with a numbered tie, but the tie hangs in two clean halves. Inside, foam slots cradle empty vials, all except one with a white label stamped: KESTREL-03.

gemma3:1b

Best sample: #1 · First sentence quality: good

Jun runs his fingers along the cool, smooth surface of the drawer's wooden frame, the scent of stagnant water and something floral intensifying as he pulls at the tarnished metal tie.

qwen3:1.7b

Best sample: #2 · First sentence quality: good

The air sharpens with the tang of ozone, and the metal of the drawer feels cool against his palm as it slides open, revealing a vial cradled in the crook of his fingers.

Compounding Failure Risk Across Iterations

If any continuation step is unusable, the full draft now needs repair work at that point. This is the same structure as a series reliability model, where success compounds by multiplication.

Let:

p_bad = probability a single generated step is bad (unusable)
N = number of generation steps in the full story

Then:

P(all usable) = (1 - p_bad)^N
P(at least one bad step) = 1 - (1 - p_bad)^N (complement rule: reference)
E[# bad steps] = N * p_bad

Using this run’s first-sentence bad rates (gemma3:1b = 33.3%, qwen3:1.7b = 37.5%):

At Least One Bad Step vs Iteration Count

Steps (N)	P(at least 1 bad) gemma3:1b	P(at least 1 bad) qwen3:1.7b
10	98.27%	99.09%
20	99.97%	99.99%
40	99.999991%	99.999999%

Computed from first-sentence bad rates in this run: gemma3:1b = 33.3%, qwen3:1.7b = 37.5%.

At N = 40, a clean one-shot run is effectively non-viable at these rates, and expected rewrite load is high:

gemma3:1b: E[# bad] = 40 * 0.333 = 13.3
qwen3:1.7b: E[# bad] = 40 * 0.375 = 15.0

You can also invert this to set a target bad rate for one-shot generation quality:

p_bad <= 1 - R^(1/N) where R is desired clean-run probability.
Example: with N = 40 and R = 50%, you need p_bad <= 1.72%.
Example: with N = 500 and R = 50%, you need p_bad <= 0.14%.

These calculations assume independent errors. In autoregressive generation, mistakes can feed forward and amplify, so real long-form failure risk is often worse than this baseline.

Takeaways

qwen3:1.7b leads this run on good + fair (37.5% vs 29.2%).
On first-sentence-only scoring, gemma3:1b leads on good + fair (66.7% vs 62.5%) and on good alone (29.2% vs 16.7%).
Full-output quality still has high bad rates for both models, so end-to-end continuation quality is not yet where we want it.
One-sentence violations are high for both models, which is a clear prompt-following issue to address in the next run.