Fourth Log - Writing Experiments
Writing Experiment: gemma3:1b vs qwen3:1.7b Quality Mix
February 23, 2026
Estimated read: 7 min
This run compares gemma3:1b and qwen3:1.7b on short story continuation quality. The goal is to maximize good and fair, minimize bad, and keep one-sentence compliance high.
- Run timestamp: February 23, 2026 (
2026-02-23T06:18:08.267Z) - Models:
gemma3:1bandqwen3:1.7b - Samples: 24 generations per model (4 cases x 6 samples)
- Evaluator: ChatGPT 5.2
- Goal: more
goodandfair; fewerbad
Data
Quality Mix (Stacked Percentage Columns)
gemma3:1b
- Good: 0 / 24 (0.0%)
- Fair: 7 / 24 (29.2%)
- Bad: 17 / 24 (70.8%)
qwen3:1.7b
- Good: 1 / 24 (4.2%)
- Fair: 8 / 24 (33.3%)
- Bad: 15 / 24 (62.5%)
Legend
Good = definitely should be used in the story Fair = could be used in the story Bad = definitely should not be used in the story
Summary Table
Quality Distribution by Model
| Model | Good | Fair | Bad | Good + Fair |
|---|---|---|---|---|
| gemma3:1b | 0 / 24 (0.0%) | 7 / 24 (29.2%) | 17 / 24 (70.8%) | 7 / 24 (29.2%) |
| qwen3:1.7b | 1 / 24 (4.2%) | 8 / 24 (33.3%) | 15 / 24 (62.5%) | 9 / 24 (37.5%) |
Ratings were assigned by ChatGPT 5.2.
First-Sentence Quality Stats
This is scored only on the first sentence each generation produced.
gemma3:1b
- Good: 7 / 24 (29.2%)
- Fair: 9 / 24 (37.5%)
- Bad: 8 / 24 (33.3%)
qwen3:1.7b
- Good: 4 / 24 (16.7%)
- Fair: 11 / 24 (45.8%)
- Bad: 9 / 24 (37.5%)
First-Sentence Quality Distribution by Model
| Model | Good | Fair | Bad | Good + Fair |
|---|---|---|---|---|
| gemma3:1b | 7 / 24 (29.2%) | 9 / 24 (37.5%) | 8 / 24 (33.3%) | 16 / 24 (66.7%) |
| qwen3:1.7b | 4 / 24 (16.7%) | 11 / 24 (45.8%) | 9 / 24 (37.5%) | 15 / 24 (62.5%) |
Uses the `firstSentenceQuality` field from each result row.
One-Sentence Request Violations
The models were requested to add a single sentence. Going over was allowed but marked as a request violation.
One-Sentence Compliance
| Model | One-Sentence Violations | One-Sentence Compliant |
|---|---|---|
| gemma3:1b | 15 / 24 (62.5%) | 9 / 24 (37.5%) |
| qwen3:1.7b | 14 / 24 (58.3%) | 10 / 24 (41.7%) |
If the model went over one sentence, it was marked as a request violation.
Best Excerpts by Case
Best picks are selected by firstSentenceQuality for each model and case. Only the first sentence is shown below so the excerpt matches the scoring basis. When multiple samples have the same firstSentenceQuality, the lowest sample number is used.
1) Harbor return with optimistic turn
Scene Fog drifted between the pilings and softened the warning lights to bruised orange smears. Mara counted the empty slips along the pier where the cutter should have been tied. At the end she found only black water, a loop of severed rope, and her own breath clouding in the cold.
gemma3:1b
The salt spray tasted of iron on her lips as she scanned the water.
qwen3:1.7b
She traced the rope's path to the dock's edge, where the ledger's leather scent clung to the wet sand.
2) Skybridge dead-drop with cautious relief
Scene The skybridge hummed with mag-rail vibration, and the glass beneath her palm held a faint chill that smelled of ozone and disinfectant. Ad panels along the ceiling cycled through breakfast spots and rent-by-the-hour pods, washing the crowd in pale blues and pinks. Mara kept to the edge, matching the pace of commuters with lanyards and coffee bulbs, while her thumb worried the cracked corner of her old ID card. A maintenance drone hovered near the access hatch, its camera iris clicking as it tracked faces; she angled her chin down and let a couple arguing about surge fares drift between them. “If you want to file a complaint, take it up with the algorithm,” one of them snapped, and Mara used the raised voices as cover to slip behind a structural rib. The dead-drop panel was still there, a narrow rectangle of brushed steel with a tiny service notch, and the security seal she expected to find was missing.
gemma3:1b
The metal panel yielded with a sharp groan, and Mara slid beneath it, the air immediately colder despite the mag-rail's residual warmth.
qwen3:1.7b
She reached for the panel, her fingers brushing the cold steel, then pulled back, the scent of ozone clinging to her skin.
3) Clocktower confession with hopeful risk
Scene The clocktower stairs narrowed to a ribbed iron ladder, slick with condensation that left her gloves smelling of pennies. Above, the gears ticked and clacked like teeth, each turn sending a tremor through the stone. Celia paused beneath the dial face where moonlight seeped through the numerals, and she felt the draft tug at the papers inside her coat. Someone had been here recently; fresh boot grit freckled the landing, and a lantern-wick stink still hung in the air.
gemma3:1b
The air felt thick and cool against my skin as I ascended the narrow staircase, each step a muffled echo against the stone.
qwen3:1.7b
The gears hissed like a fevered beast, and the scent of oil mixed with the tang of rust clung to the air.
4) Conservatory retrieval with alarmed setback
Scene The conservatory’s glass panes bead with condensation, and the air tastes like wet soil and coppery fertilizer. Jun moves between tall rows of citrus trees under grow lights that buzz with a faint electric heat against his cheeks. A maintenance cart sits abandoned beside the central fountain, its battery panel open and a coil of insulated wire spilling onto the tile. He reaches the propagation station where the gene-safe drawer should be sealed with a numbered tie, but the tie hangs in two clean halves. Inside, foam slots cradle empty vials, all except one with a white label stamped: KESTREL-03.
gemma3:1b
Jun runs his fingers along the cool, smooth surface of the drawer's wooden frame, the scent of stagnant water and something floral intensifying as he pulls at the tarnished metal tie.
qwen3:1.7b
The air sharpens with the tang of ozone, and the metal of the drawer feels cool against his palm as it slides open, revealing a vial cradled in the crook of his fingers.
Takeaways
qwen3:1.7bleads this run ongood + fair(37.5% vs 29.2%).- On first-sentence-only scoring,
gemma3:1bleads ongood + fair(66.7% vs 62.5%) and ongoodalone (29.2% vs 16.7%). - Full-output quality still has high
badrates for both models, so end-to-end continuation quality is not yet where we want it. - One-sentence violations are high for both models, which is a clear prompt-following issue to address in the next run.