This log is aimed at one practical question: can prompt framing alone push a small model toward more divergent poem outputs without changing model size?

Run summary

  • Run timestamp: March 6, 2026 (2026-03-06T19:46:40.872Z)
  • Model: gemma3:1b
  • Prompt setups: 3
  • Seed words: 10
  • Samples per case: 3
  • Total generations: 90
  • Scoring: 5 binary creativity/quality questions per poem (0-5 true-count score)

Data

Words tested

flower lantern tide rust mercy love echo thorn glass ember

Quality distributions by prompt

Best-of-3 Per Case (Highest True Count Selected)

Each prompt+word case contributes one sample: the one with the highest score out of three generations.

count
0
1
2
3
3
3
4
4
5
01-base10 case groups
0
1
2
2
2
3
1
4
5
5
02-base-with-bio10 case groups
0
1
2
3
3
4
7
5
03-journey-then-base10 case groups

All Samples (No Best-Of Selection)

Every generated sample is counted. This view shows baseline spread before selecting the strongest option per case.

count
2
0
1
7
2
11
3
5
4
5
5
01-base30 samples
5
0
6
1
4
2
8
3
1
4
6
5
02-base-with-bio30 samples
3
0
6
1
5
2
8
3
4
8
5
03-journey-then-base30 samples

Overall Score Distribution

True Count Score (0-5) All Samples (N=90) Best-of-3 per Case (N=30)
0 10 / 90 (11.1%) 0 / 30 (0.0%)
1 12 / 90 (13.3%) 0 / 30 (0.0%)
2 16 / 90 (17.8%) 2 / 30 (6.7%)
3 27 / 90 (30.0%) 8 / 30 (26.7%)
4 6 / 90 (6.7%) 4 / 30 (13.3%)
5 19 / 90 (21.1%) 16 / 30 (53.3%)

Each score is the number of rubric questions answered true (max 5). Best-of picks the highest-scoring sample in each prompt+word case.

Prompt-Level Performance

Prompt All-Sample Avg True Count Best-of Avg True Count Selection Delta Best-of 5/5 Cases
01-base 3.07 4.10 +1.03 4 / 10 (40.0%)
02-base-with-bio 2.40 3.90 +1.50 5 / 10 (50.0%)
03-journey-then-base 2.67 4.40 +1.73 7 / 10 (70.0%)

Best-of is computed within each case group of 3 samples.

Rubric Hit Rates (All Samples vs Best-of)

Rubric Question True in All Samples True in Best-of
Unexpected interpretation 25 / 90 (27.8%) 18 / 30 (60.0%)
Every line adds meaning 62 / 90 (68.9%) 28 / 30 (93.3%)
Strong in few lines 62 / 90 (68.9%) 28 / 30 (93.3%)
Transforms the word 65 / 90 (72.2%) 28 / 30 (93.3%)
Makes reader see word differently 30 / 90 (33.3%) 22 / 30 (73.3%)

Best-of significantly improves all five dimensions, especially interpretive novelty and reframing.

Prompt-Level Rubric Hit Rates (All Samples)

Prompt Unexpected Every Line Adds Meaning Strong in Few Lines Transforms Word See Word Differently
01-base 20.0% 90.0% 90.0% 70.0% 36.7%
02-base-with-bio 26.7% 56.7% 56.7% 73.3% 26.7%
03-journey-then-base 36.7% 60.0% 60.0% 73.3% 36.7%

Percentages are from all 30 samples per prompt.

Prompt-by-prompt read

01-base

The base prompt is the most stable on line-level discipline. It has the highest all-sample mean (3.07) and the strongest “every line adds meaning” rate (90.0%) in raw sampling. The tradeoff is novelty: it stays grounded, but unexpected interpretation remains lower (20.0%) than the narrative-heavy setup.

In vocabulary, this setup repeatedly returns to a tight motif cluster: chipped, rain, dust, motes, fingers, window. That consistency helps coherence, but can narrow divergence if reused for many drafting passes.

02-base-with-bio

The compact author-profile framing is the most volatile setup. Raw average is the lowest (2.40), but best-of climbs to 3.90 with 50.0% of cases reaching 5/5. This pattern says there are strong hits in the sample set, but also more misses.

This prompt leans hard into the freight-yard urban texture: forgotten, rust, slicked, cold, boots. It improves transformation behavior (73.3% for “transforms word”) but loses some line-economy consistency (56.7% for “every line adds meaning”).

03-journey-then-base

The long journey preamble produces the highest ceiling. It has the best best-of mean (4.40) and the most perfect best-of outcomes (7 / 10 cases at 5/5). It is also strongest on divergent interpretation after selection (80.0% on “unexpected interpretation” in best-of).

This setup costs more prompt budget and is less stable in raw mode (2.67 mean), but it creates the largest uplift when sampling and selecting (+1.73). If the workflow supports best-of selection, this is currently the most effective framing in this run.

Word-level behavior

Word Performance Across Prompt Setups

Word All-Sample Avg (N=9) Best Score by Prompt Best-of Mean
thorn 4.00 5, 5, 5 5.00
flower 3.11 5, 5, 5 5.00
ember 3.11 5, 5, 5 5.00
rust 2.78 5, 5, 5 5.00
glass 3.11 4, 5, 5 4.67
lantern 2.78 3, 3, 5 3.67
love 2.33 4, 4, 3 3.67
echo 2.11 3, 2, 5 3.33
tide 2.22 4, 2, 3 3.00
mercy 1.56 3, 3, 3 3.00

Best score by prompt shows the strongest sample each word reached in prompts 01, 02, and 03.

thorn is the cleanest word in this run: highest raw mean and perfect best-of across all three prompts. mercy, tide, and echo are the hardest words in raw sampling, but all improve with selection. rust is notable because it looks mid-pack in raw average and then jumps to perfect best-of in every prompt.

Vocabulary fingerprints by prompt

Prompt Vocabulary and Diversity Signals

Prompt Recurring Best-of Vocabulary Best-of Type/Token Ratio Avg Tokens per Best Poem
01-base chipped, rain, dust, motes, fingers, window 0.401 47.6
02-base-with-bio forgotten, rust, rain, slicked, cold, boots 0.454 39.4
03-journey-then-base dust, motes, rain, slow, single, forgotten 0.464 42.9

Type/token ratio is lexical diversity on the best-of subset (higher means more varied vocabulary).

The third prompt has the highest lexical diversity on selected outputs (0.464 type/token), which matches the qualitative read: broader language variation and stronger conceptual pivots, at higher prompt-token cost.

Notes

For divergent poem drafting, the current best operating mode is:

  • use 03-journey-then-base when selection is available,
  • keep 01-base as the stability fallback,
  • continue tuning 02-base-with-bio because its wins are strong but inconsistent.