Poem Experiment Log: Prompt Framing for Divergent Thinking

This log is aimed at one practical question: can prompt framing alone push a small model toward more divergent poem outputs without changing model size?

Run summary

Run timestamp: March 6, 2026 (2026-03-06T19:46:40.872Z)
Model: gemma3:1b
Prompt setups: 3
Seed words: 10
Samples per case: 3
Total generations: 90
Scoring: 5 binary creativity/quality questions per poem (0-5 true-count score)

Data

poem-experiment-2026-03-06T19-46-40-872Z.json

Words tested

flower lantern tide rust mercy love echo thorn glass ember

Quality distributions by prompt

Best-of-3 Per Case (Highest True Count Selected)

Each prompt+word case contributes one sample: the one with the highest score out of three generations.

count

01-base10 case groups

02-base-with-bio10 case groups

03-journey-then-base10 case groups

All Samples (No Best-Of Selection)

Every generated sample is counted. This view shows baseline spread before selecting the strongest option per case.

count

01-base30 samples

02-base-with-bio30 samples

03-journey-then-base30 samples

Overall Score Distribution

True Count Score (0-5)	All Samples (N=90)	Best-of-3 per Case (N=30)
0	10 / 90 (11.1%)	0 / 30 (0.0%)
1	12 / 90 (13.3%)	0 / 30 (0.0%)
2	16 / 90 (17.8%)	2 / 30 (6.7%)
3	27 / 90 (30.0%)	8 / 30 (26.7%)
4	6 / 90 (6.7%)	4 / 30 (13.3%)
5	19 / 90 (21.1%)	16 / 30 (53.3%)

Each score is the number of rubric questions answered true (max 5). Best-of picks the highest-scoring sample in each prompt+word case.

Prompt-Level Performance

Prompt	All-Sample Avg True Count	Best-of Avg True Count	Selection Delta	Best-of 5/5 Cases
01-base	3.07	4.10	+1.03	4 / 10 (40.0%)
02-base-with-bio	2.40	3.90	+1.50	5 / 10 (50.0%)
03-journey-then-base	2.67	4.40	+1.73	7 / 10 (70.0%)

Best-of is computed within each case group of 3 samples.

Rubric Hit Rates (All Samples vs Best-of)

Rubric Question	True in All Samples	True in Best-of
Unexpected interpretation	25 / 90 (27.8%)	18 / 30 (60.0%)
Every line adds meaning	62 / 90 (68.9%)	28 / 30 (93.3%)
Strong in few lines	62 / 90 (68.9%)	28 / 30 (93.3%)
Transforms the word	65 / 90 (72.2%)	28 / 30 (93.3%)
Makes reader see word differently	30 / 90 (33.3%)	22 / 30 (73.3%)

Best-of significantly improves all five dimensions, especially interpretive novelty and reframing.

Prompt-Level Rubric Hit Rates (All Samples)

Prompt	Unexpected	Every Line Adds Meaning	Strong in Few Lines	Transforms Word	See Word Differently
01-base	20.0%	90.0%	90.0%	70.0%	36.7%
02-base-with-bio	26.7%	56.7%	56.7%	73.3%	26.7%
03-journey-then-base	36.7%	60.0%	60.0%	73.3%	36.7%

Percentages are from all 30 samples per prompt.

Prompt-by-prompt read

01-base

The base prompt is the most stable on line-level discipline. It has the highest all-sample mean (3.07) and the strongest “every line adds meaning” rate (90.0%) in raw sampling. The tradeoff is novelty: it stays grounded, but unexpected interpretation remains lower (20.0%) than the narrative-heavy setup.

In vocabulary, this setup repeatedly returns to a tight motif cluster: chipped, rain, dust, motes, fingers, window. That consistency helps coherence, but can narrow divergence if reused for many drafting passes.

02-base-with-bio

The compact author-profile framing is the most volatile setup. Raw average is the lowest (2.40), but best-of climbs to 3.90 with 50.0% of cases reaching 5/5. This pattern says there are strong hits in the sample set, but also more misses.

This prompt leans hard into the freight-yard urban texture: forgotten, rust, slicked, cold, boots. It improves transformation behavior (73.3% for “transforms word”) but loses some line-economy consistency (56.7% for “every line adds meaning”).

03-journey-then-base

The long journey preamble produces the highest ceiling. It has the best best-of mean (4.40) and the most perfect best-of outcomes (7 / 10 cases at 5/5). It is also strongest on divergent interpretation after selection (80.0% on “unexpected interpretation” in best-of).

This setup costs more prompt budget and is less stable in raw mode (2.67 mean), but it creates the largest uplift when sampling and selecting (+1.73). If the workflow supports best-of selection, this is currently the most effective framing in this run.

Word-level behavior

Word Performance Across Prompt Setups

Word	All-Sample Avg (N=9)	Best Score by Prompt	Best-of Mean
thorn	4.00	5, 5, 5	5.00
flower	3.11	5, 5, 5	5.00
ember	3.11	5, 5, 5	5.00
rust	2.78	5, 5, 5	5.00
glass	3.11	4, 5, 5	4.67
lantern	2.78	3, 3, 5	3.67
love	2.33	4, 4, 3	3.67
echo	2.11	3, 2, 5	3.33
tide	2.22	4, 2, 3	3.00
mercy	1.56	3, 3, 3	3.00

Best score by prompt shows the strongest sample each word reached in prompts 01, 02, and 03.

thorn is the cleanest word in this run: highest raw mean and perfect best-of across all three prompts. mercy, tide, and echo are the hardest words in raw sampling, but all improve with selection. rust is notable because it looks mid-pack in raw average and then jumps to perfect best-of in every prompt.

Vocabulary fingerprints by prompt

Prompt Vocabulary and Diversity Signals

Prompt	Recurring Best-of Vocabulary	Best-of Type/Token Ratio	Avg Tokens per Best Poem
01-base	chipped, rain, dust, motes, fingers, window	0.401	47.6
02-base-with-bio	forgotten, rust, rain, slicked, cold, boots	0.454	39.4
03-journey-then-base	dust, motes, rain, slow, single, forgotten	0.464	42.9

Type/token ratio is lexical diversity on the best-of subset (higher means more varied vocabulary).

The third prompt has the highest lexical diversity on selected outputs (0.464 type/token), which matches the qualitative read: broader language variation and stronger conceptual pivots, at higher prompt-token cost.

Notes

For divergent poem drafting, the current best operating mode is:

use 03-journey-then-base when selection is available,
keep 01-base as the stability fallback,
continue tuning 02-base-with-bio because its wins are strong but inconsistent.