Seventh Log - Writing Experiments
Poem Experiment Log: Prompt Framing for Divergent Thinking
March 11, 2026
Estimated read: 10 min
I ran a poem experiment to test whether prompt framing can push a small model toward more divergent, image-driven writing. This post includes the full JSON dataset, score distributions, and prompt-by-prompt analysis.
This log is aimed at one practical question: can prompt framing alone push a small model toward more divergent poem outputs without changing model size?
Run summary
- Run timestamp: March 6, 2026 (
2026-03-06T19:46:40.872Z) - Model:
gemma3:1b - Prompt setups: 3
- Seed words: 10
- Samples per case: 3
- Total generations: 90
- Scoring: 5 binary creativity/quality questions per poem (
0-5true-count score)
Data
Words tested
Quality distributions by prompt
Best-of-3 Per Case (Highest True Count Selected)
Each prompt+word case contributes one sample: the one with the highest score out of three generations.
All Samples (No Best-Of Selection)
Every generated sample is counted. This view shows baseline spread before selecting the strongest option per case.
Overall Score Distribution
| True Count Score (0-5) | All Samples (N=90) | Best-of-3 per Case (N=30) |
|---|---|---|
| 0 | 10 / 90 (11.1%) | 0 / 30 (0.0%) |
| 1 | 12 / 90 (13.3%) | 0 / 30 (0.0%) |
| 2 | 16 / 90 (17.8%) | 2 / 30 (6.7%) |
| 3 | 27 / 90 (30.0%) | 8 / 30 (26.7%) |
| 4 | 6 / 90 (6.7%) | 4 / 30 (13.3%) |
| 5 | 19 / 90 (21.1%) | 16 / 30 (53.3%) |
Each score is the number of rubric questions answered true (max 5). Best-of picks the highest-scoring sample in each prompt+word case.
Prompt-Level Performance
| Prompt | All-Sample Avg True Count | Best-of Avg True Count | Selection Delta | Best-of 5/5 Cases |
|---|---|---|---|---|
| 01-base | 3.07 | 4.10 | +1.03 | 4 / 10 (40.0%) |
| 02-base-with-bio | 2.40 | 3.90 | +1.50 | 5 / 10 (50.0%) |
| 03-journey-then-base | 2.67 | 4.40 | +1.73 | 7 / 10 (70.0%) |
Best-of is computed within each case group of 3 samples.
Rubric Hit Rates (All Samples vs Best-of)
| Rubric Question | True in All Samples | True in Best-of |
|---|---|---|
| Unexpected interpretation | 25 / 90 (27.8%) | 18 / 30 (60.0%) |
| Every line adds meaning | 62 / 90 (68.9%) | 28 / 30 (93.3%) |
| Strong in few lines | 62 / 90 (68.9%) | 28 / 30 (93.3%) |
| Transforms the word | 65 / 90 (72.2%) | 28 / 30 (93.3%) |
| Makes reader see word differently | 30 / 90 (33.3%) | 22 / 30 (73.3%) |
Best-of significantly improves all five dimensions, especially interpretive novelty and reframing.
Prompt-Level Rubric Hit Rates (All Samples)
| Prompt | Unexpected | Every Line Adds Meaning | Strong in Few Lines | Transforms Word | See Word Differently |
|---|---|---|---|---|---|
| 01-base | 20.0% | 90.0% | 90.0% | 70.0% | 36.7% |
| 02-base-with-bio | 26.7% | 56.7% | 56.7% | 73.3% | 26.7% |
| 03-journey-then-base | 36.7% | 60.0% | 60.0% | 73.3% | 36.7% |
Percentages are from all 30 samples per prompt.
Prompt-by-prompt read
01-base
The base prompt is the most stable on line-level discipline. It has the highest all-sample mean (3.07) and the strongest “every line adds meaning” rate (90.0%) in raw sampling. The tradeoff is novelty: it stays grounded, but unexpected interpretation remains lower (20.0%) than the narrative-heavy setup.
In vocabulary, this setup repeatedly returns to a tight motif cluster: chipped, rain, dust, motes, fingers, window. That consistency helps coherence, but can narrow divergence if reused for many drafting passes.
02-base-with-bio
The compact author-profile framing is the most volatile setup. Raw average is the lowest (2.40), but best-of climbs to 3.90 with 50.0% of cases reaching 5/5. This pattern says there are strong hits in the sample set, but also more misses.
This prompt leans hard into the freight-yard urban texture: forgotten, rust, slicked, cold, boots. It improves transformation behavior (73.3% for “transforms word”) but loses some line-economy consistency (56.7% for “every line adds meaning”).
03-journey-then-base
The long journey preamble produces the highest ceiling. It has the best best-of mean (4.40) and the most perfect best-of outcomes (7 / 10 cases at 5/5). It is also strongest on divergent interpretation after selection (80.0% on “unexpected interpretation” in best-of).
This setup costs more prompt budget and is less stable in raw mode (2.67 mean), but it creates the largest uplift when sampling and selecting (+1.73). If the workflow supports best-of selection, this is currently the most effective framing in this run.
Word-level behavior
Word Performance Across Prompt Setups
| Word | All-Sample Avg (N=9) | Best Score by Prompt | Best-of Mean |
|---|---|---|---|
| thorn | 4.00 | 5, 5, 5 | 5.00 |
| flower | 3.11 | 5, 5, 5 | 5.00 |
| ember | 3.11 | 5, 5, 5 | 5.00 |
| rust | 2.78 | 5, 5, 5 | 5.00 |
| glass | 3.11 | 4, 5, 5 | 4.67 |
| lantern | 2.78 | 3, 3, 5 | 3.67 |
| love | 2.33 | 4, 4, 3 | 3.67 |
| echo | 2.11 | 3, 2, 5 | 3.33 |
| tide | 2.22 | 4, 2, 3 | 3.00 |
| mercy | 1.56 | 3, 3, 3 | 3.00 |
Best score by prompt shows the strongest sample each word reached in prompts 01, 02, and 03.
thorn is the cleanest word in this run: highest raw mean and perfect best-of across all three prompts. mercy, tide, and echo are the hardest words in raw sampling, but all improve with selection. rust is notable because it looks mid-pack in raw average and then jumps to perfect best-of in every prompt.
Vocabulary fingerprints by prompt
Prompt Vocabulary and Diversity Signals
| Prompt | Recurring Best-of Vocabulary | Best-of Type/Token Ratio | Avg Tokens per Best Poem |
|---|---|---|---|
| 01-base | chipped, rain, dust, motes, fingers, window | 0.401 | 47.6 |
| 02-base-with-bio | forgotten, rust, rain, slicked, cold, boots | 0.454 | 39.4 |
| 03-journey-then-base | dust, motes, rain, slow, single, forgotten | 0.464 | 42.9 |
Type/token ratio is lexical diversity on the best-of subset (higher means more varied vocabulary).
The third prompt has the highest lexical diversity on selected outputs (0.464 type/token), which matches the qualitative read: broader language variation and stronger conceptual pivots, at higher prompt-token cost.
Notes
For divergent poem drafting, the current best operating mode is:
- use
03-journey-then-basewhen selection is available, - keep
01-baseas the stability fallback, - continue tuning
02-base-with-biobecause its wins are strong but inconsistent.