We needed an AI that could write graded Chinese fiction — stories constrained to a specific HSK vocabulary level that still read like real literature. We assumed the hard part would be vocabulary control. We were wrong about almost everything.
The model with the tightest vocabulary compliance scored 10.5/30 on writing quality. The one that broke level most often scored 26.8 — including a perfect 30/30 that the reviewer said "reads like real Chinese children's literature." And the 4.5% vocabulary error rate we kept trying to fix? It turns out that's close to what second language acquisition research says is optimal for learning.
This article covers the full dataset. For how these findings shaped our production pipeline, see How We Built AI-Generated Graded Chinese Stories.
The Test
Seven Chinese LLMs generated multi-chapter stories at HSK 1, 3, 5, and 7 using identical outlines — same plot, same characters, same chapter structure. The only variable was the model.
We measured four things:
- Writing quality: 30-point rubric (characters, plot, pacing, naturalness, engagement, vocabulary — 5 points each), scored by Claude Sonnet
- Vocabulary compliance: percentage of above-level words, measured by jieba segmentation against the HSK 3.0 word list
- Length control: actual vs. target character count
- Generation speed: wall-clock time per story
Test stories ranged from a child losing a toy at the park (HSK 1, 2,000 characters) to a retired calligrapher finding a letter in an antique inkstone that reveals his teacher's secret (HSK 7, 8,000 characters).
Three Models That Couldn't Write Fiction
Before comparing quality, three models failed to produce usable output.
Kimi K2.5 (reasoning model) leaked thinking tokens into story content. In 35% of chapters, English planning text appeared mid-paragraph — "The user wants me to continue the story..." Its clean chapters scored ~17/30 versus K2's 26.8. It only supports temperature=1. Reasoning models are built to show their work — exactly what you don't want in fiction.
GLM-4.7-Flash dumped English analysis blocks into the narrative, then degenerated into token repetition loops — one story repeated "地方, 地方, 地方..." for 10,000+ characters.
Step-3.5-Flash produced 13,412 characters of Chinese chain-of-thought reasoning ("好的,我现在需要处理这个写作任务...") where the first chapter should have been.
All three share a pattern: models optimized for reasoning struggle with sustained creative output. They plan instead of writing.
The Scoreboard
Writing quality (out of 30):
| Model | HSK 1 | HSK 3 | HSK 5 | HSK 7 | Average |
|---|---|---|---|---|---|
| Kimi K2 | 30 | 21 | 29 | 27 | 26.8 |
| DeepSeek V3.2 | 24 | 28 | 25 | 27 | 26.0 |
| Doubao Seed 2.0 Pro | 22 | 25 | 24 | 23 | 23.5 |
| Doubao Seed 2.0 Mini | 15 | 17 | 20 | 19 | 17.8 |
| Gemini 2.5 Flash | 18 | 17 | 12 | 6 | 13.3 |
| Qwen 3.5 Plus | 7 | 14 | 12 | 9 | 10.5 |
Kimi K2 and DeepSeek V3.2 are the clear top tier. Gemini and Qwen collapse at higher levels — Gemini's HSK 7 scored 6/30, Qwen's HSK 1 scored 7/30.
Vocabulary compliance (above-level + unknown word rate):
| Model | HSK 1 | HSK 3 | HSK 5 | HSK 7 | Average |
|---|---|---|---|---|---|
| Qwen 3.5 Plus | 1.8% | 1.2% | 0.6% | 0.2% | 0.9% |
| DeepSeek V3.2 | 6.6% | 3.5% | 2.0% | 0.6% | 3.2% |
| Kimi K2 | 10.1% | 3.4% | 3.5% | 1.2% | 4.5% |
Vocabulary compliance and writing quality are inversely correlated. Qwen achieves 0.9% errors by writing below level — simple, repetitive, flat. Kimi writes naturally and reaches for the right word even when it's above level.
Generation speed:
| Model | HSK 1 | HSK 3 | HSK 5 | HSK 7 |
|---|---|---|---|---|
| Gemini 2.5 Flash | 16s | 66s | 67s | 162s |
| Kimi K2 | 41s | 88s | 197s | 313s |
| DeepSeek V3.2 | 113s | 305s | 516s | 486s |
| Doubao Seed 2.0 Pro | 225s | 278s | 366s | 793s |
| Qwen 3.5 Plus | 900s | 728s | 987s | 1006s |
Kimi K2 generates a full HSK 7 story in ~5 minutes. Qwen takes 17 minutes for lower quality. Speed tracks with quality — the best writers are also the fastest.
Why 4.5% "Errors" Are Close to Optimal
Those vocabulary error rates look like failures. They're not.
Second language acquisition research consistently finds that 2–3% unknown words is the optimal learning zone. At 98% known words, readers comprehend well enough to acquire new vocabulary through context. At 95%, comprehension is rough but manageable. Below 90%, it collapses entirely (Hu & Nation 2000, Schmitt et al. 2011).
Each unknown word needs 8–12 encounters in varied contexts for acquisition (Waring & Takaki 2003). Narrative text gets a comprehension bonus over expository text — readers tolerate more unknowns in stories because plot context aids guessing.
Kimi K2's 4.5% error rate drops to 1.2–3.5% at HSK 5+, right in the sweet spot. Learners encounter enough unknown words to acquire them naturally while maintaining the 95%+ comprehension needed to enjoy the story.
Qwen's 0.9% error rate — the "best" compliance — means learners encounter almost nothing new. Perfect vocabulary control produces a ceiling, not a floor. We were trying to fix something that was already helping learners.
The Word List Trap
Our initial benchmark injected the full HSK word list into every prompt. At HSK 7, that's roughly 50,000 characters (~25,000 tokens) of vocabulary before the model starts writing. We assumed this would help models stay within level.
It did the opposite.
| Model | HSK Level | With List | Without List | Delta |
|---|---|---|---|---|
| Kimi K2 | HSK 3 | 21/30 | 27/30 | +6 |
| Doubao Pro | HSK 7 | 23/30 | 30/30 | +7 |
| Qwen 3.5+ | HSK 7 | 9/30 | 26/30 | +17 |
Qwen jumped from 9/30 to 26/30. Doubao Pro hit a perfect score — the reviewer called it "the most emotionally powerful story" in the entire benchmark. Showing a model words it shouldn't use activates those words in its attention while degrading creative output.
The vocab list also hurt compliance. In a separate experiment, adding the word list to pedagogical prompts nearly doubled above-level violations at HSK 1 (from 35 to 66 unique above-level words).
One exception: DeepSeek V3.2's HSK 3 story dropped from 28/30 to 23/30 without the list. For that model at lower levels, the constraint provided useful focus. Every other model improved.
8 Prompt Strategies, One Crossover
We tested eight prompt variants across all nine HSK levels:
- Baseline — full vocab list in prompt
- No-vocab — "write using HSK N vocabulary," no word list
- Pedagogical — level-specific structural rules (HSK 1: SVO only, explicit subjects; HSK 3: complex sentences, pacing)
- Pedagogical + vocab — structural rules plus word list injection
- Craft-detailed — 40% dialogue ratio, sensory details, pacing beats
- Author-identity — "You are a Chinese author" framing
- Mandarin — system and user prompts entirely in Chinese
- Mandarin-pedagogical — Chinese prompts with structural rules
Eliminated early: Author-identity (worst compliance, no quality gain). Craft-detailed (highest quality ceiling but 9.6% error rate at HSK 4 and 11% catastrophic failure rate). Pedagogical + vocab (adding the word list to pedagogical rules hurt compliance at every level).
The clear finding: the optimal strategy changes at HSK 4.
| HSK | Best Strategy | Error Rate |
|---|---|---|
| 1 | Pedagogical | 3.1% |
| 2 | Pedagogical | 6.2% |
| 3 | Pedagogical | 6.6% |
| 4 | No-vocab | 7.3% |
| 5 | No-vocab | 3.6% |
| 6 | No-vocab | 2.8% |
| 7 | No-vocab | 1.3% |
| 8 | No-vocab | 1.0% |
| 9 | No-vocab | 1.4% |
Below HSK 4, pedagogical rules reduce above-level words by 2–5x. Rules like "use only SVO sentence structure" and "state subjects explicitly" genuinely constrain vocabulary. Above HSK 4, the vocabulary pool (1,978+ words) is rich enough that models naturally stay within level without constraints.
One more finding: HSK 7, 8, and 9 are indistinguishable by prompts. All three share the same vocabulary pool (10,896 words). The scoring model confirmed it can't distinguish literary levels between them. Only story concept complexity and target length differentiate these levels.
Every Approach Scores 22/30
After establishing Kimi K2 as our production model, we tried everything to push quality higher. Eight architectural approaches, all tested on the same HSK 5 story concept:
| Approach | Score |
|---|---|
| Baseline (1-chapter context) | 22/30 |
| Full text context (all prior chapters) | 22/30 |
| AI reviewer feedback injected into generation | 22/30 |
| One-shot full-story generation | 22/30 |
| Structured state tracking (SCORE framework) | 22/30 |
| Combined (full context + enhanced outline) | 21/30 |
| Enhanced outline (Harmon Story Circle + timeline + character knowledge) | 17/30 |
Seven of eight approaches scored 21–22. Enhanced outlines — with Harmon Story Circle structure, timeline tracking, character knowledge states, and setup/payoff pairs — hurt quality by 5 points. More constraints gave the model more opportunities to violate them.
We also tested 9 parameter configurations — temperature from 0.6 to 1.0, frequency penalties from 0 to 0.5, presence penalties, top-p variations, and the official recommended settings from Kimi, Qwen, and Doubao. Every configuration scored 22–23/30 with the same per-criterion breakdown: 4/4/3/4/4/4. Pacing and plot coherence are always the weak points. Parameters don't matter.
Finally, we tested a two-pass rewrite approach: Claude Sonnet reviews stories and identifies issues, then Kimi K2 rewrites flagged chapters.
| HSK | Original | After Rewrite |
|---|---|---|
| 2 | 22/30 | 21/30 |
| 5 | 22/30 | 22/30 |
| 7 | 19/30 | 19/30 |
| 9 | 23/30 | 22/30 |
Rewrites fix specific issues but introduce new contradictions — the model rewrites individual chapters without full story context. In this experiment, the reviewer worked better as a quality gate (accept or regenerate from scratch) than as an editor. A later controlled test found that gate did not change best-of-two selection outcomes, so we removed it from production.
22/30 is Kimi K2's quality ceiling for this class of story. The bottleneck is the model's creative writing capability, not the generation architecture, context management, or sampling parameters. Better stories will come from better models, not better prompts.
What We Ship
About 15% of initial generations fail catastrophically — token repetition loops, quality collapse, hallucinatory word salad. All re-runs succeed cleanly. These failures are random (bad generation luck, possible API instability), not systematic. Handle them with retry, not prompt engineering.
Our production decisions:
- Model: Kimi K2 via Moonshot direct API (faster and cheaper than OpenRouter)
- Prompts: Pedagogical for HSK 1–3, no-vocab for HSK 4–9
- No vocabulary list injected at any level
- Per-chapter quality gates catch repetition loops, truncation, and language purity
- Programmatic vocabulary and error-rate checks drive release validation; optional best-of-two experiments compare candidates by error rate, not the removed AI reviewer score
- ~15% retry rate is built into the pipeline as expected, not a bug
The entire benchmark ran for three weeks and tested more approaches than we expected to need. Most of the findings were counterintuitive: the word list hurt, parameters didn't matter, more structure made stories worse, rewrites made things worse. The simplest approach — let the model write freely, check the result, throw it away if it's bad — consistently outperformed every sophisticated alternative we tested.
Read stories generated by this pipeline: HSK 1 · HSK 2 · HSK 3 · HSK 4 · HSK 5 · HSK 6 · HSK 7 · HSK 8 · HSK 9
Related guides: