How We Built It

Chinese LLM Benchmark: 7 Models Tested Writing HSK Stories

We benchmarked Chinese LLMs writing graded HSK stories two ways — a writing-quality round and a vocabulary-control round. Here's what won, what broke, and why the production model later changed.

AnthonyAnthony·March 16, 2026·14 min read

We needed an AI that could write graded Chinese fiction — stories constrained to a specific HSK vocabulary level that still read like real literature. We assumed the hard part would be vocabulary control. We were wrong about almost everything.

The model with the tightest vocabulary compliance scored 10.5/30 on writing quality. The one that broke level most often scored 26.8 — including a perfect 30/30 that the reviewer said "reads like real Chinese children's literature." And the 4.5% vocabulary error rate we kept trying to fix? It turns out that's close to what second language acquisition research says is optimal for learning.

This article covers the full dataset. For how these findings shaped our production pipeline, see How We Built AI-Generated Graded Chinese Stories.

The Test

Seven Chinese LLMs generated multi-chapter stories at HSK 1, 3, 5, and 7 using identical outlines — same plot, same characters, same chapter structure. The only variable was the model.

We measured four things:

  • Writing quality: 30-point rubric (characters, plot, pacing, naturalness, engagement, vocabulary — 5 points each), scored by Claude Sonnet
  • Vocabulary compliance: percentage of above-level words, measured by jieba segmentation against the HSK 3.0 word list
  • Length control: actual vs. target character count
  • Generation speed: wall-clock time per story

Test stories ranged from a child losing a toy at the park (HSK 1, 2,000 characters) to a retired calligrapher finding a letter in an antique inkstone that reveals his teacher's secret (HSK 7, 8,000 characters).

Three Models That Couldn't Write Fiction

Before comparing quality, three models failed to produce usable output.

Kimi K2.5 (reasoning model) leaked thinking tokens into story content. In 35% of chapters, English planning text appeared mid-paragraph — "The user wants me to continue the story..." Its clean chapters scored ~17/30 versus K2's 26.8. It only supports temperature=1. Reasoning models are built to show their work — exactly what you don't want in fiction.

GLM-4.7-Flash dumped English analysis blocks into the narrative, then degenerated into token repetition loops — one story repeated "地方, 地方, 地方..." for 10,000+ characters.

Step-3.5-Flash produced 13,412 characters of Chinese chain-of-thought reasoning ("好的,我现在需要处理这个写作任务...") where the first chapter should have been.

All three share a pattern: models optimized for reasoning struggle with sustained creative output. They plan instead of writing.

The Scoreboard

Writing quality (out of 30):

ModelHSK 1HSK 3HSK 5HSK 7Average
Kimi K23021292726.8
DeepSeek V3.22428252726.0
Doubao Seed 2.0 Pro2225242323.5
Doubao Seed 2.0 Mini1517201917.8
Gemini 2.5 Flash181712613.3
Qwen 3.5 Plus71412910.5

Kimi K2 and DeepSeek V3.2 are the clear top tier. Gemini and Qwen collapse at higher levels — Gemini's HSK 7 scored 6/30, Qwen's HSK 1 scored 7/30.

Vocabulary compliance (above-level + unknown word rate):

ModelHSK 1HSK 3HSK 5HSK 7Average
Qwen 3.5 Plus1.8%1.2%0.6%0.2%0.9%
DeepSeek V3.26.6%3.5%2.0%0.6%3.2%
Kimi K210.1%3.4%3.5%1.2%4.5%

Vocabulary compliance and writing quality are inversely correlated. Qwen achieves 0.9% errors by writing below level — simple, repetitive, flat. Kimi writes naturally and reaches for the right word even when it's above level.

Generation speed:

ModelHSK 1HSK 3HSK 5HSK 7
Gemini 2.5 Flash16s66s67s162s
Kimi K241s88s197s313s
DeepSeek V3.2113s305s516s486s
Doubao Seed 2.0 Pro225s278s366s793s
Qwen 3.5 Plus900s728s987s1006s

Kimi K2 generates a full HSK 7 story in ~5 minutes. Qwen takes 17 minutes for lower quality. Speed tracks with quality — the best writers are also the fastest.

Why 4.5% "Errors" Are Close to Optimal

Those vocabulary error rates look like failures. They're not.

Second language acquisition research consistently finds that 2–3% unknown words is the optimal learning zone. At 98% known words, readers comprehend well enough to acquire new vocabulary through context. At 95%, comprehension is rough but manageable. Below 90%, it collapses entirely (Hu & Nation 2000, Schmitt et al. 2011).

Each unknown word needs 8–12 encounters in varied contexts for acquisition (Waring & Takaki 2003). Narrative text gets a comprehension bonus over expository text — readers tolerate more unknowns in stories because plot context aids guessing.

Kimi K2's 4.5% error rate drops to 1.2–3.5% at HSK 5+, right in the sweet spot. Learners encounter enough unknown words to acquire them naturally while maintaining the 95%+ comprehension needed to enjoy the story.

Qwen's 0.9% error rate — the "best" compliance — means learners encounter almost nothing new. Perfect vocabulary control produces a ceiling, not a floor. We were trying to fix something that was already helping learners.

The Word List Trap

Our initial benchmark injected the full HSK word list into every prompt. At HSK 7, that's roughly 50,000 characters (~25,000 tokens) of vocabulary before the model starts writing. We assumed this would help models stay within level.

It did the opposite.

ModelHSK LevelWith ListWithout ListDelta
Kimi K2HSK 321/3027/30+6
Doubao ProHSK 723/3030/30+7
Qwen 3.5+HSK 79/3026/30+17

Qwen jumped from 9/30 to 26/30. Doubao Pro hit a perfect score — the reviewer called it "the most emotionally powerful story" in the entire benchmark. Showing a model words it shouldn't use activates those words in its attention while degrading creative output.

The vocab list also hurt compliance. In a separate experiment, adding the word list to pedagogical prompts nearly doubled above-level violations at HSK 1 (from 35 to 66 unique above-level words).

One exception: DeepSeek V3.2's HSK 3 story dropped from 28/30 to 23/30 without the list. For that model at lower levels, the constraint provided useful focus. Every other model improved.

8 Prompt Strategies, One Crossover

We tested eight prompt variants across all nine HSK levels:

  1. Baseline — full vocab list in prompt
  2. No-vocab — "write using HSK N vocabulary," no word list
  3. Pedagogical — level-specific structural rules (HSK 1: SVO only, explicit subjects; HSK 3: complex sentences, pacing)
  4. Pedagogical + vocab — structural rules plus word list injection
  5. Craft-detailed — 40% dialogue ratio, sensory details, pacing beats
  6. Author-identity — "You are a Chinese author" framing
  7. Mandarin — system and user prompts entirely in Chinese
  8. Mandarin-pedagogical — Chinese prompts with structural rules

Eliminated early: Author-identity (worst compliance, no quality gain). Craft-detailed (highest quality ceiling but 9.6% error rate at HSK 4 and 11% catastrophic failure rate). Pedagogical + vocab (adding the word list to pedagogical rules hurt compliance at every level).

The clear finding: the optimal strategy changes at HSK 4.

HSKBest StrategyError Rate
1Pedagogical3.1%
2Pedagogical6.2%
3Pedagogical6.6%
4No-vocab7.3%
5No-vocab3.6%
6No-vocab2.8%
7No-vocab1.3%
8No-vocab1.0%
9No-vocab1.4%

Below HSK 4, pedagogical rules reduce above-level words by 2–5x. Rules like "use only SVO sentence structure" and "state subjects explicitly" genuinely constrain vocabulary. Above HSK 4, the vocabulary pool (1,978+ words) is rich enough that models naturally stay within level without constraints.

One more finding: HSK 7, 8, and 9 are indistinguishable by prompts. All three share the same vocabulary pool (10,896 words). The scoring model confirmed it can't distinguish literary levels between them. Only story concept complexity and target length differentiate these levels.

Every Approach Scores 22/30

After the quality round pointed to Kimi K2, we tried everything to push quality higher. Eight architectural approaches, all tested on the same HSK 5 story concept:

ApproachScore
Baseline (1-chapter context)22/30
Full text context (all prior chapters)22/30
AI reviewer feedback injected into generation22/30
One-shot full-story generation22/30
Structured state tracking (SCORE framework)22/30
Combined (full context + enhanced outline)21/30
Enhanced outline (Harmon Story Circle + timeline + character knowledge)17/30

Seven of eight approaches scored 21–22. Enhanced outlines — with Harmon Story Circle structure, timeline tracking, character knowledge states, and setup/payoff pairs — hurt quality by 5 points. More constraints gave the model more opportunities to violate them.

We also tested 9 parameter configurations — temperature from 0.6 to 1.0, frequency penalties from 0 to 0.5, presence penalties, top-p variations, and the official recommended settings from Kimi, Qwen, and Doubao. Every configuration scored 22–23/30 with the same per-criterion breakdown: 4/4/3/4/4/4. Pacing and plot coherence are always the weak points. Parameters don't matter.

Finally, we tested a two-pass rewrite approach: Claude Sonnet reviews stories and identifies issues, then Kimi K2 rewrites flagged chapters.

HSKOriginalAfter Rewrite
222/3021/30
522/3022/30
719/3019/30
923/3022/30

Rewrites fix specific issues but introduce new contradictions — the model rewrites individual chapters without full story context. In this experiment, the reviewer worked better as a quality gate (accept or regenerate from scratch) than as an editor. A later controlled test found that gate did not change best-of-two selection outcomes, so we removed it from production.

22/30 is Kimi K2's quality ceiling for this class of story. The bottleneck is the model's creative writing capability, not the generation architecture, context management, or sampling parameters. On the quality axis, better stories come from better models, not better prompts. On the vocabulary-control axis, the opposite turned out to be true — as the second round showed.

The Second Round: Controlling Vocabulary at Scale

The quality benchmark answered "which model writes the best story." It didn't answer the question that actually decides whether a story is usable as a graded reader: what fraction of its vocabulary sits above the learner's level. So we ran a second, larger benchmark — nine models this time, adding DeepSeek V3.2, two MiniMax versions, a newer Qwen, GLM-5, StepFun, and Doubao Seed 2.0 — scored not on a 30-point rubric but on type error rate: the percentage of a story's unique words that are above the target HSK level. (This is why you'll see us cite both "7 models" and "9 models" — two different rounds with two different scoring methods.)

Three findings reorganized everything.

Concept difficulty dominates model choice. On a hard concept — a cooking scene at HSK 1, where 锅, 炒, 厨房, 盐, and 油 are all above level — all nine models landed between 30% and 68% type error. The gap between the best and worst model was smaller than the gap between an easy and a hard concept for the same model. Choosing the right story idea matters more than choosing the right model.

Most "findings" are noise. We measured run-to-run variance directly: same model, same prompt, same concept, five runs. Standard deviation was about 4.7 points, with a 12-point range. Any prompt tweak showing less than ~7 points of improvement is indistinguishable from luck — and several earlier prompt "wins" duly evaporated when we re-ran them on a different provider.

One prompt change beat every model change. The real breakthrough wasn't a model. It was an instruction: "the reader is at HSK 3, but write the prose using only HSK 1–2 vocabulary — like a graded reader, not literary fiction." That cut type error by 25–27 points at HSK 3 (from ~32% down to 4–7% on controlled concepts). It works because it attacks the model's default assumption that prose sophistication should rise to match the reader's level. This one instruction does more for vocabulary control than the entire model-selection process did.

The honest caveat: those deltas are measured on clean, controlled concepts. Real production themes push the numbers higher — see the live per-level figures below.

What We Ship

About 15% of initial generations fail catastrophically — token repetition loops, quality collapse, hallucinatory word salad. All re-runs succeed cleanly. These failures are random (bad generation luck, possible API instability), not systematic. Handle them with retry, not prompt engineering.

Our production decisions:

  • Prompts: pedagogical and prose-cap rules for HSK 1–3, looser "write at HSK N" prompts for HSK 4+
  • No vocabulary list injected at any level — it hurts both quality and compliance
  • Per-chapter quality gates catch repetition loops, truncation, and language purity
  • Programmatic vocabulary and error-rate checks drive release validation; the AI reviewer/editor pass is gone — it changed best-of-two outcomes 0 of 5 times
  • ~15% retry rate is built into the pipeline as expected, not a bug

On the model itself: it changed, and not by choice. The quality round pointed to Kimi K2, and we ran it in production for months. Then, on 2026-05-20, Moonshot deprecated that exact model version with no notice — generation started failing mid-run with 404 Not Found. We benchmarked the leading replacements (and compatibility-checked the remaining drop-in options), and none matched the original on vocabulary control: the old Kimi K2 hit 7.4% type error at HSK 3, while the closest replacement sat around 20–25%. We now route DeepSeek V4 Pro for HSK 1–3 and 5–9, and Kimi K2.6 for HSK 4, where a paired retest showed it produces better-balanced chapter lengths. Most of our existing library was generated on the original Kimi K2 before it was retired; new stories use the current routing.

That is the part of building on frontier models nobody warns you about: your best-performing component can disappear between one model generation and the next. The published config is the only source of truth for what is actually running today.

What this looks like in production

Controlled-concept numbers are the ceiling. Here is the live picture across the published library — token coverage (share of running words a learner already knows) and type error rate (share of unique words above level), measured the same way for every story:

LevelStoriesToken coverageType error
HSK 12677.3%44.8%
HSK 22381.6%36.5%
HSK 31482.1%35.5%
HSK 43391.7%18.9%
HSK 5786.4%26.7%
HSK 6888.9%21.8%
HSK 72199.1%2.6%
HSK 8798.1%3.8%
HSK 9898.5%3.3%

HSK 7–9 are genuine independent-reading material (98%+ coverage). HSK 1–3 can't be — a 300–988 word pool can't carry a coherent story without reaching above level — so those stories are graded aspirationally and lean on full pinyin annotation, exactly as Mandarin Companion and Chinese Breeze do at beginner levels. The full per-story breakdown and the reasoning behind it are in How We Built AI-Generated Graded Chinese Stories.

The two benchmarks together ran for weeks and tested more approaches than we expected to need. Most of the findings were counterintuitive: the word list hurt, parameters didn't matter, more structure made stories worse, rewrites made things worse, and the model we crowned got deprecated out from under us. The simplest approach — let the model write freely under a prose cap, check the result, throw it away if it's bad — consistently outperformed every sophisticated alternative we tested.

Appendix: Every Model We Tried

For completeness, here is every model we touched. The headline "7 models" is the original writing-quality benchmark (the six scored below plus Kimi K2.5); the tables here go wider, adding the other reasoning models that failed before scoring, the larger vocabulary-control round, and the replacements we screened after Kimi K2 was deprecated. Several models appear more than once, scored differently each time.

Round 1 — writing quality (30-point rubric, averaged across HSK 1/3/5/7):

ModelQualityVerdict
Kimi K2 (original)26.8/30Winner on quality; later deprecated
DeepSeek V3.226.0/30Strong all-rounder
Doubao Seed 2.0 Pro23.5/30Mid-tier
Doubao Seed 2.0 Mini17.8/30Weak
Gemini 2.5 Flash13.3/30Collapsed at higher levels
Qwen 3.5 Plus10.5/30Tightest vocab control, flattest prose
Kimi K2.5Failed — leaked thinking tokens into prose
GLM-4.7-FlashFailed — degenerated into repetition loops
Step-3.5-FlashFailed — dumped chain-of-thought into the story

Round 2 — vocabulary control (type error rate on a hard HSK 1 concept; these numbers are a deliberately punishing screening test, not production rates):

ModelType errorVerdict
Qwen 3.5 Plus30.9%Best compliance, repetitive prose
GLM-533.0%Ignored the prompt's concept
DeepSeek V3.236.3%Best compliance/quality balance
Kimi K250.0%Best storytelling, loosest vocab
Qwen 3.6 Plus52.6%Strict downgrade from 3.5 Plus
MiniMax M2.555.3%Non-Chinese character names
Step-3.5-Flash55.7%English leaks mid-story
MiniMax M2.758.8%Poor compliance
Doubao Seed 2.0 Pro67.6%Worst of the round

After the deprecation — replacement screening (not part of the original benchmark):

ModelOutcomeNotes
DeepSeek V4 ProAdopted (HSK 1–3, 5–9)Reliable once thinking mode is disabled
Kimi K2.6Adopted (HSK 4)Best chapter-length balance; needs temp 0.6
DeepSeek V4 FlashNot adoptedProduced the single best screening story, but average numbers
Qwen 3.6-27bEliminatedCharacters referred to themselves by name in dialogue
kimi-latestRejectedFast but ~36% type error
moonshot-v1-32kCompatibility-checked onlyConfirmed it runs as a drop-in, but not benchmarked for quality

Read stories generated by this pipeline: HSK 1 · HSK 2 · HSK 3 · HSK 4 · HSK 5 · HSK 6 · HSK 7 · HSK 8 · HSK 9

Related guides:

Frequently Asked Questions

What is the best AI model for writing Chinese?

It depends on what you measure. In our writing-quality round, Kimi K2 scored highest (26.8/30); Qwen 3.5 Plus had the tightest vocabulary control (0.9% above-level words) but the worst prose (10.5/30). The two are inversely correlated. The biggest single lever turned out to be neither the model nor the prompt, but a prose-cap instruction that tells the model to write below the reader's level — it cut above-level word rates by 25+ points across models.

Can AI write graded Chinese stories?

Yes, with trade-offs. Models that strictly follow vocabulary constraints write stilted, textbook-like prose; models given freedom write naturally but use too many above-level words. Our production pipeline generates with a creative model, then validates vocabulary compliance programmatically. We tested an AI reviewer/editor pass and removed it — it changed selection outcomes 0 out of 5 times.

How accurate is AI-generated Chinese text?

It varies sharply by level. At HSK 7-9 our published stories reach 98-99% token coverage — genuine independent-reading quality. At HSK 1-3 the 300-988 word vocabulary pool makes strict compliance impossible, so coverage sits around 77-82% and the stories are graded 'aspirationally,' relying on full pinyin annotation — the same approach Mandarin Companion and Chinese Breeze take at beginner levels.