What is the best AI model for writing Chinese?

In our benchmark of 7 Chinese LLMs, Kimi K2 scored highest for overall writing quality (26.8/30) while DeepSeek R1 had the tightest vocabulary control. The best stories came from models that broke level constraints slightly — a 4.5% above-level word rate is close to what language acquisition research considers optimal for learning.

Can AI write graded Chinese stories?

Yes, but with trade-offs. AI models that strictly follow vocabulary constraints write stilted, textbook-like prose. Models given more freedom write naturally but include too many above-level words. Our production pipeline uses a two-pass approach: generate with a creative model, then audit vocabulary compliance programmatically.

How accurate is AI-generated Chinese text?

Vocabulary compliance ranged from 91% to 99% across 7 models and 8 prompt strategies. The biggest issue is not vocabulary errors but naturalness — the model with 99% compliance scored only 10.5/30 on writing quality. Production content requires human review for tone, cultural accuracy, and narrative coherence.

Chinese LLM Benchmark: 7 Models Tested Writing HSK Stories

We needed an AI that could write graded Chinese fiction — stories constrained to a specific HSK vocabulary level that still read like real literature. We assumed the hard part would be vocabulary control. We were wrong about almost everything.

The model with the tightest vocabulary compliance scored 10.5/30 on writing quality. The one that broke level most often scored 26.8 — including a perfect 30/30 that the reviewer said "reads like real Chinese children's literature." And the 4.5% vocabulary error rate we kept trying to fix? It turns out that's close to what second language acquisition research says is optimal for learning.

This article covers the full dataset. For how these findings shaped our production pipeline, see How We Built AI-Generated Graded Chinese Stories.

The Test

Seven Chinese LLMs generated multi-chapter stories at HSK 1, 3, 5, and 7 using identical outlines — same plot, same characters, same chapter structure. The only variable was the model.

We measured four things:

Writing quality: 30-point rubric (characters, plot, pacing, naturalness, engagement, vocabulary — 5 points each), scored by Claude Sonnet
Vocabulary compliance: percentage of above-level words, measured by jieba segmentation against the HSK 3.0 word list
Length control: actual vs. target character count
Generation speed: wall-clock time per story

Test stories ranged from a child losing a toy at the park (HSK 1, 2,000 characters) to a retired calligrapher finding a letter in an antique inkstone that reveals his teacher's secret (HSK 7, 8,000 characters).

Three Models That Couldn't Write Fiction

Before comparing quality, three models failed to produce usable output.

Kimi K2.5 (reasoning model) leaked thinking tokens into story content. In 35% of chapters, English planning text appeared mid-paragraph — "The user wants me to continue the story..." Its clean chapters scored ~17/30 versus K2's 26.8. It only supports temperature=1. Reasoning models are built to show their work — exactly what you don't want in fiction.

GLM-4.7-Flash dumped English analysis blocks into the narrative, then degenerated into token repetition loops — one story repeated "地方, 地方, 地方..." for 10,000+ characters.

Step-3.5-Flash produced 13,412 characters of Chinese chain-of-thought reasoning ("好的，我现在需要处理这个写作任务...") where the first chapter should have been.

All three share a pattern: models optimized for reasoning struggle with sustained creative output. They plan instead of writing.

The Scoreboard

Writing quality (out of 30):

Model	HSK 1	HSK 3	HSK 5	HSK 7	Average
Kimi K2	30	21	29	27	26.8
DeepSeek V3.2	24	28	25	27	26.0
Doubao Seed 2.0 Pro	22	25	24	23	23.5
Doubao Seed 2.0 Mini	15	17	20	19	17.8
Gemini 2.5 Flash	18	17	12	6	13.3
Qwen 3.5 Plus	7	14	12	9	10.5

Kimi K2 and DeepSeek V3.2 are the clear top tier. Gemini and Qwen collapse at higher levels — Gemini's HSK 7 scored 6/30, Qwen's HSK 1 scored 7/30.

Vocabulary compliance (above-level + unknown word rate):

Model	HSK 1	HSK 3	HSK 5	HSK 7	Average
Qwen 3.5 Plus	1.8%	1.2%	0.6%	0.2%	0.9%
DeepSeek V3.2	6.6%	3.5%	2.0%	0.6%	3.2%
Kimi K2	10.1%	3.4%	3.5%	1.2%	4.5%

Vocabulary compliance and writing quality are inversely correlated. Qwen achieves 0.9% errors by writing below level — simple, repetitive, flat. Kimi writes naturally and reaches for the right word even when it's above level.

Generation speed:

Model	HSK 1	HSK 3	HSK 5	HSK 7
Gemini 2.5 Flash	16s	66s	67s	162s
Kimi K2	41s	88s	197s	313s
DeepSeek V3.2	113s	305s	516s	486s
Doubao Seed 2.0 Pro	225s	278s	366s	793s
Qwen 3.5 Plus	900s	728s	987s	1006s

Kimi K2 generates a full HSK 7 story in ~5 minutes. Qwen takes 17 minutes for lower quality. Speed tracks with quality — the best writers are also the fastest.

Why 4.5% "Errors" Are Close to Optimal

Those vocabulary error rates look like failures. They're not.

Second language acquisition research consistently finds that 2–3% unknown words is the optimal learning zone. At 98% known words, readers comprehend well enough to acquire new vocabulary through context. At 95%, comprehension is rough but manageable. Below 90%, it collapses entirely (Hu & Nation 2000, Schmitt et al. 2011).

Each unknown word needs 8–12 encounters in varied contexts for acquisition (Waring & Takaki 2003). Narrative text gets a comprehension bonus over expository text — readers tolerate more unknowns in stories because plot context aids guessing.

Kimi K2's 4.5% error rate drops to 1.2–3.5% at HSK 5+, right in the sweet spot. Learners encounter enough unknown words to acquire them naturally while maintaining the 95%+ comprehension needed to enjoy the story.

Qwen's 0.9% error rate — the "best" compliance — means learners encounter almost nothing new. Perfect vocabulary control produces a ceiling, not a floor. We were trying to fix something that was already helping learners.

The Word List Trap

Our initial benchmark injected the full HSK word list into every prompt. At HSK 7, that's roughly 50,000 characters (~25,000 tokens) of vocabulary before the model starts writing. We assumed this would help models stay within level.

It did the opposite.

Model	HSK Level	With List	Without List	Delta
Kimi K2	HSK 3	21/30	27/30	+6
Doubao Pro	HSK 7	23/30	30/30	+7
Qwen 3.5+	HSK 7	9/30	26/30	+17

Qwen jumped from 9/30 to 26/30. Doubao Pro hit a perfect score — the reviewer called it "the most emotionally powerful story" in the entire benchmark. Showing a model words it shouldn't use activates those words in its attention while degrading creative output.

The vocab list also hurt compliance. In a separate experiment, adding the word list to pedagogical prompts nearly doubled above-level violations at HSK 1 (from 35 to 66 unique above-level words).

One exception: DeepSeek V3.2's HSK 3 story dropped from 28/30 to 23/30 without the list. For that model at lower levels, the constraint provided useful focus. Every other model improved.

8 Prompt Strategies, One Crossover

We tested eight prompt variants across all nine HSK levels:

Baseline — full vocab list in prompt
No-vocab — "write using HSK N vocabulary," no word list
Pedagogical — level-specific structural rules (HSK 1: SVO only, explicit subjects; HSK 3: complex sentences, pacing)
Pedagogical + vocab — structural rules plus word list injection
Craft-detailed — 40% dialogue ratio, sensory details, pacing beats
Author-identity — "You are a Chinese author" framing
Mandarin — system and user prompts entirely in Chinese
Mandarin-pedagogical — Chinese prompts with structural rules

Eliminated early: Author-identity (worst compliance, no quality gain). Craft-detailed (highest quality ceiling but 9.6% error rate at HSK 4 and 11% catastrophic failure rate). Pedagogical + vocab (adding the word list to pedagogical rules hurt compliance at every level).

The clear finding: the optimal strategy changes at HSK 4.

HSK	Best Strategy	Error Rate
1	Pedagogical	3.1%
2	Pedagogical	6.2%
3	Pedagogical	6.6%
4	No-vocab	7.3%
5	No-vocab	3.6%
6	No-vocab	2.8%
7	No-vocab	1.3%
8	No-vocab	1.0%
9	No-vocab	1.4%

Below HSK 4, pedagogical rules reduce above-level words by 2–5x. Rules like "use only SVO sentence structure" and "state subjects explicitly" genuinely constrain vocabulary. Above HSK 4, the vocabulary pool (1,978+ words) is rich enough that models naturally stay within level without constraints.

One more finding: HSK 7, 8, and 9 are indistinguishable by prompts. All three share the same vocabulary pool (10,896 words). The scoring model confirmed it can't distinguish literary levels between them. Only story concept complexity and target length differentiate these levels.

Every Approach Scores 22/30

After establishing Kimi K2 as our production model, we tried everything to push quality higher. Eight architectural approaches, all tested on the same HSK 5 story concept:

Approach	Score
Baseline (1-chapter context)	22/30
Full text context (all prior chapters)	22/30
AI reviewer feedback injected into generation	22/30
One-shot full-story generation	22/30
Structured state tracking (SCORE framework)	22/30
Combined (full context + enhanced outline)	21/30
Enhanced outline (Harmon Story Circle + timeline + character knowledge)	17/30

Seven of eight approaches scored 21–22. Enhanced outlines — with Harmon Story Circle structure, timeline tracking, character knowledge states, and setup/payoff pairs — hurt quality by 5 points. More constraints gave the model more opportunities to violate them.

We also tested 9 parameter configurations — temperature from 0.6 to 1.0, frequency penalties from 0 to 0.5, presence penalties, top-p variations, and the official recommended settings from Kimi, Qwen, and Doubao. Every configuration scored 22–23/30 with the same per-criterion breakdown: 4/4/3/4/4/4. Pacing and plot coherence are always the weak points. Parameters don't matter.

Finally, we tested a two-pass rewrite approach: Claude Sonnet reviews stories and identifies issues, then Kimi K2 rewrites flagged chapters.

HSK	Original	After Rewrite
2	22/30	21/30
5	22/30	22/30
7	19/30	19/30
9	23/30	22/30

Rewrites fix specific issues but introduce new contradictions — the model rewrites individual chapters without full story context. In this experiment, the reviewer worked better as a quality gate (accept or regenerate from scratch) than as an editor. A later controlled test found that gate did not change best-of-two selection outcomes, so we removed it from production.

22/30 is Kimi K2's quality ceiling for this class of story. The bottleneck is the model's creative writing capability, not the generation architecture, context management, or sampling parameters. Better stories will come from better models, not better prompts.

What We Ship

About 15% of initial generations fail catastrophically — token repetition loops, quality collapse, hallucinatory word salad. All re-runs succeed cleanly. These failures are random (bad generation luck, possible API instability), not systematic. Handle them with retry, not prompt engineering.

Our production decisions:

Model: Kimi K2 via Moonshot direct API (faster and cheaper than OpenRouter)
Prompts: Pedagogical for HSK 1–3, no-vocab for HSK 4–9
No vocabulary list injected at any level
Per-chapter quality gates catch repetition loops, truncation, and language purity
Programmatic vocabulary and error-rate checks drive release validation; optional best-of-two experiments compare candidates by error rate, not the removed AI reviewer score
~15% retry rate is built into the pipeline as expected, not a bug

The entire benchmark ran for three weeks and tested more approaches than we expected to need. Most of the findings were counterintuitive: the word list hurt, parameters didn't matter, more structure made stories worse, rewrites made things worse. The simplest approach — let the model write freely, check the result, throw it away if it's bad — consistently outperformed every sophisticated alternative we tested.

Read stories generated by this pipeline: HSK 1 · HSK 2 · HSK 3 · HSK 4 · HSK 5 · HSK 6 · HSK 7 · HSK 8 · HSK 9

Related guides: