How We Built It

How We Built AI-Generated Graded Chinese Stories

We benchmarked 7 Chinese LLMs across 9 HSK levels to build graded readers. Here's what surprised us — including why giving AI a word list makes stories worse.

AnthonyAnthony·March 14, 2026·6 min read

Every story on HSKStory is generated by AI, reviewed for quality, and validated against the official HSK 3.0 vocabulary standard. We didn't pick a model and hope for the best — we ran a structured benchmark across 7 Chinese-capable LLMs, 9 HSK levels, and 8 prompt strategies to find what actually works.

Here's what we learned, including several findings that contradicted our assumptions.

The Challenge

Writing a graded reader is a constrained optimization problem. The text must:

  1. Stay within a vocabulary level — use only words a learner at that HSK level should know
  2. Read naturally — sound like real Chinese, not a textbook exercise
  3. Tell a compelling story — with characters, conflict, and pacing that keeps readers engaged

These goals conflict. The strictest vocabulary control produces the dullest prose. The most natural writing uses whatever words fit the story. Finding the right balance is the core challenge.

The Benchmark: 7 Models, 9 Levels

We tested 7 Chinese LLMs head-to-head. Each model generated a complete story at every HSK level using the same outline. We measured writing quality (30-point rubric covering characters, plot, pacing, naturalness, engagement, and vocabulary), vocabulary compliance (percentage of words above the target HSK level), and structural reliability.

Three models failed before scoring:

  • Kimi K2.5 (reasoning model): Thinking tokens leaked into story content in 35% of chapters — English planning text like "The user wants me to continue..." appeared mid-paragraph
  • GLM-4.7-Flash: Degenerated into token repetition loops — one story repeated the same word for 10,000+ characters
  • Step-3.5-Flash: Dumped Chinese chain-of-thought into the story content

These failures highlight a reality of AI-generated content: not every model can reliably produce structured creative output, even if it excels at conversation.

The Inverse Correlation: Quality vs. Compliance

The clearest pattern in our data: the models that controlled vocabulary best wrote the worst stories.

ModelAvg. Quality (/30)Avg. Vocab Error Rate
Kimi K226.84.5%
DeepSeek V3.226.03.2%
Doubao Seed 2.0 Pro23.5
Qwen 3.5 Plus10.50.9%

Qwen achieved the best vocabulary control by far (0.9% error rate) — but its stories scored 10.5/30. It stayed within level by writing below level. Simple, repetitive, flat.

Kimi K2 wrote the most engaging stories (26.8/30, including a perfect 30/30 at HSK 1 that "reads like real Chinese children's literature") but used 4.5% above-level words.

Here's the key insight: that 4.5% error rate is actually close to the pedagogically optimal zone. Second language acquisition research shows that 2–3% unknown words is the sweet spot for learning — enough new vocabulary to acquire through context while maintaining comprehension (Hu & Nation 2000, Schmitt et al. 2011).

We were trying to "fix" something that was already working.

The Surprising Finding: Vocab Lists Make Stories Worse

Our original benchmark injected the full HSK word list into every prompt — at HSK 7, that's roughly 25,000 tokens of vocabulary before the model starts writing.

We assumed this would help models stay within level. It did the opposite.

ConditionHSK 1 Error RateHSK 7 Quality (/30)
With vocab list3.3%9
Without vocab list2.1%26

Qwen scored 9/30 at HSK 7 with the vocab list. Without it: 26/30. Doubao Pro jumped from 23/30 to a perfect 30/30 — the reviewer called it "the most emotionally powerful story" in the entire benchmark.

Why? The vocab list likely activates above-level vocabulary in the model's attention. Showing a model words it shouldn't use seems to make it more likely to use them — while simultaneously degrading the creative quality of its output.

One exception: DeepSeek V3.2's HSK 3 story dropped from 28/30 to 23/30 without the list. For that model at lower levels, the constraint provided useful focus.

The HSK 4 Crossover Point

We tested 8 prompt strategies to find the best approach per level. The most important finding: the optimal strategy changes at HSK 4.

Below HSK 4 — pedagogical prompts with level-specific structural rules work best. Rules like "use only SVO sentence structure" and "require explicit subjects" reduce above-level words by 2–5x compared to unconstrained generation.

Above HSK 4 — simple prompts ("write using HSK 5 vocabulary") outperform constrained ones. The vocabulary pool is large enough (1,978+ words) that models naturally stay within level.

HSK LevelBest StrategyError Rate
HSK 1Pedagogical rules3.1%
HSK 2Pedagogical rules4.0%
HSK 3Pedagogical rules6.6%
HSK 4+Unconstrained1.3–3.6%

This crossover makes intuitive sense. At HSK 1 (300 words), the model needs guardrails. At HSK 5 (3,557 words), the pool is rich enough for natural storytelling without constraints.

The Quality Ceiling

We tested every approach we could think of to push quality above 22/30 for a challenging HSK 5 story:

  • Sliding context window (1 chapter vs. full history)
  • AI reviewer feedback injected into rewrites
  • One-shot full-story generation (single API call)
  • State tracking with character knowledge graphs
  • Enhanced outlines with story circle frameworks
  • Temperature and parameter tuning

Every approach scored 21–22/30. The consistent weaknesses: pacing (3/5) and plot complexity (3–4/5). Characters and naturalness always scored well (4–5/5).

The most counterintuitive result: enhanced outlines with Harmon Story Circle structure, timeline tracking, and character knowledge states hurt quality by 5 points. More constraints gave the model more opportunities to violate them.

We also tested a two-pass approach: AI reviewer catches issues, then the model rewrites flagged chapters. Rewrites fixed specific problems but introduced new contradictions — the model rewrites individual chapters without full story context. Scores didn't improve.

Our conclusion: the reviewer was more useful as a quality gate (accept or regenerate) than as an editor, but a later controlled test showed the gate did not change which best-of-two story was selected. We removed it from production and rely on deterministic quality checks instead.

The Production Pipeline

Everything we learned shaped the production pipeline:

  1. Level-appropriate prompting — pedagogical rules for HSK 1–3, unconstrained for HSK 4–9
  2. Per-chapter quality gates — repetition detection, minimum length, language purity checks
  3. Vocabulary validation — every story checked against the HSK 3.0 standard
  4. Deterministic validation — vocabulary metrics decide whether a story is ready for release; optional best-of-two experiments compare candidates by error rate, not reviewer score
  5. Automatic retry — roughly 15% of initial generations don't pass quality gates, triggering fresh regeneration

The result: 105 published stories across HSK 1–9, with vocabulary error rates averaging 3–5% at upper levels and 13–14% at HSK 1–2 (where the 300-word vocabulary pool makes strict compliance impossible without sacrificing readability).

Those error rates aren't a compromise — they're a feature. A story with 3% unknown words is a story where you're learning while reading. Every word you don't know is a chance to acquire it through context, the way native speakers learn vocabulary naturally.


Explore stories by level: HSK 1 · HSK 2 · HSK 3 · HSK 4 · HSK 5 · HSK 6 · HSK 7 · HSK 8 · HSK 9

Related guides:

Frequently Asked Questions

What makes AI-graded stories different from manually graded ones?

AI grading analyzes vocabulary coverage, sentence complexity, and grammar patterns against HSK word lists automatically. This enables faster content production and consistent difficulty calibration, while manual grading relies on human judgment and is slower but can catch nuances AI misses.

Are AI-generated Chinese stories accurate?

Quality depends on the pipeline. HSKStory uses AI for initial generation followed by human review for naturalness, cultural accuracy, and pedagogical value. Pure AI output without review can produce grammatically correct but unnatural or culturally tone-deaf content.

How does AI determine Chinese reading difficulty?

AI grading systems typically analyze the percentage of words from each HSK level, average sentence length, grammar complexity, and use of idiomatic expressions. A story targeting HSK 3 should use 95%+ vocabulary from HSK 1-3 levels.