Chinese has no spaces between words. That's the first problem. The second is that the same character can have completely different pronunciations depending on context. The third is that even after you get the pronunciation right, displaying it above the text breaks your layout in ways that CSS was never designed to handle.
We built a pipeline that adds accurate pinyin to 100 graded Chinese stories — 434,000 character segments across 9 HSK levels. It took three major rewrites, one reverted GPU batch job, and more edge cases than we thought existed in a natural language. Here's what we learned.
Why This Is Hard
Adding pinyin to Chinese text requires solving three problems in sequence:
- Word segmentation — where do words start and end?
大学生活could be大学生+活(college student + live) or大学+生活(college + life). Only context tells you. - Pronunciation disambiguation — which pronunciation does this character get?
得alone can be de (grammatical particle), dé (to obtain), or děi (must). There are thousands of these polyphonic characters. - Display — how do you render pinyin above characters without breaking line spacing, word boundaries, or character alignment?
Most tutorials stop at step 1 and call it done. We had to solve all three for 100 stories where every wrong syllable is a learner reading the wrong sound.
The LLM Approach (And Why We Abandoned It)
Our first attempt used DeepSeek's API to segment and annotate text. Send a paragraph, get back word boundaries with pinyin. It worked — sometimes.
The problems were fundamental, not fixable:
- Non-deterministic. The same input produced different segmentations on different runs. A story re-processed after an edit might segment differently from the original.
- Fragile. The API would merge curly quotes with adjacent characters, truncate long paragraphs, or return malformed JSON. We wrote five separate retry and repair scripts just to handle DeepSeek's inconsistencies.
- Expensive and slow. Every re-segmentation was an API call with latency and cost. Iterating on quality meant burning money.
We needed a deterministic pipeline. Same input, same output, every time, running locally with no API dependency.
The 7-Stage Pipeline
We replaced the LLM with a pipeline built on spacy-pkuseg (neural word segmenter) and pypinyin (pronunciation dictionary). Seven stages, each solving a specific class of error that the previous stage introduces.
Stage 1: Split on whitespace
pkuseg silently strips all whitespace, including ideographic spaces used as paragraph indentation in Chinese text. We pre-split on whitespace runs and preserve them as separate tokens. This sounds trivial — but skipping it caused 128 paragraphs across 43 stories to produce null entries in our pinyin data, because the reconstructed text no longer matched the original.
Stage 2: Pre-split punctuation
pkuseg's neural model merges curly quotes, em dashes, and angle brackets with adjacent characters. "你好" becomes a single token instead of three. We isolate these punctuation marks before feeding text to the segmenter. Fullwidth punctuation (,。!?) is handled correctly by pkuseg and doesn't need pre-splitting — a distinction we learned the hard way by over-splitting and breaking sentence-final particles.
Stage 3: Neural segmentation
spacy-pkuseg segments the cleaned text into words. This is the core NLP step — a trained neural model that understands Chinese word boundaries. We run it with no custom dictionary, which is the counterintuitive part. More on that below.
Stage 4: Split merged suffixes
pkuseg sometimes fuses role suffixes with the following word. 理发师剪 (barber cuts) becomes 理发 + 师剪 — the 师 (master/professional) suffix gets stuck to the next verb. We detect tokens that start with a role suffix character (员/师/者/家/长/生), verify the token isn't a real word, and split it so the merge stage can recombine correctly.
Stage 5: Smart merge
Adjacent tokens are merged when they form a known compound word. Two dictionaries provide lookup: the HSK vocabulary (10,896 words) and jieba's general dictionary (498,000 words). The merge has a special exception for role suffixes — without it, 服务 + 员 would never merge into 服务员 because both parts are individually valid HSK words, and the algorithm would have no reason to combine them.
Stage 6: Ghost word fix
After merging, some compounds exist in jieba's frequency dictionary but not in CC-CEDICT — they have no English translation. When a reader taps these words, they see pinyin but an empty definition card. We call these "orphan words." The fix: any multi-character token not found in HSK vocabulary or CC-CEDICT gets split back into individual characters (or jieba sub-words if all parts have definitions). This eliminated ~8,000 orphan word types while keeping character names and real dictionary compounds intact.
Stage 7: Batch pinyin annotation
All words are joined back into a full sentence, and pypinyin annotates the entire sentence at once. Sentence-level annotation is critical because some pronunciation rules depend on surrounding context — tone sandhi for 一 and 不 changes based on the following syllable's tone. We supplement pypinyin's built-in dictionary with CC-CEDICT (105,000 phrase entries), which fixed pronunciation for 9,248 unique story words across 89,851 occurrences.
Stage 8: Context heuristics and overrides
A final pass handles cases that no dictionary can solve. The most interesting: 地 after a closing curly quote. In Chinese, onomatopoeia is often quoted and followed by 地 as an adverbial marker — "哗"地冲出来 means "rushed out with a splash." The 地 here is always de, never dì (ground). Neither pypinyin nor the neural pronunciation model recognizes this pattern. A simple positional check — 地 immediately after " or ' — fixed 109 cases across 36 stories.
The Dictionary Paradox
The most counterintuitive discovery in the entire project.
We initially loaded all 10,896 HSK words as a custom dictionary for pkuseg. This seemed obviously correct — tell the segmenter about the words that matter to our learners. The segmentation got worse.
The custom dictionary made pkuseg greedily prefer HSK words, breaking three classes of compounds:
| Input | Expected | With HSK Dict | Bug |
|---|---|---|---|
| 大学生活 | 大学 + 生活 | 大学生 + 活 | Greedy match on HSK word 大学生 |
| 开开心心 | 开开心心 | 开 + 开心 + 心地 | Greedy match on HSK word 开心 |
| 服务员笑 | 服务员 + 笑 | 服务 + 员笑 | HSK word 服务 matched, suffix orphaned |
pkuseg's neural model already knows common words. The custom dictionary didn't teach it anything new — it just overrode the model's contextual judgment with greedy string matching.
The fix: remove the dictionary entirely. Let the neural model segment freely, then recombine tokens into known words in the smart merge pass. This achieved the same vocabulary coverage without the greedy matching bugs.
The lesson: more data made the model worse. The neural network had learned subtle disambiguation rules from its training corpus. Injecting a dictionary replaced those learned rules with a crude longest-match heuristic.
The Polyphonic Problem
pypinyin alone achieves roughly 87% accuracy on polyphonic characters. For a language learning app, that's not good enough. The character 得 appears hundreds of times across our stories, and getting it wrong means a learner practices the wrong pronunciation.
g2pW: A BERT Model for Chinese Pronunciation
g2pW is a BERT-based model that predicts character pronunciation from sentence context. We ran it on an RTX 4090 (rented on Vast.ai) across all 100 stories.
Version 1: 40,000 corrections, 3,000 regressions. Reverted.
The BERT model made 40,513 changes across 434,235 character segments. But it systematically broke multi-character words. 头发 tóu fa (hair) became tóu fā — the model stripped the neutral tone from the second syllable. 眼睛 yǎn jing became yǎn jīng. It also predicted rare classical readings: 和 became hàn (a classical variant) instead of the standard hé.
We identified 3,304 regressions and reverted the entire batch the same day.
Version 2: The Hybrid Fix
The insight: g2pW is excellent at single-character disambiguation (是 de or dé?) but unreliable for multi-character words where dictionary lookups are already correct.
The fix was a priority system:
- Manual overrides — known errors in both g2pW and CC-CEDICT (e.g.,
大夫= dài fu, not dà fū) - CC-CEDICT phrase dictionary — correct pronunciation for 105,000 multi-character words
- g2pW — BERT predictions for single characters and unknown phrases
- pypinyin defaults — kept if nothing else changes them
This meant running g2pW first, then re-applying the phrase dictionaries for multi-character words. The neural model handles what it's good at (context-dependent single characters), and the dictionaries handle what they're good at (known phrase pronunciations).
Result: 16,222 corrections. Zero regressions. Validated against 13 known failure patterns.
| Category | Corrections | Example |
|---|---|---|
| 得 → de (particle) | 1,883 | 跑得快 pǎo de kuài |
| 地 → de (adverbial) | 360 | 慢慢地走 mànmàn de zǒu |
| 一 tone sandhi | 6,373 | 一个 yí gè (not yī gè) |
| 不 tone sandhi | 665 | 不对 bú duì (not bù duì) |
The CSS Problem Nobody Warns You About
After solving segmentation and pronunciation, we had one more problem: displaying pinyin above characters using HTML <ruby> tags.
Bug 1: Invisible character spacing
Each character gets its own <ruby> tag for per-character pinyin annotation. The browser sizes each ruby box to whichever is wider: the base character or the pinyin above it. Pinyin like chuāng is physically wider than the character 窗. This forces the ruby box wider, creating visible gaps between characters within the same word.
The first instinct — hide pinyin with opacity: 0 — doesn't work. opacity: 0 keeps the element in layout. The invisible pinyin still widens the ruby box.
The fix required two CSS tricks:
- Pinyin hidden:
display: noneon<rt>(removes from layout entirely, zero spacing impact) - Pinyin visible:
width: 0; overflow: visibleon<rt>(annotation renders visually via overflow, but contributes zero width to the ruby box)
Bug 2: Punctuation double-spacing
Chinese fullwidth punctuation (!?,。) has built-in half-width blank space in the Noto Serif SC glyph design. Adjacent to quotes, this creates a visible double-space. We fixed this with OpenType font-feature-settings: "chws" 1 — Contextual Half-Width Spacing, which collapses the blank space only when two punctuation marks are adjacent. The more aggressive halt feature (half-width for all punctuation) collapsed spacing between sentences, making text unreadable. The proper CSS solution, text-spacing-trim, isn't supported in Chrome without a flag yet.
The Numbers
After three rewrites, a reverted GPU batch, and 98 test cases:
- 100 stories across HSK 1–9
- 434,235 character segments annotated
- 105,000 phrase entries in the pronunciation dictionary
- 16,222 polyphonic corrections via hybrid BERT + dictionary
- Zero API cost, millisecond latency per segmentation call
- 70 seconds to regenerate all 100 stories on a laptop CPU
The pipeline is deterministic. Same input, same output, every time. No retry scripts, no parsing hacks, no API keys.
Read stories with pinyin at every level: HSK 1 · HSK 2 · HSK 3 · HSK 4 · HSK 5 · HSK 6 · HSK 7 · HSK 8 · HSK 9
Related guides: