How We Built It

How to Add Pinyin to Any Chinese Text

A 7-stage pipeline for accurate pinyin annotation. Character disambiguation, segmenter tuning, and what we learned processing 100 Chinese stories.

AnthonyAnthony·March 16, 2026·8 min read

Chinese has no spaces between words. That's the first problem. The second is that the same character can have completely different pronunciations depending on context. The third is that even after you get the pronunciation right, displaying it above the text breaks your layout in ways that CSS was never designed to handle.

We built a pipeline that adds accurate pinyin to 100 graded Chinese stories — 434,000 character segments across 9 HSK levels. It took three major rewrites, one reverted GPU batch job, and more edge cases than we thought existed in a natural language. Here's what we learned.

Why This Is Hard

Adding pinyin to Chinese text requires solving three problems in sequence:

  1. Word segmentation — where do words start and end? 大学生活 could be 大学生 + (college student + live) or 大学 + 生活 (college + life). Only context tells you.
  2. Pronunciation disambiguation — which pronunciation does this character get? alone can be de (grammatical particle), (to obtain), or děi (must). There are thousands of these polyphonic characters.
  3. Display — how do you render pinyin above characters without breaking line spacing, word boundaries, or character alignment?

Most tutorials stop at step 1 and call it done. We had to solve all three for 100 stories where every wrong syllable is a learner reading the wrong sound.

The LLM Approach (And Why We Abandoned It)

Our first attempt used DeepSeek's API to segment and annotate text. Send a paragraph, get back word boundaries with pinyin. It worked — sometimes.

The problems were fundamental, not fixable:

  • Non-deterministic. The same input produced different segmentations on different runs. A story re-processed after an edit might segment differently from the original.
  • Fragile. The API would merge curly quotes with adjacent characters, truncate long paragraphs, or return malformed JSON. We wrote five separate retry and repair scripts just to handle DeepSeek's inconsistencies.
  • Expensive and slow. Every re-segmentation was an API call with latency and cost. Iterating on quality meant burning money.

We needed a deterministic pipeline. Same input, same output, every time, running locally with no API dependency.

The 7-Stage Pipeline

We replaced the LLM with a pipeline built on spacy-pkuseg (neural word segmenter) and pypinyin (pronunciation dictionary). Seven stages, each solving a specific class of error that the previous stage introduces.

Stage 1: Split on whitespace

pkuseg silently strips all whitespace, including ideographic spaces used as paragraph indentation in Chinese text. We pre-split on whitespace runs and preserve them as separate tokens. This sounds trivial — but skipping it caused 128 paragraphs across 43 stories to produce null entries in our pinyin data, because the reconstructed text no longer matched the original.

Stage 2: Pre-split punctuation

pkuseg's neural model merges curly quotes, em dashes, and angle brackets with adjacent characters. "你好" becomes a single token instead of three. We isolate these punctuation marks before feeding text to the segmenter. Fullwidth punctuation (,。!?) is handled correctly by pkuseg and doesn't need pre-splitting — a distinction we learned the hard way by over-splitting and breaking sentence-final particles.

Stage 3: Neural segmentation

spacy-pkuseg segments the cleaned text into words. This is the core NLP step — a trained neural model that understands Chinese word boundaries. We run it with no custom dictionary, which is the counterintuitive part. More on that below.

Stage 4: Split merged suffixes

pkuseg sometimes fuses role suffixes with the following word. 理发师剪 (barber cuts) becomes 理发 + 师剪 — the (master/professional) suffix gets stuck to the next verb. We detect tokens that start with a role suffix character (员/师/者/家/长/生), verify the token isn't a real word, and split it so the merge stage can recombine correctly.

Stage 5: Smart merge

Adjacent tokens are merged when they form a known compound word. Two dictionaries provide lookup: the HSK vocabulary (10,896 words) and jieba's general dictionary (498,000 words). The merge has a special exception for role suffixes — without it, 服务 + would never merge into 服务员 because both parts are individually valid HSK words, and the algorithm would have no reason to combine them.

Stage 6: Ghost word fix

After merging, some compounds exist in jieba's frequency dictionary but not in CC-CEDICT — they have no English translation. When a reader taps these words, they see pinyin but an empty definition card. We call these "orphan words." The fix: any multi-character token not found in HSK vocabulary or CC-CEDICT gets split back into individual characters (or jieba sub-words if all parts have definitions). This eliminated ~8,000 orphan word types while keeping character names and real dictionary compounds intact.

Stage 7: Batch pinyin annotation

All words are joined back into a full sentence, and pypinyin annotates the entire sentence at once. Sentence-level annotation is critical because some pronunciation rules depend on surrounding context — tone sandhi for and changes based on the following syllable's tone. We supplement pypinyin's built-in dictionary with CC-CEDICT (105,000 phrase entries), which fixed pronunciation for 9,248 unique story words across 89,851 occurrences.

Stage 8: Context heuristics and overrides

A final pass handles cases that no dictionary can solve. The most interesting: after a closing curly quote. In Chinese, onomatopoeia is often quoted and followed by as an adverbial marker — "哗"地冲出来 means "rushed out with a splash." The here is always de, never (ground). Neither pypinyin nor the neural pronunciation model recognizes this pattern. A simple positional check — immediately after " or ' — fixed 109 cases across 36 stories.

The Dictionary Paradox

The most counterintuitive discovery in the entire project.

We initially loaded all 10,896 HSK words as a custom dictionary for pkuseg. This seemed obviously correct — tell the segmenter about the words that matter to our learners. The segmentation got worse.

The custom dictionary made pkuseg greedily prefer HSK words, breaking three classes of compounds:

InputExpectedWith HSK DictBug
大学生活大学 + 生活大学生 + 活Greedy match on HSK word 大学生
开开心心开开心心开 + 开心 + 心地Greedy match on HSK word 开心
服务员笑服务员 + 笑服务 + 员笑HSK word 服务 matched, suffix orphaned

pkuseg's neural model already knows common words. The custom dictionary didn't teach it anything new — it just overrode the model's contextual judgment with greedy string matching.

The fix: remove the dictionary entirely. Let the neural model segment freely, then recombine tokens into known words in the smart merge pass. This achieved the same vocabulary coverage without the greedy matching bugs.

The lesson: more data made the model worse. The neural network had learned subtle disambiguation rules from its training corpus. Injecting a dictionary replaced those learned rules with a crude longest-match heuristic.

The Polyphonic Problem

pypinyin alone achieves roughly 87% accuracy on polyphonic characters. For a language learning app, that's not good enough. The character appears hundreds of times across our stories, and getting it wrong means a learner practices the wrong pronunciation.

g2pW: A BERT Model for Chinese Pronunciation

g2pW is a BERT-based model that predicts character pronunciation from sentence context. We ran it on an RTX 4090 (rented on Vast.ai) across all 100 stories.

Version 1: 40,000 corrections, 3,000 regressions. Reverted.

The BERT model made 40,513 changes across 434,235 character segments. But it systematically broke multi-character words. 头发 tóu fa (hair) became tóu fā — the model stripped the neutral tone from the second syllable. 眼睛 yǎn jing became yǎn jīng. It also predicted rare classical readings: became hàn (a classical variant) instead of the standard .

We identified 3,304 regressions and reverted the entire batch the same day.

Version 2: The Hybrid Fix

The insight: g2pW is excellent at single-character disambiguation (是 de or ?) but unreliable for multi-character words where dictionary lookups are already correct.

The fix was a priority system:

  1. Manual overrides — known errors in both g2pW and CC-CEDICT (e.g., 大夫 = dài fu, not dà fū)
  2. CC-CEDICT phrase dictionary — correct pronunciation for 105,000 multi-character words
  3. g2pW — BERT predictions for single characters and unknown phrases
  4. pypinyin defaults — kept if nothing else changes them

This meant running g2pW first, then re-applying the phrase dictionaries for multi-character words. The neural model handles what it's good at (context-dependent single characters), and the dictionaries handle what they're good at (known phrase pronunciations).

Result: 16,222 corrections. Zero regressions. Validated against 13 known failure patterns.

CategoryCorrectionsExample
得 → de (particle)1,883跑得快 pǎo de kuài
地 → de (adverbial)360慢慢地走 mànmàn de zǒu
一 tone sandhi6,373一个 yí gè (not yī gè)
不 tone sandhi665不对 bú duì (not bù duì)

The CSS Problem Nobody Warns You About

After solving segmentation and pronunciation, we had one more problem: displaying pinyin above characters using HTML <ruby> tags.

Bug 1: Invisible character spacing

Each character gets its own <ruby> tag for per-character pinyin annotation. The browser sizes each ruby box to whichever is wider: the base character or the pinyin above it. Pinyin like chuāng is physically wider than the character . This forces the ruby box wider, creating visible gaps between characters within the same word.

The first instinct — hide pinyin with opacity: 0 — doesn't work. opacity: 0 keeps the element in layout. The invisible pinyin still widens the ruby box.

The fix required two CSS tricks:

  • Pinyin hidden: display: none on <rt> (removes from layout entirely, zero spacing impact)
  • Pinyin visible: width: 0; overflow: visible on <rt> (annotation renders visually via overflow, but contributes zero width to the ruby box)

Bug 2: Punctuation double-spacing

Chinese fullwidth punctuation (!?,。) has built-in half-width blank space in the Noto Serif SC glyph design. Adjacent to quotes, this creates a visible double-space. We fixed this with OpenType font-feature-settings: "chws" 1 — Contextual Half-Width Spacing, which collapses the blank space only when two punctuation marks are adjacent. The more aggressive halt feature (half-width for all punctuation) collapsed spacing between sentences, making text unreadable. The proper CSS solution, text-spacing-trim, isn't supported in Chrome without a flag yet.

The Numbers

After three rewrites, a reverted GPU batch, and 98 test cases:

  • 100 stories across HSK 1–9
  • 434,235 character segments annotated
  • 105,000 phrase entries in the pronunciation dictionary
  • 16,222 polyphonic corrections via hybrid BERT + dictionary
  • Zero API cost, millisecond latency per segmentation call
  • 70 seconds to regenerate all 100 stories on a laptop CPU

The pipeline is deterministic. Same input, same output, every time. No retry scripts, no parsing hacks, no API keys.


Read stories with pinyin at every level: HSK 1 · HSK 2 · HSK 3 · HSK 4 · HSK 5 · HSK 6 · HSK 7 · HSK 8 · HSK 9

Related guides:

Frequently Asked Questions

How do you add pinyin above Chinese characters?

You need a pinyin annotation pipeline that segments Chinese text into words, looks up each word in a pronunciation dictionary, and handles polyphones (characters with multiple readings depending on context). Libraries like pypinyin and jieba handle this in Python.

What are polyphones and why do they matter for pinyin?

Polyphones are Chinese characters that have different pronunciations depending on the word they appear in. For example, 了 can be 'le' or 'liǎo'. A good pinyin pipeline must use word-level context, not character-level lookup, to select the correct reading.

Which tools convert Chinese text to pinyin?

pypinyin is the most popular Python library for pinyin conversion. For word segmentation, jieba or pkuseg handle most cases well. For production accuracy with polyphones, combining a segmenter with a custom dictionary produces the best results.