The best Chinese TTS voice we tested isn't available through any cloud API. It's only in the open-source model weights. So we rented GPUs, fought dependency conflicts, and built a pipeline to narrate 100 Chinese stories — 438 chapters of graded reader content across HSK 1-9.
Total GPU cost: under $10. Here's exactly how to do it.
Why Self-Host at All
Alibaba's Qwen3-TTS offers 49+ voices through their DashScope cloud API. We tested eight of them: Vincent, Moon, Ethan, Serena, Elias, Cherry, Maia, Neil. Elias won — warm, clear, natural pacing for language learning content.
Then we tried Dylan.
Dylan is only available in the open-source model release (Qwen3-TTS-12Hz-1.7B-CustomVoice). It's not in the API catalog. And it was noticeably better than everything else — more natural pauses, better emotional range in dialogue, clearer pronunciation of individual characters.
For a language learning app where learners practice pronunciation by listening, "noticeably better" matters. So we needed to run the model ourselves.
The Setup: Vast.ai + PyTorch NGC
What you need:
- Any NVIDIA GPU with 8GB+ VRAM — the 1.7B model uses ~3.5GB VRAM
- 50GB disk space (model weights + audio output)
- A high single-thread CPU clock (this matters more than GPU choice)
Qwen3-TTS is CPU-bound during inference. The GPU loads the model, but text-to-speech synthesis bottlenecks on single-threaded CPU work. This means your choice of CPU matters far more than whether you rent an RTX 3090 or RTX 4090.
CPU performance benchmarks (same model, same chapters):
| CPU | Clock | Time/chunk | $/hr |
|---|---|---|---|
| Unknown (desktop) | ~5+ GHz | 47s | — |
| Ryzen 9 9950X (Zen 5) | 4.3 GHz | 49s | $0.30 |
| Ryzen 9 7900X (Zen 4) | 4.7 GHz | 52s | $0.28 |
| Ryzen 7 5700X (Zen 3) | 3.4 GHz | 83s | $0.25 |
| EPYC 7D12 (server) | 2.2 GHz | 120s | $0.20 |
| EPYC 7C13 (server) | 2.0 GHz | 289s | $0.18 |
EPYC server CPUs are 3-6x slower despite having 64+ cores. Desktop Zen 4/5 CPUs are the sweet spot.
Finding an instance:
On Vast.ai, filter for 8GB+ VRAM (RTX 3090 at $0.12-0.15/hr or RTX 4090 at $0.25-0.35/hr both work), 50GB+ disk. Sort by CPU clock speed, not GPU model. Pick hosts with >95% reliability score. A $0.13/hr RTX 3090 with a 5 GHz desktop CPU will outperform a $0.35/hr RTX 4090 with an EPYC server CPU.
Known failure mode: Some Vast.ai hosts have broken networking. If SSH times out after provisioning, don't debug — destroy the instance and try another host. You'll waste 2 minutes, not 2 hours.
The Dependency Conflict
This is the step that will waste your afternoon if you don't know about it.
pip install qwen-tts pydub soundfile
This installs the Qwen TTS Python package. It also silently replaces the PyTorch that came with your NGC template. The NGC template ships a carefully matched stack: specific CUDA version, specific torch version, specific flash-attn build. qwen-tts pulls in its own torch version, breaking flash-attn compatibility.
The fix:
pip install torchvision --force-reinstall --no-deps
apt-get update -qq && apt-get install -y -qq ffmpeg
The --no-deps flag is critical. Without it, pip tries to resolve the entire dependency tree again and makes the conflict worse. With it, you repair just the broken piece without touching anything else.
Our script detects the available attention implementation at runtime:
if device == "cuda":
try:
import flash_attn
attn = "flash_attention_2"
except ImportError:
attn = "sdpa" # PyTorch native fallback
Flash attention is faster, but SDPA works if your environment is slightly broken. The model produces identical output either way.
The Hidden API Limit
Even if you're using the cloud API, this matters: Qwen3-TTS accepts a maximum of 600 display-width units per request. The documentation says "600 characters." It means display-width.
CJK characters count as 2 units. ASCII characters count as 1. So the real limit is roughly 300 Chinese characters per synthesis call.
We confirmed this by binary search:
| Input | Display-width | Result |
|---|---|---|
| 300 CJK chars | 600 | Success |
| 302 CJK chars | 604 | InvalidParameter: Range of input length should be [0, 600] |
Our safe limit: 280 characters per chunk. The 20-character margin costs negligible extra API calls but eliminates edge-case failures when a chunk contains mixed CJK and ASCII.
Chunking Long Chapters
Most story chapters are 150-500 characters. Anything over 280 needs splitting. The naive approach — split at character 280 — cuts mid-sentence and produces audible seams in the audio where the model loses context.
Our chunking algorithm uses a three-level hierarchy:
- Split on paragraph breaks (
\n\n) — these are natural pauses in narration - Split on line breaks (
\n) — often dialogue boundaries - Split on sentence-ending punctuation (
。!?) — last resort, but at least preserves sentence integrity
At each level, a greedy packing function combines sub-units back into chunks under 280 characters. The result: every chunk is a complete thought that the TTS model can narrate with proper intonation.
Audio stitching: Each chunk becomes a separate synthesis call. We concatenate the results with a 300ms silent gap between chunks — enough for a natural paragraph pause without sounding choppy. Final output: one MP3 per chapter.
One gotcha: The API returns WAV files, not MP3, regardless of what the documentation suggests. Decode with pydub.AudioSegment.from_file(buffer, format="wav"), not format="mp3".
The Generation Script
The self-hosted script is 280 lines of Python. It loads the model directly using the qwen_tts package, strips markdown formatting from story text, chunks it, synthesizes each chunk, and concatenates the audio.
Key parameters:
model = AutoModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto"
)
The instruct prompt (in Chinese): "Read aloud like an audiobook, at a slightly slower pace, with stable and warm narration, emotion in dialogue, slight pauses between paragraphs, every character pronounced clearly."
This prompt was tuned for audiobook-style narration. Different prompts produce drastically different results — a prompt tuned for one voice may sound robotic on another. Test before committing to a full batch run.
Performance and Cost
Measured on an RTX 4090 with flash attention:
| Condition | Time per chapter | Total for 438 chapters |
|---|---|---|
| Dedicated GPU | ~136 seconds | ~16.5 hours |
| Shared GPU | ~257 seconds | ~31.3 hours |
At $0.13/hr on Vast.ai (RTX 3090 with desktop CPU):
| Condition | Total cost |
|---|---|
| Dedicated | ~$2.15 |
| Shared | ~$4.10 |
At $0.30/hr (RTX 4090):
| Condition | Total cost |
|---|---|
| Dedicated | ~$5.00 |
| Shared | ~$9.40 |
We ran both TTS generation and pinyin correction on the same rented GPU, amortizing the rental across two compute-intensive tasks.
For comparison, the DashScope cloud API charges per character. At roughly $0.01 per 1,000 characters, our 438 chapters (~150,000 characters total) would cost about $1.50. The cloud API is cheaper — but it doesn't have Dylan.
Production Workflow
We process one HSK level at a time, spot-checking audio quality before moving to the next.
1. Bundle the story text:
Package all story markdown files and the generation script into a tarball. Always rebuild the bundle before generation — story text changes between sessions (typo fixes, surname corrections), and stale text produces audio that doesn't match the published version.
2. Upload and generate:
SCP the bundle to the GPU server, install dependencies, run with nohup. The nohup is not optional — Vast.ai SSH connections drop after idle periods, and a dropped connection kills the process. Log output to a file and monitor with grep.
3. Download and sync:
Pull the generated MP3 files back to local, then sync to Cloudflare R2 (our audio CDN). Re-import on the production server with --force to pick up the new voice directory.
4. Verify in production:
Open a story, switch to the Dylan voice, play a few chapters. Check that audio matches the current story text and that chunk boundaries aren't audible.
Failure Modes
Things that will break and how to handle them:
SSH drops mid-generation. Use nohup and --skip-existing. The script checks for existing audio files and skips chapters that already have output. After reconnecting, just restart the script — it picks up where it left off.
Single oversized sentence. If any sentence exceeds 280 characters without a 。!? breakpoint, it passes to the API as a single chunk and may be rejected. This is rare in natural Chinese text but happens in run-on dialogue. Fix: edit the story to add punctuation.
Stale model cache. If you run the script twice with different model parameters, HuggingFace may serve cached weights. Clear ~/.cache/huggingface/ if you're switching between the 0.6B and 1.7B model variants.
Voice-specific prompt tuning. The same instruct prompt produces different results on different voices. A prompt that makes Dylan sound warm and natural might make another voice sound flat. Always test a few chapters with a new voice before committing to a full batch.
What We'd Do Differently
Use a larger SSD. 30GB is tight when generating hundreds of MP3 files. 50GB gives comfortable headroom and avoids the anxiety of watching disk space during a 16-hour batch job.
Process multiple levels in parallel. Our current workflow is sequential (one HSK level at a time). Two instances at $0.30/hr each would halve wall-clock time for $6 more. Whether that's worth it depends on how fast you need the audio.
Build a health check. The script logs progress but doesn't alert on failures. A simple webhook ping after each level completes would save checking logs every few hours.
Listen to stories with audio narration at every level: HSK 1 · HSK 2 · HSK 3 · HSK 4 · HSK 5 · HSK 6 · HSK 7 · HSK 8 · HSK 9
Related guides: