How We Built It

Running Chinese Text-to-Speech on a $0.13/hr GPU

The best Chinese TTS voice isn't an API call. We self-hosted Qwen3-TTS on rented GPUs to narrate 100+ stories — setup, chunking, and failure modes.

AnthonyAnthony·March 16, 2026·7 min read

The best Chinese TTS voice we tested isn't available through any cloud API. It's only in the open-source model weights. So we rented GPUs, fought dependency conflicts, and built a pipeline to narrate 100 Chinese stories — 438 chapters of graded reader content across HSK 1-9.

Total GPU cost: under $10. Here's exactly how to do it.

Why Self-Host at All

Alibaba's Qwen3-TTS offers 49+ voices through their DashScope cloud API. We tested eight of them: Vincent, Moon, Ethan, Serena, Elias, Cherry, Maia, Neil. Elias won — warm, clear, natural pacing for language learning content.

Then we tried Dylan.

Dylan is only available in the open-source model release (Qwen3-TTS-12Hz-1.7B-CustomVoice). It's not in the API catalog. And it was noticeably better than everything else — more natural pauses, better emotional range in dialogue, clearer pronunciation of individual characters.

For a language learning app where learners practice pronunciation by listening, "noticeably better" matters. So we needed to run the model ourselves.

The Setup: Vast.ai + PyTorch NGC

What you need:

  • Any NVIDIA GPU with 8GB+ VRAM — the 1.7B model uses ~3.5GB VRAM
  • 50GB disk space (model weights + audio output)
  • A high single-thread CPU clock (this matters more than GPU choice)

Qwen3-TTS is CPU-bound during inference. The GPU loads the model, but text-to-speech synthesis bottlenecks on single-threaded CPU work. This means your choice of CPU matters far more than whether you rent an RTX 3090 or RTX 4090.

CPU performance benchmarks (same model, same chapters):

CPUClockTime/chunk$/hr
Unknown (desktop)~5+ GHz47s
Ryzen 9 9950X (Zen 5)4.3 GHz49s$0.30
Ryzen 9 7900X (Zen 4)4.7 GHz52s$0.28
Ryzen 7 5700X (Zen 3)3.4 GHz83s$0.25
EPYC 7D12 (server)2.2 GHz120s$0.20
EPYC 7C13 (server)2.0 GHz289s$0.18

EPYC server CPUs are 3-6x slower despite having 64+ cores. Desktop Zen 4/5 CPUs are the sweet spot.

Finding an instance:

On Vast.ai, filter for 8GB+ VRAM (RTX 3090 at $0.12-0.15/hr or RTX 4090 at $0.25-0.35/hr both work), 50GB+ disk. Sort by CPU clock speed, not GPU model. Pick hosts with >95% reliability score. A $0.13/hr RTX 3090 with a 5 GHz desktop CPU will outperform a $0.35/hr RTX 4090 with an EPYC server CPU.

Known failure mode: Some Vast.ai hosts have broken networking. If SSH times out after provisioning, don't debug — destroy the instance and try another host. You'll waste 2 minutes, not 2 hours.

The Dependency Conflict

This is the step that will waste your afternoon if you don't know about it.

pip install qwen-tts pydub soundfile

This installs the Qwen TTS Python package. It also silently replaces the PyTorch that came with your NGC template. The NGC template ships a carefully matched stack: specific CUDA version, specific torch version, specific flash-attn build. qwen-tts pulls in its own torch version, breaking flash-attn compatibility.

The fix:

pip install torchvision --force-reinstall --no-deps
apt-get update -qq && apt-get install -y -qq ffmpeg

The --no-deps flag is critical. Without it, pip tries to resolve the entire dependency tree again and makes the conflict worse. With it, you repair just the broken piece without touching anything else.

Our script detects the available attention implementation at runtime:

if device == "cuda":
    try:
        import flash_attn
        attn = "flash_attention_2"
    except ImportError:
        attn = "sdpa"  # PyTorch native fallback

Flash attention is faster, but SDPA works if your environment is slightly broken. The model produces identical output either way.

The Hidden API Limit

Even if you're using the cloud API, this matters: Qwen3-TTS accepts a maximum of 600 display-width units per request. The documentation says "600 characters." It means display-width.

CJK characters count as 2 units. ASCII characters count as 1. So the real limit is roughly 300 Chinese characters per synthesis call.

We confirmed this by binary search:

InputDisplay-widthResult
300 CJK chars600Success
302 CJK chars604InvalidParameter: Range of input length should be [0, 600]

Our safe limit: 280 characters per chunk. The 20-character margin costs negligible extra API calls but eliminates edge-case failures when a chunk contains mixed CJK and ASCII.

Chunking Long Chapters

Most story chapters are 150-500 characters. Anything over 280 needs splitting. The naive approach — split at character 280 — cuts mid-sentence and produces audible seams in the audio where the model loses context.

Our chunking algorithm uses a three-level hierarchy:

  1. Split on paragraph breaks (\n\n) — these are natural pauses in narration
  2. Split on line breaks (\n) — often dialogue boundaries
  3. Split on sentence-ending punctuation (。!?) — last resort, but at least preserves sentence integrity

At each level, a greedy packing function combines sub-units back into chunks under 280 characters. The result: every chunk is a complete thought that the TTS model can narrate with proper intonation.

Audio stitching: Each chunk becomes a separate synthesis call. We concatenate the results with a 300ms silent gap between chunks — enough for a natural paragraph pause without sounding choppy. Final output: one MP3 per chapter.

One gotcha: The API returns WAV files, not MP3, regardless of what the documentation suggests. Decode with pydub.AudioSegment.from_file(buffer, format="wav"), not format="mp3".

The Generation Script

The self-hosted script is 280 lines of Python. It loads the model directly using the qwen_tts package, strips markdown formatting from story text, chunks it, synthesizes each chunk, and concatenates the audio.

Key parameters:

model = AutoModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto"
)

The instruct prompt (in Chinese): "Read aloud like an audiobook, at a slightly slower pace, with stable and warm narration, emotion in dialogue, slight pauses between paragraphs, every character pronounced clearly."

This prompt was tuned for audiobook-style narration. Different prompts produce drastically different results — a prompt tuned for one voice may sound robotic on another. Test before committing to a full batch run.

Performance and Cost

Measured on an RTX 4090 with flash attention:

ConditionTime per chapterTotal for 438 chapters
Dedicated GPU~136 seconds~16.5 hours
Shared GPU~257 seconds~31.3 hours

At $0.13/hr on Vast.ai (RTX 3090 with desktop CPU):

ConditionTotal cost
Dedicated~$2.15
Shared~$4.10

At $0.30/hr (RTX 4090):

ConditionTotal cost
Dedicated~$5.00
Shared~$9.40

We ran both TTS generation and pinyin correction on the same rented GPU, amortizing the rental across two compute-intensive tasks.

For comparison, the DashScope cloud API charges per character. At roughly $0.01 per 1,000 characters, our 438 chapters (~150,000 characters total) would cost about $1.50. The cloud API is cheaper — but it doesn't have Dylan.

Production Workflow

We process one HSK level at a time, spot-checking audio quality before moving to the next.

1. Bundle the story text:

Package all story markdown files and the generation script into a tarball. Always rebuild the bundle before generation — story text changes between sessions (typo fixes, surname corrections), and stale text produces audio that doesn't match the published version.

2. Upload and generate:

SCP the bundle to the GPU server, install dependencies, run with nohup. The nohup is not optional — Vast.ai SSH connections drop after idle periods, and a dropped connection kills the process. Log output to a file and monitor with grep.

3. Download and sync:

Pull the generated MP3 files back to local, then sync to Cloudflare R2 (our audio CDN). Re-import on the production server with --force to pick up the new voice directory.

4. Verify in production:

Open a story, switch to the Dylan voice, play a few chapters. Check that audio matches the current story text and that chunk boundaries aren't audible.

Failure Modes

Things that will break and how to handle them:

SSH drops mid-generation. Use nohup and --skip-existing. The script checks for existing audio files and skips chapters that already have output. After reconnecting, just restart the script — it picks up where it left off.

Single oversized sentence. If any sentence exceeds 280 characters without a 。!? breakpoint, it passes to the API as a single chunk and may be rejected. This is rare in natural Chinese text but happens in run-on dialogue. Fix: edit the story to add punctuation.

Stale model cache. If you run the script twice with different model parameters, HuggingFace may serve cached weights. Clear ~/.cache/huggingface/ if you're switching between the 0.6B and 1.7B model variants.

Voice-specific prompt tuning. The same instruct prompt produces different results on different voices. A prompt that makes Dylan sound warm and natural might make another voice sound flat. Always test a few chapters with a new voice before committing to a full batch.

What We'd Do Differently

Use a larger SSD. 30GB is tight when generating hundreds of MP3 files. 50GB gives comfortable headroom and avoids the anxiety of watching disk space during a 16-hour batch job.

Process multiple levels in parallel. Our current workflow is sequential (one HSK level at a time). Two instances at $0.30/hr each would halve wall-clock time for $6 more. Whether that's worth it depends on how fast you need the audio.

Build a health check. The script logs progress but doesn't alert on failures. A simple webhook ping after each level completes would save checking logs every few hours.


Listen to stories with audio narration at every level: HSK 1 · HSK 2 · HSK 3 · HSK 4 · HSK 5 · HSK 6 · HSK 7 · HSK 8 · HSK 9

Related guides:

Frequently Asked Questions

Can you self-host Chinese text-to-speech?

Yes. Open-source models like CosyVoice, ChatTTS, and Edge TTS provide Chinese speech synthesis that you can run on your own hardware. Self-hosting gives you full control over latency, cost, and data privacy compared to cloud APIs.

Which TTS models produce the most natural Chinese?

CosyVoice and ChatTTS currently produce the most natural-sounding Mandarin. Cloud options like Azure Neural TTS and Google Cloud TTS also perform well. Quality depends on the specific voice and whether the model handles tones and sentence intonation correctly.

What hardware do you need for self-hosted Chinese TTS?

Most modern Chinese TTS models run on a GPU with at least 4 GB VRAM for real-time synthesis. CPU-only inference is possible but slower. For batch processing (pre-generating audio for a content library), even modest hardware works if speed is not critical.