Qwen3-TTS represents a significant step forward in local speech synthesis. This open-source solution offers an alternative to ElevenLabs and OpenAI, runnable directly on the user's hardware.

Key Features

  • Speed: End-to-end latency of approximately 97ms for streaming.
  • Natural Voice Control: Ability to give natural language instructions to modulate the tone and emotion of the voice.
  • Voice Cloning: Voice cloning from a reference clip of just 3 seconds.
  • OpenAI Compatibility: Works natively with the OpenAI Python client, requiring only a change to the base URL.
  • Multilingual: Supports 10+ languages, including Italian, English, Japanese, and German.

Technical Details

Qwen3-TTS uses a new dual-track hybrid architecture and the Qwen3-TTS-Tokenizer-12Hz tokenizer for acoustic compression. Versions of 0.6B (fast and light) and 1.7B (high fidelity) are available. It supports FlashAttention 2 to reduce memory usage.

The low latency makes real-time voice conversation more realistic, opening up new possibilities for integration into local LLM agents.