Voice embedding in Qwen3
Qwen3 introduces a voice embedding feature in its Text-to-Speech (TTS) model, opening new possibilities in voice cloning and manipulation. The system transforms a voice into a 1024-dimensional vector (or 2048 for the 1.7 billion parameter model), allowing the voice to be recreated based solely on this vector.
Voice manipulation through mathematics
The most interesting aspect is the ability to modify voices through mathematical operations. You can combine different voices, alter gender or pitch, and even create an emotional space. This technique also enables semantic voice search.
Implementation and resources
The voice embedding model is a small encoder, with only a few million parameters. It has been made available in a standalone version, with ONNX models optimized for web and front-end inference. Inference via voice embedding is supported in specific forks of vLLM.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!