Qwen3-ASR: Open-Source Speech Recognition

The Qwen3-ASR model family, developed by Qwen, offers speech recognition (ASR) and language identification capabilities for a total of 52 languages and dialects. The models, available in two variants (1.7B and 0.6B parameters), are based on the Qwen3-Omni foundation model and are trained on a large speech dataset.

Key Features

  • All-in-one: Support for language identification and speech recognition in 30 languages and 22 Chinese dialects, as well as various English accents.
  • Performance and Speed: The Qwen3-ASR-1.7B model offers high recognition quality even in complex acoustic environments. The 0.6B version prioritizes efficiency, achieving a processing speed of 2000 simultaneous transcriptions with a concurrency of 128.
  • Forced Alignment: Qwen3-ForcedAligner-0.6B allows predicting timestamps for arbitrary units within audio snippets up to 5 minutes in 11 languages.
  • Comprehensive Inference Toolkit: In addition to the weights and architecture of the models, an inference framework is provided that supports vLLM-based batch inference, asynchronous serving, streaming, and timestamp prediction.

For those evaluating on-premise deployments, there are trade-offs to consider carefully. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.