Local audio gets serious: audio.cpp delivers music generation and stem separation

When we talk about local inference with LLMs, ggml immediately comes up as the engine behind projects like llama.cpp. But the team behind audio.cpp is expanding the horizon far beyond text: with the latest release, the framework — entirely native C++ and ggml-based — takes on music generation, sound effects, and source separation, reaching 75% of its stated roadmap.

The rollout includes models such as ACE-Step 1.5, HeartMuLa, and Stable Audio 3 in Small and Medium versions for music and effects. The most intriguing detail isn't just the variety: HeartMuLa, previously limited to short clips, now generates around ten minutes of audio in a single run. For anyone working with soundtracks or soundscapes, having such a tool running locally, with no API calls and no data leaving the premises, changes the game.

A quick benchmark: generating 600 seconds of music with ACE-Step Turbo took 60.16 seconds on audio.cpp, with a Real-Time Factor of 0.100 (nearly 10× faster than real-time). The same test in Python clocked 88.52 seconds, with an RTF of 0.148. A margin that, multiplied over long sessions or server workloads, becomes substantial. But the developers aren't sugar-coating things: HTDemucs, for stem separation, is still slower than the Python path, while Stable Audio warm runs show mixed results. “I’m not trying to hide that,” the maintainer writes in the Reddit post, explaining that the priority is first to get models into the shared framework, then tighten backend-specific performance.

A detail that will please those thinking in terms of server and long-running deployment: a “mem_saver” mode has been introduced. It doesn't reduce the absolute peak VRAM during inference, but it lowers the resident memory after the task completes, without significantly affecting speed. That’s an infrastructure-conscious touch, not a demo feature, and it signals the project’s maturity.

For AI-RADAR readers, the release carries a significance beyond audio. It shows that the ggml ecosystem is no longer confined to spoken language or chatbots. Having a single native C++ path for text-to-speech, transcription, separation, and now music means you can build self-hosted multimedia pipelines without resorting to Python containers or cloud services. The usual trade-off remains: not everything is yet faster than the Python reference, but the consolidation phase is almost complete. Anyone evaluating on-premise deployment knows that Total Cost of Ownership also hinges on the ability to keep data in-house and scale on proprietary hardware — and audio.cpp seems to be moving squarely in that direction.

Local audio gets serious: audio.cpp delivers music generation and stem separation

💻 Need GPU Cloud Infrastructure?

Stay ahead — get AI signals in your inbox

💬 Comments (0)

🔍 Continue Exploring

More in Frameworks

👥 Join 160+ AI explorers