An engineer has implemented a Speculative Reasoning Execution (SRE) module for Microsoft's AutoGen, drastically reducing latency in workflows using the Chain-of-Thought (CoT) technique.
Implementation Details
The traditional approach involves a sequential loop (Think โ Wait โ Execute Tool โ Wait โ Speak), unsuitable for real-time interactions. SRE, inspired by speculative decoding, intercepts the text stream generated by the LLM via regex to predict tool calls. If a high-confidence tool call pattern is detected, the tool is executed asynchronously in a background thread, in parallel with the LLM's reasoning text generation.
Benchmarks
Tests, performed on an NVIDIA A100, showed a reduction in Time-to-Action from 13.4 seconds (sequential) to 1.6 seconds (with SRE), corresponding to an 85% improvement.
Other implementations
A distributed training system for Whisper on Ray, called SpeechLab, was also created, with a scaling efficiency of 94% on 4 A100 GPUs. SpeechLab handles streaming audio ingestion, avoiding out-of-memory (OOM) issues on large datasets.
๐ฌ Commenti (0)
๐ Accedi o registrati per commentare gli articoli.
Nessun commento ancora. Sii il primo a commentare!