## Introduction Diffusion models have been introduced as a potential solution to surpass autoregulatory models in the field of artificial intelligence. However, most open-source diffusion models decode fewer than 5 tokens per model forward pass even with sophisticated sampling strategies. Another problem is that distillation-based accelerators (dParallel, d3LLM) fine-tune MDLMs on trajectories generated by a base model, which can lead to reduced performance during fine-tuning and limit their ability to perform well on the quality of the base model's samples. Researchers have therefore developed dUltra, a new learning framework based on Group Relative Policy Optimization (GRPO) that learns unmasking strategies for efficient parallel decoding. The framework introduces an unmasking planner head that predicts per-token unmasking likelihoods under independent Bernoulli distributions. The researchers jointly optimize the base diffusion LLM and the unmasking order planner using reward signals combining verifiable reward, distillation reward, and the number of unmasking steps. The results show that dUltra improves the accuracy-efficiency trade-off over state-of-the-art heuristic and distillation baselines, moving towards achieving "diffusion supremacy" over autoregressive models. ## Technical context Diffusion models have been introduced as a potential solution to surpass autoregulatory models in the field of artificial intelligence. However, most open-source diffusion models decode fewer than 5 tokens per model forward pass even with sophisticated sampling strategies. Another problem is that distillation-based accelerators (dParallel, d3LLM) fine-tune MDLMs on trajectories generated by a base model, which can lead to reduced performance during fine-tuning and limit their ability to perform well on the quality of the base model's samples. ## Conclusion The creation of dUltra represents an important step towards achieving "diffusion supremacy" over autoregulatory models. The new learning framework uses reinforcement learning to optimize parallel decoding, improving efficiency and accuracy of diffusion models. ## Implications The realization of dUltra has significant implications for artificial intelligence and natural language applications. With the ability to optimize parallel decoding, diffusion models can reach higher performance levels than autoregulatory models. ## Future research directions The creation of dUltra represents only the beginning of research in this field. Researchers will continue to improve and optimize the learning framework to achieve even higher performance levels.

dUltra: Ultra-Fast Diffusion Language Models via Reinforcement Learning

💬 Commenti (0)

📚 Approfondimenti

Approfondisci su LLM On-Premise

I modelli LLM potrebbero persuadere senza essere sollecitati

Hybrid Models con vLLM V1

DeepSeek: spunta un nuovo modello, nome in codice "model1"