\n## Introduction\n\nThe new training framework, called OpenMMReasoner, was developed by researchers from MiroMind AI and Chinese universities.\n\n\n\n## Technical Details\n\nOpenMMReasoner uses a two-stage process. The first stage refines a base model with a curated dataset in supervised fine-tuning (SFT). The second stage, guided by reinforcement learning (RL), helps the model to reason more effectively on tasks involving both text and visual data.\n\n\n\n## Practical Implications\n\nThe experiment shows that models trained with OpenMMReasoner surpass other vision-guided leaders, often training on smaller and higher-quality datasets. The framework and all its resources, including a 7-billion-parameter model, are fully open-sourced to provide a solid base for building applications requiring transparency and robustness.\n\n\n\n## Conclusion and Future Prospects\n\nAccording to Kaichen Zhang, one of the authors of the research paper that outlines the new method, OpenMMReasoner offers significant advantages for companies looking beyond closed systems. \"A smaller and open-source model has practical benefits: companies can deploy it locally, reduce latency, reduce token costs associated with long thinking chains, maintain full control over their data, and [is] fine-tunable for downstream specific tasks '\" said VentureBeat.\n

Multimodal Reasoning Transparency Challenge\n\nRecent progress in reinforcement learning with verifiable rewards (RLVR) has significantly improved the capabilities of language models (LLMs). RLVR trains LLMs to generate chain-of-thought tokens (representing human reasoning processes) before generating the final response. This improves the model's ability to solve complex tasks such as mathematics and programming.\n\n## OpenMMReasoner Recipe\n\nOpenMMReasoner addresses this transparency challenge with a complete and scalable training framework built on top of open-source language models. Researchers found that curating high-quality data increased dataset diversity.\n\n## The Distillation Step\n\nThe first stage of the recipe is a three-stage supervised fine-tuning pipeline. It begins with data sourcing, where the team collected approximately 103,000 question-answer pairs from publicly available datasets covering general Q&A and reasoning tasks.\n\n## The Distillations\n\nNext, they added a distillation step using a powerful model (Qwen3-VL-235B-Instruct) to generate new, high-quality reasoning traces for selected questions. The dataset will be used to train a smaller model.\n\n## Domain Mixing\n\nTo increase response diversity, the team generated multiple verified reasoning traces per each question. This expanded the dataset to 583,000 samples. Finally, they implemented a domain mixing step, adding data from domains such as science, mathematics, and puzzles to further generalize the model's reasoning ability, resulting in an SFT dataset final of 874,000 examples.\n\n## Reinforcement Learning Step\n\nThe second stage is a reinforcement learning framework that uses a smaller dataset of 74,000 samples from domains such as science, mathematics, and puzzles. The model is trained with a combined reward function that takes into account both the final response correctness and output formatting consistency. To improve efficiency, the process includes an overthinking penalty, discouraging the model to generate excessively long chains (a common problem in RL-trained models that result in longer thinking costs and slower responses).

\n## OpenMMReasoner as a Tool for Companies\n\nAccording to Zhang, the reinforcement learning step fundamentally changes the model's output trustworthiness. \"Modeling approaches typically jump directly to the response, exploring only a small portion of the reasoning space\," said Zhang. \"In contrast, an approach focused on reason first forces the model to explicitly examine intermediate steps... [to] arrive at responses with greater internal consistency.\n\n## Model Performance\n\nResearchers used OpenMMReasoner to generate data for fine-tuning the Qwen2.5-VL-7B-Instruct open-source vision-language model. The result is a highly capable LMM that consistently surpasses advanced vision-guided leaders, including Open Vision Reasoner (OVR), on a wide range of multimodal reasoning benchmarks. The first stage of fine-tuning alone creates a strong base model that outperforms other SFT fine-tuning approaches in terms of efficiency and quality, despite using smaller datasets.\n\n## Emergent Language Reasoning\n\nOne of the project's goals was to explore the gradual emergence of language reasoning behaviors, indicating a transfer of reasoning ability from multi-modal to purely linguistic tasks. The result is that the model consistently surpasses vision-guided leaders on a wide range of multimodal reasoning benchmarks.

Future Prospects for the Project\n\nAccording to Zhang, OpenMMReasoner offers significant potential for creating more robust and adaptable language models suitable for diverse applications.