Microsoft has released Phi-4-Reasoning-Vision-15B, a compact multimodal model designed for reasoning and vision understanding.

Architecture and Operation

Phi-4 is based on the Phi-4-Reasoning language model and the SigLIP-2 vision encoder, using a mid-fusion architecture. The vision encoder converts images into visual tokens, projected into the embedding space of the language model. This architecture allows leveraging the strengths of both pre-trained components, while keeping training and inference costs low.

The model employs a dynamic resolution vision encoder, with up to 3,600 visual tokens, to enable high-resolution image understanding, essential for tasks such as GUI element localization and detailed document analysis. Bidirectional attention within images (intra-image) improves spatial reasoning, avoiding the risks of overfitting.

Training and Data

Phi-4-Reasoning-Vision-15B is trained via Supervised Fine-Tuning (SFT) on a mix of reasoning and non-reasoning data. The model operates as a single system capable of invoking chain-of-thought reasoning (using <think>...</think> blocks) for tasks such as mathematical and scientific reasoning, or resorting to direct inference (marked with <nothink>) for perception-focused tasks, such as captioning, object detection, and localization.

The training data primarily consists of filtered and improved open-source vision-language datasets, supplemented by domain-specific data from internal Microsoft teams and targeted data acquisitions. This data-centric approach, combined with moderate training compute requirements (240 NVIDIA B200 GPUs for 4 days), distinguishes Phi-4-Reasoning-Vision-15B from models that rely on much more data and compute resources.

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.