Microsoft has announced Phi-4-reasoning-vision-15B, a 15 billion parameter open-weight multimodal model, available through Microsoft Foundry, HuggingFace, and GitHub.

Key Features

Phi-4-reasoning-vision-15B is designed for a wide range of vision-language tasks, including image captioning, asking questions about images, reading documents and receipts, helping with homework, and inferring about changes in sequences of images. The model excels in mathematical and scientific reasoning, as well as understanding elements on computer and mobile screens.

An interesting aspect is its value compared to other open-weight models, offering a good trade-off between accuracy and compute costs. Phi-4 shows competitive performance compared to slower models that require ten times more compute time and tokens, and better accuracy than equally fast models, especially in mathematical and scientific reasoning.

Focus on smaller and faster vision-language models

Many vision-language models (VLMs) tend to grow in terms of the number of parameters and tokens consumed and generated, increasing training and inference costs and limiting their usability for deployment, especially in resource-constrained or interactive settings. Phi-4-reasoning-vision-15B stands as an alternative, focusing on efficiency through careful model design and data curation. The model was trained with much less compute than similar-sized open-weight VLMs, using only 200 billion tokens of multimodal data.

Lessons from training a multimodal model

Training a multimodal reasoning model requires precise choices about model architecture, dataset quality and composition, and the interaction between reasoning and perception tasks. Phi-4's architecture is based on a mid-fusion approach, which uses a pre-trained vision encoder to convert images into visual tokens projected into the embedding space of a pre-trained LLM.

Data quality is another crucial aspect. The final dataset consists primarily of filtered and improved open-source data, internal domain-specific data, and targeted acquired data. Microsoft paid particular attention to data balancing, varying the ratios between mathematics and science and computer-use data. It was found that increasing mathematical data improves both mathematical and scientific benchmarks and those related to computer use.

Applications

Phi-4-reasoning-vision-15B can be used in various contexts, including describing images, answering questions, interpreting sequences of images, and recognizing objects and text. The model excels in tasks that combine visual input with structured inferences, such as solving mathematical problems presented visually and supporting reasoning in educational or scientific contexts.

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks at /llm-onpremise to evaluate these aspects.