AI-RADAR.IT · AI-RADAR.NET · AI-RADAR.TECH

News & analysis on local LLMs, stack & on-prem hardware.

📁 LLM AI generated

Microsoft Phi-4: Compact Multimodal Model for Reasoning and Vision

Published on 2026-03-04 19:20 ℹ️ LocalLLaMA 📰 Read the original source article →

🏷️ Hardware 🏷️ LLM On-Premise 🏷️ Fine-Tuning 🏷️ DevOps

Microsoft Phi-4: modello multimodale compatto per ragionamento e visione

Microsoft has released Phi-4-Reasoning-Vision-15B, a compact multimodal model designed for reasoning and vision understanding.

Architecture and Operation

Phi-4 is based on the Phi-4-Reasoning language model and the SigLIP-2 vision encoder, using a mid-fusion architecture. The vision encoder converts images into visual tokens, projected into the embedding space of the language model. This architecture allows leveraging the strengths of both pre-trained components, while keeping training and inference costs low.

The model employs a dynamic resolution vision encoder, with up to 3,600 visual tokens, to enable high-resolution image understanding, essential for tasks such as GUI element localization and detailed document analysis. Bidirectional attention within images (intra-image) improves spatial reasoning, avoiding the risks of overfitting.

Training and Data

Phi-4-Reasoning-Vision-15B is trained via Supervised Fine-Tuning (SFT) on a mix of reasoning and non-reasoning data. The model operates as a single system capable of invoking chain-of-thought reasoning (using <think>...</think> blocks) for tasks such as mathematical and scientific reasoning, or resorting to direct inference (marked with <nothink>) for perception-focused tasks, such as captioning, object detection, and localization.

The training data primarily consists of filtered and improved open-source vision-language datasets, supplemented by domain-specific data from internal Microsoft teams and targeted data acquisitions. This data-centric approach, combined with moderate training compute requirements (240 NVIDIA B200 GPUs for 4 days), distinguishes Phi-4-Reasoning-Vision-15B from models that rely on much more data and compute resources.

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.

AI-Radar Takeaway

Microsoft introduces Phi-4-Reasoning-Vision-15B, a compact multimodal model based on Phi-4-Reasoning and SigLIP-2. This open-weight model uses a mid-fusion architecture to integrate vision and language, trained with supervised fine-tuning on reasoning and perception data. Optimized for manageable training and inference costs, Phi-4 supports complex reasoning and perceptual tasks.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

Railway Cloud Infrastructure

Modern cloud platform with instant deployments. Deploy from GitHub in seconds with automatic HTTPS, databases, and monitoring. Perfect for web apps, APIs, and LLM inference services.

✓ GitHub integration ✓ Auto HTTPS ✓ Simple pricing

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

AI-RADAR NEWSLETTER

Stay ahead — get AI signals in your inbox

Daily or weekly digest of the most important AI news. 160+ readers, no spam.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

Explore LLM On-Premise

Complete guide to running AI models locally: hardware, stack, and privacy.

Microsoft unveils Phi-4: compact multimodal model for reasoning

Microsoft unveils Phi-4: compact multimodal model for reasoning

Microsoft has released Phi-4-reasoning-vision-15B, a 15 billion parameter open-weight multimodal model. Designed to balance reasoning power, efficiency, and dat

SenseNova U1: Native Multimodal Unification Redefines Large Language Models

SenseNova U1: Native Multimodal Unification Redefines Large Language Models

SenseNova has released the U1 series, native multimodal models that unify understanding, reasoning, and generation within a monolithic architecture. By moving b

Qwen3-Coder-Next: new language model for programming

Qwen3-Coder-Next: new language model for programming

Qwen3-Coder-Next, a language model developed for programming applications, has been released on Hugging Face. Its availability on the platform facilitates acces

Visual Language Models: Tokenization Bypassed or Reintroduced?

Visual Language Models: Tokenization Bypassed or Reintroduced?

A recent study analyzes whether pixel-based language models effectively overcome the limitations of tokenization, especially in languages with non-Latin scripts

Intention Collapse: Measuring Intentions in Language Models

A new study introduces metrics to analyze how language models compress intentions into token sequences. Researchers defined three model-agnostic metrics – inten

More in LLM

Qwen Fine-tunes: Why Optimized Models Struggle to Impress

DeepSeek-V4-Pro-DSpark: A New Open-Source LLM Targeting Local Deployment

Ornith-1.0-35B Q3_K_M: 17 GB VRAM, all benchmarks pass, extreme quantization holds up

Distilling Your Own LLM for Theorem Proving: When On-Premise Beats the Cloud

Anthropic’s Mythos 5 Authorized for Over 100 US Entities: A Turn for Sovereign AI?

Trump Administration Allows Anthropic to Release Mythos to Select US Organizations

→ View all in LLM →

AI-Radar LLM On-Premise

Complete guide to running AI models locally: hardware, stack, privacy, and reference architectures.

👥 Join 160+ AI explorers

A free community of developers, engineers and AI enthusiasts following local AI daily.

Register free → Already a member? Log in