Multimodal LLMs extend the standard text-in/text-out paradigm to accept images, audio, and video alongside text. A vision encoder (e.g., CLIP, SigLIP) converts images into tokens that the LLM processes alongside text tokens.
Architecture
The most common architecture stacks: Image Encoder → Projection Layer → LLM. The projection layer (often a linear layer or MLP) maps vision encoder embeddings into the LLM's token embedding space. The LLM then processes a mixed sequence of visual and text tokens. Audio models use a Whisper-style speech encoder instead.
Notable Open Multimodal Models (2025–2026)
| Model | Modalities | Base LLM | VRAM |
|---|---|---|---|
| LLaVA-1.6 | Image | Llama 3 / Mistral | ~6 GB (Q4) |
| Phi-4 Vision | Image, Video | Phi-4 3.8B | ~8 GB |
| Gemma 3 12B | Image | Gemma 3 | ~8 GB (Q4) |
| Qwen2.5-VL 72B | Image, Video | Qwen2.5 72B | ~42 GB (Q4) |
| Llama 4 Scout | Image (native) | Llama 4 | ~60 GB (Q4) |
| InternVL2.5 8B | Image, Video | InternLM | ~5 GB (Q4) |
Why It Matters for On-Premise
Document analysis, medical imaging, manufacturing quality control, and satellite imagery analysis all benefit from vision-language models run locally. Cloud vision APIs (GPT-4o, Gemini) send your images to external servers — a non-starter for medical or classified documents. Models like LLaVA, Phi-4 Vision, and Gemma 3 run on consumer hardware via Ollama (ollama run llava).