Multimodal LLM – LLM Glossary

Multimodal LLMs extend the standard text-in/text-out paradigm to accept images, audio, and video alongside text. A vision encoder (e.g., CLIP, SigLIP) converts images into tokens that the LLM processes alongside text tokens.

Architecture

The most common architecture stacks: Image Encoder → Projection Layer → LLM. The projection layer (often a linear layer or MLP) maps vision encoder embeddings into the LLM's token embedding space. The LLM then processes a mixed sequence of visual and text tokens. Audio models use a Whisper-style speech encoder instead.

Notable Open Multimodal Models (2025–2026)

Model	Modalities	Base LLM	VRAM
LLaVA-1.6	Image	Llama 3 / Mistral	~6 GB (Q4)
Phi-4 Vision	Image, Video	Phi-4 3.8B	~8 GB
Gemma 3 12B	Image	Gemma 3	~8 GB (Q4)
Qwen2.5-VL 72B	Image, Video	Qwen2.5 72B	~42 GB (Q4)
Llama 4 Scout	Image (native)	Llama 4	~60 GB (Q4)
InternVL2.5 8B	Image, Video	InternLM	~5 GB (Q4)

Why It Matters for On-Premise

Document analysis, medical imaging, manufacturing quality control, and satellite imagery analysis all benefit from vision-language models run locally. Cloud vision APIs (GPT-4o, Gemini) send your images to external servers — a non-starter for medical or classified documents. Models like LLaVA, Phi-4 Vision, and Gemma 3 run on consumer hardware via Ollama (ollama run llava).