MODEL_CARDS :: ON_PREMISE_READY :: MAY 2026

LLM Model Cards 2026

Curated reference cards for the top open-weight LLMs deployable on enterprise hardware in May 2026. VRAM requirements, license terms, use case fit, and hardware tier recommendations — verified against real on-premise deployments.

FILTER: [General Purpose] [Reasoning / CoT] [Coding / STEM] [Multilingual]

▰ Tier 1 — RTX 5090 / A6000 / H100

▰ Tier 2 — RTX 4090 / 3090 / A5000

▰ Tier 3 — RTX 4080 / A4000 / Mac M3 Pro

▰ Tier 4 — RTX 4070 / M3 / 32GB RAM only

> GENERAL_PURPOSE_MODELS

> LLAMA_3.3_70B :: META

TIER 1 Apache 2.0

PARAMETERS

70B

VRAM Q4_K_M

~40 GB

VRAM Q8

~75 GB

CONTEXT

128K

TOK/S (Q4, 4090)

~8-12*

LANGUAGES

8 (EN++)

STRENGTHS

Best open-weight general purpose as of Q1 2026
Strong instruction following and reasoning
Extensive community tooling and quantizations
Apache 2.0 — no usage restrictions

LIMITATIONS

*Needs dual GPU (2×RTX 4090) for full speed
Single 4090: requires heavy CPU offload
Not natively multimodal

RECOMMENDED FOR: Enterprise chat assistants, document Q&A, RAG backend, agentic orchestrator role

HARDWARE FIT: 2×RTX 4090 (native speed) · 1×RTX 5090 (Q4_K_M) · 2×A6000 · H100/A100

> QWEN3.6_27B :: ALIBABA

TIER 2 Apache 2.0

PARAMETERS

27B

VRAM Q4_K_M

~16 GB

VRAM Q8

~29 GB

CONTEXT

128K

TOK/S (Q4, 4090)

~30-40

LANGUAGES

STRENGTHS

Thinking mode (extended chain-of-thought)
Fits single RTX 4090 at Q4_K_M
Strong at reasoning, coding, and math
29-language multilingual (Italian included)

LIMITATIONS

Thinking mode can over-generate verbose CoT
Not multimodal (vision)
Requires attention to temperature settings

RECOMMENDED FOR: Single-GPU inference server, agentic worker node, multilingual European deployments

HARDWARE FIT: RTX 4090 (native Q4) · RTX 3090 (Q4_K_S) · A5000 24GB

> REASONING_AND_COT_MODELS

> QWEN3.6_35B_A3.7B_MOE :: ALIBABA

TIER 2 Apache 2.0 MoE

TOTAL / ACTIVE

35B / 3.7B

VRAM Q4_K_M

~22 GB

COMPUTE / TOKEN

~4B dense

CONTEXT

128K

TOK/S (Q4, 4090)

~35-45

ARCHITECTURE

MoE 128E

RECOMMENDED FOR: Enterprise reasoning tasks, extended chain-of-thought, agentic planning, single 24GB GPU deployments requiring high quality

HARDWARE FIT: RTX 4090 24GB (fits with Q4_K_M) · RTX 5090 32GB (headroom for longer ctx)

→ Read the full MoE Deployment Guide

> DEEPSEEK_R1_32B :: DEEPSEEK

TIER 2 MIT

PARAMETERS

32B

VRAM Q4_K_M

~20 GB

VRAM Q8

~34 GB

CONTEXT

128K

TOK/S (Q4, 4090)

~20-28

REASONING

CoT Native

RECOMMENDED FOR: Complex analytical tasks, step-by-step problem decomposition, math/logic verification, regulated industry reasoning

HARDWARE FIT: RTX 4090 (fits Q4_K_M with room for context) · Dual RTX 3090 (Q8 offload)

> CODING_AND_STEM

> PHI_4 :: MICROSOFT

TIER 3/4 MIT

PARAMETERS

14B

VRAM Q4_K_M

~8 GB

VRAM Q8

~15 GB

CONTEXT

16K

TOK/S (Q4, 4070)

~50-65

SPECIALTY

STEM/Code

RECOMMENDED FOR: Code generation, math problem solving, STEM Q&A, low-resource deployments (edge servers, laptops)

HARDWARE FIT: RTX 4070 12GB (Q4) · RTX 3080 10GB (Q4_K_S) · Apple M2 Pro 16GB

> MULTILINGUAL_MODELS

> MISTRAL_SMALL_3.1_24B :: MISTRAL

TIER 2/3 Apache 2.0

PARAMETERS

24B

VRAM Q4_K_M

~14 GB

VRAM Q8

~26 GB

CONTEXT

128K

TOK/S (Q4, 4080)

~35-50

LANGUAGES

🇪🇺 EU Focus

RECOMMENDED FOR: European enterprise deployments, multilingual customer support, Italian/French/German/Spanish interfaces, GDPR-aware workloads

HARDWARE FIT: RTX 4080 16GB (Q4) · RTX 4090 (Q8) · A5000 24GB (Q8, margin for ctx)

> GEMMA_3_27B :: GOOGLE

TIER 2 Apache 2.0 Multimodal

PARAMETERS

27B

VRAM Q4_K_M

~16 GB

VRAM Q8

~29 GB

CONTEXT

128K

TOK/S (Q4, 4090)

~30-38

MODALITIES

Text + Vision

RECOMMENDED FOR: Document analysis with images, multimodal Q&A, scientific literature processing, on-premise vision tasks

HARDWARE FIT: RTX 4090 (Q4, with vision) · RTX 4080 16GB (text Q4 only) · A5000 (Q8 text)

> QUICK_COMPARISON

MODEL	SIZE	Q4 VRAM	LICENSE	TIER	REASONING	MULTILINGUAL	VISION
Llama 3.3 70B	70B	~40 GB	Apache 2.0	1	★★★★☆	★★★☆☆	✗
Qwen3.6 27B	27B	~16 GB	Apache 2.0	2	★★★★★	★★★★☆	✗
Qwen3.6-35B-A3.7B	35B MoE	~22 GB	Apache 2.0	2	★★★★★	★★★★☆	✗
Mistral Small 3.1 24B	24B	~14 GB	Apache 2.0	2/3	★★★☆☆	★★★★★	✗
Phi-4 14B	14B	~8 GB	MIT	3/4	★★★★☆	★★☆☆☆	✗
Gemma 3 27B	27B	~16 GB	Apache 2.0	2	★★★☆☆	★★★★☆	✓
DeepSeek-R1 32B	32B	~20 GB	MIT	2	★★★★★	★★☆☆☆	✗

⚠ VRAM estimates based on Q4_K_M GGUF quantization + ~2GB overhead for context and KV cache at 4K context. Actual usage varies by context length, batch size, and runtime. Benchmark your specific use case. Data current as of May 2026 — the open-weight LLM landscape changes rapidly.

DEEP DIVES

MoE Guide → Hardware Matrix → Agentic AI → SLM Guide →