MODEL_CARDS :: ON_PREMISE_READY :: MAY 2026
LLM Model Cards 2026
Curated reference cards for the top open-weight LLMs deployable on enterprise hardware in May 2026. VRAM requirements, license terms, use case fit, and hardware tier recommendations — verified against real on-premise deployments.
▰ Tier 1 — RTX 5090 / A6000 / H100
▰ Tier 2 — RTX 4090 / 3090 / A5000
▰ Tier 3 — RTX 4080 / A4000 / Mac M3 Pro
▰ Tier 4 — RTX 4070 / M3 / 32GB RAM only
> GENERAL_PURPOSE_MODELS
> LLAMA_3.3_70B :: META
TIER 1
Apache 2.0
PARAMETERS
70B
VRAM Q4_K_M
~40 GB
VRAM Q8
~75 GB
CONTEXT
128K
TOK/S (Q4, 4090)
~8-12*
LANGUAGES
8 (EN++)
STRENGTHS
- Best open-weight general purpose as of Q1 2026
- Strong instruction following and reasoning
- Extensive community tooling and quantizations
- Apache 2.0 — no usage restrictions
LIMITATIONS
- *Needs dual GPU (2×RTX 4090) for full speed
- Single 4090: requires heavy CPU offload
- Not natively multimodal
RECOMMENDED FOR: Enterprise chat assistants, document Q&A, RAG backend, agentic orchestrator role
HARDWARE FIT: 2×RTX 4090 (native speed) · 1×RTX 5090 (Q4_K_M) · 2×A6000 · H100/A100
> QWEN3.6_27B :: ALIBABA
TIER 2
Apache 2.0
PARAMETERS
27B
VRAM Q4_K_M
~16 GB
VRAM Q8
~29 GB
CONTEXT
128K
TOK/S (Q4, 4090)
~30-40
LANGUAGES
29
STRENGTHS
- Thinking mode (extended chain-of-thought)
- Fits single RTX 4090 at Q4_K_M
- Strong at reasoning, coding, and math
- 29-language multilingual (Italian included)
LIMITATIONS
- Thinking mode can over-generate verbose CoT
- Not multimodal (vision)
- Requires attention to temperature settings
RECOMMENDED FOR: Single-GPU inference server, agentic worker node, multilingual European deployments
HARDWARE FIT: RTX 4090 (native Q4) · RTX 3090 (Q4_K_S) · A5000 24GB
> REASONING_AND_COT_MODELS
> QWEN3.6_35B_A3.7B_MOE :: ALIBABA
TIER 2
Apache 2.0
MoE
TOTAL / ACTIVE
35B / 3.7B
VRAM Q4_K_M
~22 GB
COMPUTE / TOKEN
~4B dense
CONTEXT
128K
TOK/S (Q4, 4090)
~35-45
ARCHITECTURE
MoE 128E
RECOMMENDED FOR: Enterprise reasoning tasks, extended chain-of-thought, agentic planning, single 24GB GPU deployments requiring high quality
HARDWARE FIT: RTX 4090 24GB (fits with Q4_K_M) · RTX 5090 32GB (headroom for longer ctx)
> DEEPSEEK_R1_32B :: DEEPSEEK
TIER 2
MIT
PARAMETERS
32B
VRAM Q4_K_M
~20 GB
VRAM Q8
~34 GB
CONTEXT
128K
TOK/S (Q4, 4090)
~20-28
REASONING
CoT Native
RECOMMENDED FOR: Complex analytical tasks, step-by-step problem decomposition, math/logic verification, regulated industry reasoning
HARDWARE FIT: RTX 4090 (fits Q4_K_M with room for context) · Dual RTX 3090 (Q8 offload)
> CODING_AND_STEM
> PHI_4 :: MICROSOFT
TIER 3/4
MIT
PARAMETERS
14B
VRAM Q4_K_M
~8 GB
VRAM Q8
~15 GB
CONTEXT
16K
TOK/S (Q4, 4070)
~50-65
SPECIALTY
STEM/Code
RECOMMENDED FOR: Code generation, math problem solving, STEM Q&A, low-resource deployments (edge servers, laptops)
HARDWARE FIT: RTX 4070 12GB (Q4) · RTX 3080 10GB (Q4_K_S) · Apple M2 Pro 16GB
> MULTILINGUAL_MODELS
> MISTRAL_SMALL_3.1_24B :: MISTRAL
TIER 2/3
Apache 2.0
PARAMETERS
24B
VRAM Q4_K_M
~14 GB
VRAM Q8
~26 GB
CONTEXT
128K
TOK/S (Q4, 4080)
~35-50
LANGUAGES
🇪🇺 EU Focus
RECOMMENDED FOR: European enterprise deployments, multilingual customer support, Italian/French/German/Spanish interfaces, GDPR-aware workloads
HARDWARE FIT: RTX 4080 16GB (Q4) · RTX 4090 (Q8) · A5000 24GB (Q8, margin for ctx)
> GEMMA_3_27B :: GOOGLE
TIER 2
Apache 2.0
Multimodal
PARAMETERS
27B
VRAM Q4_K_M
~16 GB
VRAM Q8
~29 GB
CONTEXT
128K
TOK/S (Q4, 4090)
~30-38
MODALITIES
Text + Vision
RECOMMENDED FOR: Document analysis with images, multimodal Q&A, scientific literature processing, on-premise vision tasks
HARDWARE FIT: RTX 4090 (Q4, with vision) · RTX 4080 16GB (text Q4 only) · A5000 (Q8 text)
> QUICK_COMPARISON
| MODEL | SIZE | Q4 VRAM | LICENSE | TIER | REASONING | MULTILINGUAL | VISION |
|---|---|---|---|---|---|---|---|
| Llama 3.3 70B | 70B | ~40 GB | Apache 2.0 | 1 | ★★★★☆ | ★★★☆☆ | ✗ |
| Qwen3.6 27B | 27B | ~16 GB | Apache 2.0 | 2 | ★★★★★ | ★★★★☆ | ✗ |
| Qwen3.6-35B-A3.7B | 35B MoE | ~22 GB | Apache 2.0 | 2 | ★★★★★ | ★★★★☆ | ✗ |
| Mistral Small 3.1 24B | 24B | ~14 GB | Apache 2.0 | 2/3 | ★★★☆☆ | ★★★★★ | ✗ |
| Phi-4 14B | 14B | ~8 GB | MIT | 3/4 | ★★★★☆ | ★★☆☆☆ | ✗ |
| Gemma 3 27B | 27B | ~16 GB | Apache 2.0 | 2 | ★★★☆☆ | ★★★★☆ | ✓ |
| DeepSeek-R1 32B | 32B | ~20 GB | MIT | 2 | ★★★★★ | ★★☆☆☆ | ✗ |
⚠ VRAM estimates based on Q4_K_M GGUF quantization + ~2GB overhead for context and KV cache at 4K context. Actual usage varies by context length, batch size, and runtime. Benchmark your specific use case. Data current as of May 2026 — the open-weight LLM landscape changes rapidly.