Running MoE Models on CPU/RAM: A Guide to Optimizing Bandwidth for GLM-4 and GPT-OSS

The core principle of running Mixture-of-Experts (MoE) models on CPU/RAM is that the CPU doesn't need to extract or calculate all weights from memory simultaneously. Only a fraction of the parameters are "active" for any given token, and since calculations are approximate, memory throughput becomes our primary bottleneck.

Models and Bandwidth

Let's look at two popular models: GLM-4.7-Flash (3B active params) and GPT OSS 120B (5.1B active params). At Q4_K_M quantization, their active memory footprints are:

GLM-4.7-Flash: ~1.7 GB
GPT OSS 120B: ~2.55 GB

Now, let's look at theoretical vs. realistic DDR5 Dual-Channel Bandwidth:

DDR5-4800: 76.8 GB/s
DDR5-6000: 96.0 GB/s
DDR5-6400: 102.4 GB/s

The realistic bandwidth for LLM inference is closer to 35 GB/s. So:

GLM-4.7-Flash: 35 GB/s / 1.7GB = 20.5 tokens/sec
GPT OSS 120B: 35 GB/s / 2.55 GB = 13.7 tokens/sec

Hardware Optimization (Intel 14700f Example)

To hit these numbers, your CPU and BIOS settings must be dialed in:

XMP/EXPO: Enable your XMP profile in BIOS.
Power Limits: You need the CPU to stay at its maximum boost clock to keep the memory controller saturated. Increase your Power Level (PL1/PL2) to 219W.
Thermal Management: To prevent throttling at 219W, you need high-end cooling. I recommend undervolting to keep temps manageable without losing performance.

Software and Compilation

It is recommended to compile llama.cpp from source with flags optimized for your specific architecture (Raptor Lake).

Running the Server

The key is to pin the process to your Performance Cores (P-cores) and avoid the Efficiency Cores (E-cores), which can slow down the memory-heavy threads.

Even if the model doesn't fit entirely on the VRAM, llama.cpp can offload specific layers to the GPU, providing a nice speed boost to the overall generation.

Running MoE Models on CPU/RAM: A Guide to Optimizing Bandwidth for GLM-4 and GPT-OSS

Models and Bandwidth

Hardware Optimization (Intel 14700f Example)

Software and Compilation

Running the Server

💻 Need GPU Cloud Infrastructure?

💬 Commenti (0)

📚 Approfondimenti

Approfondisci su LLM On-Premise

GLM-4.7-Flash: benchmark da capogiro su H200 e RTX 6000 Ada

LLM per coding: GLM 4.7 Flash, GPT OSS 120B e Qwen3 Coder 30B a confronto

GLM-4.7-Flash: il modello di Z.ai per inferenza locale