The core principle of running Mixture-of-Experts (MoE) models on CPU/RAM is that the CPU doesn't need to extract or calculate all weights from memory simultaneously. Only a fraction of the parameters are "active" for any given token, and since calculations are approximate, memory throughput becomes our primary bottleneck.

Models and Bandwidth

Let's look at two popular models: GLM-4.7-Flash (3B active params) and GPT OSS 120B (5.1B active params). At Q4_K_M quantization, their active memory footprints are:

  • GLM-4.7-Flash: ~1.7 GB
  • GPT OSS 120B: ~2.55 GB

Now, let's look at theoretical vs. realistic DDR5 Dual-Channel Bandwidth:

  • DDR5-4800: 76.8 GB/s
  • DDR5-6000: 96.0 GB/s
  • DDR5-6400: 102.4 GB/s

The realistic bandwidth for LLM inference is closer to 35 GB/s. So:

  • GLM-4.7-Flash: 35 GB/s / 1.7GB = 20.5 tokens/sec
  • GPT OSS 120B: 35 GB/s / 2.55 GB = 13.7 tokens/sec

Hardware Optimization (Intel 14700f Example)

To hit these numbers, your CPU and BIOS settings must be dialed in:

  1. XMP/EXPO: Enable your XMP profile in BIOS.
  2. Power Limits: You need the CPU to stay at its maximum boost clock to keep the memory controller saturated. Increase your Power Level (PL1/PL2) to 219W.
  3. Thermal Management: To prevent throttling at 219W, you need high-end cooling. I recommend undervolting to keep temps manageable without losing performance.

Software and Compilation

It is recommended to compile llama.cpp from source with flags optimized for your specific architecture (Raptor Lake).

Running the Server

The key is to pin the process to your Performance Cores (P-cores) and avoid the Efficiency Cores (E-cores), which can slow down the memory-heavy threads.

Even if the model doesn't fit entirely on the VRAM, llama.cpp can offload specific layers to the GPU, providing a nice speed boost to the overall generation.