ONNX (Open Neural Network Exchange) is an open format for ML models that allows a model trained in PyTorch, TensorFlow, or JAX to be exported and run with a different inference backend — specifically ONNX Runtime, which applies hardware-specific graph optimisations.
ONNX Runtime Features
- Automatic kernel fusion and graph optimisation
- Execution providers: CPUExecutionProvider, CUDAExecutionProvider, TensorRTExecutionProvider, CoreMLExecutionProvider, DirectMLExecutionProvider
- Built-in INT8 quantization via
onnxruntime.quantization - Hugging Face Optimum library simplifies export:
optimum-cli export onnx --model bert-base-uncased
ONNX vs GGUF for LLM Inference
| ONNX Runtime | GGUF (llama.cpp) | |
|---|---|---|
| Best for | Encoder models (BERT, embeddings), smaller LLMs, Windows ARM | Generative LLMs (7B–70B) |
| Windows DirectML | Excellent | Limited |
| Apple CoreML | Good | Good (Metal) |
| LLM support | Phi-4, Llama via Olive | All major families |
Why It Matters for On-Premise
ONNX Runtime is the recommended backend for running embedding models (all-MiniLM, BGE) in production — it applies CPU optimisations that make embedding generation 2–3× faster than naive PyTorch. For Windows-based on-premise deployments, DirectML via ONNX Runtime enables GPU acceleration without requiring CUDA, making it useful for AMD GPU environments.