ONNX / ONNX Runtime – LLM Glossary

ONNX (Open Neural Network Exchange) is an open format for ML models that allows a model trained in PyTorch, TensorFlow, or JAX to be exported and run with a different inference backend — specifically ONNX Runtime, which applies hardware-specific graph optimisations.

ONNX Runtime Features

Automatic kernel fusion and graph optimisation
Execution providers: CPUExecutionProvider, CUDAExecutionProvider, TensorRTExecutionProvider, CoreMLExecutionProvider, DirectMLExecutionProvider
Built-in INT8 quantization via onnxruntime.quantization
Hugging Face Optimum library simplifies export: optimum-cli export onnx --model bert-base-uncased

ONNX vs GGUF for LLM Inference

	ONNX Runtime	GGUF (llama.cpp)
Best for	Encoder models (BERT, embeddings), smaller LLMs, Windows ARM	Generative LLMs (7B–70B)
Windows DirectML	Excellent	Limited
Apple CoreML	Good	Good (Metal)
LLM support	Phi-4, Llama via Olive	All major families

Why It Matters for On-Premise

ONNX Runtime is the recommended backend for running embedding models (all-MiniLM, BGE) in production — it applies CPU optimisations that make embedding generation 2–3× faster than naive PyTorch. For Windows-based on-premise deployments, DirectML via ONNX Runtime enables GPU acceleration without requiring CUDA, making it useful for AMD GPU environments.