ONNX / ONNX Runtime

Format

Open Neural Network Exchange — a portable model format that enables cross-framework interoperability, often used to accelerate inference via ONNX Runtime.

ONNX (Open Neural Network Exchange) is an open format for ML models that allows a model trained in PyTorch, TensorFlow, or JAX to be exported and run with a different inference backend — specifically ONNX Runtime, which applies hardware-specific graph optimisations.

ONNX Runtime Features

  • Automatic kernel fusion and graph optimisation
  • Execution providers: CPUExecutionProvider, CUDAExecutionProvider, TensorRTExecutionProvider, CoreMLExecutionProvider, DirectMLExecutionProvider
  • Built-in INT8 quantization via onnxruntime.quantization
  • Hugging Face Optimum library simplifies export: optimum-cli export onnx --model bert-base-uncased

ONNX vs GGUF for LLM Inference

ONNX RuntimeGGUF (llama.cpp)
Best forEncoder models (BERT, embeddings), smaller LLMs, Windows ARMGenerative LLMs (7B–70B)
Windows DirectMLExcellentLimited
Apple CoreMLGoodGood (Metal)
LLM supportPhi-4, Llama via OliveAll major families

Why It Matters for On-Premise

ONNX Runtime is the recommended backend for running embedding models (all-MiniLM, BGE) in production — it applies CPU optimisations that make embedding generation 2–3× faster than naive PyTorch. For Windows-based on-premise deployments, DirectML via ONNX Runtime enables GPU acceleration without requiring CUDA, making it useful for AMD GPU environments.