Quadtrix.cpp: A From-Scratch C++17 Transformer LLM Trained on CPU

A "From Scratch" Transformer LLM in C++17: Quadtrix.cpp Redefines Control

In the rapidly evolving landscape of Large Language Models (LLMs), reliance on complex frameworks and external libraries has become the norm. However, a recent project, Quadtrix.cpp, challenges this convention by presenting a complete Transformer LLM implemented entirely in C++17, with no external dependencies beyond the standard library and POSIX sockets. This initiative, developed from scratch, includes a tensor library, all forward pass operations, and a full analytical backward pass with explicit gradient derivations for every operator.

Quadtrix.cpp's radical approach offers a unique perspective on granular control over the entire development and deployment pipeline. In an era where data sovereignty and technological stack transparency are growing priorities for businesses, such a "bare metal" implementation can serve as a benchmark. The choice to avoid frameworks like PyTorch or LibTorch and BLAS libraries underscores a commitment to maximum autonomy and a deep understanding of every model component.

Technical Details and Performance: The Value of CPU Optimization

Quadtrix.cpp's architecture is based on a decoder-only Transformer with 4 layers, 4 heads, and an embedding dimension of 200. This model, with 826,985 parameters (approximately 0.83 million), was trained on a corpus of 31.4 million characters of children's stories, using a context window of 128 characters. The training achieved a validation loss of 1.6371 nats in 76.2 minutes, operating on a single CPU core.

A notable aspect is the implementation of OpenMP parallelization across all CPU cores for critical operations such as matrix multiplication (matmul), batch matrix multiplication (bmm), softmax, and LayerNorm. This allowed for an acceleration of approximately 5-7x on an 8-core machine. The complexity of gradient derivations, particularly for LayerNorm and attention, was a focal point of development, requiring careful management of intermediate variables and dropout masks. Although the model's output is still "gibberish" given its size and limited training time, its value lies in demonstrating the feasibility of a completely autonomous implementation.

Implications for On-Premise Deployments and Data Sovereignty

For organizations considering LLM deployment in on-premise, hybrid, or air-gapped environments, Quadtrix.cpp offers significant insights. The absence of external dependencies drastically reduces the attack surface and simplifies compliance and security management. Not having to rely on third-party libraries means total control over the executed code, a crucial factor for sectors with stringent regulatory requirements or for protecting sensitive data.

Furthermore, the ability to train a model on a CPU, albeit with lower performance compared to GPUs, paves the way for scenarios where dedicated AI acceleration hardware might not be immediately available or economically viable for specific workloads. This approach can positively impact the Total Cost of Ownership (TCO) in the long term, eliminating licensing costs or dependencies on cloud providers. For those evaluating on-premise deployments, AI-RADAR offers analytical frameworks on /llm-onpremise to assess the trade-offs between control, performance, and costs.

Future Prospects and Balancing Control with Productivity

The Quadtrix.cpp project highlights a fundamental trade-off in the world of AI development: the balance between complete control over the implementation and the productivity offered by established frameworks. While a "from scratch" implementation requires a significant investment in time and expertise for managing low-level details, it guarantees a deep understanding and the possibility of extreme optimizations, which are impossible with higher abstractions.

A GPU-ported version of the project using LibTorch, while maintaining the same architecture and hyperparameters, demonstrated an acceleration of approximately 75x on an RTX 3080. This comparison underscores the crucial role of specialized hardware for large-scale LLM inference and training, but also the intrinsic value of an implementation that allows choosing the desired level of abstraction. Quadtrix.cpp does not aim to replace existing frameworks but to demonstrate the feasibility and benefits of an approach that maximizes control and transparency, elements increasingly sought after in the deployment of critical AI solutions.

Quadtrix.cpp: A From-Scratch C++17 Transformer LLM Trained on CPU

A "From Scratch" Transformer LLM in C++17: Quadtrix.cpp Redefines Control

Technical Details and Performance: The Value of CPU Optimization

Implications for On-Premise Deployments and Data Sovereignty

Future Prospects and Balancing Control with Productivity

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

Llama.cpp: CUDA fix for GLM 4.7 Flash Attention merged

Llama.cpp now supports OpenAI Responses API

llama.cpp integrates Kimi-Linear support: improved performance

👥 Join 160+ AI explorers