Distilling Your Own LLM for Theorem Proving: When On-Premise Beats the Cloud

When a niche becomes a cost sinkhole, the temptation to bring models in-house grows strong. A researcher working on automated theorem proving with LLMs described a telling situation: they have funding for hardware but not for cloud credits, and the domain is so specific — the Rocq proof assistant — that smaller models simply don’t grasp it. Their idea: distill a larger model into a compact yet competent version, to run on-premise.

The cloud trap for niche domains

Agentic theorem proving is an extreme LLM use case: it demands formal reasoning, knowledge of esoteric syntax, and an accuracy general-purpose models often lack. Those working with assistants like Rocq (the former Coq) face a paradox: cloud models are powerful but expensive at every inference, while small ones, which would cost less, fail on the language’s subtleties.

The user, forced onto cloud models, sees costs spiral. Their question is a classic TCO calculation: why pay per token when you can buy hardware? And if the hardware is already funded, bringing inference in-house becomes almost a necessity.

What distillation is and why it might work

Distillation (or knowledge distillation) is the process where a large model — the teacher — trains a smaller one — the student — by transferring not only the answers but also the probability distributions over predictions. It’s a common technique for creating compact LLMs that retain much of the original’s expertise while reducing latency and VRAM consumption.

For theorem proving, this would mean taking a model already strong on formal language (like those trained on Lean) and distilling it on Rocq examples. The dual advantage: reduced size allows execution on consumer GPUs or workstations with limited VRAM, and on-premise deployment cuts recurring operational costs.

The on-premise stack: VRAM and TCO concerns

Self-hosted doesn’t mean free. To run inference with a 7–13 billion parameter model, even after quantization to INT8 or FP16, GPUs with at least 16–24 GB of VRAM are needed, and training or fine-tuning pushes requirements higher. A suitable on-premise environment involves upfront CapEx, but over time the TCO can be lower than a cloud API, especially with continuous usage and sensitive or large datasets.

AI-RADAR has repeatedly noted that the cloud vs. on-premise decision for LLMs isn’t only economic: it depends on request volume, tolerable latency, and the need for data control. In this researcher’s case, the already available hardware funding tips the balance toward on-premise.

The data bottleneck: Rocq is a modelless desert

The original post’s edit adds another challenge: for Rocq, very few models exist, whereas Lean already has a DeepSeek fine-tuned version. Post-training that model on Rocq examples might be faster than starting from scratch. However, datasets of instruction-response pairs in Rocq’s formal language are needed, and creating them requires uncommon domain expertise.

This opens a different front from pure technology: dataset construction is often the real bottleneck in specialized fine-tuning projects. Theorem provers need annotated data, possibly generated by the large model itself, in a loop reminiscent of self-distillation.

Beyond the niche: what this story tells us

The researcher’s case signals how central open-source ecosystems and self-hosting are becoming for practical LLM adoption. It’s not just about cost: it’s the ability to shape a model to exact tasks, free from generic API constraints. On-premise distillation, paired with fine-tuning techniques like QLoRA, is democratizing access to specialized models.

For those evaluating on-premise deployment, AI-RADAR provides analytical frameworks that help map the trade-offs between hardware, software, and required skills. The Rocq theorem proving story shows that model sovereignty starts from very concrete needs, and often the answer lies in a stack you build yourself.