35 Billion Parameter LLM on GTX 1060 6GB: An On-Premise Case Study

Running 35 Billion Parameter LLMs on Older Hardware: The GTX 1060 6GB Case

The landscape of Large Language Models (LLMs) is often dominated by discussions around cutting-edge cloud infrastructures and latest-generation GPUs, such as NVIDIA's H100 or A100 series. However, a recent experiment has shown that surprising results can be achieved even with older hardware. A user shared their experience running a 35 billion parameter LLM, the qwen3.6-35B-a3b-MTP-GGUF UD Q4_K_XL, on a Dell T5810 workstation equipped with an NVIDIA GTX 1060 GPU with 6GB of VRAM.

This case study is particularly relevant for organizations evaluating on-premise deployment strategies. The ability to leverage existing or less expensive hardware for LLM workloads can significantly impact the Total Cost of Ownership (TCO) and data sovereignty, crucial aspects for technical decision-makers seeking alternatives to cloud-based solutions.

Technical Details and Deployment Configuration

The hardware configuration used for this test includes a Dell T5810 workstation, a system dating back approximately ten years. The core of this machine is an Intel Xeon E5-2698v3 CPU, featuring 16 cores and 32 threads, complemented by 32GB of DDR3 memory. The key component for AI acceleration is an NVIDIA GTX 1060 graphics card with 6GB of VRAM, a mid-range consumer GPU released in 2016.

For model execution, the user employed LMStudio on a Windows operating system. The chosen model, unsloth qwen3.6-35B-a3b-MTP-GGUF UD Q4_K_XL, is a quantized version of the 35 billion parameter Qwen 3.6, optimized for execution on resource-constrained hardware. Specific settings included a context length of 131072 tokens, with 41 model layers offloaded to the GPU and an equivalent number of MoE (Mixture of Experts) layers handled by the CPU. KV Quantization was set to Q4_0, and the CPU threadpool utilized 16 cores.

Performance and Implications for Local Inference

Despite the hardware limitations, the recorded performance was notable for a local deployment. During the prefill phase, which involves the initial processing of an extended input, the system achieved a speed of approximately 130-150 tokens per second for a 16,000-token context. In the decode phase, concerning the sequential generation of response tokens, the speed settled around 16 tokens per second for a 4,000-token context.

These figures indicate sufficient responsiveness for interactive applications like chatbots, making the model "very usable for chat" according to the user. The ability to run LLMs of this size on relatively old consumer hardware opens new perspectives for scenarios where data sovereignty is paramount or where cloud operational costs are prohibitive. This demonstrates that, with the right optimizations (such as Quantization and intelligent CPU-GPU offloading), it is possible to extend the useful life of existing infrastructure for AI workloads.

Outlook for On-Premise Deployments

The success of this experiment highlights a fundamental point for CTOs and infrastructure architects: the flexibility of on-premise deployments. While high-end GPUs offer superior performance, the ability to run significant LLMs on more accessible hardware can drastically lower the entry barriers for AI adoption in controlled environments. This approach is particularly advantageous for sectors with stringent compliance requirements or for applications operating in air-gapped environments.

For those evaluating on-premise deployments, clear trade-offs exist between initial investment (CapEx), operational costs (OpEx), and desired performance. This example shows that software optimization and Quantization techniques can unlock significant value from existing hardware, offering a viable path for LLM implementation without the need for massive investments in new infrastructure. AI-RADAR provides analytical frameworks at /llm-onpremise to evaluate these trade-offs and support informed decisions.