Qwen3.6 27B: vLLM and INT4 on Docker for High-Performance Local Inference on 2x RTX 3090s

Optimizing Large Language Models for On-Premise Deployment

The landscape of Large Language Models (LLMs) continues to evolve rapidly, presenting new challenges and opportunities for companies looking to integrate these technologies into their infrastructures. While cloud solutions offer scalability and ease of use, on-premise deployment is increasingly a priority for those who require total data control, sovereignty, and long-term cost optimization. In this context, efficient inference execution becomes a critical factor.

A recent community project has highlighted a practical approach to address these needs. It involves a Docker container that allows the Qwen3.6 27B model, a considerably sized LLM, to run on local hardware. The solution stands out for its use of advanced optimization techniques, demonstrating how high performance can be achieved even outside of more expensive cloud environments.

Technical Details: vLLM, Quantization, and Hardware

The core of this implementation lies in the combination of several technological elements. The chosen model, Qwen3.6 27B, is an LLM with 27 billion parameters, requiring efficient memory management and computational resources. To optimize inference, the project leverages vLLM, a serving framework known for its ability to improve throughput and reduce latency, especially in scenarios with multiple, parallel requests.

A fundamental aspect of making a model of this size runnable on consumer-grade hardware is quantization. In this case, Lorbus AutoRound INT4 quantization was applied. This technique reduces the precision of model weights from floating-point formats (like FP16) to 4-bit integers, drastically lowering VRAM requirements and improving computational efficiency. Although quantization can involve a slight compromise in accuracy, modern techniques like Lorbus AutoRound are designed to minimize this impact. Completing the optimization picture, MTP speculative decoding further accelerates the token generation process. The entire stack was tested on a hardware configuration consisting of two NVIDIA RTX 3090 GPUs, achieving a speed of 118 tokens per second—a remarkable result for a local setup.

Implications for On-Premise Deployment and TCO

This demonstration has significant implications for CTOs, DevOps leads, and infrastructure architects evaluating deployment strategies for LLMs. The ability to achieve 118 tokens per second on two RTX 3090s—GPUs with 24GB of VRAM each and relatively accessible on the secondary market compared to enterprise cards—highlights a viable path for on-premise inference. This approach can translate into a lower Total Cost of Ownership (TCO) in the long run, avoiding the recurring and often unpredictable operational costs of cloud solutions.

Deployment via Docker containers also simplifies management and portability, allowing companies to maintain full control over their data, a crucial aspect for regulatory compliance and data sovereignty, especially in regulated sectors. For those evaluating on-premise deployment, there are trade-offs to consider, such as infrastructure management and scalability. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects, providing tools to compare CapEx and OpEx, cooling requirements, and integration with existing stacks.

Future Prospects and the Role of the Community

The success of projects like this underscores the importance of the open-source community in pushing the boundaries of LLM efficiency and accessibility. Innovation in quantization techniques, serving frameworks, and deployment methodologies continues to make larger and more complex models usable on a wider range of hardware.

For businesses, this means having a growing array of options to implement generative AI in a controlled and cost-effective manner. The choice between cloud and on-premise is no longer a rigid dichotomy but a matter of balancing agility, cost, security, and control. Solutions like the one presented demonstrate that high-performance LLM inference is increasingly within reach for well-optimized local infrastructures.

Qwen3.6 27B: vLLM and INT4 on Docker for High-Performance Local Inference on 2x RTX 3090s

Optimizing Large Language Models for On-Premise Deployment

Technical Details: vLLM, Quantization, and Hardware

Implications for On-Premise Deployment and TCO

Future Prospects and the Role of the Community

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

Qwen3.5 Support Merged in llama.cpp

US-Israel conflict: Grok's prediction vs. Claude's deployment

Qwen: A step forward for local LLM inference?

👥 Join 160+ AI explorers