Topic / Trend Rising

On-Premise LLM Deployment and Optimization

Growing efforts to run large language models locally are being boosted by quantization, speculative decoding, and community tools, enabling wider adoption outside cloud.

Detected: 2026-07-05 · Updated: 2026-07-05

Related Coverage

2026-07-05 LocalLLaMA

RTX 3090 and LLMs: Running Qwen 27B with 200K Tokens Locally Is a Reality

The AI maker community celebrates the power of the NVIDIA RTX 3090: a user shares their experience running the Qwen 27B model with a 200,000-token context window, using the ‘club 3090’ configuration from GitHub. The consumer GPU with 24 GB of VRAM pr...

#Hardware #LLM On-Premise #DevOps
2026-07-03 LocalLLaMA

Longcat 2: INT8 and FP8 quantization now available for on-prem deployment

Meituan has published Longcat 2 model weights in INT8 and FP8 quantized formats. For teams self-hosting LLMs, having ready-to-use compressed versions lowers hardware requirements and inference costs while keeping a good balance between accuracy and V...

#Hardware #LLM On-Premise #DevOps
2026-07-02 LocalLLaMA

vLLM's silent fix doubles context window on a single consumer GPU

A Reddit appreciation post reveals a technical leap: vLLM's latest releases fix memory allocation bugs, allowing Qwen2.5 7B to run with 240,000 tokens on a single RTX 5090, up from 120,000. A reminder that well-maintained open source can break down b...

#Hardware #LLM On-Premise #DevOps
2026-07-02 LocalLLaMA

Two RTX 3090s in a Thermaltake Core P3: when DIY meets local LLM inference

A user managed to fit two RTX 3090 GPUs inside an open-frame Thermaltake Core P3 case by 3D-printing a bracket to tilt the radiator. Beyond the striking visuals, the build can locally run models like Qwen 27B. For those evaluating on-premise deployme...

#Hardware #LLM On-Premise #Fine-Tuning
2026-07-01 LocalLLaMA

Ascend GX10 or DGX Spark: Betting on Local LLM Inference

A Reddit user considers buying four Ascend GX10 GPUs to run open-source models with a 128k context window. Tests with GLM5.2 show around 15 tok/s output, usable with quantization, and a 1000W power draw. A choice that reignites the debate over on-pre...

#Hardware #LLM On-Premise #DevOps
2026-06-28 LocalLLaMA

Ornith-1.0-35B GGUF: Native MTP Graft Boosts Local Decoding by 35%

An experimental update for Ornith-1.0-35B introduces native MTP speculative decoding, achieving 233.8 tok/s on a single GPU with llama.cpp – a 35% boost – while preserving byte-identical next-token distribution to the target model. Comprehensive benc...

#Hardware #LLM On-Premise
← Back to All Topics