Topic / Trend Rising

On-Premise AI and Local LLM Deployment

There is a strong push to run large language models on local hardware—from consumer GPUs to enterprise servers—driven by data sovereignty, cost control, and performance. Community tools like llama.cpp, quantization techniques, and hybrid NPU/GPU setups are making self-hosting increasingly viable.

Detected: 2026-06-25 · Updated: 2026-06-25

Related Coverage

2026-06-24 • LocalLLaMA

AMD Strix Halo NPU Now Works with ROCm: Hybrid GPU+NPU for Local LLMs

The AMD Ryzen AI Max+ 395 (Strix Halo) finally gets its NPU running for LLM inference thanks to tools like Lemonade, enabling a hybrid NPU+iGPU mode. This leverages the NPU’s speed for prompt processing while the GPU handles token generation in paral...

#Hardware #LLM On-Premise #DevOps

2026-06-22 • LocalLLaMA

A 35B MoE LLM on a Single RTX 3090: Speed and Quality Within Consumer Reach

With APEX I-Quality and the turbo8 codec, Qwen3.6-35B-A3B hits 137 t/s and 128k context on a single RTX 3090. Tests show the spiritbuun fork matches ik_llama, and the new turbo8/turbo4 cache boosts coherence and throughput. A signal for those evaluat...

#Hardware #LLM On-Premise #DevOps

2026-06-22 • LocalLLaMA

Ling-2.6: From immense to lightning-fast, on-premise finds its way

The new technical report unveils models up to 1 trillion parameters, but the Ling-mini legacy makes eyes sparkle for those betting on limited hardware. 160 tokens per second on 8 GB VRAM become a use case for self-hosting.

#Hardware #LLM On-Premise #DevOps

2026-06-22 • LocalLLaMA

llama.cpp Boosts Local Inference with Multi-Token Prediction Support for Step Flash Models

A new pull request in the llama.cpp repository extends multi-token prediction (MTP) support, a technique that cuts inference times. Focused on Step3.5/3.7 Flash models, the update marks another step toward efficiency in self-hosted AI, reducing compu...

#Hardware #LLM On-Premise #DevOps

2026-06-21 • LocalLLaMA

The Llama.cpp Optimization Guide We Needed: A Year of Experiments Distilled

After 12 months of testing local inference, a developer has published a comprehensive guide to llama.cpp optimization: VRAM fitting, KV cache, MoE models, CPU tuning, and the most common out-of-memory traps. A practical reference for those committed ...

#Hardware #LLM On-Premise #DevOps

2026-06-20 • LocalLLaMA

GLM 5.2 local speeds: 7.8 tokens/sec with six RTX 3090s and 90K context

A Reddit user shared initial local inference metrics for GLM 5.2: running on six RTX 3090s with UD-IQ2_M quantization and a 90K context window, the model generates 7.8 tokens per second. The numbers fuel the debate on what it takes to run large LLMs ...

#Hardware #LLM On-Premise #DevOps

2026-06-20 • LocalLLaMA

GLM 5.2: 'max effort' default is a self-hosting killer. Here's the high-level alternative

Moving to GLM 5.2 doubled reasoning tokens and made the model unusable on an old Xeon server (12-hour wait). A technical report shows the 'high level' setting uses less than half the tokens while retaining 98% of coding performance. For self-hosted u...

#Hardware #LLM On-Premise #DevOps

2026-06-20 • LocalLLaMA

Giving local AI agents web access without paid APIs: SearXNG + Scrapling

A self-hosted setup using SearXNG for search and Scrapling with Trafilatura for page extraction gives local agents web access without external API keys, though with clear trade-offs in speed and search quality.

#LLM On-Premise #DevOps

2026-06-18 • LocalLLaMA

North Mini Code Goes 4-bit: Now Runs Locally on Mac and via Ollama

North Mini Code team drops a 4-bit quantized version on Hugging Face, requiring around 20 GB of memory. The model now runs on local hardware via Ollama and llama.cpp-based runtimes, and is also available through the OpenRouter API – a move that boost...

#Hardware #LLM On-Premise #DevOps

2026-06-18 • Tom's Hardware

Local AI Challenges the Cloud: Two Mini PCs Process Millions of Tokens and Cut Costs

An innovative approach demonstrates how it's possible to move Large Language Model (LLM) inference away from the cloud, leveraging the power of two mini PCs. This strategy allows for processing millions of tokens daily, generating significant savings...

#Hardware #LLM On-Premise #DevOps

← Back to All Topics