Discovering Local Models

A user shared their experience using large language models (LLMs) locally, highlighting how just one month of experimentation led to a deeper understanding compared to two years of using cloud-based models.

The experience began with the Qwen2.5 model, immediately addressing issues related to context overflow. This required optimizing parameters such as context size, temperature, top-K, and top-P. Subsequently, switching to Qwen3 (MLX) highlighted the speed offered by the Mixture of Experts (MoE) architecture.

Challenges and Technical Insights

The user then deepened their understanding of the linear growth of the KV cache and the need to periodically release the model from memory. Another interesting discovery was the reproducibility of model states by re-prompting the same prompt to a "fresh" instance of the model.

Currently, the user is experimenting with Qwen3.5 and observes that memory usage does not seem to increase, despite disabling auto-reset in LM Studio. They are considering creating a shared solution for other users but are concerned about the potential memory consumption by the KV cache.

The user expresses the desire to have a resource monitor available in LM Studio, providing information on token flow, KV cache, and activated experts. Despite limited knowledge of the basic transformer architecture, without MoE optimizations, the user is interested in LoRa fine-tuning but is unsure if they have the necessary time.

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.