Optimizing Prompt Processing Speed for On-Premise LLMs: The Role of Micro-Batching
A recent analysis using `llama.cpp` revealed how increasing the physical micro-batch size (`ubatch`) can drastically improve prompt prefill speed for partially offloaded Large Language Models on consumer GPUs like the RTX 3090. This approach, while l...