## Ghost Engine: Efficient LLM Inference An engineer has developed Ghost Engine, an inference engine that aims to optimize the local execution of large language models (LLMs) such as Llama-3-8B. The key idea is to generate model weights in real-time, rather than loading them from memory, thereby reducing VRAM requirements. ## How it works Ghost Engine uses a "Predator-Prey" architecture. "Predators" are high-precision outliers (about 1% of the weights). "Prey" are ternary instructions {-1, 0, 1} that reconstruct the rest of the weights. ## Results Tests on Llama-3-8B show: * Compression: ~3.0 bits per weight (bpw), 5.33x smaller than FP16. * Fidelity: 0.915 Cosine Similarity on Layer 20 (SwiGLU). * Output Quality: 0.912 similarity on actual inference outputs. * Correct handling of the SwiGLU architecture. The code is open source (AGPLv3) and available on GitHub. The project is in preview phase and seeks collaborators to optimize the decompression kernels for Metal/CUDA in order to achieve production speeds.

Ghost Engine: Run Llama-3-8B in 3GB VRAM by Generating Weights

💬 Commenti (0)

📚 Approfondimenti

Approfondisci su LLM On-Premise

Inferenza LLM più veloce con lo Speculative Decoding

Un tool open source fa dibattere 5 IA per validare le risposte

Ripetere i prompt migliora le prestazioni dei modelli linguistici