๐ LLM
AI generated
Ghost Engine: Run Llama-3-8B in 3GB VRAM by Generating Weights
## Ghost Engine: Efficient LLM Inference
An engineer has developed Ghost Engine, an inference engine that aims to optimize the local execution of large language models (LLMs) such as Llama-3-8B. The key idea is to generate model weights in real-time, rather than loading them from memory, thereby reducing VRAM requirements.
## How it works
Ghost Engine uses a "Predator-Prey" architecture. "Predators" are high-precision outliers (about 1% of the weights). "Prey" are ternary instructions {-1, 0, 1} that reconstruct the rest of the weights.
## Results
Tests on Llama-3-8B show:
* Compression: ~3.0 bits per weight (bpw), 5.33x smaller than FP16.
* Fidelity: 0.915 Cosine Similarity on Layer 20 (SwiGLU).
* Output Quality: 0.912 similarity on actual inference outputs.
* Correct handling of the SwiGLU architecture.
The code is open source (AGPLv3) and available on GitHub. The project is in preview phase and seeks collaborators to optimize the decompression kernels for Metal/CUDA in order to achieve production speeds.
๐ฌ Commenti (0)
๐ Accedi o registrati per commentare gli articoli.
Nessun commento ancora. Sii il primo a commentare!