Local LLMs with Persistent Memory

A new approach allows large language models (LLMs) running locally to retain information learned during user interactions. Unlike traditional systems that rely on retrieval augmented generation (RAG) techniques or external databases, this method directly integrates knowledge into the model's weights.

The system is divided into two main phases: "wake" and "sleep".

  • Wake: During the wake phase, the user interacts normally with the model. The system extracts relevant information from the conversation and injects it into the neural network weights via the MEMIT (Mass-Editing Memory in Transformers) technique. This allows immediate recall of information without the need for additional training.
  • Sleep: When the user activates "sleep" mode (via the /sleep command), the system audits the stored information, updating degraded ones and removing excess ones. This process uses null-space constraints to prevent the correction of one memory from damaging others.

Implementation and Results

The system has been successfully tested on various hardware configurations and models, including:

  • MacBook Air M3 (8GB) with Llama-3.2-3B-4bit: supports approximately 15 facts, with "sleep" times of approximately 5 minutes.
  • 2x H100 80GB with Llama-3.1-8B: guarantees 100% recall of the 30 facts stored after the "sleep" phase.
  • 2x H100 80GB with Llama-3.1-70B: offers 100% recall of the 60 facts, without significant impacts on the model's perplexity (PPL).

An interesting finding is that the initial approach based on LoRA (Low-Rank Adaptation) completely fails with large models such as Llama-3.1-70B. RLHF (Reinforcement Learning from Human Feedback) alignment creates a behavioral prior that overrides the knowledge injected via LoRA. The MEMIT-based implementation proved to be simpler and more robust.

The system is inspired by the CLS (Complementary Learning Systems) theory of neuroscience, which sees the wake phase as rapid hippocampal encoding and the sleep phase as a consolidation phase.

The source code is available on GitHub and requires a Mac with Apple Silicio chip and macOS 14+.