A recent study has highlighted how a small language model, Llama 3 8B, can achieve comparable performance to a much larger model, Llama 3 70B, in question answering tasks that require reasoning over multiple steps (multi-hop QA).

Experiment Details

Researchers used Graph RAG (KET-RAG) and LightRAG to evaluate the models' capabilities. They found that information retrieval is no longer the main obstacle, as the answer is present in the context between 77% and 91% of the time. The real bottleneck is reasoning: between 73% and 84% of wrong answers come from the model's inability to connect the correct information.

Techniques Used

To improve the performance of the smaller model, two techniques were implemented during inference:

  • Structured chain of thought: Decomposing questions into graph query patterns before providing the answer.
  • Compression of the retrieved context: Reducing the context by approximately 60% through graph traversal, without further LLM calls.

Results

Using these techniques allowed Llama 3 8B to match or exceed the performance of Llama 3 70B on three common benchmarks: HotpotQA, MuSiQue, and 2WikiMultiHopQA (500 questions each). All this, at an approximately 12x lower cost (Groq).

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.