Llama 3 8B: matching 70B performance with structured prompting

A recent study has highlighted how a small language model, Llama 3 8B, can achieve comparable performance to a much larger model, Llama 3 70B, in question answering tasks that require reasoning over multiple steps (multi-hop QA).

Experiment Details

Researchers used Graph RAG (KET-RAG) and LightRAG to evaluate the models' capabilities. They found that information retrieval is no longer the main obstacle, as the answer is present in the context between 77% and 91% of the time. The real bottleneck is reasoning: between 73% and 84% of wrong answers come from the model's inability to connect the correct information.

Techniques Used

To improve the performance of the smaller model, two techniques were implemented during inference:

Structured chain of thought: Decomposing questions into graph query patterns before providing the answer.
Compression of the retrieved context: Reducing the context by approximately 60% through graph traversal, without further LLM calls.

Results

Using these techniques allowed Llama 3 8B to match or exceed the performance of Llama 3 70B on three common benchmarks: HotpotQA, MuSiQue, and 2WikiMultiHopQA (500 questions each). All this, at an approximately 12x lower cost (Groq).

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.

Llama 3 8B: matching 70B performance with structured prompting

Experiment Details

Techniques Used

Results

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

Knowledge Distillation for Temporal Knowledge Graph Reasoning with Large Language Models

Qwen 3.5-35B-A3B: a surprising model for development tasks

LLM and unexpected requests: when AI responds outside the box

👥 Join 160+ AI explorers