# Introduction The evaluation of large language models (LLMs) relies heavily on standardized benchmarks. However, these aggregated metrics can obscure particular areas where the LLMs are weak ('model gaps') and imbalanced coverage in the benchmarks themselves ('benchmark gaps'). # Problem The lack of a representation-grounded approach for evaluation makes it difficult to compare models and identify strengths and weaknesses. Benchmarks, while useful, may not always reflect the needs of the model. # Proposal We present a new method that uses sparse autoencoders (SAEs) to automatically uncover both types of gaps. The approach utilizes SAE concept activations and computes saliency-weighted performance scores across benchmark data, grounding evaluation in the model's internal representations and enabling comparison across benchmarks.