Introduction

Neuron explanation tools have become essential for understanding the functioning of deep neural networks. However, despite achieving great empirical success, theoretical foundations remain absent, which is crucial for ensuring trustworthy and reliable explanations. This work presents a first theoretical analysis of two fundamental challenges: faithfulness and stability.

Faithfulness

Faithfulness refers to the ability of neuron identification to represent the underlying concept faithfully. This is essential for ensuring accurate and reliable explanations. In this work, we analyze the possibility of considering neuron identification as the inverse process of machine learning.

Stability

Stability refers to the consistency of neuron identifications across different datasets. This is crucial for ensuring replicable and reliable explanations. By proposing a bootstrap ensemble procedure and utilizing the BE (Bootstrap Explanation) method, we are able to quantify stability of identifications.

Generalization bounds for similarity metrics

We derive generalization bounds for widely used similarity metrics such as accuracy, AUROC, and IoU. This allows us to ensure faithfulness of identifications.

Proposed method

We propose a new method that combines theoretical analysis with practical implementation to derive neuron explanations that are trustworthy and stable.

Experiments

Our experiments on both synthetic and real data validate theoretical results and demonstrate the practicality of our method. This work represents an important step towards trustworthy neuron interpretation.