Interpretability and the nature of latent representations in deep neural networks
Interpreting the inner workings of deep neural networks is crucial for establishing trust and ensuring model fairness and safety. Mechanistic interpretability aims to understand how models store representations by breaking down neural networks into interpretable units. In contrast, concept-based interpretability is concerned with high-level human interpretable concepts such as textures, shapes, and parts of objects present as directions in the latent space of a model, meaning the lower-dimensional, abstract representation of data within hidden or “latent” layers of a model.
In this manuscript, we study interpretability and the nature of representations in neural networks, concentrating mainly on the context of computer vision. We explore some limitations of mechanistic and concept-based interpretability methods that make interpreting individual neurons and directions challenging. In order to address such challenges, we develop methods to locate distinct human interpretable concepts and we study how such concepts relate to each other in the activation space. Furthermore, we take a step towards understanding model representations in a way that goes beyond simply locating concepts for mechanistic analyses, and we consider how investigating the entanglement of concepts can aid our understanding and help improve model robustness.
History
Faculty
- Faculty of Science and Engineering
Degree
- Doctoral
First supervisor
Nikola S. NikolovSecond supervisor
David J.P. O’SullivanDepartment or School
- Computer Science & Information Systems