AutoPsych: Automated Psychophysics for Interpretability and Diversity Benchmarking
Poster Presentation 43.316: Monday, May 18, 2026, 8:30 am – 12:30 pm, Banyan Breezeway
Session: Object Recognition: Features, parts
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Habon Issa1, Sunny Liu1, David Klindt1; 1Cold Spring Harbor Laboratory
Interpreting neural representations is a fundamental challenge for neuroscience. This is complicated by the fact that single neurons respond to multiple independent stimuli (i.e., they exhibit mixed-selective or superposed representations). In the AI mechanistic interpretability field, population coding methods including sparse autoencoders (SAEs) have successfully lifted features out of superposition, revealing the hidden feature representations of large neural networks. However, there are no rigorous quantitative metrics to measure the success of SAE architectures against existing alternatives. Human psychophysics experiments are considered the gold-standard method to benchmark population coding methods, but the low scalability of this method presents a major bottleneck to feature evaluation. While recent work automates this using large language models (LLMs), these methods are limited to textually-describable concepts and can be difficult to generalize. We introduce AutoPsych, a framework for conducting automated psychophysics experiments where our approach replaces human subjects with perceptual similarity models (e.g., DreamSim and LPIPS for vision models) that are pretrained to mimic human psychophysics judgments. Using this framework, we propose two novel, quantitative benchmarks. The first, an Odd-One-Out (O3) task, measures the interpretability of individual neurons or SAE latents by assessing the perceptual consistency of their maximally activating inputs. The second, cross-O3 (xO3), extends this to measure the diversity of a set of units, providing a proxy for how many distinct concepts they cover. Compared to existing methods, AutoPsych requires no LLM-generated textual descriptions, and is readily adaptable to diverse data types including images, text, and biological data. This provides a scalable and generalizable solution for rigorously evaluating progress in neural network interpretability.
Acknowledgements: This work was performed with assistance from the US National Institutes of Health Grant S10OD028632-01