Hand-Engineered Image-Computable Models Can Still Outperform DNNs in V1 Similarity
Talk Presentation 54.12: Tuesday, May 19, 2026, 2:45 – 4:30 pm, Talk Room 1
Session: Theory
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Tanish Mendki1, Sudhanshu Srivastava2, Ansh Soni3; 1University of California, Santa Barbara, 2University of California, San Diego, 3University of Pennsylvania
Task-based Deep Neural Network models (DNNs) are widely used as models of inferotemporal visual cortex (IT), with early work showing a large jump over previous hand-made models [Yamins and DiCarlo, 2014]. However, recent work has suggested that over time, not only has the performance-alignment relationship plateaued but reversed, with high performing models becoming worse models of IT [Linsley et al., 2023]. Here we attempt to see if this reversal extends to earlier cortical regions. We used a 515-image subset of the Natural Scenes Dataset (NSD) and extracted V1 voxel responses from eight subjects, constructing 515 x N matrices of neural data. For each model, we compared latent representations for the same 515 images across all layers and identified the best-fitting “V1 layer” based on subject-averaged alignment scores. We evaluated a broad set of models, including SOTA IT similarity models [Schrimpf et al., 2020], high-task-performant CNN’s and VIT’s, along with other models directly attempting to model V1 [Dapello et al., 2020]. Along with DNNs we also tested traditional, hand-crafted models such as HMAX [Riesenhuber and Poggio, 1999]. Three complementary similarity measures were used: (1) representational similarity analysis, (2) pairwise matching between model units and voxels, and (3) linear predictivity via ridge regression. Surprisingly, we find that the most modern models are equal or worse models of V1 compared to HMAX even as they are better models of IT. Furthermore, HMAX becomes the best model when utilizing representational similarity scores that care about representational geometry and indistinguishable from the best for strict pairwise matching, being only permutation invariant. These findings highlight that progress in task performance has not translated into better mechanistic models of V1, and that classical image-computable models should not be treated as obsolete benchmarks. Re-evaluating hand-engineered approaches, rather than defaulting to DNNs, might be crucial for improving biologically-grounded alignment.