Quantifying the Quality of Shape and Texture Representations in Deep Neural Network Models

Poster Presentation 63.405: Wednesday, May 22, 2024, 8:30 am – 12:30 pm, Pavilion
Session: Object Recognition: Models

Fenil R. Doshi1,2 (), Talia Konkle1,2, George A. Alvarez1,2; 1Harvard University, 2Kempner Institute for the Study of Natural and Artificial Intelligence

Deep Neural Networks (DNNs) have emerged as leading models of high-level visual processing, but there exist key disparities between DNNs and human vision, in particular the models' substantially greater reliance on texture over shape in object recognition. This bias has been quantified through a shape-bias score (Geirhos et al., 2019) where models are presented with images containing conflicting shape and texture cues, and the number of correct shape decisions is divided by the total number of correct decisions (shape correct + texture correct). This shape-bias metric shows significant variance across a broad range of vision models, with more-recent models showing more human-like shape-bias. However, this metric lends itself towards slightly misleading interpretations by not taking the models’ absolute performance into account; for example, a model making just a single correct shape decision, and no correct texture decisions, would have a 100% shape bias, which may incorrectly suggest the model having a strong shape representation. To address this limitation, we propose a revised metric, the accuracy-corrected shape-bias, as the square root of the product of both the original shape-bias score and the shape-dependent accuracy (which reflects the total proportion of correct decisions guided by shape cues). We show that a randomly initialized AlexNet model (5.75% shape-dependent accuracy) shows high original shape-bias (0.466) but low accuracy-corrected shape-bias score (0.164), better capturing the fact that these models have impoverished shape representations. Moreover, across over 100 trained models, we find that increases in shape-bias are due to shape-enhancement and equal or greater texture-suppression, and that none of the models examined have “strong” shape representations (none exceed 52.5% shape-dependent accuracy). Overall, we find that the gulf between human and DNN shape representations remains much larger than suggested by bias-scores alone, and that there has been little improvement in shape quality beyond early AlexNet models.

Acknowledgements: NSF PAC COMP-COG 1946308 to GAA, NSF CAREER BCS-1942438 to TK, Kempner Institute Graduate Fellowship to FRD