Efficient Inverse Graphics with Differentiable Generative Models Explains Trial-level Face Discriminations and Robustness of Face Perception to Unusual Viewing Angles

Poster Presentation 63.417: Wednesday, May 22, 2024, 8:30 am – 12:30 pm, Pavilion
Session: Face and Body Perception: Models

Hakan Yilmaz1 (), Matthew Muellner1, Joshua B. Tenenbaum2, Katharina Dobs3, Ilker Yildirim1; 1Yale University, 2Massachusetts Institute of Technology, 3Justus-Liebig University Giessen

At a glance, we not only recognize the category or identity of objects, but also perceive their rich three-dimensional (3D) structure. Critically, this richness of perception is not brittle: Our percepts may degrade under unusual viewing conditions, but they do so gracefully, remaining far above chance, even when the best computer vision systems fail. What renders human perception so distinct—with efficiently inferred, rich representations that are nevertheless robust? Here, we present a new computational architecture of visual perception, Efficient Differentiable Inverse Graphics (EDIG), that integrates discriminative and generative computations to achieve fast and robust inferences of rich 3D scenes. In a bottom-up pass, EDIG uses a discriminatively trained deep neural network (DNN) to initialize a percept by mapping an observed real-world image to its underlying 3D scene. Crucially, EDIG can further refine this initial estimate via iterative, optimization-based inference over a differentiable graphics-based generative model. In a case study of face perception, we train EDIG on a dataset of upright face images, to learn to map these images to 3D scenes in a weakly supervised fashion. We also train an architecture-matched DNN with a standard supervised classification objective, using the same training dataset. We test EDIG, EDIG’s bottom-up component, and this alternative on a behavioral dataset of 2AFC identity-matching tasks—with upright and inverted face conditions—consisting of 1560 unique trials per condition. We show that although EDIG and bottom-up only alternatives match average human accuracy on upright faces, only EDIG achieves human-level accuracy on inverted faces. Moreover, EDIG explains significantly more variance in trial-level human accuracy levels than alternatives. EDIG and humans also match qualitatively, both requiring extended processing to match inverted faces, relative to upright faces. These results suggest that human face perception integrates discriminative and generative computations, and provide a blueprint for building humanlike perception systems.