Comparing brain, human, and machine perceptual similarity of visual images
Poster Presentation 56.438: Tuesday, May 19, 2026, 2:45 – 6:45 pm, Pavilion
Session: Perceptual Organization: Neural mechanisms, models
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Daniel Chong1 (djc473@cornell.edu), Raja Marjieh2, Ilia Sucholutsky3, Nori Jacoby1, Amy Kuceyeski1; 1Cornell University, 2Princeton University, 3New York University
Understanding how humans judge the similarity of visual images is of interest in computer vision, neuroscience, and cognitive science. While some research has shed light on how brain activity representations in response to images are related to subjective human similarity judgements, a full picture is yet to be realized. In this study we use a brain voxel-level encoding model to quantify the shared variance of human similarity judgements and the brain’s visual regions activity similarity. We analyzed 100 images from the BOLD5000 dataset (4,950 unique image pairs). For each pair, we had three sources of similarity: human ratings (0-6 scale, averaged across 3-6 raters), AI models (573 vision models, 64 language models), and predicted brain activity patterns derived from a voxel-wise encoding model. The encoding model allowed us to predict voxel-level fMRI responses and compute pairwise brain similarity within 7 early visual regions (V1d, V1v, V2d, V2v, V3d, V3v, hV4) and 12 late visual regions (FFA1, FFA2, OFA, EBA, FBA2, PPA, RSC, OPA, OWFA, VWFA1, VWFA2, mfswords). We correlated each brain region’s similarity structure with human judgments, vision models, and language models. Late visual regions were significantly more correlated with human judgments compared to early visual regions (mean r=0.36 vs. 0.28; t(17)=4.04; p<.001), with mid-fusiform sulcus word area, parahippocampal place area, and ventral word form area 2 showing the strongest correlations (r=0.406, 0.401, 0.391; p<.001). Correlations with vision AI models were not significantly different between early and late regions (mean r=0.179 vs 0.184; t(17)=0.505 p=0.620), while language AI models showed stronger correlations with late semantic regions compared to early visual regions (mean r=0.27 vs 0.19; t(17)=5.61 p<.001). This suggests that human similarity perception is grounded in semantic representations that are also found in higher level visual regions, and that language models better mimic this semantic structure compared to vision models.