Lightness Illusions Through AI Eyes: Assessing ConvNet and ViT Concordance with Human Perception

Poster Presentation 53.308: Tuesday, May 21, 2024, 8:30 am – 12:30 pm, Banyan Breezeway
Session: Color, Light and Materials: Lightness, brightness

Jaykishan Patel1,2 (), Alban Flachot1,2, Javier Vazquez-Corral3,4, Konstantinos George Derpanis1,5, Richard Murray1,2; 1York University, 2Center for Vision Research, 3Universitat Autònoma de Barcelona (UAB), 4Computer Vision Center, UAB, 5Lassonde School of Engineering

Inferring surface reflectance from luminance images has proven to be a challenge for models of human vision, as many combinations of illumination, reflectance, and 3D shape can create the same luminance image. Traditional models struggle with this deep ambiguity. Recently, convolutional neural networks (CNNs) and vision transformers (ViTs) have been successful computer vision approaches to inferring surface colour. These architectures have the potential to be foundational models for lightness and color perception, if they process image information similarly to humans. We trained CNN and ViT backbones including ResNet18, VGG19, DPT, and custom designs to infer surface reflectance from luminance images using a custom dataset of luminance and reflectance images generated in Blender. We used these models to infer surface reflectance from several well-known images that generate strong lightness illusions, including the argyle, Koffka-Adelson, snake, simultaneous contrast, White's, and checkerboard assimilation illusions, as well as their control images. These illusions are often thought to result from the visual system's attempt to infer surface reflectance from ambiguous images using the statistics of natural images, and we hypothesized that networks trained on simple scenes rendered with shading and shadows would be susceptible to similar illusions. We found that all networks did in fact predict illusions in most test images, and predicted stronger illusions than in the control conditions. The exceptions were that the models typically failed to predict the argyle illusion, and to predict assimilation illusions. Model saliency analysis showed that the networks' outputs were strongly dependent on pixel-information in the shadowed regions of the image. These results support the hypothesis that some lightness phenomena arise from the visual system's use of natural scene statistics to infer reflectance from ambiguous images, and show the potential of CNNs and other deep learning architectures as starting points for models of human lightness and colour perception.