Materials, Objects and Perception

Talk Session: Saturday, May 20, 2023, 10:45 am – 12:30 pm, Talk Room 1
Moderator: Bei Xiao, American University

Talk 1, 10:45 am, 22.11

Behaviourally relevant image structure linked with visual sampling and perception of materials

Alexandra C. Schmid1, Matthias Nau1, Chris I. Baker1; 1National Institutes of Health

The perceived material of an object is inherently tied to its environmental affordances, whether we are avoiding a wet, slippery floor or handling a fragile, crystal vase. Moreover, the way we interact with an object also affects how we visually sample it, as our gaze is guided to behaviorally relevant features. We hypothesized that human gaze behavior during object viewing should therefore be guided by the object’s perceived material and, if so, visual sampling of the object should reflect regularities in image structure that perceptually define the material. To test this, we characterised the relationship between human gaze behaviour, image structure, and material perception by combining eye tracking and deep-learning-based gaze predictions for 924 rendered photorealistic object stimuli. These stimuli were complex glossy objects rendered in natural illumination fields with varying reflectance properties, leading to a wide range of material appearances such as plastic, clay, ceramics, fabric, etc. Using DeepGaze IIE to predict fixation patterns on these images, we found that these patterns do indeed differ between stimuli, independent of object shape. This suggests that surface properties affect how we visually sample objects. Further, these differences in gaze patterns correlated with differences in perceived material. Finally, we found that variations in contrast, clarity, size, and colour of specular reflections of the objects predicted differences in both perceived material and gaze behaviour, providing a direct link between image cues, viewing behaviour, and affordance-related information. In a series of follow-up analyses, we then tested these model-based predictions in eye tracking data. Collectively, our results support the notion that both object perception and viewing behaviour are shaped by the affordances of the things we look at, and that both are determined by regularities in image structure caused by the complex yet characteristic ways that light is scattered by materials with different surface properties.

Acknowledgements: This research was supported by a Walter Benjamin Fellowship awarded to A.C.S. from the German Research Foundation (DFG)

Talk 2, 11:00 am, 22.12

Shared Representation of Different Material Categories: Transfer Learning of Crystals From Soaps

Chenxi Liao1, Masataka Sawayama2, Bei Xiao1; 1American University, 2The University of Tokyo

We encounter various materials every day (e.g., plastics, soaps, and stones), and often need to estimate attributes of novel materials in new environments. How do humans judge material properties across different categories of materials, which possess common but also distinctive visual characteristics? We hypothesize that the features humans use to perceive materials might be overlapping, although different materials have their peculiar visual characteristics pertinent to their physical properties. We previously demonstrated the unsupervised image synthesis model, StyleGAN, can generate perceptually convincing images of translucent objects (e.g., soaps). Here, using transfer learning, we test whether the model pretrained on a soap dataset provides critical information to learn a different material, crystals. Specifically, we transfer the pre-trained StyleGAN from a large soap dataset to a relatively small crystal dataset, via full model fine-tuning. With little training time, we obtain a new generator that synthesizes realistic crystals. The layer-wise latent spaces of both material domains show similar scale-specific semantic representations: early-layers represent coarse spatial-scale features (e.g., the object’s contour), middle-layers represent mid-level spatial-scale features (e.g., material), and later-layers represent fine spatial-scale features (e.g., color). Notably, by swapping the weights between the soap and crystal generators, we find that the multiscale generative processes of the materials (spanning from 4-by-4 to 1024-by-1024 resolution) mostly differ in the coarse spatial-scale convolution layers. Convolution weights at the 32-by-32 resolution generative stage determine the critical difference in the geometry of the two materials. Convolution weights at 64-by-64 resolution decode the common characteristics (e.g., translucency) between the materials. Moreover, without additional training, we could create new material appearances that have visual features from both training categories. Together, we show that there are overlapping latent image features among distinctive material categories, and that learning features from one material benefits learning a new material.

Talk 3, 11:15 am, 22.13

Material perception diagnosticity of visual product interaction.

Aaron Kaltenmaier1,2, Maarten Wijntjes1; 1Technical University Delft, 2University College London

Estimating material properties is a key task when judging online apparel. To match expectations with reality, we designed and investigated various visual touch screen interactions that aimed to communicate the multisensorial properties of ‘softness’, ‘thinness’, ‘shininess’ and ‘elasticity’. In the first (lab) experiment, we measured how perceptually distinguishable the fabrics are from each other using the Thurstonian-based metric NDL (Number of Distinguishable Levels). We found that observers agreed well with each other and that the NDL varied between fabric properties: about 4 for softness and approximately 2 for the other three adjectives. Next, we designed a touch screen interaction, inspired by ‘ShoogleIt’, that let participants slide the fabric over a cylindrical obstacle. Thurstonian scaling experiments showed that softness estimations (‘expectations’) were in line with what we found in the lab (‘reality’) but that for the other three adjectives the diagnosticity was less strong. Next, we designed two other interactions that were intended to primarily communicate thinness (by draping the fabric) and elasticity (by stretching the fabric). While thinness diagnosticity increased, elasticity was still difficult to infer. The latter result could be due to the inability to infer exerted force from visual information. The results indicate clearly that there are differences in how successful various properties can be visually communicated. Softness and thinness can effectively be communicated, but shininess and elasticity cannot.

Talk 4, 11:30 am, 22.14

Measuring Object Recognition Ability: Reliability, Validity, and the Aggregate z-score Approach.

Conor J. R. Smithson1 (), Jason K. Chow1, Ting-Yun Chang1, Isabel Gauthier1; 1Vanderbilt University

Measurement of domain-general object recognition ability (o) requires the minimisation of domain-specific influences on scores. For this purpose, it is useful to combine multiple tasks which differ in task demands and stimuli. One approach is to model o as a latent variable explaining performance on such a battery of tasks, however, time and sample requirements limit usage of this approach. Alternatively, an aggregate measure of o can be obtained by averaging z-scores from each task. Using data from Sunday et al. (2022), we demonstrate that aggregate scores from just two object recognition tasks with differing stimuli and task demands provides a good approximation (r = .79) of factor scores calculated from a larger confirmatory factor model in which six tasks and three object categories were used. Indeed, some task combinations produced correlations of up to r = .87 with factor scores. We then revise these measures to reduce testing time, and additionally develop an odd-one-out task, using a unique object category on each trial. Greater diversity of task demands and objects should provide more accurate measurement of domain-general ability. To test the reliability and validity of our measures, 163 participants completed our three object recognition tasks on two occasions, spaced one month apart. Providing the first evidence that o is stable over time, our 15-minute aggregate o measure demonstrated good test-retest reliability (r = .77) at this interval, and hierarchical regression showed that the stability of o could not be completely accounted for by intelligence, perceptual speed, and early visual processing. Using structural equation modelling we show that our measures all load significantly onto the same latent variable, and also demonstrate that as a latent variable, o is highly stable (r = .94) over a month. Our measures are freely available to use, and can be downloaded at

Acknowledgements: This work was supported by the David K. Wilson Chair Research Fund (Vanderbilt University)

Talk 5, 11:45 am, 22.15

The Beholder’s Share: Cross-subject Variability in Responses to Abstract Art

Celia Durkin1,4 (), Benjamin Peters1,4, Christopher Baldassano1, Eric Kandel2,3,4, Daphna Shohamy1,2,3,4; 1Columbia University Psychology Department, 2Howard Hughes Medical Institute, 3Kavli Institute for Brain Science, 4Zuckerman Mind, Brain Behavior Institute

Subjective experience of art emerges from an interaction between external input, which is shared across individuals, and internal associations, which vary across individuals and give art its personal meaning. In art theory, the Beholder’s Share refers to the contribution a viewer makes to the meaning of a painting by drawing on a set of unique prior experiences. A key tenet of the Beholder’s Share is that a viewer brings more personal meaning to abstract art than to representational art. Here, we interrogate this theory. We reason that more personal meaning brought to a painting should manifest in variability across subjects in neural responses to the same painting. To test this, we scanned participants with fMRI while they viewed abstract or representational paintings. To determine whether subjects respond more subjectively to abstract vs. representational paintings, we measured cross-subject variability in patterns of BOLD activity. We found that abstract paintings elicited more variable patterns of BOLD activity, specifically in regions of the Default Mode Network, but not in low-level visual regions. This pattern is consistent with the idea that abstract paintings evoke more subjective high-level responses despite common visual input. Next, we leveraged neural networks to model how differences in individuals' prior visual experiences could drive the variability in high-level responses to abstract art. We simulated individual differences in visual experience using instances of the same neural network (ResNet50) trained on different visual data sets and compared across-network variability in activations for abstract and representational paintings. We found that representations varied across networks more for abstract paintings than for representational paintings. Complementing the fMRI results, this pattern was found specifically in higher layers of the network. Overall, these studies provide insight into a possible neural instantiation of the Beholder’s Share and how it may emerge from individual differences in prior experience.

Talk 6, 12:00 pm, 22.16

A solution to the ill-posed problem of common factors in vision

Dario Gordillo1, Aline Cretenoud1, Simona Garobbio1, Michael H. Herzog1; 1Laboratory of Psychophysics, Brain Mind Institute, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland

Studies investigating individual differences in vision tend to deliver mixed results. Some studies argue for a common factor underlying visual abilities, i.e., a participant performing better in one visual task, compared to another participant, is also assumed to perform better in another visual task. Other studies propose that visual abilities are better explained by several uncorrelated factors, i.e., the performance in one visual task does not necessarily predict performance in another visual task. In the above studies, the data are analyzed with principal component analysis (PCA) or factor analysis (FA). Conclusions are often made based on measures such as the proportion of variance explained by the first component/factor of a PCA/FA. Here, using computer simulations, we demonstrate that we cannot draw conclusions about common factors based on measures such as the proportion of variance explained by the first component/factor of a PCA/FA. Further, we show that the number of participants and variables strongly influence the results of PCA and FA. Finally, we propose a new tool that tests for common factors. We applied our tool to data from 13 previous studies investigating common factors in vision.

Acknowledgements: This work was funded by the National Centre of Competence in Research (NCCR) Synapsy financed by the Swiss National Science Foundation under grant 51NF40-185897.

Talk 7, 12:15 pm, 22.17

Toward a computational neuroscience of visual cortex without deep learning

Atlas Kazemian1, Eric Elmoznino1, Michael Bonner1; 1Johns Hopkins University

The performance of convolutional neural networks (CNNs) as representational models of visual cortex is thought to be associated with their optimization on ethologically relevant tasks. Here, we show that this view is incorrect and that there are other architectural and statistical factors that primarily account for their performance. We show this by developing a novel statistically inspired neural network that yields accurate predictions of cortical image representation without the need for optimization on supervised or self-supervised tasks. Our architecture is characterized by a core module of convolutions and max pooling, which can be stacked in a deep hierarchy. An important characteristic of our model is the use of thousands of random filters to sample the high-dimensional space of natural image statistics. These filters can be mapped to cortical responses through a simple linear-regression procedure, which we validate using held-out test data. This statistical-mapping procedure provides an unbiased approach for exploring the tuning properties of higher-level visual neurons without restricting the space of possible filters to those learned on a specific pre-training dataset and task. Remarkably, we find that the model competes with standard supervised CNNs at predicting image-evoked responses in visual cortex in both monkey electrophysiology and human fMRI data but without the need for pre-training, making it orders of magnitude more data-efficient than standard CNNs trained on massive image datasets. Together, our findings reveal a surprisingly parsimonious prescription for the design of high-performance neural network models of cortical representation, and they suggest the intriguing possibility that the computational architecture of the visual cortex could emerge from the replication and elaboration of a core canonical module.