The development of categorical object representations: bridging visual neuroscience and deep learning

< Back to 2023 Symposia

Symposium: Friday, May 19, 2023, 2:30 – 4:30 pm, Talk Room 2

Organizers: Marieke Mur1; 1Western University
Presenters: Heather L Kosakowski, Michael J Arcaro, Katharina Dobs, Talia Konkle, Marieke Mur

It is well established that object representations in high-level primate visual cortex emphasize categories of ecological relevance such as faces and animals. Yet how these categorical object representations emerge over the course of development is less well understood. The aim of this symposium is to integrate the latest work on visual object learning in the fields of developmental and computational neuroscience. Speakers will address the following key questions: When do category-selective responses emerge during development? What can we learn from the structure of visual experience alone? What constraints may additionally shape the development of categorical object representations in visual cortex? Two of our speakers are pioneers in developmental visual neuroscience, and the other two are at the forefront of developing deep learning models of object vision. Talks will highlight recent experimental work in human and nonhuman primates on developmental trajectories of object category formation in visual cortex as well as recent computational work on the learning objectives that drive object category formation. Heather Kosakowski will discuss her work on awake fMRI in human infants, which shows that category-selective regions for faces, places, and bodies are present in infants as young as 2-9 months of age. She will argue that the early emergence of category-selective regions should be understood within the broader context of cognitive development, which relies on parallel development of brain regions outside of ventral temporal cortex. Michael Arcaro will subsequently discuss his work on the development of the human and nonhuman primate visual systems. His findings suggest that categorical object representations arise from a proto-architecture present at birth that is modified by visual experience to become selectively responsive to frequently encountered elements of the environment. We will then shift from an empirical to a computational focus. Katharina Dobs will discuss her work on modeling human face perception with deep artificial neural networks. Her results show that networks need both visual experience with faces and training for face identification to show behavioral signatures of human face perception. Talia Konkle will subsequently discuss her work on building an integrated empirical-computational framework for understanding how we learn to recognize objects in the world. Her work shows that many artificial neural networks have a similar capacity for brain predictivity, including fully self-supervised visual systems with no specialized architectures, but that no networks yet capture all the signatures of the data. The fifth and final talk will conclude with a synthesis of the presented work to stimulate discussion among the speakers and audience. Speakers will be allotted 17 minutes for their talks followed by 5 minutes of question time. We will end with a 10-minute general discussion period. Bringing these speakers together will yield fruitful discussions on the current challenges and future directions for bridging developmental and computational approaches to investigating visual object learning.


Parallel development of cortical regions that support higher-level vision and cognition

Heather L Kosakowski1; 1Harvard University

After birth, infants’ brains must parse large amounts of sensory input into meaningful signals. Traditional bottom-up, serial models of cortical development suggest that the statistical regularities in sensory input guide development of high-level visual categories. My work shows that infants have face-selective responses in the fusiform face area, scene-selective responses in the parahippocampal place area, and body-selective response in the extrastriate body area. Thus, under a bottom-up, serial account, 2- to 9-months of visual experience must be sufficient to develop category-selective regions. However, behavioral evidence that infants discriminate complex visual features within days of birth and use abstract knowledge to guide where they look poses a substantial problem for the bottom-up view. For example, shortly after birth infants discriminate familiar faces from visually similar unfamiliar faces and choose which faces to spend more time looking at. Consistent with these results, my recent work indicates that 2- to 4-month-old infants have face-selective responses in superior temporal sulcus and medial prefrontal cortex, regions that support social-emotional cognition in adults. Taken together, a parallel model of cortical development provides a better explanation of these data than traditional bottom-up serial models.

Topographic constraints on visual development

Michael J Arcaro1; 1University of Pennsylvania

Primates are remarkably good at recognizing faces and objects in their visual environment, even after just a brief glimpse. How do we develop the neural circuitry that supports such robust perception? The anatomical consistency and apparent modularity of face and object processing regions indicate that intrinsic constraints play an important role in the formation of these brain regions. Yet, the neonate visual system is limited and develops throughout childhood. Here, I will discuss work on the development of the human and nonhuman primate visual systems. These studies demonstrate that regions specialized in the processing of faces are the result of experience acting on an intrinsic but malleable neural architecture. Subcortical and cortical topographic connectivity play a fundamental role, providing an early scaffolding that guides experience-driven modifications. Within the visual system, this connectivity reflects an extensive topographic organization of visual space and shape features that are present at birth. During postnatal development, this proto-architecture is modified by daily experience to become selectively responsive to frequently encountered elements of the environment. Anatomical localization is governed by correspondences between maps of low-level feature selectivity and where in visual space these features are typically viewed. Thus, rather than constituting rigidly pre-specified modules, face and object processing regions instead reflect an architecture that builds on topographic scaffolds to learn and adapt to the regularities of our visual environment.

Using deep neural networks to test possible origins of human face perception

Katharina Dobs1; 1Justus-Liebig University Giessen

Human face recognition is highly accurate, and exhibits a number of distinctive and well documented behavioral and neural “signatures” such as the face-inversion effect, the other-race effect and neural specialization for faces. How does the remarkable human ability of face recognition arise in development? Is experience with faces required, and if so, what kind of experience? We cannot straightforwardly manipulate visual experience during development in humans, but we can ask what is possible in machines. Here, I will present our work testing whether convolutional neural networks (CNNs) optimized on different tasks with varying visual experience capture key aspects of human face perception. We find that only face-trained – not object-trained or untrained – CNNs achieved human-level performance on face recognition and exhibited behavioral signatures of human face perception. Moreover, these signatures emerged only in CNNs trained for face identification, not in CNNs that were matched in the amount of face experience but trained on a face detection task. Critically, similar to human visual cortex, CNNs trained on both face and object recognition spontaneously segregated themselves into distinct subsystems for each. These results indicate that humanlike face perception abilities and neural characteristics emerge in machines and could in principle arise in humans (through development or evolution or both) after extensive training on real-world face recognition without face-specific predispositions, but that experience with objects alone is not sufficient. I will conclude by discussing how this computational approach offers novel ways to illuminate how and why visual recognition works the way it does.

Leveraging deep neural networks for learnability arguments

Talia Konkle1, Colin Connell1, Jacob Prince1, George Alvarez1; 1Harvard University

Deep neural network models are powerful visual representation learners – transforming natural image input into usefully formatted latent spaces. As such, these models give us new inferential purchase on arguments about what is learnable from the experienced visual input, given the inductive biases of different architectural connections, and the pressures of different task objectives. I will present our current efforts to collect the models of the machine learning community for opportunistic controlled-rearing experiments, comparing hundreds of models to human brain responses to thousands of images using billions of regressions. Surprisingly, we find many models have a similar capacity for brain predictivity – including fully self-supervised visual systems with no specialized architectures, that learn only from the structure in the visual input. As such, these results provide computational plausibility for an origin story in which domain-general experience-dependent learning mechanisms guide visual representation, without requiring specialized architectures or domain-specialized category learning mechanisms. At the same time, no models capture all the signatures of the data, inviting testable speculation for what is missing – specified in terms of architectural inductive biases, functional objectives, and distributions of visual experience. As such, this empirical-computational enterprise brings exciting new leverage into the origins underlying our ability to recognize objects in the world.

Bridging visual developmental neuroscience and deep learning: challenges and future directions

Marieke Mur1; 1Western University

I will synthesize the work presented in this symposium and provide an outlook for the steps ahead in bridging visual developmental neuroscience and deep learning. I will first paint a picture of the emerging understanding of how categorical object representations in visual cortex arise over the course of development. The answer to this question can be considered to lie on a continuum, with one extreme suggesting that we are born with category-selective cortical modules, and the other extreme suggesting that categorical object representations in visual cortex arise from the structure of visual experience alone. Emerging evidence from both experimental and computational work suggests that the answer lies in between: categorical object representations may arise from an interplay between visual experience and constraints imposed by behavioral pressures as well as inductive biases built into our visual system. This interplay may yield the categorical object representations we see in adults, which emphasize natural categories of ecological relevance such as faces and animals. Deep learning provides a powerful computational framework for putting this hypothesis to the test. For example, unsupervised learning objectives may provide an upper bound on what can be learnt from the structure of visual experience alone. Furthermore, within the deep learning framework, we can selectively turn on constraints during the learning process and examine effects on the learnt object representations. I will end by highlighting challenges and opportunities in realizing the full potential of deep learning as a modeling framework for the development of categorical object representations.

< Back to 2023 Symposia