Gaze dynamics reveal the protracted development of perceptual segregation of multiple talking faces to solve the multisensory cocktail party problem

Poster Presentation 16.307: Friday, May 15, 2026, 3:45 – 6:00 pm, Banyan Breezeway
Session: Multisensory Processing: Motor

Katia steinfeld1 (katia.steinfeld@unil.ch), Micah Murray1,2, David Lewkowicz3; 1The Lausanne University Hospital and University of Lausanne, 2The Sense Innovation and Research Center, 3Child Study Center, Yale School of Medicine

Often, we are confronted with several people speaking all at once. Successfully identifying a relevant talker requires us to bind and integrate the audible and visible speech streams of that talker (target) and segregate them from competing talkers (distractors). How these multisensory abilities emerge during early development is poorly understood. We used metrics of gaze dynamics derived from information theory to quantify children’s and adults’ ability to extract the perceptual cues necessary to solve the multisensory cocktail party problem (MCPP). While gaze-tracked, children aged 3–7 years (N=172) and adults (N=37) viewed four talking faces and heard an auditory utterance that was either temporally synchronized with one face or desynchronized from all faces. Participants’ task was to identify the “talking” (i.e., audiovisually synchronized) face. Dynamic gaze behavior was quantified using dwell time, stationary gaze entropy and transition entropy. All children, including 3-year-olds, showed greater dwell time on the synchronized target than on distractors (F(1,172)=18.76, p<.001), indicating an early sensitivity to audiovisual temporal coherence. However, only from 5 years of age did children reliably concentrate gaze on the synchronized target (F(4,172)=12.11, p<.001). Beginning at 6 years, synchronized cues restricted the exploration to a subset of distractors (F(4,172)=3.50, p=.009) and produced a more predictable sequence of transitions (F(1,167)=12.34, p<.001). Adults showed the strongest dwell-time preference for the target (F(1,37)=1448.30, p<.001) as well as the greatest reductions in stationary (F(1,37)=434.79, p<.001) and transition entropy (F(1,35)=167.81, p<.001) when cues were synchronized. Together, these findings reveal a qualitative shift in dynamic gaze control from 5 to 6 years, marked by greater concentration of gaze on the target and more structured transitions among distractors. We interpret it as reflecting (a) improved multisensory integration, increasing the perceived salience of the target, and (b) the emergence of task-dependent weighting of salience cues, supporting efficient segregation of multisensory scenes.