IT Duration Coding Reflects Stimulus History, and Video ANNs Qualitatively Approximate This Computation
Talk Presentation 24.27: Saturday, May 16, 2026, 2:30 – 4:15 pm, Talk Room 2
Session: Temporal Processing
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Dominique Chuaqui1, Matteo Dunnhofer1,2, Kohitij Kar1; 1York University, 2University of Udine
The ventral stream is frequently modeled as a static object-recognition system, yet accumulating evidence points to meaningful history-dependent computation. Recent evidence shows that the inferior temporal (IT) cortex, long considered a shape-selective endpoint of the ventral stream, responds differently to static images based on preceding stimulus history. Dynamic tasks provide a powerful tool for studying these temporal influences. Therefore, here, we used duration estimation as a tractable test case to ask: Does IT compute how long an object persists by integrating temporal history, or would a purely frame-based system, such as a feedforward ANN, yield the same outcome? We further asked whether modern video ANNs provide a better mechanistic hypothesis for IT’s dynamics. To diagnose history-dependence, we manipulated temporal coherence without altering the underlying images. Objects were shown either in their natural temporal sequence (coherent) or as temporally scrambled versions containing the exact same frames (incoherent). If IT duration estimates rely only on instantaneous object-bearing frames, coherence should not matter. If IT integrates stimulus history, coherent and incoherent videos should yield different duration predictions. We recorded activity from 119 reliable IT neurons (split-half>0.5) while macaques viewed 200 videos. Linear models trained on IT activity showed significantly higher duration-prediction accuracy for coherent than incoherent stimuli (r=0.68 vs. 0.60, p<0.001), demonstrating robust history-dependence in IT dynamics. We then evaluated two ANN hypothesis classes. Feedforward image models showed no coherence effect (Δ=0.0067, p=0.334) and performed below IT for both stimulus types. In contrast, video ANNs, which explicitly integrate temporal context, exhibited a significant coherence advantage (Δ = 0.0677, p =0.0073, statistically indistinguishable from IT: Δ= −0.0087, p=0.687), though their absolute correlations remained lower than IT. Together, these findings show that IT’s duration coding is shaped by stimulus history, and temporally integrated video ANNs partially capture this dependence, highlighting key constraints for future ANNs.
Acknowledgements: KK is supported by the Canada Research Chair Program (CRC-2021-00326), SFARI (967073), Brain-Canada Foundation (2023-0259), and NSERC (RGPIN-2024-06223).