Do CNNs Trained on Self-Motion Videos Develop Sensitivity to 1st- and 3rd-order Motion?

Poster Presentation 53.421: Tuesday, May 21, 2024, 8:30 am – 12:30 pm, Pavilion
Session: Motion: Detection

Zhenyu Zhu1 (), Thomas Serre1, William Warren1; 1Brown University

At least two classes of motion information play a role in locomotor control: 1st-order motion energy, such as moving high-contrast texture, and 3rd-order feature-tracking, such as moving object boundaries (Lu and Sperling 1995). Previous literature showed that human heading responses when following a virtual crowd are dominated by 3rd-order motion and weakly influenced by 1st-order motion, revealed when surface texture moves in the Same or Opposite direction as object boundaries (the phi illusion) (Zhu and Warren VSS2023). In this project, we test whether units selective for both 1st and 3rd-order motion emerge in a state-of-the-art Convolutional Neural Network (CNN) model of motion responses in the primate dorsal stream. DorsalNet (Mineault et al. 2021) is a 5-layer CNN trained to estimate self-motion parameters in simulated drone videos. We tested the model’s heading estimates respectively on three virtual crowd displays used in Zhu and Warren’s (VSS2023) human experiments. In the CONTROL display, DorsalNet layers, like humans, show no differences between Same and Opposite conditions, while responses significantly increase with the number of moving objects for both (Model and Human: p<0.01). In the TEXTURE DISPLACEMENT display, DorsalNet, like humans, shows significant differences when texture motion is coherent (Same > Opposite; Model and Human: p<0.01), but not when motion is incoherent due to small or large displacements. Critically, in the BLURRED BOUNDARIES display, blurring object boundaries reduces the response to 3rd-order motion, increasing the difference between the Same and Opposite conditions in humans (p<0.01), but not in the model. These results demonstrate that DorsalNet has developed a 1st-order motion energy mechanism, which can capture some human heading responses, but not those due to 3rd-order feature-tracking.

Acknowledgements: Funding: NIH R01EY029745, NIH 1S10OD025181, NIH T32MH115895