Primed for Action: How Humans Decode Actions from Almost Nothing

Poster Presentation 53.450: Tuesday, May 19, 2026, 8:30 am – 12:30 pm, Pavilion
Session: Action: Perception, recognition

Filip Rybansky1 (), Sadegh Rahmani2, Andrew Gilbert2, Frank Guerin2, Anya Hurlbert1, Quoc Vuong1; 1Newcastle University, 2University of Surrey

Why do humans find actions easy to recognise compared to state-of-the-art machines? While machines struggle with naturalistic action videos, people can recognize them from minimal amounts of spatial and spatiotemporal features. We previously identified Minimal Recognisable Configurations (MIRCs) in Epic-Kitchens-100 videos which were correctly (Easy) or incorrectly (Hard) categorised by our computer-vision network. High-level (e.g., active object, active hand) and low-level (e.g., orientation, motion) critical features for action recognition were determined from these MIRCs. Here we used eye-tracking to converge on critical features and establish whether people’s eye-movements can explain why machines found some actions hard. We further tested whether contextual top-down information can affect people’s performance and eye movements, as machines predominantly use bottom-up information. Participants (N=36) were briefly presented with a static prime image (400ms), then viewed the original or MIRC video and identified the action while eye movements were recorded. Primes were the middle frame of the original video. High-level features were further removed, or the prime was phase-scrambled to progressively reduce contextual information. Normalised Scanpath Saliency (NSS) was used to compare which features best predicted where participants looked. Notably, active objects performed significantly better than active hands (NSS=1.60vs0.75) and, for low-level features, orientation performed better than motion (NSS=1.16vs0.53). The predictive ability was higher for all features in Hard than Easy videos (NSS=0.69vs0.51). For priming, unedited primes produced higher accuracy and shorter First Fixation Durations than phase-scrambled primes when viewing MIRCs (ACC=0.78vs0.57,FFD=330vs299ms). Our results suggest that egocentric action recognition is dominated by the active object, and by spatial over motion features. Overall, participants were more likely to fixate on the critical features for Hard videos but could recognise Easy videos even by fixating regions with lower feature activation. Human strategies such as increased attention to critical features can help improve action recognition by machine in difficult conditions.

Acknowledgements: Leverhulme Trust