Primed for Action: How Humans Decode Actions from Almost Nothing

Poster Presentation 53.450: Tuesday, May 19, 2026, 8:30 am – 12:30 pm, Pavilion
Session: Action: Perception, recognition

Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions

There is a Poster PDF for this presentation, but you must be a current member or registered to attend VSS 2026 to view it.
Please go to your Account Home page to register.

Filip Rybansky¹ (filip@rybansky.com), Sadegh Rahmani², Andrew Gilbert², Frank Guerin², Anya Hurlbert¹, Quoc Vuong¹; ¹Newcastle University, ²University of Surrey

Why do humans find actions easy to recognise compared to state-of-the-art machines? While machines struggle with naturalistic action videos, people can recognize them from minimal amounts of spatial and spatiotemporal features. We previously identified Minimal Recognisable Configurations (MIRCs) in Epic-Kitchens-100 videos which were correctly (Easy) or incorrectly (Hard) categorised by our computer-vision network. High-level (e.g., active object, active hand) and low-level (e.g., orientation, motion) critical features for action recognition were determined from these MIRCs. Here we used eye-tracking to converge on critical features and establish whether people’s eye-movements can explain why machines found some actions hard. We further tested whether contextual top-down information can affect people’s performance and eye movements, as machines predominantly use bottom-up information. Participants (N=36) were briefly presented with a static prime image (400ms), then viewed the original or MIRC video and identified the action while eye movements were recorded. Primes were the middle frame of the original video. High-level features were further removed, or the prime was phase-scrambled to progressively reduce contextual information. Normalised Scanpath Saliency (NSS) was used to compare which features best predicted where participants looked. Notably, active objects performed significantly better than active hands (NSS=1.60vs0.75) and, for low-level features, orientation performed better than motion (NSS=1.16vs0.53). The predictive ability was higher for all features in Hard than Easy videos (NSS=0.69vs0.51). For priming, unedited primes produced higher accuracy and shorter First Fixation Durations than phase-scrambled primes when viewing MIRCs (ACC=0.78vs0.57,FFD=330vs299ms). Our results suggest that egocentric action recognition is dominated by the active object, and by spatial over motion features. Overall, participants were more likely to fixate on the critical features for Hard videos but could recognise Easy videos even by fixating regions with lower feature activation. Human strategies such as increased attention to critical features can help improve action recognition by machine in difficult conditions.

Acknowledgements: Leverhulme Trust

Vision Sciences Society

Primed for Action: How Humans Decode Actions from Almost Nothing

Important Dates

MyVSS

Join VSS

Future Meetings