AI-Powered Feature Extraction from Naturalistic Egocentric Recording and Eye-Tracking

Poster Presentation 43.416: Monday, May 18, 2026, 8:30 am – 12:30 pm, Pavilion
Session: Eye Movements: Natural, complex tasks

Yanbin Xu1 (), Iba Baig2,1, Kevin Li1, Seiji Cattelain3, Hayato Ono4, Sho Tsuji3,4, Ming Bo Cai1,4; 1University of Miami, 2Northeastern University, 3École normale supérieure, 4The University of Tokyo

Children are excellent “self-directed” learners. They learn tasks efficiently by actively sampling information within a complex environment. Modeling the dynamics of children’s naturalistic information-sampling behaviors, such as eye gaze and hand movements, is critical for elucidating how internal states and endogenous rewards guide active learning. Head-mounted eye tracking combined with third-person camera recordings allows developmental psychologists to simultaneously capture children’s naturalistic interactions with their environment and others, along with their fixation targets. Despite data collection advancement, extracting interpretable features from large volumes of video and audio data for modeling still relies heavily on time-consuming manual annotation. Therefore, we developed a set of AI-based analysis tools to automate the process of psychologically interpretable feature extraction from multi-modal recordings: (1) a pipeline that temporally aligns multi-device video recordings using audio synchronization with frame-level accuracy; (2) a graphical user interface for semi-automatic segmentation of objects in ego-centric videos from head-mounted eye-trackers; (3) a pipeline for automatically extracting time courses of the participants’ body postures and hand actions from third-person videos utilizing a large vision-language model. The tools are modular and can be applied to diverse data acquisition setups. We validate our pipeline using a newly collected dataset of child–parent interactions in a naturalistic memory task. This dataset includes egocentric video with eye tracking from both child and parent perspectives, and two third-person video recordings from different angles. The result shows that our pipeline’s inference of participants’ poses is near-perfect for adults and has an accuracy ~ 80% for children, while the accuracy for hand action inference is lower, potentially due to occlusion. In conclusion, our tools support moment-by-moment annotation of attended objects, body poses, and actions. These features allow analyzing the dynamics of decision-making of eye-gaze shift and body motion during natural activity of children, including interaction with caregivers.