Relative depth discrimination in natural images of paired human body joints

Poster Presentation 23.341: Saturday, May 18, 2024, 8:30 am – 12:30 pm, Banyan Breezeway
Session: Face and Body Perception: Bodies

Jiaqi Liu1 (), Daniel Kersten1; 1University of Minnesota Twin Cities

Humans have the ability to perceive three-dimensional depth from a two-dimensional image plane. An illustrative example is the ability of humans to recognize body pose and extrapolate the three-dimensional spatial arrangement of joints given a human body image. While past studies have indicated that the internal representation of the human figure can sometimes impose constraints on the depth discrimination of static stimuli, the precise integration of local and structural information among body parts for inferring depth remains unclear. Here we investigated human ability to identify relative depth between pairs of body parts given limited spatial context from natural images. In the experiment, 20 observers viewed a series of pairs of body parts, each recognizable above chance, and displayed through a circular aperture. Observers were then asked to identify which part was closer to them. The manipulation of structural information involved varying the spatial relationship (retained/original position and disrupted/side by side) and different types of body pairs (same side and cross side). Each condition comprised 100 trials, with images sourced from the Leeds Sports Dataset. The performance of human depth perception was evaluated against the ground truth established by the Unite the People dataset. We found that retained spatial relations significantly enhance the discrimination of relative depth between body parts compared to disrupted spatial relations. Furthermore, the accuracy in depth discrimination was higher in elbow-elbow pairs compared to elbow-wrist pairs. Additionally, an investigation into how Euclidean distance between parts could influence depth discrimination revealed that, in contrast to elbow-elbow pairs, a closer distance between the wrist and elbow resulted in heightened accuracy, suggesting a potential grouping mechanism between adjacent parts. Our study underscores that humans efficiently employ both structural knowledge and low-/mid-level grouping cues to infer depth information given limited spatial context.

Acknowledgements: This research was funded in part by NIH grant 5R01EY029700 Towards a Compositional Generative Model of Human Vision.