Unsupervised 3D human pose estimation via conditional multi-view ancestral sampling

May 25, 2026·

Ryohei Goto

Takuya Fujihashi

Shunsuke Saruwatari

Fumio Okura

· 0 min read

PDF Code

Abstract

We propose a method of estimating a 3D human pose from a single view without 3D supervision. The key to our method is to leverage the 2D diffusion priors of motion diffusion models (MDMs) pre-trained on large 2D human pose datasets. Specifically, we extend multi-view ancestral sampling of diffusion models to the task of 2D-3D lifting of human pose. To this end, we newly propose a conditional multi-view ancestral sampling (cMAS) that optimizes the 3D pose such that its multi-view projections follow the manifold in 2D MDM noise space, while conditioning the 3D pose to match the given 2D poses and anatomical constraints of humans. Experiments on the Yoga dataset demonstrate that our method achieves better cross-domain performance compared to state-of-the-art supervised and unsupervised 3D pose estimation methods, including extreme human poses where 3D supervision is unavailable.

Type

Conference paper

Publication

In IEEE International Conference on Automatic Face and Gesture Recognition (FG 2026)

Last updated on May 25, 2026

FG 2026 FG Computer Vision Computer Graphics

Authors

Fumio Okura

Associate Professor

← PlantPose: Universal plant skeleton estimation via tree-constrained graph generation Jun 1, 2026

DP-SfM: Dual-pixel structure-from-motion without scale ambiguity May 5, 2026 →