MetricHMSR:
Metric Human Mesh and Scene Recovery from Monocular Images

1Tsinghua University, 2Beijing Normal University, 3Beihang University
* Equal contribution.    † Corresponding author.
MetricHMSR teaser

MetricHMSR jointly reconstructs human pose, metric shape, global position, and scene geometry from a single monocular image.

Abstract

We introduce MetricHMSR, a framework for recovering metric human meshes and 3D scenes from a single monocular image. Existing methods struggle with metric scale due to monocular ambiguity and weak-perspective assumptions, and their fully coupled feature representations make it difficult to disentangle local pose from global translation. To address this, MetricHMR incorporates a bounding camera ray map to provide explicit metric cues for human reconstruction, together with a Human Mixture-of-Experts (HumanMoE) that dynamically routes image features to specialized experts for disentangled perception of local human pose and global metric position. Leveraging the recovered metric human as a geometric anchor, MetricHMSR further refines monocular metric depth estimation to achieve more accurate 3D alignment between humans and scenes.

Overview

MetricHMSR pipeline

Overall pipeline of MetricHMSR.

MetricHMSR consists of two components: MetricHMR for metric human mesh recovery, and a human-guided metric depth refinement module for scene reconstruction.

Given a cropped human image and the corresponding bounding ray map, the model predicts SMPL pose, shape, and global translation. The recovered metric human is then used as a geometric anchor to refine monocular metric depth and produce more physically consistent human-scene reconstruction.

Key Ideas

Bounding Ray Map

A pixel-aligned camera representation that encodes camera intrinsics, human image position, and the effects of cropping and scaling, providing explicit metric cues for reconstruction.

HumanMoE

A Patch MoE and Global MoE jointly capture patch-level and image-level information, enabling feature-level disentanglement between local pose and global metric position.

Human-Guided Depth Refinement

The reconstructed metric human serves as a geometric anchor to refine monocular metric depth, improving human-scene alignment and metric consistency.

Results

Metric human-scene reconstruction

Metric human-scene reconstruction. Qualitative examples of recovered human meshes aligned with reconstructed scenes.

Temporal consistency of MetricHMR. Video results showing stable metric shape and global position over time.

Trajectory consistency

Trajectory consistency. Frame-wise reconstructions yield coherent global motion trajectories across time.

Depth refinement results

Depth refinement results. Human-guided refinement improves metric scene depth and human-scene alignment.

Video

BibTeX

@InProceedings{Song_2026_CVPR,
    author    = {Song, Chentao and Zhang, He and Yuan, Haolei and Lin, Haozhe and Tao, Jianhua and Zhang, Hongwen and Yu, Tao},
    title     = {MetricHMSR: Metric Human Mesh and Scene Recovery from Monocular Images},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2026},
}