MetricHMSR: Metric Human Mesh and Scene Recovery from Monocular Images

MetricHMSR:
Metric Human Mesh and Scene Recovery from Monocular Images

¹Tsinghua University, ²Beijing Normal University, ³Beihang University

* Equal contribution. † Corresponding author.

Abstract

We introduce MetricHMSR, a framework for recovering metric human meshes and 3D scenes from a single monocular image. Existing methods struggle with metric scale due to monocular ambiguity and weak-perspective assumptions, and their fully coupled feature representations make it difficult to disentangle local pose from global translation. To address this, MetricHMR incorporates a bounding camera ray map to provide explicit metric cues for human reconstruction, together with a Human Mixture-of-Experts (HumanMoE) that dynamically routes image features to specialized experts for disentangled perception of local human pose and global metric position. Leveraging the recovered metric human as a geometric anchor, MetricHMSR further refines monocular metric depth estimation to achieve more accurate 3D alignment between humans and scenes.

Overview

Overall pipeline of MetricHMSR.

MetricHMSR consists of two components: MetricHMR for metric human mesh recovery, and a human-guided metric depth refinement module for scene reconstruction.

Given a cropped human image and the corresponding bounding ray map, the model predicts SMPL pose, shape, and global translation. The recovered metric human is then used as a geometric anchor to refine monocular metric depth and produce more physically consistent human-scene reconstruction.

Key Ideas

Bounding Ray Map

A pixel-aligned camera representation that encodes camera intrinsics, human image position, and the effects of cropping and scaling, providing explicit metric cues for reconstruction.

HumanMoE

A Patch MoE and Global MoE jointly capture patch-level and image-level information, enabling feature-level disentanglement between local pose and global metric position.

Human-Guided Depth Refinement

The reconstructed metric human serves as a geometric anchor to refine monocular metric depth, improving human-scene alignment and metric consistency.

Results

Metric human-scene reconstruction. Qualitative examples of recovered human meshes aligned with reconstructed scenes.

Temporal consistency of MetricHMR. Video results showing stable metric shape and global position over time.

Trajectory consistency. Frame-wise reconstructions yield coherent global motion trajectories across time.

Depth refinement results. Human-guided refinement improves metric scene depth and human-scene alignment.

BibTeX

@InProceedings{Song_2026_CVPR, author = {Song, Chentao and Zhang, He and Yuan, Haolei and Lin, Haozhe and Tao, Jianhua and Zhang, Hongwen and Yu, Tao}, title = {MetricHMSR: Metric Human Mesh and Scene Recovery from Monocular Images}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, }

MetricHMSR:Metric Human Mesh and Scene Recovery from Monocular Images

MetricHMSR jointly reconstructs human pose, metric shape, global position, and scene geometry from a single monocular image.