1 Shenzhen International Graduate School, Tsinghua University
,2 Tencent AI Lab
,3 The Hong Kong University of Science and Technology
,4 Stanford University
,5 The Chinese University of Hong Kong
Given a single reference image, MagicMan can generate dense, high-quality, and consistent human novel view images and normal maps.
Existing works in single-image human reconstruction suffer from weak generalizability
due to insufficient training data or 3D inconsistencies for a lack of comprehensive multi-view knowledge.
In this paper, we introduce MagicMan, a human-specific multi-view diffusion model designed to generate high-quality novel view images from a single reference image.
As its core, we leverage a pre-trained 2D diffusion model as the generative prior for generalizability,
with the parametric SMPL-X model as the 3D body prior to promote 3D awareness.
To tackle the critical challenge of maintaining consistency while achieving dense multi-view generation for improved human reconstruction,
we first introduce hybrid multi-view attention to facilitate both efficient and thorough information interchange across different views.
Additionally, we present a geometry-aware dual branch to perform concurrent generation in both RGB and normal domains, further enhancing consistency via geometry cues.
Last but not least, to address ill-shaped issues arising from inaccurate SMPL-X estimation that conflicts with the reference image,
we propose a novel iterative refinement strategy, which progressively optimizes SMPL-X accuracy and the quality and consistency of the generated multi-views.
Extensive experimental results demonstrate that our method significantly outperforms existing approaches in both novel view synthesis and subsequent human reconstruction tasks.
Given a single human image, our proposed MagicMan utilizes a pre-trained 2D diffusion model with 3D human body prior to generate novel view images. First, the reference image is fed into the denoising UNet via a reference UNet, with the viewpoint condition incorporated through camera embeddings. The rendered normal and segmentation maps of the posed SMPL-X mesh that corresponds to the reference image are also provided as geometry guidance to facilitate 3D-awareness and consistency. To obtain dense and consistent novel view images, we modify the attention module to a more efficient hybrid 1D-3D attention (a) to establish comprehensive connections between multi-views, and propose a geometry-aware dual branch (b) to also generate normal images in complementary to RGB images via geometry cues. Last but not least, a novel iterative refinement strategy (c) is proposed (only during inference) to gradually update the initially estimated inaccurate SMPL-X pose and the synthesized novel view images, substantially reducing the ill-shaped issues arising from unreliable SMPL-X estimates.
Novel view results on in-the-wild data
Novel view results on THuman2.1
Novel view results on CustomHumans
Reconstruction results on in-the-wild data
Reconstruction results on THuman2.1
Reconstruction results on CustomHumans
@misc{he2024magicman,
title={MagicMan: Generative Novel View Synthesis of Humans with 3D-Aware Diffusion and Iterative Refinement},
author={Xu He and Xiaoyu Li and Di Kang and Jiangnan Ye and Chaopeng Zhang and Liyang Chen and Xiangjun Gao and Han Zhang and Zhiyong Wu and Haolin Zhuang},
year={2024},
eprint={2408.14211},
archivePrefix={arXiv},
primaryClass={cs.CV}
}