MagicMan: Generative Novel View Synthesis of Humans with 3D-Aware Diffusion and Iterative Refinement

Xu He 1,   Xiaoyu Li 2,   Di Kang 2,   Jiangnan Ye 1,   Chaopeng Zhang 2,  
Liyang Chen 1,   Xiangjun Gao 3,   Han Zhang 4,   Zhiyong Wu 1,5,✉,   Haolin Zhuang 1

1 Shenzhen International Graduate School, Tsinghua University

,  

2 Tencent AI Lab

,

3 The Hong Kong University of Science and Technology

,  

4 Stanford University

,  

5 The Chinese University of Hong Kong

Given a single reference image, MagicMan can generate dense, high-quality, and consistent human novel view images and normal maps.

Abstract

Existing works in single-image human reconstruction suffer from weak generalizability due to insufficient training data or 3D inconsistencies for a lack of comprehensive multi-view knowledge. In this paper, we introduce MagicMan, a human-specific multi-view diffusion model designed to generate high-quality novel view images from a single reference image. As its core, we leverage a pre-trained 2D diffusion model as the generative prior for generalizability, with the parametric SMPL-X model as the 3D body prior to promote 3D awareness. To tackle the critical challenge of maintaining consistency while achieving dense multi-view generation for improved human reconstruction, we first introduce hybrid multi-view attention to facilitate both efficient and thorough information interchange across different views. Additionally, we present a geometry-aware dual branch to perform concurrent generation in both RGB and normal domains, further enhancing consistency via geometry cues. Last but not least, to address ill-shaped issues arising from inaccurate SMPL-X estimation that conflicts with the reference image, we propose a novel iterative refinement strategy, which progressively optimizes SMPL-X accuracy and the quality and consistency of the generated multi-views. Extensive experimental results demonstrate that our method significantly outperforms existing approaches in both novel view synthesis and subsequent human reconstruction tasks.

Method Overview

Given a single human image, our proposed MagicMan utilizes a pre-trained 2D diffusion model with 3D human body prior to generate novel view images. First, the reference image is fed into the denoising UNet via a reference UNet, with the viewpoint condition incorporated through camera embeddings. The rendered normal and segmentation maps of the posed SMPL-X mesh that corresponds to the reference image are also provided as geometry guidance to facilitate 3D-awareness and consistency. To obtain dense and consistent novel view images, we modify the attention module to a more efficient hybrid 1D-3D attention (a) to establish comprehensive connections between multi-views, and propose a geometry-aware dual branch (b) to also generate normal images in complementary to RGB images via geometry cues. Last but not least, a novel iterative refinement strategy (c) is proposed (only during inference) to gradually update the initially estimated inaccurate SMPL-X pose and the synthesized novel view images, substantially reducing the ill-shaped issues arising from unreliable SMPL-X estimates.

Comparisons of Novel View Synthesis

Novel view results on in-the-wild data

Novel view results on THuman2.1

Novel view results on CustomHumans

Comparisons of 3D Human Reconstruction

Reconstruction results on in-the-wild data

Reconstruction results on THuman2.1

Reconstruction results on CustomHumans

BibTeX


@misc{he2024magicman,
    title={MagicMan: Generative Novel View Synthesis of Humans with 3D-Aware Diffusion and Iterative Refinement},
    author={Xu He and Xiaoyu Li and Di Kang and Jiangnan Ye and Chaopeng Zhang and Liyang Chen and Xiangjun Gao and Han Zhang and Zhiyong Wu and Haolin Zhuang},
    year={2024},
    eprint={2408.14211},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}