How Pose, Depth, and Surface-Normal Impact HyperHuman’s Image Quality

24 Nov 2024

Authors:

(1) Xian Liu, Snap Inc., CUHK with Work done during an internship at Snap Inc.;

(2) Jian Ren, Snap Inc. with Corresponding author: [email protected];

(3) Aliaksandr Siarohin, Snap Inc.;

(4) Ivan Skorokhodov, Snap Inc.;

(5) Yanyu Li, Snap Inc.;

(6) Dahua Lin, CUHK;

(7) Xihui Liu, HKU;

(8) Ziwei Liu, NTU;

(9) Sergey Tulyakov, Snap Inc.

Table of Links

Abstract and 1 Introduction

3.2 Latent Structural Diffusion Model

3.3 Structure-Guided Refiner

4 Human Verse Dataset

5 Experiments

5.1 Main Results

5.2 Ablation Study

6 Discussion and References

A Appendix and A.1 Additional Quantitative Results

A.2 More Implementation Details and A.3 More Ablation Study Results

A.4 More User Study Details

A.5 Impact of Random Seed and Model Robustness and A.6 Boarder Impact and Ethical Consideration

A.7 More Comparison Results and A.8 Additional Qualitative Results

A.9 Licenses

A.2 MORE IMPLEMENTATION DETAILS

We report implementation details like training hyper-parameters, and model architecture in Tab. 6.

Table 6: Training Hyper-parameters and Network Architecture in HyperHuman.

A.3 MORE ABLATION STUDY RESULTS

We implement additional ablation study experiments on the second stage Structure-Guided Refiner. Note that due to the training resource limit and the resolution discrepancy between MS-COCO real images (512 × 512) and high-quality renderings (1024 × 1024), we conduct several toy ablation experiments in the lightweight 512 × 512 variant of our model: 1) w/o random dropout, where the all the input conditions are not dropout or masked out during the conditional training stage. 2) Only Text, where not any structural prediction is input to the model and we only use the text prompt as condition. 3) Condition on p, where we only use human pose skeleton p as input condition to the refiner network. 4) Condition on d that uses depth map d as input condition. 5) Condition on n that uses surface-normal n as input condition. And their combinations of 6) Condition on p, d; 7) Condition on p, n; 8) Condition on d, n, to verify the impact of each condition and the necessity of using such multi-level hierarchical structural guidance for fine-grained generation. The results are reported in Tab. 7. We can see that the random dropout conditioning scheme is crucial for more robust training with better image quality, especially in the two-stage generation pipeline. Besides, the structural map/guidance contains geometry and spatial relationship information, which are beneficial to image generation of higher quality. Another interesting phenomenon is that only conditioned on surface-normal n is better than conditioned on both the pose skeleton p and depth map d, which aligns with our intuition that surface-normal conveys rich structural information that mostly cover coarse-level skeleton and depth map, except for the keypoint location and foregroundbackground relationship. Overall, we can conclude from ablation results that: 1) Each condition (i.e., pose skeleton, depth map, and surface-normal) is important for higher-quality and more aligned generation, which validates the necessity of our first-stage Latent Structural Diffusion Model to jointly capture them. 2) The random dropout scheme for robust conditioning can essentially bridge the train-test error accumulation in two-stage pipeline, leading to better image results.

This paper is available on arxiv under CC BY 4.0 DEED license.

← Previous

In-Depth Analysis of Human Image Generation Models

Up Next →

HyperHuman Tops Image Generation Models in User Study