Evaluation of hybrid deep learning and optimization method for 3D human pose and shape reconstruction in simulated depth images

In this paper, we address the problem of capturing both the shape and the pose of a human character using a single depth sensor. Some previous works proposed to fit a parametric generic human template into the depth image, while others developed deep learning (DL) approaches to find the correspondence between depth pixels and vertices of the template. We designed a hybrid approach, combining the advantages of both methods, and conducted extensive experiments on the SURREAL [1], DFAUST datasets [2] and a subset of AMASS [3]. Results show that this hybrid approach enables us to enhance pose and shape estimation compared to using DL or model fitting separately. We also evaluated the ability of the DL-based dense correspondence method to segment also the background - not only the body parts. We also evaluated 4 di ff erent methods to perform the model fitting based on a dense correspondence, where the number of available 3D points di ff ers from the number of corresponding template vertices. These two results enabled us to better understand how to combine DL and model fitting, and the potential limits of this approach to deal with real-depth images. Future works could explore the potential


A B S T R A C T
In this paper, we address the problem of capturing both the shape and the pose of a human character using a single depth sensor. Some previous works proposed to fit a parametric generic human template into the depth image, while others developed deep learning (DL) approaches to find the correspondence between depth pixels and vertices of the template. We designed a hybrid approach, combining the advantages of both methods, and conducted extensive experiments on the SURREAL [1], DFAUST datasets [2] and a subset of AMASS [3]. Results show that this hybrid approach enables us to enhance pose and shape estimation compared to using DL or model fitting separately. We also evaluated the ability of the DL-based dense correspondence method to segment also the background -not only the body parts. We also evaluated 4 different methods to perform the model fitting based on a dense correspondence, where the number of available 3D points differs from the number of corresponding template vertices. These two results enabled us to better understand how to combine DL and model fitting, and the potential limits of this approach to deal with real-depth images. Future works could explore the potential of taking temporal information into account, which has proven to increase the accuracy of pose and shape reconstruction based on a unique depth or RGB image. between the image domain and the SMPL parameter space 1 [8]. However, this correspondence problem is complex due to 2 high variation in human poses, shapes, and camera viewpoints 3 [9, 9, 8, 10]. 4 We make the assumption that adapting these RGB-based ap-5 proaches to depth images is promising. Indeed, using depth in-6 stead of RGB images helps to resolve ambiguity from 2D to 3D. 7 With the dissemination of low-cost depth sensors in the con-  In [20], we proposed to combine the advantages of DL-based 21 dense correspondence estimation, with a parametric model fit- 22 ting for the fine tuning of the shape and the pose. We assumed 23 that it would enhance the accuracy of the pose and shape recon- 24 struction, compared to using them separately. Hence, the two 25 main hypotheses we validated in this paper are: 26 H1 Depth image background segmentation could be performed 27 jointly with the dense correspondence. Hence, similar to 28 [9, 10], as a first step, we establish dense correspondences 29 via mapping 3D vertices to the color domain. We use a 30 Double-Unet network [21] to obtain this color embedding 31 for each depth pixel. A first U-Net aims at segmenting the 32 depth images into 15 classes (body parts and background), 33 which should help a second U-Net to regress color embed- 34 ding for each pixel. 35 H2 Using dense correspondence as an input of the model fitting 36 algorithm should improve the performance of pose and 37 shape reconstruction. Most of the previous works, based 38 on this model fitting, used joint position estimation as an 39 input of the optimization, which makes the approach very 40 sensitive to noise and inaccuracies. We assume that us-41 ing thousands of pixel-to-vertex correspondences instead 42 of 15 joint positions would increase the accuracy of the 43 reconstruction.

44
For dense correspondence estimation, we trained a neural 45 network to map depth pixels to a low-dimensional canonical 46 template geometry representation (geometry embedding). This 47 representation entails normalized spatial coordinates of the T-48 pose human SMPL template vertices, in addition to body part 49 segmentation labels. Based on the success of previous works 50 [8, 10], we regress this representation in an image-to-image 51 manner. One of the key ideas is to associate a specific color 52 encoding for the background, to jointly perform body parts 53 and background segmentation. This pixel-to-vertex correspon-54 dence is next used to optimize the shape and pose parameters 55 of SMPL, inspired by previous works on hands [22,23]. How-56 ever, the number of available 3D points differs from the number 57 of template vertices in the SMPL model. Hence, we propose in 58 this paper to test several strategies to select the best correspon-59 dence between the 3D points and the template vertices. 60 We compared our method to state-of-the-art competition that 61 solves for both monocular RGB and depth inputs on stan-62 dard human shape in motion datasets following the experimen-63 tal setting of [18], using synthetic (SURREAL), pseudo-real 64 (DanseDB), and real (DFAUST) data. We also provided an in-65 depth ablative analysis of the various components involved in 66 our method. These first tests are applied to segmented images 67 where the background is suppressed, and there is no occlusions 68 with the environment. We then evaluated the ability of this hy-69 brid approach to deal with more complex depth images, with 70 background. More specifically, we evaluated if the dense corre-71 spondence network based on the geometry embedding can ac-72 tually segment the background in a specific color.

73
In the following, we first review previous works most related 74 to our approach in section 2. This two-step approach is pre-75 sented in section 3. We then present a comparison to previ-76 Fig. 1. Overview of the proposed framework. Our method can predict 3D human shape and pose from an input-depth image. A double U-Net network is applied to predict body part segmentation and to regress normalized canonical vertex coordinates. These outputs are used to compute dense correspondence between the input depth pixels and the template geometry via the nearest neighbor in a low-dimensional embedding. We then fit the SMPL model to the input depth by minimizing the distances between vertices and their corresponding depth pixels. The final output is shown on the right-hand side from two viewpoints with the overlaid input depth point cloud.
ous works in section 4, and perform an extensive evaluation of 1 the approach with non-segmented depth images. Finally, we 2 explore various strategies to take into account the dense corre-3 spondence in model fitting (section 5) , before concluding. 4 2. Related works 5 Human 3D shape reconstruction and pose estimation have 6 generated vast literature. We refer the reader to [24, 25,  shape and pose from a single color image has become possi-42 ble through several diverse approaches [25]. A family of works 43 leveraged 2D joint information in predicting 3D human pose 44 [33] and shape [34]. Bogo  employed adversarial learning by using a generator to predict 10 parameters of SMPL, and a discriminator to distinguish the real 11 mesh instances and the predicted ones. Other deep learning-12 based methods [37,38,39,40] inferred the 3D body shape or 13 mesh directly from color images using convolutional networks.
14 Graph CNN method [37] first attached the extracted features applied PointNet++ [46] to learn a representation of each point 43 cloud, and further enforce local smoothness to compute dense 44 correspondences across full or partial human shapes. They used 45 a depth image as input and learned a descriptor for each pixel. 46 Tan et al. [10] learned an embedding from RGB images that 47 follows the geodesic properties of an underlying 3D surface, 48 which enabled the inference of human correspondences.

49
Recent works proposed an alternative approach to dense cor-50 respondence, by using encoder to recover an implicit function 51 of the human body surface based on sparse 3D points, and then 52 fit a SMPL model [47,48,49]. These methods are able to 53 jointly represent body pose, shape, and clothing geometry and 54 obtained impressive results, even for fine details on the surface 55 mesh.

56
In this paper, we explore the limits of coupling dense cor-57 respondence and model fitting [20] to handle background seg-58 mentation together with image-to-vertex correspondence. We 59 also aim at demonstrating that combining this type of approach 60 with SMPL model fitting should enhance the accuracy of the 61 pose and shape reconstruction. However, this raises the ques-62 tion of finding a good manner to take into account this complex 63 point-to-vertex correspondence in the model fitting method. 64 3. Our 3D human shape reconstruction from depth images using the segmentation and correspondence maps. In the sec-1 ond step, a parametric shape model is fitted to the depth image 2 using the resulting correspondence maps (see section 3.3).
where W is a linear blend skinning function with vertex-joint   to-image architecture) as illustrated in Fig. 1 to predict body 53 part segmentation, and to regress normalized mesh colors. 54 These two outputs are concatenated to generate the pixel em-

13
In this section, we introduce the model-fitting stage of our method. Given an input depth image and pixel-to-vertex correspondences obtained from the previous stage, we fit the SMPL model to the depth image to recover the human shape and pose parameters of the adapted template mesh. To this end, we minimize the following objective function: where E D is the data term. The data term stands for minimizing a L 2 distance between pixel i's 3D point p i (obtained using the intrinsic matrix and the pixel's depth value), and the corresponding vertex v c (i). This distance is summed over all pixels that belong to the body region Ω ⊂ Γ in the segmentation map: where |Ω| is the total number of pixels in Ω.

43
For fair comparison to [19,18], the quality of our reconstruction method was assessed using the Mean Average Vertex Error in millimeters (mm), averaged subsequently overall testing frames: mization. To achieve this goal, we tested various strategies. 4 For this specific study, we changed the test conditions, with 5 a smaller dataset compared to the previous section, which en- a color embedding associated with a forearm could be linked to 33 an arm template vertex, if this is the closest one. 34 The second strategy S 2 is similar to S 1 , except that each point 35 is associated to the closest template vertex with the same body 36 part label. This way, a 3D point with a forearm color embedding 37 could not be associated with an arm template vertex, even if this 38 one is the closest one.  curacy of the segmentation and shape reconstruction. Fig.3 25 shows some examples of adding background to 125 test images. 26 In this test, we used the strategy S 4 presented above to deal 27 with the dense correspondence, as it provided us with the best 28 results. We applied the same test conditions than those used to 29 test S 1 to S 4 strategies: 10,000 depth images selected with 4Hz 30  have shown that the strategy used to associate a template ver-20 tex and 3D points has a significant influence on the final human 21 pose and shape reconstruction, after optimization. The best ac-22 curacy was obtained when selecting the average of the available 23 points associated with the same template vertex, before model 24 fitting. 25 As in many previous works, we tested our approach with 26 simulated depth images to accurately control the test condi- 27 tions, and the corresponding ground truth. However, dealing 28 with real depth images provided by depth sensors raises many 29 difficult constraints, such as segmenting the character from the 30 background, denoising the images, dealing with occlusions and 31 clothes, etc. In this paper, we focused on the segmentation prob- 32 lem, but further evaluation is needed to see the behavior of the 33 approach when dealing with real depth images. Preliminary re- 34 sults on real RGBD images (see Figure 5) tend to show that 35 the segmentation step is very sensitive to noise, with several 36 background pixels that were incorrectly labeled as body parts. 37 The resulting SMPL model leads to incorrect surface shape re-