CN111339870A

CN111339870A - A Human Shape and Pose Estimation Method for Object Occlusion Scenarios

Info

Publication number: CN111339870A
Application number: CN202010099358.XA
Authority: CN
Inventors: 王雁刚; 黄步真; 张天舒; 彭聪
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-02-18
Filing date: 2020-02-18
Publication date: 2020-06-26
Anticipated expiration: 2040-02-18
Also published as: CN111339870B

Abstract

The invention discloses a human body shape and attitude estimation method for object occlusion scenes. The calculated weak perspective projection parameters are converted into camera coordinates to obtain a UV image containing human body shape information without occlusion; The image is occluded by random object pictures, and the human body mask under occlusion is obtained; the obtained virtual occlusion data is used to train the UV map repair network of the encoding-decoding structure; the color image of the human body occluded by the real object is input, and the mask image is used as the ground truth to construct Saliency detection network of encoding-decoding structure; use latent space features obtained from encoding to supervise training of human encoding network; input color image of occluded human body to obtain a complete UV image; use the corresponding relationship between UV image and vertices of human 3D model to recover the occlusion situation 3D model of the human body under. The invention converts the estimation of the shape of the occluded human body into the image repairing problem of the two-dimensional UV map, thereby realizing the real-time and dynamic reconstruction of the human body in the occlusion scene.

Description

A Human Shape and Pose Estimation Method for Object Occlusion Scenarios

技术领域technical field

本发明属于计算机视觉及三维视觉领域，具体涉及一种针对物体遮挡场景下的人体形状和姿态估计方法。The invention belongs to the field of computer vision and three-dimensional vision, and in particular relates to a method for estimating human body shape and posture under the scene of object occlusion.

背景技术Background technique

从单张图像中估计三维人体的形状和姿态是近年来三维视觉领域的一个研究热点。它在人体运动捕捉、虚拟试衣和人体动画等虚拟现实技术的应用方面有着重要的作用。近年来，深度学习技术简化了从单张图像恢复人体整体形状的求解方式，特别是在SMPL模型被提出并得到广泛应用以后，单目图像三维人体形状和姿态估计经历了多个阶段的蓬勃发展，包括(1)通过匹配二维视觉特征来优化求解SMPL参数；使(2)用卷积神经网络(CNN)直接回归SMPL参数；(3)利用二维UV贴图表示SMPL表面三维点，进而将三维人体形状估计转化成基于CNN的图像翻译问题。深度神经网络凭借其准确性和运行效率成为三维人体形状估计的主流方法，它们能够在特定场景下获得较好的重建结果。然而，现有方法大多都没有考虑人和物体之间的遮挡这一常见的现象。如果不明确考虑遮挡，这类方法就不能直接转移到处理遮挡场景下的人体估计。这导致它们对带遮挡甚至轻微物体遮挡的场景十分敏感，难以满足现实需求。Estimating the shape and pose of a 3D human body from a single image is a research hotspot in the field of 3D vision in recent years. It plays an important role in the application of virtual reality technologies such as human motion capture, virtual fitting and human animation. In recent years, deep learning technology has simplified the solution method of recovering the overall shape of the human body from a single image, especially after the SMPL model was proposed and widely used, the 3D human body shape and pose estimation from monocular images has undergone multiple stages of vigorous development , including (1) optimizing and solving the SMPL parameters by matching two-dimensional visual features; (2) directly regressing the SMPL parameters with a convolutional neural network (CNN); (3) using the two-dimensional UV map to represent the three-dimensional points of the SMPL surface, and then the 3D human shape estimation into a CNN-based image translation problem. Deep neural networks have become the mainstream method for 3D human shape estimation due to their accuracy and operational efficiency, and they can obtain better reconstruction results in specific scenarios. However, most of the existing methods do not consider the common phenomenon of occlusion between people and objects. Without explicit consideration of occlusion, such methods cannot be directly transferred to deal with human estimation in occluded scenarios. This makes them very sensitive to scenes with occlusion or even slight object occlusion, and it is difficult to meet real-world needs.

一直以来，遮挡场景下的人体三维形状和姿态估计始终是领域内的难点，其主要原因有：(1)对象遮挡会在网络训练中引入严重的歧义，并且导致可直接利用的图像特征大大减少，从而影响完整的三维人体形状估计效果。(2)由于遮挡物体的普遍性和随机性，网络难以准确分割图像中人体和遮挡对象所在的像素，导致重建结果受到干扰。The three-dimensional shape and pose estimation of the human body in occlusion scenes has always been a difficult problem in the field. The main reasons are: (1) Object occlusion will introduce serious ambiguity in network training, and lead to a significant reduction in the image features that can be directly used. , thereby affecting the complete 3D human shape estimation effect. (2) Due to the ubiquity and randomness of occluded objects, it is difficult for the network to accurately segment the pixels where the human body and the occluded objects are located in the image, resulting in disturbed reconstruction results.

发明内容SUMMARY OF THE INVENTION

发明目的：针对遮挡场景下的人体形状和姿态估计问题，本发明提出一种针对物体遮挡场景的人体形状和姿态估计方法，将遮挡人体形状估计转化为二维UV贴图的图像修复问题，进而实现遮挡场景下人体的实时、动态重建。Purpose of the invention: Aiming at the problem of human body shape and attitude estimation in occlusion scenes, the present invention proposes a human body shape and attitude estimation method for object occlusion scenes, which converts the occluded human body shape estimation into the image restoration problem of two-dimensional UV maps, and then realizes the Real-time, dynamic reconstruction of the human body in occlusion scenes.

技术方案：本发明所述的一种针对物体遮挡场景的人体形状和姿态估计方法，包括以下步骤：Technical solution: A method for estimating human body shape and posture for object occlusion scenes described in the present invention includes the following steps:

(1)在数据准备阶段，利用三维人体数据集人体三维关节点与二维关节点之间的对应关系计算弱透视投影参数；(1) In the data preparation stage, the weak perspective projection parameters are calculated by using the correspondence between the three-dimensional joint points of the human body and the two-dimensional joint points of the three-dimensional human data set;

(2)根据计算得到的弱透视投影参数，通过三维旋转、平移将人体三维模型转换到相机坐标下；(2) According to the calculated weak perspective projection parameters, the three-dimensional model of the human body is converted into the camera coordinates through three-dimensional rotation and translation;

(3)将相机坐标下的人体三维模型顶点x,y,z坐标值归一化到-0.5至0.5范围内后存入UV贴图的R,G,B三个通道中，获得不带遮挡情况下包含人体形状信息的UV贴图；(3) Normalize the x, y, and z coordinate values of the vertices of the 3D model of the human body under the camera coordinates to the range of -0.5 to 0.5 and store them in the R, G, and B channels of the UV map to obtain the situation without occlusion UV map containing the shape information of the human body;

(4)对人体二维图像加入随机物体图片遮挡，并获取遮挡情况下的人体掩膜；(4) Add random object picture occlusion to the two-dimensional image of the human body, and obtain the human body mask under the occlusion condition;

(5)重复步骤(3)，使用弱透视投影后落在掩膜区域之外的三维点为视觉遮挡下的三维点，其x,y,z坐标固定设为-0.5，获得对应遮挡下的UV贴图；(5) Repeat step (3), the 3D point that falls outside the mask area after using weak perspective projection is the 3D point under visual occlusion, and its x, y, and z coordinates are fixed to -0.5, and the corresponding 3D point under occlusion is obtained. UV map;

(6)在训练阶段，基于步骤(1)至步骤(5)获得的虚拟遮挡数据训练编码-解码结构的UV贴图修复网络；所述修复网络以与完整人体UV图之间的L1损失，相邻像素之间的拉普拉斯平滑项以及UV连接处一致性作为约束；(6) In the training phase, the UV map repair network of the encoding-decoding structure is trained based on the virtual occlusion data obtained in steps (1) to (5); the repair network is based on the L1 loss between the complete human body UV map, The Laplacian smoothing term between adjacent pixels and the consistency of UV connections are used as constraints;

(7)利用真实物体遮挡人体彩色图像作为输入，以掩膜图像作为真值构建编码-解码结构的显著性检测网络；(7) The saliency detection network of encoding-decoding structure is constructed by using the real object occluded human color image as input, and the mask image as the ground truth;

(8)将遮挡人体彩色图片与显著性图连接后送入人体编码网络，同时将相应遮挡下的UV贴图使用步骤(6)训练好的修复网络进行编码，使用编码得到的隐空间特征监督人体编码网络训练；(8) Connect the color image of the occluded human body with the saliency map and send it to the human body encoding network. At the same time, encode the UV map under the corresponding occlusion using the repair network trained in step (6), and use the encoded latent space features to supervise the human body. coding network training;

(9)在测试阶段，输入遮挡人体彩色图像，经由显著性检测网络，人体编码网络，将人体编码网络编码得到的隐空间特征值使用修复网络的解码器解码，得到完整的UV图像；(9) In the test phase, input the color image of the occluded human body, and through the saliency detection network and the human body encoding network, decode the latent space eigenvalues encoded by the human body encoding network using the decoder of the repair network to obtain a complete UV image;

(10)使用UV贴图与人体三维模型的顶点对应关系恢复出遮挡情况下的人体三维模型。(10) Using the UV map and the corresponding relationship between the vertices of the three-dimensional model of the human body to restore the three-dimensional model of the human body under occlusion.

进一步地，步骤(6)所述的UV贴图修复网络使用ResNet作为编码器，以堆叠的反卷积层作为解码器。Further, the UV map inpainting network described in step (6) uses ResNet as the encoder and the stacked deconvolution layers as the decoder.

进一步地，所述步骤(6)通过以下公式实现：Further, described step (6) is realized by following formula:

L＝L₁+λL_tv+μL_p L=L ₁ +λL _tv + μL _p

其中，λ，μ为权重，L_tv为拉普拉斯平滑项，L_p为UV连接处一致性约束：Among them, λ and μ are the weights, L _tv is the Laplace smoothing term, and L _p is the consistency constraint at the UV connection:

其中，V_b是对应多个UV像素的模型顶点点集，P(v)是模型顶点v对应的UV像素值。Among them, V _b is the model vertex point set corresponding to multiple UV pixels, and P(v) is the UV pixel value corresponding to the model vertex v.

进一步地，步骤(8)所述的人体编码网络使用VGG-19结构。Further, the human coding network described in step (8) uses the VGG-19 structure.

进一步地，步骤(9)所述的彩色图像为从单目彩色相机获取的经过预处理的人体遮挡图像。Further, the color image in step (9) is a preprocessed human occlusion image obtained from a monocular color camera.

有益效果：与现有技术相比，本发明的有益效果为：1、使用大量虚拟遮挡数据训练图像修复网络，使得整体框架对各类遮挡均具有较好鲁棒性；2、使用显著性检测，减小遮挡和背景等无效图像特征对重建的干扰，增强对图像中人体与遮挡边缘的鲁棒性，避免了分割不准确的问题；3、使用隐空间一致性的方法，将人体三维形状估计转化为图像修复问题，降低了求解复杂度；4、提出一种UV连接处一致性约束，提高了以UV贴图进行人体重建方法中重建结果的平滑度。Beneficial effects: Compared with the prior art, the beneficial effects of the present invention are: 1. Use a large amount of virtual occlusion data to train the image repair network, so that the overall framework has better robustness to various occlusions; 2. Use saliency detection , reduce the interference of invalid image features such as occlusion and background on the reconstruction, enhance the robustness to the human body and occlusion edges in the image, and avoid the problem of inaccurate segmentation; 3. Use the method of latent space consistency to convert the three-dimensional shape of the human body. The estimation is transformed into an image inpainting problem, which reduces the complexity of the solution; 4. A UV connection consistency constraint is proposed to improve the smoothness of the reconstruction results in the method of human reconstruction with UV maps.

附图说明Description of drawings

图1为本发明的流程图；Fig. 1 is the flow chart of the present invention;

图2为人体信息UV图生成示意图；Fig. 2 is a schematic diagram of human body information UV map generation;

图3为人体形状信息UV图；Fig. 3 is the UV map of human body shape information;

图4为人体三维模型示意图；4 is a schematic diagram of a three-dimensional model of a human body;

图5为显著性检测网络结构图；Fig. 5 is a network structure diagram of saliency detection;

图6为本发明的重建结果示意图。FIG. 6 is a schematic diagram of a reconstruction result of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明作进一步详细描述。如图1所示，本发明所述的一种针对物体遮挡场景的人体形状和姿态估计方法实现过程如下：The present invention will be described in further detail below with reference to the accompanying drawings. As shown in FIG. 1 , the implementation process of a method for estimating human body shape and posture for object occlusion scenes according to the present invention is as follows:

如图2所示，人体信息UV图的生成方式为：在数据准备阶段，首先利用三维人体数据集中人体三维模型关节点与二维关节点之间的投影关系计算弱透视投影参数并通过三维平移、旋转等操作将人体模型转化到相机坐标下，将相机坐标下的人体三维模型顶点x,y,z坐标归一化到[-0.5，0.5]并存入UV贴图的R,G,B三个通道，由此获得如图3所示的不带遮挡情况下包含人体形状信息的UV贴图。为了获取遮挡人体UV图，对人体二维图像加入随机物体图片遮挡，并获取遮挡情况下的人体掩膜。将人体三维模型通过投影参数向人体掩膜进行弱透视投影。落在掩膜区域之外的三维点为视觉遮挡下的三维点，其x,y,z坐标固定设为-0.5，掩膜区域内的仍然存入顶点三维坐标，从而获得如图4中所示的对应遮挡下的UV贴图。由于此步骤中遮挡UV图与完整UV图均与彩色图像的背景无关，因此可以使用虚拟遮挡产生大量遮挡UV数据，增强网络的鲁棒性。As shown in Figure 2, the generation method of the human body information UV map is as follows: in the data preparation stage, first use the projection relationship between the joint points of the 3D model of the human body and the 2D joint points in the 3D human data set to calculate the weak perspective projection parameters, and then use the 3D translation to calculate the weak perspective projection parameters. , rotation and other operations to convert the human body model to the camera coordinates, normalize the x, y, and z coordinates of the vertices of the human body 3D model under the camera coordinates to [-0.5, 0.5] and store them in the R, G, B three of the UV map channel, thereby obtaining the UV map containing the human body shape information without occlusion as shown in Figure 3. In order to obtain the UV map of the occluded human body, a random object image is added to the 2D image of the human body, and the human body mask under the occlusion condition is obtained. Weak perspective projection is performed on the human body 3D model to the human body mask through the projection parameters. The three-dimensional points that fall outside the mask area are the three-dimensional points under the visual occlusion, and their x, y, and z coordinates are fixed to -0.5, and the three-dimensional coordinates of the vertices are still stored in the mask area, so as to obtain as shown in Figure 4. The UV map under the corresponding occlusion shown. Since both the occlusion UV map and the complete UV map in this step have nothing to do with the background of the color image, virtual occlusion can be used to generate a large amount of occlusion UV data to enhance the robustness of the network.

使用获取得到的大量遮挡UV图和完整UV图，训练以ResNet-50为编码器，堆叠反卷积层为解码器的图像修复网络。该网络能够遮挡UV图编码为高维人体特征，并从高维特征中解码出完整的人体形状UV图。网络以与完整人体UV图之间的L1 loss，相邻像素之间的拉普拉斯平滑项以及UV连接处一致性作为约束。Using the obtained large number of occluded UV maps and full UV maps, an image inpainting network with ResNet-50 as the encoder and stacked deconvolutional layers as the decoder is trained. The network is able to encode occlusion UV maps into high-dimensional human features, and decode the complete human shape UV maps from high-dimensional features. The network is constrained by the L1 loss with the full body UV map, the Laplacian smoothing term between adjacent pixels, and the consistency of UV connections.

其具体公式为：Its specific formula is:

L＝L₁+λL_tv+μL_p L=L ₁ +λL _tv + μL _p

其中，V_b是对应多个UV像素的模型顶点点集，P(v)是模型顶点v对应的UV像素值。该约束能够使如图3所示的UV图的各个部分平滑连接。Among them, V _b is the model vertex point set corresponding to multiple UV pixels, and P(v) is the UV pixel value corresponding to the model vertex v. This constraint enables the parts of the UV map as shown in Figure 3 to be smoothly connected.

以真实物体遮挡人体彩色图像作为输入，掩膜图像作为真值构建编码-解码结构的显著性检测网络，经过如图5所示的显著性图检测网络后，得到该遮挡图像的人体显著性图。将遮挡人体彩色图片与显著性图连接后送入人体编码网络，同时将相应遮挡下的UV贴图使用训练好的修复网络进行编码，使用编码得到的隐空间特征监督人体编码网络训练。此处输入以VGG-19为基本结构的人体编码网络。使用与该彩色图像对应的遮挡UV图，经过图像修复网络的编码器得到的高维特征作为人体编码网络的监督。同时如图5，以不同缩放比例的人体掩膜作为显著性网络的监督，对两个网络进行端到端训练。The saliency detection network of the encoding-decoding structure is constructed with the real object occluding the color image of the human body as the input and the mask image as the ground truth. After the saliency map detection network shown in Figure 5, the human saliency map of the occluded image is obtained. . The color image of the occluded human body is connected with the saliency map and sent to the human encoding network. At the same time, the UV map under the corresponding occlusion is encoded by the trained repair network, and the encoded latent space features are used to supervise the training of the human encoding network. Here, input the human coding network with VGG-19 as the basic structure. Using the occlusion UV map corresponding to the color image, the high-dimensional features obtained by the encoder of the image inpainting network are used as the supervision of the human encoding network. At the same time, as shown in Figure 5, the two networks are trained end-to-end with different scaled human masks as the supervision of the saliency network.

完成网络训练之后，直接从单目彩色相机获取人体遮挡图像并进行裁剪、缩放等预处理。将预处理后的彩色图像输入网络，直接经过显著性检测网络，人体编码网络后得到高维人体特征。将人体编码网络编码得到的隐空间特征值使用修复网络的解码器解码得到高维特征，然后使用图像修复网络的解码器解码得到完整UV图像。经过UV贴图与人体三维模型之间的对应关系，可以直接从人体形状UV图中恢复出对应形状的人体三维模型。图6中展示了遮挡人体彩色图像经过该方法的重建结果。After the network training is completed, the human occlusion images are directly obtained from the monocular color camera and preprocessed such as cropping and scaling. The preprocessed color images are input into the network, directly passed through the saliency detection network, and the human body encoding network to obtain high-dimensional human body features. The latent space feature value encoded by the human body encoding network is decoded by the decoder of the inpainting network to obtain high-dimensional features, and then the decoder of the image inpainting network is used to decode the complete UV image. Through the correspondence between the UV map and the 3D model of the human body, the 3D model of the human body with the corresponding shape can be directly recovered from the UV map of the human body shape. Figure 6 shows the reconstruction results of the occluded human color image through this method.

Claims

1. a human body shape and attitude estimation method for object occlusion scene, is characterized in that, comprises the following steps:

(1) In the data preparation stage, the weak perspective projection parameters are calculated by using the correspondence between the three-dimensional joint points of the human body and the two-dimensional joint points of the three-dimensional human data set;

(2) According to the calculated weak perspective projection parameters, the three-dimensional model of the human body is converted into the camera coordinates through three-dimensional rotation and translation;

(3) Normalize the x, y, and z coordinate values of the vertices of the 3D model of the human body under the camera coordinates to the range of -0.5 to 0.5 and store them in the R, G, and B channels of the UV map to obtain the situation without occlusion UV map containing the shape information of the human body;

(4) Add random object picture occlusion to the two-dimensional image of the human body, and obtain the human body mask under the occlusion condition;

(5) Repeat step (3), the 3D point that falls outside the mask area after using weak perspective projection is the 3D point under visual occlusion, and its x, y, and z coordinates are fixed to -0.5, and the corresponding 3D point under occlusion is obtained. UV map;

(6) In the training phase, the UV map repair network of the encoding-decoding structure is trained based on the virtual occlusion data obtained in steps (1) to (5); the repair network is based on the L1 loss between the complete human body UV map, The Laplacian smoothing term between adjacent pixels and the consistency of UV connections are used as constraints;

(7) The saliency detection network of encoding-decoding structure is constructed by using the real object occluded human color image as input, and the mask image as the ground truth;

(8) Connect the color image of the occluded human body with the saliency map and send it to the human body encoding network. At the same time, encode the UV map under the corresponding occlusion using the repair network trained in step (6), and use the encoded latent space features to supervise the human body. coding network training;

(9) In the test phase, input the color image of the occluded human body, and through the saliency detection network and the human body encoding network, decode the latent space eigenvalues encoded by the human body encoding network using the decoder of the repair network to obtain a complete UV image;

(10) Using the UV map and the corresponding relationship between the vertices of the three-dimensional model of the human body to restore the three-dimensional model of the human body under occlusion.

2. a specific kind of human body shape and attitude estimation method for object occlusion scene according to claim 1, is characterized in that, the UV map repairing network described in step (6) uses ResNet as encoder, with stacked reverse Convolutional layers act as decoders.

3. a specific kind of human body shape and attitude estimation method for object occlusion scene according to claim 1, is characterized in that, described step (6) is realized by following formula:

L=L ₁ +λL _tv + μL _p

Among them, λ and μ are the weights, L _tv is the Laplace smoothing term, and L _p is the consistency constraint at the UV connection:

Among them, V _b is the model vertex point set corresponding to multiple UV pixels, and P(v) is the UV pixel value corresponding to the model vertex v.

4 . A specific method for estimating human body shape and posture for an object occlusion scene according to claim 1 , wherein the human coding network described in step (8) uses a VGG-19 structure. 5 .

5. a specific kind of human body shape and attitude estimation method for object occlusion scene according to claim 1, is characterized in that, the color image described in step (9) is the preprocessed obtained from monocular color camera Human occlusion image.