CN111311729A

CN111311729A - Natural scene three-dimensional human body posture reconstruction method based on bidirectional projection network

Info

Publication number: CN111311729A
Application number: CN202010056119.6A
Authority: CN
Inventors: 林杰; 崔健; 石光明; 刘丹华; 李甫
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-01-18
Filing date: 2020-01-18
Publication date: 2020-06-19
Anticipated expiration: 2040-01-18
Also published as: CN111311729B

Abstract

The invention discloses a natural scene three-dimensional human body posture reconstruction method based on a bidirectional projection network, which aims at solving the problem that the human body three-dimensional posture reconstruction process in the prior art still needs to be improved. The invention comprises the following steps: firstly, acquiring data by using a camera; secondly, sending the collected video and image data to a two-dimensional posture detector to obtain two-dimensional human body joint point coordinates of corresponding postures; designing two-way projection networks with two structures according to the existence of the three-dimensional attitude data tags in the training process; training the designed network by using a deep-antagonistic learning strategy, minimizing a network loss function, and iterating to finally obtain a trained three-dimensional posture generator; and fifthly, inputting the output result of the two-dimensional posture detector in the step two into the three-dimensional posture generator trained in the step four. The technology is low in cost, can assist VR and AR technologies in the 5G era, establishes portable somatosensory interaction equipment, and realizes large-scale popularization and application of three-dimensional motion reconstruction technologies.

Description

A 3D Human Pose Reconstruction Method in Natural Scenes Based on Bidirectional Projection Network

技术领域technical field

本发明涉及计算机视觉技术领域，特别是涉及一种基于双向投影网络的自然场景三维人体姿态重建方法。The invention relates to the technical field of computer vision, in particular to a method for reconstructing a three-dimensional human body posture in a natural scene based on a bidirectional projection network.

背景技术Background technique

在虚拟现实技术和体感人机交互中通常需要对人体的动作进行准确的捕捉，重建出一个运动的人体三维骨骼。现有的方法通常是使用专业的运动捕捉设备(MOCAP)或者是体感摄像机(Kinect)等一些硬件外设来完成三维人体姿态的重建。但是这些专业的设备通常情况下价格高昂，对实验环境要求极高，阻碍了三维姿态重建技术的大范围推广和运用。单目图像中人体3D姿态估计是计算机视觉中的一项艰巨的任务，基于2D关节点的三维姿态重建更是一个棘手的病态问题。现有的多数方法通常依赖成对的标签数据对网络进行有监督的训练，在缺乏标签数据和明确对应关系的情况下模型性能不佳。因此，利用深度学习技术对上述人体三维姿态重建过程进行改进，使得整个过程能够摆脱专业硬件外设的依赖，仅仅依靠普通的手机和相机就能完成自然场景下的三维人体姿态重建。In virtual reality technology and somatosensory human-computer interaction, it is usually necessary to accurately capture the movements of the human body and reconstruct a moving three-dimensional human skeleton. Existing methods usually use a professional motion capture device (MOCAP) or some hardware peripherals such as a somatosensory camera (Kinect) to complete the three-dimensional human pose reconstruction. However, these professional equipment are usually expensive and have extremely high requirements on the experimental environment, which hinders the wide-scale promotion and application of 3D pose reconstruction technology. Human 3D pose estimation in monocular images is a difficult task in computer vision, and 3D pose reconstruction based on 2D joints is a thorny ill-posed problem. Most of the existing methods usually rely on paired label data for supervised training of the network, and the model performance is poor in the absence of label data and explicit correspondence. Therefore, using deep learning technology to improve the above-mentioned three-dimensional human pose reconstruction process, the whole process can get rid of the dependence of professional hardware peripherals, and only rely on ordinary mobile phones and cameras to complete the three-dimensional human pose reconstruction in natural scenes.

在现有的深度学习方法中，通常依赖成对的带有标签的人体姿态数据对网络进行训练，在缺乏三维标签和明确对应关系的情况下模型很难进行训练且泛化性能不佳，难以对自然环境下复杂多变的特殊人体姿态进行合理的三维重建。因此设计一种能够在自然场景下进行准确的三维人体姿态重建，且训练过程不依赖标签数据的深度学习方案意义重大，能够以极低的成本取代专业的动作捕捉设备，并且能够完成自然场景下三维姿态的重构。In the existing deep learning methods, the network usually relies on paired human pose data with labels to train the network. In the absence of 3D labels and clear correspondence, the model is difficult to train and the generalization performance is poor, making it difficult to Reasonable 3D reconstruction of complex and changeable special human postures in the natural environment. Therefore, it is of great significance to design a deep learning scheme that can perform accurate 3D human pose reconstruction in natural scenes and does not rely on label data in the training process. It can replace professional motion capture equipment at a very low cost, and can complete 3D pose reconstruction.

发明内容SUMMARY OF THE INVENTION

本发明克服了现有技术中人体三维姿态重建过程仍需改进的问题，提供一种能够利用单目相机对自然场景下的人体动作进行三维重建的基于双向投影网络的自然场景三维人体姿态重建方法。The invention overcomes the problem that the human body three-dimensional posture reconstruction process still needs to be improved in the prior art, and provides a three-dimensional human body posture reconstruction method in natural scenes based on a bidirectional projection network, which can use a monocular camera to perform three-dimensional reconstruction of human movements in natural scenes. .

本发明的技术解决方案是，提供一种具有以下步骤的基于双向投影网络的自然场景三维人体姿态重建方法：含有以下步骤：The technical solution of the present invention is to provide a three-dimensional human body posture reconstruction method based on a bidirectional projection network with the following steps: including the following steps:

步骤一、利用相机采集自然场景人体运动视频或者图像数据；Step 1. Use a camera to collect human motion video or image data in a natural scene;

步骤二、将采集的视频、图像数据送入二维姿态检测器获取对应姿态的二维人体关节点坐标；Step 2, sending the collected video and image data into a two-dimensional attitude detector to obtain the two-dimensional body joint point coordinates of the corresponding attitude;

步骤三、根据训练过程有无三维姿态数据标签设计两种结构的双向投影网络；Step 3: Design bidirectional projection networks with two structures according to whether there are three-dimensional attitude data labels in the training process;

步骤四、利用深度对抗式学习策略对设计好的网络进行训练，最小化网络损失函数，经过迭代最终得到训练好的三维姿态生成器；Step 4. Use the deep adversarial learning strategy to train the designed network, minimize the network loss function, and finally obtain the trained 3D pose generator after iteration;

步骤五、将步骤二中二维姿态检测器的输出结果输入步骤四中训练好的三维姿态生成器，输出结果为视频/图像中人物的三维姿态数据。Step 5: Input the output result of the 2D posture detector in Step 2 into the 3D posture generator trained in Step 4, and the output result is the 3D posture data of the person in the video/image.

优选地，所述步骤一中，采用普通的单目光学相机或者手机摄像头完成自然场景下人物运动数据的采集，数据的形式是图片或视频。Preferably, in the first step, an ordinary monocular optical camera or a mobile phone camera is used to complete the collection of the movement data of people in the natural scene, and the data is in the form of pictures or videos.

优选地，所述步骤二中，二维姿态检测器为OpenPose、StackHourglass或HRNet的二维姿态检测方法，当采集数据是图片时，直接输入图片得到二维关节点检测结果，当采集数据是视频时，逐帧输入得到二维关节点检测序列。Preferably, in the second step, the two-dimensional attitude detector is a two-dimensional attitude detection method of OpenPose, StackHourglass or HRNet. When the collected data is a picture, the two-dimensional joint point detection result is obtained by directly inputting the picture. When the collected data is a video When , the two-dimensional joint point detection sequence is obtained by frame-by-frame input.

优选地，所述步骤三中，根据用户是否拥有三维姿态标签数据，选取A/B两种不同结构的双向投影网络，当有三维姿态数据供使用时，双向投影网络工作在A模式,此时网络由两个反向的对偶支路构成，其网络模块包括三维姿态生成器、三维姿态判别器、二维姿态投影层和二维姿态判别器；当无三维姿态数据供使用时，双向投影网络工作在B模式，此时网络由两条不同方向的投影支路构成，其网络模块包括三维姿态生成器、二维姿态投影层和二维姿态判别器。Preferably, in the third step, according to whether the user has three-dimensional attitude label data, two different structures of A/B bidirectional projection networks are selected. When there is three-dimensional attitude data for use, the bidirectional projection network works in A mode. The network consists of two opposite dual branches, and its network module includes a 3D attitude generator, a 3D attitude discriminator, a 2D attitude projection layer and a 2D attitude discriminator; when there is no 3D attitude data for use, the bidirectional projection network Working in the B mode, the network consists of two projection branches in different directions. The network module includes a 3D pose generator, a 2D pose projection layer and a 2D pose discriminator.

优选地，所述步骤三中三维姿态生成器的输入为二维关节点坐标，输出为三维关节点坐标，其内部包含两个深度残差网络和一个姿态特征提取层，深度残差网络由四个残差块堆叠构成，每层神经元数目为1024，姿态特征提取层完成对姿态拓扑结构的编码压缩；二维姿态判别器和三维姿态判别器具有相同的网络架构，其内部包含了二/三维姿态特征提取层、深度残差网络和一个全连接层，该二/三维判别器模块输入为不同维度的姿态向量，输出为一个一元判别值；二维姿态投影层内部包含了残差网络正向投影和旋转变换两条支路，根据功能分别将姿态分别投影到不同的观测角度，该二维姿态投影层模块的输入为三维姿态数据，输出为投影后的二维姿态数据。Preferably, in the step 3, the input of the three-dimensional pose generator is two-dimensional joint point coordinates, and the output is three-dimensional joint point coordinates, and its interior includes two deep residual networks and one attitude feature extraction layer, and the deep residual network consists of four The number of neurons in each layer is 1024, and the pose feature extraction layer completes the coding and compression of the pose topology structure; the two-dimensional pose discriminator and the three-dimensional pose discriminator have the same network architecture, which contains two / 3D pose feature extraction layer, deep residual network and a fully connected layer, the input of the 2D/3D discriminator module is the pose vector of different dimensions, and the output is a unary discriminant value; the 2D pose projection layer contains the residual network positive value. There are two branches: direction projection and rotation transformation, respectively project the attitude to different observation angles according to the function. The input of the two-dimensional attitude projection layer module is the three-dimensional attitude data, and the output is the projected two-dimensional attitude data.

优选地，所述步骤四包含以下分步骤，Preferably, the step 4 comprises the following sub-steps,

步骤4.1、当存在三维姿态数据可供网络训练时，选择模式A网络架构进行训练；Step 4.1. When there is 3D pose data for network training, select Mode A network architecture for training;

步骤4.1.1、将二维姿态作为输入，首先经过三维姿态生成器中的残差网络输出一个初始的深度估计值，得到三维姿态的初始估计结果；然后初始估计结果被传入姿态特征提取层，经过姿态先验拓扑结构特征提取输出一个特征向量，该特征向量再次被传入深度残差网络输出最终的深度估计值，生成最终的三维重构姿态；Step 4.1.1. Taking the 2D pose as input, first output an initial depth estimation value through the residual network in the 3D pose generator to obtain the initial estimation result of the 3D pose; then the initial estimation result is passed to the pose feature extraction layer , a feature vector is output through the pose prior topology feature extraction, and the feature vector is passed into the depth residual network again to output the final depth estimation value to generate the final 3D reconstructed pose;

步骤4.1.2、生成的三维重构姿态一路通过二维姿态投影层得到正向投影，并与输入的二维姿态计算姿态误差，另一路将送入三维姿态判别器计算分布误差；Step 4.1.2. The generated 3D reconstructed pose gets forward projection through the 2D pose projection layer all the way, and calculates the pose error with the input 2D pose, and the other way will send it to the 3D pose discriminator to calculate the distribution error;

步骤4.1.3、将三维姿态作为输入，首先经过二维姿态投影层得到正向投影，该正向投影一路被送入三维姿态生成器得到三维重构结果，并与输入的三维姿态计算姿态误差，另一路被送入二维姿态判别器计算分布误差；Step 4.1.3, take the 3D pose as input, first get the forward projection through the 2D pose projection layer, the forward projection is sent to the 3D pose generator all the way to get the 3D reconstruction result, and calculate the pose error with the input 3D pose , the other is sent to the two-dimensional attitude discriminator to calculate the distribution error;

步骤4.2、当没有三维姿态数据可供网络训练时，选择模式B网络架构进行训练；Step 4.2. When there is no 3D pose data for network training, select Mode B network architecture for training;

步骤4.2.1、将二维姿态作为输入，首先经过三维姿态生成器中的残差网络输出一个初始的深度估计值，得到三维姿态的初始估计结果；然后初始估计结果被传入姿态特征提取层，经过姿态先验拓扑结构特征提取输出一个特征向量，该特征向量将会再次被传入深度残差网络输出最终的深度估计值，生成最终的三维重构姿态；Step 4.2.1. Taking the 2D pose as input, first output an initial depth estimation value through the residual network in the 3D pose generator to obtain the initial estimation result of the 3D pose; then the initial estimation result is passed to the pose feature extraction layer , and output a feature vector after the pose prior topology feature extraction, and the feature vector will be passed to the depth residual network again to output the final depth estimation value to generate the final 3D reconstructed pose;

步骤4.2.2、将三维重构姿态传入二维姿态投影层，分别得到正向投影与旋转投影，其中正向投影将与输入的二维姿态计算姿态误差，旋转投影将通过二维姿态判别器计算二维分布误差；Step 4.2.2. Introduce the 3D reconstructed pose to the 2D pose projection layer to obtain the forward projection and the rotation projection respectively. The forward projection will calculate the attitude error with the input 2D pose, and the rotation projection will be judged by the 2D pose The calculator calculates the two-dimensional distribution error;

步骤4.3、分别计算A/B两种模式下的损失函数，包括姿态损失函数和分布损失函数；Step 4.3. Calculate the loss functions in the two modes of A/B, including the pose loss function and the distribution loss function;

步骤4.3.1、在模式A下，网络整体的损失函数定义为：Step 4.3.1. In mode A, the overall loss function of the network is defined as:

loss_A＝L_GAN(G_3d,D_3d)+L_GAN(G_2d,D_2d)+L_dual(G_2d,G_3d)，其中L_GAN代表带有梯度惩罚项的生成对抗网络的损失函数，其反应了分布误差，计算公式如下：loss _A =L _GAN (G _3d ,D _3d )+L _GAN (G _2d ,D _2d )+L _dual (G _2d ,G _3d ), where L _GAN represents the loss function of a generative adversarial network with a gradient penalty term, It reflects the distribution error, and the calculation formula is as follows:

L_dual代表对偶网络的双向损失，其反应了姿态误差，计算公式如下：L _dual represents the bidirectional loss of the dual network, which reflects the attitude error. The calculation formula is as follows:

L_dual(G_2d,G_3d)＝||G_2d(G_3d(X_2d))-X_2d||₁+||G_3d(G_2d(X_3d))-X_3d||₁ L _dual (G _2d ,G _3d )=||G _2d (G _3d (X _2d ))-X _2d || ₁ +||G _3d (G _2d (X _3d ))-X _3d || ₁

λ为神经网络超参数，G_3d代表三维姿态生成器，G_2d代表二维姿态投影层，D_3d和D_2d分别代表三维姿态判别器与二维姿态判别器，X_2d与X_3d分别代表真实二维姿态与三维姿态，A_3d代表了重构三维姿态分布与真实三维姿态分布采样点连线上的随机三维姿态，A_2d代表了投影二维姿态分布与真实二维姿态分布采样点连线上的随机二维姿态；λ is the neural network hyperparameter, G _3d represents the 3D pose generator, G _2d represents the 2D pose projection layer, D _3d and D _2d represent the 3D pose discriminator and 2D pose discriminator, respectively, X _2d and X _3d represent the real 2D pose and 3D pose, A _3d represents the random 3D pose on the line connecting the sampling points of the reconstructed 3D pose distribution and the real 3D pose distribution, A _2d represents the line connecting the sampling points of the projected 2D pose distribution and the real 2D pose distribution A random 2D pose on ;

步骤4.3.2、在模式B下，网络整体的损失函数定义为：Step 4.3.2. In mode B, the overall loss function of the network is defined as:

loss_B＝L_GAN(G_R2dG_3d,D_2d)+L_pose(G_K2dG_3d)，其中L_GAN代表带有梯度惩罚项的生成对抗网络的损失函数，其反应了分布误差，计算公式如下：loss _B =L _GAN (G _R2d G _3d ,D _2d )+L _pose (G _K2d G _3d ), where L _GAN represents the loss function of a generative adversarial network with a gradient penalty term, which reflects the distribution error, and the calculation formula is as follows :

L_pose为重构损失，其反应了姿态误差，计算公式如下：L _pose is the reconstruction loss, which reflects the attitude error. The calculation formula is as follows:

L_pose(G_K2dG_3d)＝||G_K2dG_3d(X_2d)-X_2d||₁ L _pose (G _K2d G _3d )=||G _K2d G _3d (X _2d )-X _2d || ₁

λ为神经网络超参数，G_3d代表三维姿态生成器，G_R2d代表二维姿态投影层旋转投影变换，G_K2d代表二维姿态投影层正向投影变换，D_2d代表二维姿态判别器，X_2d代表真实二维姿态数据，A_2d代表了投影二维姿态分布与真实二维姿态分布采样点连线上的随机二维姿态；λ is the neural network hyperparameter, G _3d represents the 3D pose generator, G _R2d represents the rotation projection transformation of the 2D pose projection layer, G _K2d represents the forward projection transformation of the 2D pose projection layer, D _2d represents the 2D pose discriminator, X _2d represents the real 2D pose data, A _2d represents the random 2D pose on the line connecting the sampling points between the projected 2D pose distribution and the real 2D pose distribution;

步骤4.4、利用神经网络优化器调整网络参数最小化误差函数，迭代20～40EPOCH后损失函数收敛，得到训练好的三维姿态生成器。Step 4.4: Use the neural network optimizer to adjust the network parameters to minimize the error function, and after iterating for 20-40 EPOCH, the loss function converges, and the trained 3D pose generator is obtained.

所述步骤五包含以下分步骤，The step 5 includes the following sub-steps,

步骤5.1、将普通相机采集的视频或图像数据传入二维姿态检测器，首先得到二维关节点数据；Step 5.1. Pass the video or image data collected by the ordinary camera into the two-dimensional attitude detector, and first obtain the two-dimensional joint point data;

步骤5.2、将二维姿态检测器的输出结果进行正规化处理，使其能够直接作为三维姿态生成器的输入；正规化处理具有以下分步骤：Step 5.2, normalize the output result of the two-dimensional attitude detector, so that it can be directly used as the input of the three-dimensional attitude generator; the normalization processing has the following steps:

步骤5.2.1、利用检测到的左右肩关节坐标重构中心颈部坐标：Step 5.2.1. Use the detected coordinates of the left and right shoulder joints to reconstruct the center neck coordinates:

其中：(x_T,y_T)代表中心颈部坐标，(x_ls,y_ls)代表左肩坐标，(x_rs,y_rs)代表右肩坐标；Among them: (x _T , y _T ) represents the center neck coordinate, (x _ls , y _ls ) represents the left shoulder coordinate, (x _rs , y _rs ) represents the right shoulder coordinate;

步骤5.2.2、利用检测到的左右肩关节和髋关节重构中心脊柱坐标：Step 5.2.2. Use the detected left and right shoulder and hip joints to reconstruct the central spine coordinates:

其中：(x_S,y_S)代表中心脊柱坐标，(x_ls,y_ls)代表左肩坐标，(x_rs,y_rs)代表右肩坐标，(x_lh,y_lh)代表左髋坐标，(x_rh,y_rh)代表右髋坐标；Where: (x _S , y _S ) represents the center spine coordinate, (x _ls , y _ls ) represents the left shoulder coordinate, (x _rs , y _rs ) represents the right shoulder coordinate, (x _lh , y _lh ) represents the left hip coordinate, ( x _rh , y _rh ) represents the coordinates of the right hip;

步骤5.3、将正规化的二维姿态数据传入三维姿态生成器，输出结果即为重建的三维姿态，当输入数据为图像数据时，输出结果为三维人体姿态骨架；当输入数据为视频数据时，输出结果为三维人体骨架动作。Step 5.3. Pass the normalized 2D pose data into the 3D pose generator, and the output result is the reconstructed 3D pose. When the input data is image data, the output result is the 3D human pose skeleton; when the input data is video data , the output result is the 3D human skeleton action.

与现有技术相比，本发明基于双向投影网络的自然场景三维人体姿态重建方法具有以下优点：(1)通过采用数据驱动的方式训练深度神经网络，能够直接通过神经网络实现低成本的人体姿态三维重建，不需要任何昂贵的硬件设备，只需要普通相机或者手机就能采集数据，并且基于视觉方法对运动人体进行三维姿态重建，能够替代专业硬件外设完成人体姿态的三维重建。成本低廉，使用方便，能够助力5G时代的VR、AR技术，建立便携式体感交互设备，实现三维动作重建技术的大规模推广与应用。Compared with the prior art, the three-dimensional human body posture reconstruction method based on the bidirectional projection network of the present invention has the following advantages: (1) By training the deep neural network in a data-driven manner, the low-cost human posture can be directly realized through the neural network. 3D reconstruction does not require any expensive hardware equipment, only ordinary cameras or mobile phones can collect data, and based on the visual method to reconstruct the 3D posture of the moving human body, it can replace the professional hardware peripherals to complete the 3D reconstruction of the human body posture. Low cost and easy to use, it can help VR and AR technologies in the 5G era, build portable somatosensory interactive devices, and realize large-scale promotion and application of 3D motion reconstruction technology.

(2)采用了特有的神经网络训练方式，充分利用了人体姿态数据的生理学结构特点为网络增加了新的约束，因此在网络的训练过程中不依赖于具体的数据标签以及三维数据集可以实现无标签的深度学习训练过程，而且训练好的模型具有很好的泛化性能，可以实现自然场景下复杂的三维人体姿态估计任务。(2) A unique neural network training method is adopted, which makes full use of the physiological structure characteristics of human posture data to add new constraints to the network. Therefore, the network training process does not depend on specific data labels and 3D data sets. The unlabeled deep learning training process, and the trained model has good generalization performance, which can achieve complex 3D human pose estimation tasks in natural scenes.

(3)本发明通过对人体姿态两大特性的研究，设计了双向投影网络。它将数据集包含的姿态先验知识作为一种新的约束加入到网络的训练过程，减轻了模型在训练时对真实3D数据的依赖，能够在不依赖标签数据的情况下进行网络的训练，并且能够实现自然场景下准确的3D人体姿态重建。(3) The present invention designs a bidirectional projection network by studying the two major characteristics of human body posture. It adds the pose prior knowledge contained in the dataset as a new constraint to the training process of the network, which reduces the model's dependence on real 3D data during training, and enables network training without relying on label data. And it can achieve accurate 3D human pose reconstruction in natural scenes.

附图说明Description of drawings

图1是本发明中双向投影网络A模式网络结构示意图；1 is a schematic diagram of a bidirectional projection network A mode network structure in the present invention;

图2是本发明中双向投影网络B模式网络结构示意图；Fig. 2 is the network structure schematic diagram of bidirectional projection network B mode in the present invention;

图3是本发明中双向投影网络组成模块的内部结构示意图；3 is a schematic diagram of the internal structure of a bidirectional projection network component module in the present invention;

图4是本发明中的整体流程图；Fig. 4 is the overall flow chart in the present invention;

图5是本发明在自然场景下的三维人体姿态重建效果图。FIG. 5 is an effect diagram of three-dimensional human body posture reconstruction in a natural scene according to the present invention.

具体实施方式Detailed ways

下面结合附图和具体实施方式对本发明基于双向投影网络的自然场景三维人体姿态重建方法作进一步说明，如图所示，本实施例中介绍本发明的详细提出过程。The method for reconstructing three-dimensional human body pose in natural scenes based on a bidirectional projection network of the present invention will be further described below with reference to the accompanying drawings and specific embodiments.

一、相关技术的介绍1. Introduction of related technologies

重构人类在三维空间的姿态动作是计算机视觉的主要目标之一，在上个世纪已经有相关学者对这一问题进行了研究[1]。为了摆脱对专业设备的依赖，早期的一些方法大多是基于特征工程，通过对人类骨骼关节进行运动生理学建模来重构3D姿态[2，3]，或者是基于搜索的方法，使用3D骨架的数据库字典进行最近邻查找，为2D姿态输出对应的3D姿态[4，5]。随着深度学习的发展，研究者试图通过建立一个端到端的模型直接从RGB图像中输出人体的3D姿态[6，7，8，9]，但是自然场景中图像复杂的背景通常会干扰端到端的3D姿态重建过程。近年来，从单目视觉系统中推断3D人体姿态引起了极大的关注，此项技术可以被广泛的应用于动画电影、虚拟现实、行为识别与人机交互。因为从2D观测中恢复3D姿态本身就是一个病态问题，所以这在计算机视觉任务中非常具有挑战性。在自然场景下，受光照、角度以及复杂背景等因素影响，直接推断图像中人体的3D姿态非常的困难，之前的一些工作将这一问题拆分成两部分：首先通过各种先进的2D人体关键点检测器从图像中估计2D姿态，然后在获得的2D姿态基础上进行3D人体姿态重建。其中[10]最先提出了一种简单的基线算法，将3D姿态重建看作是一个2D关节点到3D坐标点的回归任务，利用神经网络完成了高质量的3D姿态重建。[11]进一步将姿态表示成距离矩阵将这一问题转化为二维到三维的距离矩阵回归。[12]将人体姿态看作是一类特殊的拓扑图数据，设计了一种语义图卷积网络(SemGCN)，完成了图结构数据的回归任务。但是这些利用三维标签数据训练网络的方法有两个严重的局限性：(1)因为3D姿态的数据对实验条件要求很高，通常需要在室内利用昂贵的多角度动作捕捉设备捕获人体运动的三维信息，所以现实场景中通常很难获得大量用于训练的3D人体姿态数据；(2)有标签数据训练过程中严格的对应关系会导致单一数据集上的过拟合现象产生，这种过拟合一方面体现在模型无法泛化到其他特殊角度或者从未见过的2D姿态上，另一方面表现在网络只能生成训练集中的3D姿态数据，不能对自然场景下复杂姿态动作做出合理的重建。这两个局限性都是由于训练过程对3D标签数据的依赖造成的。Reconstructing human gestures and actions in three-dimensional space is one of the main goals of computer vision. In the last century, some scholars have studied this problem [1]. In order to get rid of the dependence on specialized equipment, some of the early methods are mostly based on feature engineering, reconstructing 3D poses by motion physiology modeling of human skeletal joints [2, 3], or search-based methods, using 3D skeleton's The database dictionary performs nearest neighbor lookup and outputs the corresponding 3D pose for the 2D pose [4, 5]. With the development of deep learning, researchers try to directly output the 3D pose of the human body from RGB images by building an end-to-end model [6, 7, 8, 9], but the complex background of images in natural scenes usually interferes with the end-to-end The 3D pose reconstruction process of the end. In recent years, inferring 3D human pose from monocular vision systems has attracted great attention, and this technology can be widely used in animated movies, virtual reality, behavior recognition, and human-computer interaction. Because recovering 3D pose from 2D observations is inherently an ill-conditioned problem, this is very challenging in computer vision tasks. In natural scenes, affected by factors such as illumination, angle and complex background, it is very difficult to directly infer the 3D pose of the human body in the image. Some previous works have divided this problem into two parts: first, through various advanced 2D human body The keypoint detector estimates the 2D pose from the image, and then performs 3D human pose reconstruction based on the obtained 2D pose. Among them [10] first proposed a simple baseline algorithm, which regards 3D pose reconstruction as a regression task from 2D joint points to 3D coordinate points, and uses neural networks to complete high-quality 3D pose reconstruction. [11] further represented the pose as a distance matrix and transformed this problem into a 2D to 3D distance matrix regression. [12] regarded human pose as a special kind of topological graph data, and designed a Semantic Graph Convolutional Network (SemGCN) to complete the regression task of graph-structured data. However, these methods using 3D labeled data to train networks have two serious limitations: (1) Because 3D pose data is very demanding on experimental conditions, it is usually necessary to use expensive multi-angle motion capture equipment to capture 3D human motion indoors. Therefore, it is usually difficult to obtain a large amount of 3D human pose data for training in real scenes; (2) The strict correspondence in the training process of labeled data will lead to overfitting on a single data set. On the one hand, the model cannot be generalized to other special angles or 2D poses that have never been seen before. On the other hand, the network can only generate 3D pose data in the training set, and cannot make reasonable gestures for complex poses in natural scenes. reconstruction. Both of these limitations are due to the reliance of the training process on 3D labeled data.

近年来，2D人体关节点检测算法精确度日益提高，已经能够做到自然场景下实时的2D姿态估计。因此越来越多的研究者致力于利用这些易于获得的2D关节点数据进行3D姿态的重建，也就是分成两步走：首先利用先进的2D人体关节检测器从图像中获得2D姿态，然后将这些2D姿态提升到3D。解决病态问题的关键点在于结合问题特征加入合理的先验信息作为约束，在传统方法中这种约束由人手工设计的正则项提供，通常只能实现单一问题的解。在深度学习时代，利用网络自动的从数据中学习先验信息约束可以看作是解决病态问题的一种新思路，通过大量数据训练的模型就可以解决一类问题。In recent years, the accuracy of 2D human joint detection algorithms has been increasing, and it has been able to achieve real-time 2D pose estimation in natural scenes. Therefore, more and more researchers are devoted to using these easily obtained 2D joint point data for 3D pose reconstruction, which is divided into two steps: first use advanced 2D human joint detector to obtain 2D pose from the image, and then use advanced 2D human joint detector to obtain 2D pose from the image. These 2D poses are elevated to 3D. The key point of solving ill-conditioned problems is to combine the characteristics of the problem and add reasonable prior information as constraints. In traditional methods, such constraints are provided by regular terms designed manually, and usually only a single problem can be solved. In the era of deep learning, using the network to automatically learn prior information constraints from data can be regarded as a new way to solve ill-conditioned problems, and a class of problems can be solved by a model trained with a large amount of data.

因此将姿态数据的重要特性抽象出来并作为网络的约束是本发明的主要贡献点。通过对姿态数据生理学结构特点的研究，本发明利用深度学习技术设计了双向投影网络，它具有A/B两种工作模式，能够在有三维数据标签和无标签数据的情况下分别进行网络的训练，训练好的网络能够完成自然场景下复杂的三维人体姿态重建任务。Therefore, it is the main contribution of the present invention to abstract the important characteristics of the pose data as the constraints of the network. Through the research on the physiological structure characteristics of attitude data, the present invention uses deep learning technology to design a bidirectional projection network, which has two working modes, A/B, and can train the network separately under the condition of three-dimensional data label and unlabeled data. , the trained network can complete complex 3D human pose reconstruction tasks in natural scenes.

二、提出的方法2. The proposed method

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

实施例1：参阅图1-5，一种基于双向投影网络的自然场景三维人体姿态重建方法的整体流程图如图4所示，当利用单目相机拍摄得到图片或者视频之后，首先经过二维姿态检测网络(OpenPose，HRNet，StackHourglass)获得对应的二维人体姿态，并得到二维关节点检测结果。在将数据送入三维姿态生成器之前，需要根据是否拥有三维标签数据选择相应的模式对三维姿态生成器进行训练。当有三维姿态标签数据时，双向投影网络工作在A模式，当没有三维姿态标签数据时，双向投影网络工作在B模式。用户可以根据自己是否拥有三维人体姿态数据，选择对应的模式对网络进行训练。Embodiment 1: Referring to Figures 1-5, the overall flow chart of a three-dimensional human body pose reconstruction method in natural scenes based on a bidirectional projection network is shown in Figure 4. After taking pictures or videos with a monocular camera, it first goes through a two-dimensional process. The pose detection network (OpenPose, HRNet, StackHourglass) obtains the corresponding two-dimensional human pose, and obtains the two-dimensional joint point detection results. Before sending the data into the 3D pose generator, it is necessary to select the corresponding mode to train the 3D pose generator according to whether it has 3D label data. When there is 3D pose label data, the bidirectional projection network works in A mode, and when there is no 3D pose label data, the bidirectional projection network works in B mode. Users can select the corresponding mode to train the network according to whether they have 3D human pose data.

当选取模式A时，双向投影网络训练过程具有如图1所示的结构，二维姿态数据与三维姿态数据分别被送入双向网络的两个支路。第一条支路中，输入的二维姿态首先经由三维姿态生成器生成三维姿态重构结果，此重构结果再经由二维姿态投影层重新生成二维投影结果；第二条支路中，输入的三维姿态首先传入二维姿态投影层，然后将输出结果再送入三维姿态生成器完成第二次重构。两条支路完成了一个对偶的运算过程，在两个过程中需要分别对重构姿态的误差进行计算，误差分为分布误差以及姿态误差，整个网络的损失函数为：When mode A is selected, the bidirectional projection network training process has the structure shown in Figure 1, and the 2D attitude data and the 3D attitude data are respectively sent to the two branches of the bidirectional network. In the first branch, the input 2D pose first generates a 3D pose reconstruction result through the 3D pose generator, and the reconstruction result regenerates the 2D projection result through the 2D pose projection layer; in the second branch, The input 3D pose is first passed to the 2D pose projection layer, and then the output is sent to the 3D pose generator to complete the second reconstruction. The two branches complete a dual operation process. In the two processes, the error of the reconstructed attitude needs to be calculated separately. The error is divided into distribution error and attitude error. The loss function of the entire network is:

loss_A＝L_GAN(G_3d,D_3d)+L_GAN(G_2d,D_2d)+L_dual(G_2d,G_3d)，loss _A =L _GAN (G _3d ,D _3d )+L _GAN (G _2d ,D _2d )+L _dual (G _2d ,G _3d ),

其中L_GAN代表带有梯度惩罚项的生成对抗网络的损失函数，其反应了分布误差，计算公式如下：Among them, _LGAN represents the loss function of the generative adversarial network with the gradient penalty term, which reflects the distribution error, and the calculation formula is as follows:

当选取模式B时，双向投影网络训练过程具有如图2所示的结构，B模式不需要任何标签数据，输入的二维姿态首先经过三维姿态生成器得到重构结果，然后此三维姿态将经过二维姿态投影层完成两种投影变化，一个支路将三维姿态投影到正向观测视角得到正向二维投影结果，另一个支路将三维姿态结果进行旋转投影变换得到其他视角观测结果。此时的两条支路完成了两种不同的观测过程，对这两种观测结果分别进行两种约束同样可以得到姿态误差以及分布误差，整个网络的损失函数为：When mode B is selected, the bidirectional projection network training process has the structure shown in Figure 2. Mode B does not require any label data. The input 2D pose is firstly reconstructed by the 3D pose generator, and then the 3D pose will be reconstructed through the 3D pose generator. The two-dimensional attitude projection layer completes two projection changes. One branch projects the three-dimensional attitude to the forward observation perspective to obtain the forward two-dimensional projection result, and the other branch performs the rotation projection transformation of the three-dimensional attitude result to obtain the observation results from other perspectives. At this time, the two branches have completed two different observation processes. The attitude error and distribution error can also be obtained by applying two constraints to these two observation results. The loss function of the entire network is:

loss_B＝L_GAN(G_R2dG_3d,D_2d)+L_pose(G_K2dG_3d)，loss _B =L _GAN (G _R2d G _3d ,D _2d )+L _pose (G _K2d G _3d ),

在训练过程中，双向投影网络的A/B两种模式共用了同一套网络模块，网络的组成模块如图3所示，其中包括三维姿态生成器、二/三维姿态判别器以及二维姿态投影层。During the training process, the A/B modes of the bidirectional projection network share the same set of network modules. The network components are shown in Figure 3, including a 3D pose generator, a 2D/3D pose discriminator, and a 2D pose projection. Floor.

其中三维姿态生成器内部包含两个深度残差网络和一个姿态特征提取层，深度残差网络由四个残差块堆叠构成，每层神经元数目为1024，输入的二维姿态首先经过残差网络输出一个初始的深度估计值得到三维姿态的初始估计结果，然后初始估计结果被传入姿态特征提取层，经过姿态先验拓扑结构特征提取，三维姿态将会被编码成一个包含空间角度以及深度信息的特征向量，这个特征向量将会再次被传入深度残差网络输出最终的深度估计值，生成最终的三维重构姿态。The three-dimensional pose generator contains two deep residual networks and one pose feature extraction layer. The deep residual network is composed of four residual blocks stacked, and the number of neurons in each layer is 1024. The input two-dimensional pose first passes through the residual The network outputs an initial depth estimation value to obtain the initial estimation result of the 3D pose, and then the initial estimation result is passed to the pose feature extraction layer. After the pose prior topology feature extraction, the 3D pose will be encoded into a space angle and depth. The feature vector of the information, this feature vector will be passed to the depth residual network again to output the final depth estimation value to generate the final 3D reconstructed pose.

二维姿态判别器和三维姿态判别器具有相同的网络架构，主要差别在于特征提取层的不同，两种维度的姿态首先经过对应的姿态特征提取层编码成一个包含运动姿态拓扑结构的特征向量，然后再经过深度残差网络和全连接层输出最终的判别值，完成对两个分布之间差异的计算。The two-dimensional pose discriminator and the three-dimensional pose discriminator have the same network architecture, and the main difference lies in the feature extraction layer. The poses of the two dimensions are first encoded into a feature vector containing the topology of the motion pose through the corresponding pose feature extraction layer. Then, the final discriminant value is output through the deep residual network and the fully connected layer to complete the calculation of the difference between the two distributions.

二维姿态投影层包含了两条支路，分别可以将姿态分别投影到不同的角度，正向视角的观测通过多个残差块连接的深度残差网络完成，其他旋转视角的观测则通过对姿态旋转变换层来实现。The two-dimensional attitude projection layer includes two branches, which can respectively project the attitude to different angles. The observation of the forward perspective is completed by the deep residual network connected by multiple residual blocks, and the observation of other rotating perspectives is done by comparing The pose rotation transformation layer is implemented.

其中正向投影的变换过程如下：The transformation process of the forward projection is as follows:

X_2d＝G_2d(X_3d)X _2d = G _2d (X _3d )

旋转投影的变换过程如下：The transformation process of the rotation projection is as follows:

X_2d＝G_R2dX_3d X _2d = G _R2d X _3d

其中X_2d代表二维姿态，X_3d代表三维姿态，G_2d代表深度残差网络投影变换，G_R2d代表旋转变换。where X _2d represents the 2D pose, X _3d represents the 3D pose, G _2d represents the depth residual network projection transformation, and G _R2d represents the rotation transformation.

其中旋转变换矩阵：where the rotation transformation matrix:

通过以上各模块的组合，可以构成双向投影网络的A、B两种训练模式，根据实际情况选取对应的模式进行网络的训练，不断迭代最小化误差函数，经过20～40EPOCH网络训练，最终可以得到训练好的三维姿态生成器。Through the combination of the above modules, two training modes, A and B, of the bidirectional projection network can be formed. According to the actual situation, the corresponding mode is selected to train the network, and the error function is continuously minimized by iteration. After 20-40 EPOCH network training, the final result can be obtained. A trained 3D pose generator.

然后将之前检测到的二维姿态经过如下的正规化处理：The previously detected 2D pose is then normalized as follows:

1.利用检测到的左右肩关节坐标重构中心颈部坐标：1. Use the detected coordinates of the left and right shoulder joints to reconstruct the center neck coordinates:

2.利用检测到的左右肩关节和髋关节重构中心脊柱坐标：2. Use the detected left and right shoulder and hip joints to reconstruct the central spine coordinates:

将正规化处理的二维人体姿态传入训练好的三维姿态生成器，三维姿态生成器会依照二维的检测结果输出一个符合人体姿态拓扑结构的三维人体骨架，连接各帧视频的骨架序列，就可以实现对视频中三维人体姿态的重建，因为本发明的方法中直接采用了二维姿态检测器的估计结果，所以模型具有很强的泛化性能，能够针对自然场景下一些特殊的姿态做出合理的三维重建效果。所发明方法的重建效果如图5所示。The normalized 2D human pose is sent to the trained 3D pose generator, and the 3D pose generator will output a 3D human skeleton conforming to the topology of the human pose according to the 2D detection results, and connect the skeleton sequence of each frame of video. The reconstruction of the three-dimensional human posture in the video can be realized. Because the estimation result of the two-dimensional posture detector is directly used in the method of the present invention, the model has strong generalization performance and can be used for some special postures in natural scenes. A reasonable 3D reconstruction effect can be obtained. The reconstruction effect of the invented method is shown in FIG. 5 .

三、参考文献，申请文件中括号内带有的数字，就是指代下面该数字对应的文献。3. References, the numbers in parentheses in the application documents refer to the literatures corresponding to the numbers below.

[1]H.-J.Lee and Z.Chen.Determination of 3d human body postures from asingle view.Computer Vision,Graphics,and Image Processing,30(2):148–168,1985.[1] H.-J.Lee and Z.Chen.Determination of 3d human body postures from asingle view.Computer Vision,Graphics,and Image Processing,30(2):148–168,1985.

[2]V.Ramakrishna,T.Kanade,and Y.Sheikh.Reconstructing 3d human posefrom 2d image landmarks.In European Conference on Computer Vision(ECCV),pages573–586.Springer,2012.[2] V. Ramakrishna, T. Kanade, and Y. Sheikh. Reconstructing 3d human pose from 2d image landmarks. In European Conference on Computer Vision (ECCV), pages 573–586. Springer, 2012.

[3]C.Ionescu,J.Carreira,and C.Sminchisescu.Iterated second-orderlabel sensitive pooling for 3d human pose estimation.In Conference onComputer Vision and Pattern Recognition(CVPR),pages 1661–1668,2014.2[3]C.Ionescu,J.Carreira,and C.Sminchisescu.Iterated second-orderlabel sensitive pooling for 3d human pose estimation.In Conference onComputer Vision and Pattern Recognition(CVPR),pages 1661–1668,2014.2

[4]H.Jiang.3d human pose reconstruction using millions ofexemplars.In International Conference on Pattern Recognition(ICPR),pages1674–1677.IEEE,2010.[4] H. Jiang. 3d human pose reconstruction using millions of exemplars. In International Conference on Pattern Recognition (ICPR), pages 1674–1677. IEEE, 2010.

[5]C.-H.Chen and D.Ramanan.3D human pose estimation＝2D poseestimation+matching.In Conference on Computer Vision and Pattern Recognition(CVPR),pages 5759–5767,2017.[5]C.-H.Chen and D.Ramanan.3D human pose estimation=2D poseestimation+matching.In Conference on Computer Vision and Pattern Recognition(CVPR),pages 5759–5767,2017.

[6]S.Li and A.B.Chan.3d human pose estimation from monocular imageswith deep convolutional neural network.In Asian Conference on Computer Vision(ACCV),pages 332–347.Springer,2014.[6]S.Li and A.B.Chan.3d human pose estimation from monocular images with deep convolutional neural network.In Asian Conference on Computer Vision(ACCV),pages 332–347.Springer,2014.

[7]D.Mehta,S.Sridhar,O.Sotnychenko,H.Rhodin,M.Shafiei,H.-P.Seidel,W.Xu,D.Casas,and C.Theobalt.Vnect:Real-time 3d human pose estimation with asingle rgb camera.volume 36,72017.[7]D.Mehta,S.Sridhar,O.Sotnychenko,H.Rhodin,M.Shafiei,H.-P.Seidel,W.Xu,D.Casas,and C.Theobalt.Vnect:Real-time 3d human pose estimation with asingle rgb camera.volume 36,72017.

[8]B.Tekin,I.Katircioglu,M.Salzmann,V.Lepetit,and P.Fua.Structuredprediction of 3d human pose with deep neural networks.In British MachineVision Conference(BMVC),2016.[8]B.Tekin,I.Katircioglu,M.Salzmann,V.Lepetit,and P.Fua.Structured prediction of 3d human pose with deep neural networks.In British MachineVision Conference(BMVC),2016.

[9]G.Pavlakos,X.Zhou,K.G.Derpanis,and K.Daniilidis.Coarse-to-finevolumetric prediction for single-image 3d human pose.In Conference onComputer Vision and Pattern Recognition(CVPR),pages 1263–1272.IEEE,2017.[9] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. Coarse-to-fine volumetric prediction for single-image 3d human pose. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1263–1272. IEEE , 2017.

[10]J.Martinez,R.Hossain,J.Romero,and J.J.Little.A simple yeteffective baseline for 3d human pose estimation.In ICCV,2017.[10]J.Martinez,R.Hossain,J.Romero,and J.J.Little.A simple yet effective baseline for 3d human pose estimation.In ICCV,2017.

[11]F.Moreno-Noguer.3d human pose estimation from a single image viadistance matrix regression.In Proceedings of the Conference on ComputerVision and Pattern Recognition(CVPR),2017.1[11]F.Moreno-Noguer.3d human pose estimation from a single image viadistance matrix regression.In Proceedings of the Conference on ComputerVision and Pattern Recognition(CVPR),2017.1

[12]Zhao L,Peng X,Tian Y,et al.Semantic Graph Convolutional Networksfor 3D Human Pose Regression[C]//Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition.2019:3425-3435.[12] Zhao L, Peng X, Tian Y, et al. Semantic Graph Convolutional Networks for 3D Human Pose Regression[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019:3425-3435.

Claims

1. a natural scene three-dimensional human body posture reconstruction method based on bidirectional projection network is characterized in that: contain the following steps:

Step 1. Use a camera to collect human motion video or image data in a natural scene;

Step 2, sending the collected video and image data into a two-dimensional attitude detector to obtain the two-dimensional body joint point coordinates of the corresponding attitude;

Step 3: Design bidirectional projection networks with two structures according to whether there are three-dimensional attitude data labels in the training process;

Step 4. Use the deep adversarial learning strategy to train the designed network, minimize the network loss function, and finally obtain the trained 3D pose generator after iteration;

Step 5: Input the output result of the 2D posture detector in Step 2 into the 3D posture generator trained in Step 4, and the output result is the 3D posture data of the person in the video/image.

2. the natural scene three-dimensional human body posture reconstruction method based on bidirectional projection network according to claim 1, is characterized in that: in described step 1, adopts common monocular optical camera or mobile phone camera to complete the character movement data under natural scene. Acquisition, the data is in the form of pictures or videos.

3. the natural scene three-dimensional human body posture reconstruction method based on bidirectional projection network according to claim 1, is characterized in that: in described step 2, two-dimensional posture detector is the two-dimensional posture detection method of OpenPose, StackHourglass or HRNet, When the collected data is a picture, directly input the picture to obtain the two-dimensional joint point detection result; when the collected data is a video, input the two-dimensional joint point detection sequence frame by frame.

4. the natural scene three-dimensional human body posture reconstruction method based on bidirectional projection network according to claim 1, it is characterized in that: in described step 3, according to whether the user has three-dimensional posture label data, chooses two kinds of different structures of A/B. Bidirectional projection network, when there is 3D attitude data for use, the bidirectional projection network works in A mode. At this time, the network consists of two opposite dual branches. Its network modules include a 3D attitude generator, a 3D attitude discriminator, a 3D attitude projection layer and 2D attitude discriminator; when there is no 3D attitude data for use, the bidirectional projection network works in B mode. At this time, the network consists of two projection branches in different directions, and its network module includes a 3D attitude generator. , 2D pose projection layer and 2D pose discriminator.

5. the natural scene three-dimensional human body posture reconstruction method based on bidirectional projection network according to claim 1, is characterized in that: the input of three-dimensional posture generator in described step 3 is two-dimensional joint point coordinates, and the output is three-dimensional joint point coordinates , which contains two deep residual networks and an attitude feature extraction layer. The deep residual network is composed of four residual blocks stacked, and the number of neurons in each layer is 1024. The attitude feature extraction layer completes the encoding and compression of the attitude topology. ; The 2D pose discriminator and the 3D pose discriminator have the same network architecture, which includes a 2D/3D pose feature extraction layer, a deep residual network and a fully connected layer. The input of the 2D/3D discriminator module is different dimensions The attitude vector of , and the output is a unary discriminant value; the two-dimensional attitude projection layer contains two branches: forward projection and rotation transformation of the residual network, and the attitude is projected to different observation angles according to the function. The input of the projection layer module is the three-dimensional attitude data, and the output is the projected two-dimensional attitude data.

6. The natural scene three-dimensional human body posture reconstruction method based on bidirectional projection network according to claim 1, is characterized in that: described step 4 comprises the following sub-steps,

Step 4.1. When there is 3D pose data for network training, select Mode A network architecture for training;

Step 4.1.1. Taking the 2D pose as input, first output an initial depth estimation value through the residual network in the 3D pose generator to obtain the initial estimation result of the 3D pose; then the initial estimation result is passed to the pose feature extraction layer , a feature vector is output through the pose prior topology feature extraction, and the feature vector is passed into the depth residual network again to output the final depth estimation value to generate the final 3D reconstructed pose;

Step 4.1.2. The generated 3D reconstructed pose gets forward projection through the 2D pose projection layer all the way, and calculates the pose error with the input 2D pose, and the other way will send it to the 3D pose discriminator to calculate the distribution error;

Step 4.1.3, take the 3D pose as input, first get the forward projection through the 2D pose projection layer, the forward projection is sent to the 3D pose generator all the way to get the 3D reconstruction result, and calculate the pose error with the input 3D pose , the other is sent to the two-dimensional attitude discriminator to calculate the distribution error;

Step 4.2. When there is no 3D pose data for network training, select Mode B network architecture for training;

Step 4.2.1. Taking the 2D pose as input, first output an initial depth estimation value through the residual network in the 3D pose generator to obtain the initial estimation result of the 3D pose; then the initial estimation result is passed to the pose feature extraction layer , and output a feature vector after the pose prior topology feature extraction, and the feature vector will be passed to the depth residual network again to output the final depth estimation value to generate the final 3D reconstructed pose;

Step 4.2.2. Introduce the 3D reconstructed pose to the 2D pose projection layer to obtain the forward projection and the rotation projection respectively. The forward projection will calculate the attitude error with the input 2D pose, and the rotation projection will be judged by the 2D pose The calculator calculates the two-dimensional distribution error;

Step 4.3. Calculate the loss functions in the two modes of A/B, including the pose loss function and the distribution loss function;

Step 4.3.1. In mode A, the overall loss function of the network is defined as:

loss _A =L _GAN (G _3d ,D _3d )+L _GAN (G _2d ,D _2d )+L _dual (G _2d ,G _3d ), where L _GAN represents the loss function of a generative adversarial network with a gradient penalty term, It reflects the distribution error, and the calculation formula is as follows:

L _dual represents the bidirectional loss of the dual network, which reflects the attitude error. The calculation formula is as follows:

L _dual (G _2d ,G _3d )=||G _2d (G _3d (X _2d ))-X _2d || ₁ +||G _3d (G _2d (X _3d ))-X _3d || ₁

λ is the neural network hyperparameter, G _3d represents the 3D pose generator, G _2d represents the 2D pose projection layer, D _3d and D _2d represent the 3D pose discriminator and 2D pose discriminator, respectively, X _2d and X _3d represent the real 2D pose and 3D pose, A _3d represents the random 3D pose on the line connecting the sampling points of the reconstructed 3D pose distribution and the real 3D pose distribution, A _2d represents the line connecting the sampling points of the projected 2D pose distribution and the real 2D pose distribution A random 2D pose on ;

Step 4.3.2. In mode B, the overall loss function of the network is defined as:

loss _B =L _GAN (G _R2d G _3d ,D _2d )+L _p o _se (G _K2d G _3d ), where L _GAN represents the loss function of the generative adversarial network with the gradient penalty term, which reflects the distribution error and calculates The formula is as follows:

L _pose is the reconstruction loss, which reflects the attitude error. The calculation formula is as follows:

L _pose (G _K2d G _3d )=||G _K2d G _3d (X _2d )-X _2d || ₁

λ is the neural network hyperparameter, G _3d represents the 3D pose generator, G _R2d represents the rotation projection transformation of the 2D pose projection layer, G _K2d represents the forward projection transformation of the 2D pose projection layer, D _2d represents the 2D pose discriminator, X _2d represents the real 2D pose data, A _2d represents the random 2D pose on the line connecting the sampling points between the projected 2D pose distribution and the real 2D pose distribution;

Step 4.4: Use the neural network optimizer to adjust the network parameters to minimize the error function, and after iterating for 20-40 EPOCH, the loss function converges, and the trained 3D pose generator is obtained.

7. The natural scene three-dimensional human body posture reconstruction method based on bidirectional projection network according to claim 1, is characterized in that: described step 5 comprises the following sub-steps,

Step 5.1. Pass the video or image data collected by the ordinary camera into the two-dimensional attitude detector, and first obtain the two-dimensional joint point data;

Step 5.2, normalize the output result of the two-dimensional attitude detector, so that it can be directly used as the input of the three-dimensional attitude generator; the normalization processing has the following steps:

Step 5.2.1. Use the detected coordinates of the left and right shoulder joints to reconstruct the center neck coordinates:

Among them: (x _T , y _T ) represents the center neck coordinate, (x _ls , y _ls ) represents the left shoulder coordinate, (x _rs , y _rs ) represents the right shoulder coordinate;

Step 5.2.2. Use the detected left and right shoulder and hip joints to reconstruct the central spine coordinates:

Where: (x _S , y _S ) represents the center spine coordinate, (x _ls , y _ls ) represents the left shoulder coordinate, (x _rs , y _rs ) represents the right shoulder coordinate, (x _lh , y _lh ) represents the left hip coordinate, ( x _rh , y _rh ) represents the coordinates of the right hip;

Step 5.3. Pass the normalized 2D pose data into the 3D pose generator, and the output result is the reconstructed 3D pose. When the input data is image data, the output result is the 3D human pose skeleton; when the input data is video data , and the output result is the 3D human skeleton action.