CN102800126A

CN102800126A - Method for recovering real-time three-dimensional body posture based on multimodal fusion

Info

Publication number: CN102800126A
Application number: CN2012102308982A
Authority: CN
Inventors: 肖俊; 刘彬; 庄越挺
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2012-07-04
Filing date: 2012-07-04
Publication date: 2012-11-28

Abstract

本发明公开了一种基于多模态融合的实时人体三维姿态恢复的方法。利用深度图解析、颜色识别、人脸检测等多种技术，来实时地获取人体主要关节点在现实世界中的坐标，从而恢复出人体的三维骨架信息。基于每一时刻同步获取得到的场景深度图像及场景彩色图像，利用人脸检测的方法获得人体头部位置信息，利用颜色识别的方法获得佩戴有色标的四肢端点位置信息，再通过四肢端点的位置信息，利用彩色图与深度图的映射关系计算出肘部与膝部的位置信息，最终利用时域信息对获取的骨架进行平滑处理，实时地重建出人体的运动信息。本发明相较于传统的利用近红外设备对人体进行三维姿态恢复的技术，提高了恢复的稳定性，也使人体运动捕获过程更为简便。The invention discloses a method for restoring a real-time three-dimensional posture of a human body based on multimodal fusion. Using depth map analysis, color recognition, face detection and other technologies to obtain the coordinates of the main joint points of the human body in real time in real time, so as to restore the three-dimensional skeleton information of the human body. Based on the scene depth image and scene color image obtained synchronously at each moment, the position information of the human head is obtained by using the method of face detection, and the position information of the end points of the limbs wearing the color code is obtained by the method of color recognition, and then through the position information of the end points of the limbs , use the mapping relationship between the color map and the depth map to calculate the position information of the elbow and knee, and finally use the time domain information to smooth the acquired skeleton, and reconstruct the motion information of the human body in real time. Compared with the traditional technology of using near-infrared equipment to restore the three-dimensional posture of the human body, the present invention improves the stability of the restoration and makes the process of capturing the motion of the human body easier.

Description

Method of Real-time Human 3D Pose Restoration Based on Multimodal Fusion

技术领域 technical field

本发明涉及一种实时人体三维姿态恢复的方法，尤其涉及利用深度图以及色标对人体三维姿态实时地进行恢复的方法。The invention relates to a method for restoring a real-time three-dimensional pose of a human body, in particular to a method for real-time restoring a three-dimensional pose of a human body by using a depth map and a color code.

背景技术 Background technique

人体三维姿态恢复是指通过设备获取现实中人体的运动数据，包括各主要关节点的三维空间坐标信息等；再对这些运动数据进行计算与渲染，从而建立虚拟场景中角色的运动姿态。通过这样的技术，可以将现实中人体的运动与虚拟世界中角色的运动绑定起来，从而驱动虚拟角色的运动。目前，人体的三维姿态恢复技术广泛地应用在电影、动画拍摄以及游戏制作等领域中，该技术相较于传统计算机动画的建模技术，效率更高，可以做到实时性。Human body 3D posture recovery refers to obtaining real human body motion data through equipment, including 3D space coordinate information of major joint points, etc.; and then calculating and rendering these motion data to establish the motion posture of the character in the virtual scene. Through such a technology, the movement of the human body in reality can be bound to the movement of the character in the virtual world, thereby driving the movement of the virtual character. At present, the three-dimensional posture restoration technology of the human body is widely used in the fields of film, animation shooting and game production. Compared with the traditional computer animation modeling technology, this technology is more efficient and can achieve real-time performance.

实现人体三维姿态恢复的技术有很多种，主要分为光学系统和非光学系统。非光学设备一般通过重力加速器或者辅助的机械设备来获取人体的运动数据，使用的并不是很广泛。而目前光学系统中，大多通过近红外设备为主，即由多个红外线摄像头识别出标记点（由反光率较高的材质制成）的位置，再通过定标算法，将标记点坐标转换为三维空间内的坐标。这种技术的优点是恢复出来的姿态比较精确，系统的鲁棒性比较高，而缺点是进行运动捕获的流程较为复杂，成本较高。There are many technologies to realize the three-dimensional posture recovery of the human body, which are mainly divided into optical systems and non-optical systems. Non-optical devices generally obtain human body motion data through gravitational accelerators or auxiliary mechanical devices, which are not widely used. At present, most of the optical systems are based on near-infrared equipment, that is, multiple infrared cameras identify the position of the marking point (made of a material with high reflectivity), and then use a calibration algorithm to convert the coordinates of the marking point into Coordinates in three-dimensional space. The advantage of this technology is that the recovered attitude is more accurate and the system robustness is relatively high, but the disadvantage is that the process of motion capture is more complicated and the cost is higher.

也有人使用多个普通摄像头提供多视角信息，再提取每个视角侧影的特征值后，从数据库中找出相似的姿态。这种技术优点是硬件成本较低，但是需要有特定数据集的支持，而且对于所要捕获的动作也有较大的限制。Some people use multiple ordinary cameras to provide multi-view information, and then extract the eigenvalues of each perspective silhouette to find similar poses from the database. The advantage of this technology is that the hardware cost is low, but it requires the support of a specific data set, and it also has a relatively large limitation on the actions to be captured.

随着微软推出新一代的交互设备Kinect，人体三维姿态恢复的技术又有了新的突破。Kinect设备可以捕获场景的深度图，深度图中每个像素与其在场景中的位置相对应并具有表示从每个参考位置到其场景位置的距离的像素值（换言之，深度图具有图像的形式，其中，像素值指出场景中的物体的形貌信息，而不是亮度或颜色）。Jamie Shotton等人在他们的论文“Real-Time Human PoseRecognition in Parts from Single Depth lmages”中，描述了一种基于机器学习的方法来恢复人体姿态。以色列一家公司Prime Sense也开发出了一种基于启发式方法的技术，通过对深度图进行背景减除、场景重建的方法，恢复出人体三维骨架信息。在以上方法中，只需要通过一台kinect设备，不需要任何的标记点即可实时地恢复出人体的运动姿态，这与传统的光学系统相比，有了很大的提升。同时，这种技术也大大降低了人体三维姿态恢复的成本，使得该技术可以进入家庭娱乐领域。With the introduction of a new generation of interactive device Kinect by Microsoft, the technology of human body three-dimensional posture recovery has made a new breakthrough. The Kinect device can capture a depth map of the scene, each pixel in the depth map corresponds to its position in the scene and has a pixel value representing the distance from each reference position to its scene position (in other words, the depth map has the form of an image, Among them, the pixel value indicates the shape information of the object in the scene, not the brightness or color). In their paper "Real-Time Human PoseRecognition in Parts from Single Depth images", Jamie Shotton et al. describe a machine learning-based approach to recovering human poses. An Israeli company, Prime Sense, has also developed a heuristic-based technology that restores the three-dimensional skeleton information of the human body by performing background subtraction and scene reconstruction on the depth map. In the above method, only one kinect device is needed, and the motion posture of the human body can be recovered in real time without any marking points, which is greatly improved compared with the traditional optical system. At the same time, this technology also greatly reduces the cost of restoring the three-dimensional posture of the human body, allowing this technology to enter the field of home entertainment.

然而，上述方法在稳定性上，与传统光学设备仍然有一定差距，且实现难度较大。However, the above method still has a certain gap with traditional optical equipment in terms of stability, and it is difficult to realize.

发明内容 Contents of the invention

本发明的目的是提供一种基于多模态融合的实时人体三维姿态恢复的方法。The purpose of the present invention is to provide a method for real-time human three-dimensional pose recovery based on multimodal fusion.

基于多模态融合的实时人体三维姿态恢复的方法，它的步骤如下：A method for real-time human three-dimensional posture recovery based on multi-modal fusion, its steps are as follows:

1）以不小于25帧每秒的帧速同步接受包含人体在内的场景深度图序列以及场景彩色图序列，所述场景深度图序列中的每一帧场景深度图由像素矩阵组成，像素矩阵中的每个像素点的值表示该像素点所对应场景中的位置到参考位置的距离，即该像素点的深度值；所述场景彩色图序列中的每一帧图片由像素矩阵组成，像素矩阵中的每个像素点的值表示该像素点所对应场景中的位置所表示的颜色信息，由RGB颜色值表示；1) Simultaneously accept the scene depth map sequence including the human body and the scene color map sequence at a frame rate of not less than 25 frames per second. Each frame of the scene depth map in the scene depth map sequence is composed of a pixel matrix, and the pixel matrix The value of each pixel point in represents the distance from the position in the scene corresponding to the pixel point to the reference position, that is, the depth value of the pixel point; each frame picture in the scene color map sequence is composed of a pixel matrix, and the pixel The value of each pixel point in the matrix represents the color information represented by the position in the scene corresponding to the pixel point, represented by the RGB color value;

2）分割所述场景深度图的背景与前景像素，获得场景深度图中表述人体部位的区域，即前景像素；2) Segmenting the background and foreground pixels of the scene depth map to obtain the area representing the body parts in the scene depth map, that is, the foreground pixels;

3）处理场景深度图中的前景像素，标注出场景深度图中表示人体躯干、头部以及四肢的像素点；3) Process the foreground pixels in the scene depth map, and mark the pixels representing the human torso, head and limbs in the scene depth map;

4）通过人脸检测，识别出场景彩色图中的人脸位置，通过场景彩色图与场景深度图的映射得到人体头部在场景深度图中的投影坐标，并转换为现实世界中的三维坐标，所述投影坐标为三维向量（X，Y，Z），其中（X，Y）具体地指向场景深度图的某个像素点，Z为该像素点的深度值；4) Through face detection, the face position in the scene color map is recognized, and the projection coordinates of the human head in the scene depth map are obtained through the mapping of the scene color map and the scene depth map, and converted into three-dimensional coordinates in the real world , the projection coordinates are three-dimensional vectors (X, Y, Z), where (X, Y) specifically points to a certain pixel point of the scene depth map, and Z is the depth value of the pixel point;

5）根据头部在场景深度图中的投影坐标，计算颈部与肩部在场景深度图中的投影坐标，并转换为现实世界中的三维坐标；5) According to the projection coordinates of the head in the scene depth map, calculate the projection coordinates of the neck and shoulders in the scene depth map, and convert them into three-dimensional coordinates in the real world;

6）通过四肢端点佩戴的带有颜色的标记物，获取手部及脚部在场景深度图中的投影坐标，并转换为现实世界中的三维坐标；6) Obtain the projected coordinates of the hands and feet in the scene depth map through the colored markers worn at the extremities, and convert them into three-dimensional coordinates in the real world;

7）通过手部和肩膀的三维坐标，计算出肘关节在场景深度图中的投影坐标，并转换为现实世界中的三维坐标；7) Calculate the projection coordinates of the elbow joint in the scene depth map through the three-dimensional coordinates of the hands and shoulders, and convert them into three-dimensional coordinates in the real world;

8）通过脚部和臀部的三维坐标，计算出膝关节在场景深度图中的投影坐标，并转换为现实世界中的三维坐标；8) Calculate the projected coordinates of the knee joint in the scene depth map through the three-dimensional coordinates of the feet and hips, and convert them into three-dimensional coordinates in the real world;

9）按照步骤2）至8）处理场景深度图序列及场景彩色图序列中的每一帧，并将每一帧捕获到的人体各部位的三维坐标依据人体结构组成并输出骨架模型，设置每个捕获的人体部位的三维坐标的约束空间和可信度，对骨架模型进行平滑处理，所述约束空间表示了每个捕获的人体部位在相邻两帧内所允许最大的位移范围。9) Process each frame in the scene depth map sequence and scene color map sequence according to steps 2) to 8), and compose the three-dimensional coordinates of each part of the human body captured in each frame according to the human body structure and output the skeleton model, and set each The constraint space and reliability of the three-dimensional coordinates of each captured human body part are smoothed, and the skeleton model is smoothed. The constraint space represents the maximum displacement range allowed by each captured human body part within two adjacent frames.

所述5）中的计算方法：The calculation method in 5) above:

a)在获取头部的三维坐标后，根据预设的颈部参考长度L_neck、颈部的参考深度D_neck以及头部的实际深度D_head，通过下列公式计算得到颈部的实际长度L_Real_neck：a) After obtaining the three-dimensional coordinates of the head, according to the preset neck reference length L _neck , the neck reference depth D _neck and the actual head depth D _head , the actual length of the neck L_Real _neck is calculated by the following formula :

L_Real_neck＝D_head*L_neck/D_neck L_Real _neck ＝D _head *L _neck /D _neck

在头部与躯干连接的线段上，根据颈部的实际长度L_Real_neck获得人体颈部的位置；On the line segment connecting the head and the torso, the position of the human neck is obtained according to the actual length of the neck L_Real _neck ;

b)在获取颈部的三维坐标之后，本方法根据预设的肩部参考宽度W_shoulder、肩部的参考深度D_shoulder以及颈部的实际深度R_neck，通过下列公式计算得到肩部的实际宽度W_Real_shoulder：b) After obtaining the three-dimensional coordinates of the neck, this method calculates the actual width of the shoulder by the following formula according to the preset reference width W _shoulder of the shoulder, the reference depth D _shoulder of the shoulder and the actual depth R _neck of the neck W_Real _shoulder :

W_Real_shoulder＝R_neck*W_shoulder/D_shoulder W_Real _shoulder ＝R _neck *W _shoulder /D _shoulder

在颈部所处的水平线段上，根据肩部实际宽度W_Real_shoulder获得人体左右肩膀的位置；On the horizontal line segment where the neck is located, the position of the left and right shoulders of the human body is obtained according to the actual width of the shoulders W_Real _shoulder ;

c)在计算肩部位置时，需要注意用户有时候会采取侧身的姿态，在这种情况下，需要调整侧身时肩膀投影的宽度；该步骤需要计算出左右肩膀所在位置的深度D_left、D_right以及肩膀的预设长度W_shoulder，那么变化后的肩膀宽度W_Projected应为：c) When calculating the shoulder position, it should be noted that the user sometimes adopts a sideways posture. In this case, the width of the shoulder projection needs to be adjusted; this step needs to calculate the depths D _left and D of the left and right shoulders. _right and the preset length W _shoulder of the shoulder, then the changed shoulder width W _Projected should be:

${W W}_{Projected Projected} = = \sqrt{{W W}_{shoulder shoulder}^{22} - - {(({D D.}_{left left} - - {D D.}_{right right}))}^{22}}$

通过W_Projected，按照步骤b）计算左右肩部的位置；Through W _Projected , follow step b) to calculate the position of the left and right shoulders;

d)通过局部搜索的方法，确保上述a），b），c）步骤得到的肩膀坐标处于前景像素中；以左肩为例，该局部搜索方法在探测左肩膀时，如果估测的左肩部像素点(x,y)处于背景像素中，那么在搜索该估测像素点右侧的像素(x+t,y+t)，其中t为搜索范围阈值，在该范围内的像素点中找出距离该估测像素点最近的前景像素，如果未能找到处于前景像素的点，那么递进式地增加t的值以扩大搜索范围，直至找到最邻近的前景像素为止。d) Through the local search method, ensure that the shoulder coordinates obtained in the above steps a), b), and c) are in the foreground pixel; taking the left shoulder as an example, when the local search method detects the left shoulder, if the estimated left shoulder pixel Point (x, y) is in the background pixel, then search for the pixel (x+t, y+t) on the right side of the estimated pixel point, where t is the threshold of the search range, and find If the foreground pixel closest to the estimated pixel cannot be found, the value of t is increased progressively to expand the search range until the nearest foreground pixel is found.

所述6）中的获取方法：The acquisition method in 6) above:

a)用户需要在手部和脚部佩戴带有颜色的标记物来辅助识别手部和脚部位置，所描述标记物的颜色应区分于用户身体其它部位的颜色；a) The user needs to wear colored markers on the hands and feet to assist in identifying the position of the hands and feet, and the color of the markers described should be distinguished from the colors of other parts of the user's body;

b)将彩色图由RBG颜色空间转换为HSV颜色空间，并提取手部和脚部标记物的HSV颜色特征作为阈值，再对每一帧场景彩色图，使用该阈值对其进行滤波，将不符合该颜色特征的像素点移除，获得颜色阈值图，并通过图像的腐蚀和膨胀操作，去除颜色阈值图中的噪点；b) Convert the color image from the RBG color space to the HSV color space, and extract the HSV color features of the hand and foot markers as a threshold, and then use the threshold to filter each frame of the scene color image, which will not Remove the pixels conforming to the color feature to obtain the color threshold map, and remove the noise in the color threshold map through image erosion and expansion operations;

c)经过以上处理，得到一张二值图像，其中手部和脚部标记物的部位会由相应的斑块(Blob)表述，以该斑块的中心点作为四肢端点的位置，再通过坐标转换，分别获得手部和脚部在现实世界中的三维坐标。c) After the above processing, a binary image is obtained, in which the position of the hand and foot markers will be represented by the corresponding plaque (Blob), and the center point of the plaque is used as the position of the extremity endpoint, and then through the coordinates transformation to obtain the 3D coordinates of the hands and feet in the real world, respectively.

所述7）中的计算方法：The calculation method in the above 7):

a)在计算时，需要在场景深度图中的前景像素中标注属于手臂部位的像素点；先分别通过左右肩膀的位置标注出场景深度图中表示躯干部位的像素点，再将场景深度图中与躯干连接的其余部位分别标注为表示四肢及头部的像素点；当手臂在正前方遮挡住躯干时，需要计算躯干“质心”的深度值，并将躯干上每一像素点的深度值与躯干“质心”的深度值比较，如果深度差大于某一阈值，则标注该像素属于手臂区域，反之，该像素点属于躯干区域；某个区域的“质心”指该区域的平均深度，为此，可以通过计算该区域深度值的直方图，并将具有最高频率的深度值或具有最高频率的两个或多个深度值的平均值设为该区域质心的深度值；a) When calculating, it is necessary to mark the pixels belonging to the arm part in the foreground pixels in the scene depth map; first mark the pixel points representing the torso in the scene depth map through the positions of the left and right shoulders respectively, and then put them in the scene depth map The rest of the parts connected to the torso are marked as pixels representing the limbs and the head; when the arms block the torso directly in front, the depth value of the "centroid" of the torso needs to be calculated, and the depth value of each pixel on the torso is compared with Compare the depth value of the "centroid" of the torso, if the depth difference is greater than a certain threshold, mark the pixel as belonging to the arm area, otherwise, the pixel belongs to the torso area; the "centroid" of a certain area refers to the average depth of the area, for this , by calculating the histogram of the depth values of the region, and setting the depth value with the highest frequency or the average of two or more depth values with the highest frequency as the depth value of the centroid of the region;

b)成功标注出手臂区域的像素点后，以手部为起点，遍历深度图中所有标注为手臂的像素，如果该像素点与手部的距离满足小臂长度的约束条件，则将其标记为潜在的肘部区域；之后再以肩部为起点，再次在手臂上搜索到肩部距离符合大臂长度约束的像素点，将这些点与之前标注出肘部的点取交集即可得到肘部的估测范围，再将这些点的中点标记为人体肘部位置。b) After successfully marking the pixels in the arm area, start from the hand and traverse all the pixels marked as arms in the depth map. If the distance between the pixel and the hand satisfies the constraint condition of the forearm length, mark it is the potential elbow area; then, starting from the shoulder, search again on the arm to find the pixel points whose distance from the shoulder meets the length constraint of the arm, and intersect these points with the points that marked the elbow before to get the elbow The estimated range of the head, and then mark the midpoint of these points as the position of the human elbow.

所述8）中的计算方法：The calculation method in the above 8):

a)在计算时，需要在场景深度图中的前景像素中标注出属于腿部部位的像素点，先分别通过左右肩膀的位置标注出场景深度图中表示躯干部位的像素点，再将场景深度图中与躯干连接的其余部位分别标注为表示四肢及头部的像素点；a) When calculating, it is necessary to mark the pixels belonging to the legs in the foreground pixels in the scene depth map, first mark the pixels representing the torso in the scene depth map through the positions of the left and right shoulders, and then set the scene depth The remaining parts connected to the torso in the figure are marked as pixels representing the limbs and head respectively;

b)成功标注出腿部区域的像素点后，以脚部为起点，遍历深度图中所有标注为腿部的像素点，如果该像素点与脚部的距离满足小腿长度的约束条件，则将其标记为潜在的膝盖点；之后再以臀部为起点，再次在腿部上搜索到臀部距离符合大腿长度约束的像素点，将这些点与之前标注出膝盖的点取交集即可得到膝盖的估测范围，再将这些点的中点标记为人体膝盖位置。b) After the pixels of the leg area are successfully marked, start from the foot and traverse all the pixels marked as legs in the depth map. If the distance between the pixel and the foot satisfies the constraint condition of the leg length, It is marked as a potential knee point; then, starting from the hip, search again on the leg to find the pixel points whose hip distance meets the thigh length constraint, and intersect these points with the points marked with the knee before to get the estimation of the knee. Measure the range, and then mark the midpoint of these points as the human knee position.

所述9）中的处理方法：The processing method in the above 9):

a.针对每一个人体部位定义一个约束长度D以及可信度C，约束长度D可以描述约束范围；约束范围是指一个以该人体部位为球心，D为半径的球体，该球体描述了该人体部位在相邻两帧的时间内，所允许的最大位移范围，不同人体部位的约束空间大小会不一样，手部的约束空间相比肩部会大一些；a. Define a constraint length D and reliability C for each human body part. The constraint length D can describe the constraint range; the constraint range refers to a sphere with the human body part as the center and D as the radius. The sphere describes the The maximum displacement range allowed for human body parts within two adjacent frames, the size of the constrained space of different human body parts will be different, and the constrained space of the hand will be larger than that of the shoulder;

b.可信度表示了该人体部位目前坐标值的准确程度，C的值越高，则该人体部位的位置越准确；初始时每个人体部位的可信度都设置为0，在新的一帧中，如果该人体部位新的位置处于前一帧该人体部位的约束空间内，则该人体部位的可信度增加一点，若该人体部位的可信度已达到最大值，则不需要改变；反之，若该人体部位新的位置在上一帧中该人体部位的约束空间之外，则只需向新的位置移动Length/C的距离，其中Length为该部位原来的位置和新位置所表示线段的长度，随后再将该部位的可信度降低一点。b. Credibility indicates the accuracy of the current coordinates of the body part. The higher the value of C, the more accurate the position of the body part is. Initially, the credibility of each body part is set to 0. In the new In one frame, if the new position of the human body part is in the constrained space of the human body part in the previous frame, the reliability of the human body part is increased a little. If the reliability of the human body part has reached the maximum value, no need Conversely, if the new position of the human body part is outside the constrained space of the human body part in the previous frame, it only needs to move the distance of Length/C to the new position, where Length is the original position and the new position of the part The length of the line segment represented, and then reduce the credibility of the part a bit.

本发明利用深度图解析、颜色识别、人脸检测等多种技术，来实时地获取人体主要关节点在现实世界中的坐标，从而恢复出人体的三维骨架信息，相较于传统的利用近红外设备对人体进行三维姿态恢复的技术，提高了恢复的稳定性，降低了使用成本，也使人体运动捕获过程更为简便，为人体三维姿态恢复技术走进家庭娱乐提供了新的解决方案。The present invention utilizes multiple technologies such as depth map analysis, color recognition, and face detection to obtain real-time coordinates of the main joint points of the human body in the real world, thereby recovering the three-dimensional skeleton information of the human body. Compared with the traditional method of using near-infrared The technology of recovering the three-dimensional posture of the human body by the device improves the stability of the recovery, reduces the cost of use, and makes the process of human motion capture easier, providing a new solution for the technology of restoring the three-dimensional posture of the human body into home entertainment.

附图说明 Description of drawings

下面结合附图和具体实施方式对本发明作进一步的说明。The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

图1为基于多模态融合的实时人体三维姿态恢复的方法流程图；Fig. 1 is the method flowchart of the real-time human three-dimensional pose recovery based on multimodal fusion;

图2为本发明所使用的场景深度图；Fig. 2 is the depth map of the scene used by the present invention;

图3为本发明所恢复的人体三维骨架效果图。Fig. 3 is an effect diagram of the three-dimensional skeleton of the human body restored by the present invention.

具体实施方式 Detailed ways

结合附图1，基于多模态融合的实时人体三维姿态恢复的方法，它的步骤如下：In conjunction with accompanying drawing 1, the method for the real-time human three-dimensional attitude recovery based on multimodal fusion, its steps are as follows:

1)获取场景深度图像及场景彩色图像1) Obtain scene depth image and scene color image

本方法以不小于25帧每秒的帧速获取长度为640、宽度为480像素的场景深度图像和场景彩色图像序列，所述场景深度图序列中的每一帧场景深度图（如图2所示）由像素矩阵组成，像素矩阵中的每个像素点的值表示该像素点所对应场景中的位置到参考位置的距离，即该像素点的深度值；其中场景深度图和场景彩色图在每一时刻都是同步的，场景深度图像和场景彩色图像的每一个像素也是经过对齐的。This method acquires a sequence of scene depth images and scene color images with a length of 640 pixels and a width of 480 pixels at a frame rate of not less than 25 frames per second, and each frame of the scene depth map in the sequence of scene depth maps (as shown in Figure 2 shown) consists of a pixel matrix, and the value of each pixel in the pixel matrix represents the distance from the position in the scene corresponding to the pixel to the reference position, that is, the depth value of the pixel; where the scene depth map and the scene color map are in Every moment is synchronized, and every pixel of the scene depth image and the scene color image are also aligned.

2)对深度图像进行背景剪除2) Background clipping on the depth image

分割所述场景深度图的背景与前景像素，获得场景深度图中表述人体部位的区域，即前景像素。在实现人体姿态跟踪的时候，我们只对前景像素（用户）感兴趣，所以需要去除掉背景像素。在实际实现过程中，有很多不同的背景去除方法。本发明实现的方法是在深度图中先确定一个斑块(Blob)作为对象的身体，然后从斑块中去除具有明显不同深度值的其它斑块。这种方式需要首先确定某个具有最小尺寸的斑块，确定该尺寸涉及到现实世界的坐标系与投影坐标系之间的转换。由深度图可以获得的物体的投影坐标，为了确定物体的实际坐标，需要使用下面的公式将物体的（x，y，深度值）坐标转换为“现实世界”坐标（X_r，Y_r，深度值）：The background and foreground pixels of the scene depth map are segmented to obtain a region representing a human body part in the scene depth map, that is, foreground pixels. When implementing human pose tracking, we are only interested in foreground pixels (users), so we need to remove background pixels. In actual implementation, there are many different background removal methods. The method implemented by the present invention is to first determine a blob in the depth map as the body of the object, and then remove other blobs with obviously different depth values from the blob. This approach needs to first determine a patch with a minimum size, and determining the size involves conversion between the real-world coordinate system and the projected coordinate system. The projected coordinates of the object that can be obtained from the depth map, in order to determine the actual coordinates of the object, the following formula needs to be used to convert the coordinates of the object (x, y, depth value) into "real world" coordinates (X _r , Y _r , depth value):

X_r＝(X-fovx/2)*像素尺寸*深度/参考深度X _r = (X-fovx/2)*pixel size*depth/reference depth

Y_r＝(Y-fovy/2)*像素尺寸*深度/参考深度Y _r = (Y-fovy/2)*pixel size*depth/reference depth

这里，fovx和fovy为x和y方向上的深度图的视野（以像素为单位）。像素尺寸为，在摄像头给定距离（参考深度）处像素所对着的长度。随后，我们就可以利用物体在现实世界中的坐标来计算其欧几里得距离，从而避免由于近大远小所造成的误差。在由现实世界坐标确定出斑块的尺寸后，即可筛选出离摄像头最近的，而且尺寸最大的斑块，并设定其为人体区域。Here, fovx and fovy are the field of view (in pixels) of the depth map in the x and y directions. The pixel size is the length that the pixel is facing at a given distance (reference depth) from the camera. Then, we can use the coordinates of the object in the real world to calculate its Euclidean distance, so as to avoid the error caused by the large distance. After the size of the plaque is determined by the coordinates of the real world, the nearest and largest plaque to the camera can be selected and set as the human body area.

另外一种更为简洁的方法是通过试探性地设置阈值，将深度值大于某一阈值的像素全部设置为背景像素，将尺寸小于某一阈值的斑块，设置为背景像素，这样实现更为简便，但准确性也更低。Another more concise method is to tentatively set the threshold, set all the pixels whose depth value is greater than a certain threshold as background pixels, and set the patches whose size is smaller than a certain threshold as background pixels, so as to achieve more Simpler, but less accurate.

3)在场景深度图中标注人体各区域3) Label each area of the human body in the scene depth map

处理场景深度图中的前景像素，标注出场景深度图中表示人体躯干、头部以及四肢的区域；Process the foreground pixels in the scene depth map, and mark the areas representing the human torso, head and limbs in the scene depth map;

4)通过人脸检测的方法，计算人体头部位置4) Calculate the position of the human head through the method of face detection

在这一步中，本发明使用OpenCV提供的哈尔分类器（Haar Cascade Classifier）进行人脸检测，从场景彩色图中实时地获取用户头部所在的像素点，通过场景彩色图与场景深度图的映射得到人体头部在场景深度图中的投影坐标，并通过步骤2中描述的坐标转换方法将其转换为现实世界中的三维坐标，所述投影坐标为三维向量（X，Y，Z），其中（X，Y）具体地指向场景深度图的某个像素点，Z为该像素点的深度值。In this step, the present invention uses the Haar Cascade Classifier (Haar Cascade Classifier) provided by OpenCV to detect the face, obtains the pixel of the user's head in real time from the scene color map, and uses the scene color map and the scene depth map. The projection coordinates of the human head in the scene depth map are obtained by mapping, and converted into three-dimensional coordinates in the real world through the coordinate transformation method described in step 2, and the projection coordinates are three-dimensional vectors (X, Y, Z), Where (X, Y) specifically points to a certain pixel of the scene depth map, and Z is the depth value of the pixel.

5)根据头部位置，计算肩膀坐标5) Calculate the shoulder coordinates according to the head position

a)在获取头部的三维坐标后，本方法根据预设的颈部参考长度L_neck、颈部的参考深度D_neck以及头部的实际深度D_head，通过下列公式计算得到颈部的实际长度L_Real_neck：a) After obtaining the three-dimensional coordinates of the head, this method calculates the actual length of the neck through the following formula according to the preset neck reference length L _neck , the neck reference depth D _neck and the head actual depth D _head L_Real _neck :

L_Real_neck＝D_head*L_neck/D_neck L_Real _neck ＝D _head *L _neck /D _neck

d)通过局部搜索的方法，确保上述a），b），c）步骤得到的肩膀坐标处于前景像素中；以左肩为例，该局部搜索方法在探测左肩膀时，如果估测的左肩部像素点(x,y)处于背景像素中，那么在搜索该估测像素点右侧的像素(x+t,y+t)，其中t为搜索范围阈值，在该范围内的像素点中找出距离该估测像素点最近的前景像素。如果未能找到处于前景像素的点，那么递进式地增加t的值以扩大搜索范围，直至找到最邻近的前景像素为止。d) Through the local search method, ensure that the shoulder coordinates obtained in the above steps a), b), and c) are in the foreground pixel; taking the left shoulder as an example, when the local search method detects the left shoulder, if the estimated left shoulder pixel Point (x, y) is in the background pixel, then search for the pixel (x+t, y+t) on the right side of the estimated pixel point, where t is the threshold of the search range, and find The closest foreground pixel to the estimated pixel. If the point in the foreground pixel cannot be found, then the value of t is progressively increased to expand the search range until the nearest foreground pixel is found.

6)通过四肢端点佩戴的带有颜色的标记物，获取手部及脚部在场景深度图中的投影坐标，并转换为现实世界中的三维坐标：6) Obtain the projected coordinates of the hands and feet in the scene depth map through the colored markers worn at the endpoints of the limbs, and convert them into three-dimensional coordinates in the real world:

a)在本发明中，用户需要在手部和脚部佩戴带有颜色的标记物来辅助识别手部和脚部位置，所描述标记物的颜色应区分于用户身体其它部位的颜色；a) In the present invention, the user needs to wear colored markers on the hands and feet to assist in identifying the positions of the hands and feet, and the color of the markers described should be distinguished from the colors of other parts of the user's body;

b)将彩色图由RBG颜色空间转换为HSV颜色空间，并提取手部和脚部标记物的HSV颜色特征作为阈值，再对每一帧场景彩色图，使用该阈值对其进行滤波，将不符合该颜色特征的像素点移除，获得颜色阈值图；并通过图像的腐蚀和膨胀操作，去除颜色阈值图中的噪点；b) Convert the color image from the RBG color space to the HSV color space, and extract the HSV color features of the hand and foot markers as a threshold, and then use the threshold to filter each frame of the scene color image, which will not Remove the pixels that meet the color characteristics to obtain the color threshold map; and remove the noise in the color threshold map through image erosion and expansion operations;

c)经过以上处理，得到一张二值图像，其中手部和脚部标记物的部位会由相应的斑块(Blob)表述，我们以该斑块的中心点作为四肢端点的位置，再通过坐标转换，分别获得手部和脚部在现实世界中的三维坐标。c) After the above processing, a binary image is obtained, in which the position of the hand and foot markers will be represented by the corresponding plaque (Blob). We use the center point of the plaque as the position of the end point of the limb, and then pass Coordinate transformation to obtain the 3D coordinates of the hands and feet in the real world, respectively.

7)通过手部和肩膀的三维坐标，计算出肘关节在场景深度图中的投影坐标，并转换为现实世界中的三维坐标：7) Calculate the projected coordinates of the elbow joint in the scene depth map through the three-dimensional coordinates of the hands and shoulders, and convert them into three-dimensional coordinates in the real world:

a)在计算时，需要在场景深度图中的前景像素中标注属于手臂部位的像素点；先分别通过左右肩膀的位置标注出场景深度图中表示躯干部位的像素点，再将场景深度图中与躯干连接的其余部位分别标注为表示四肢及头部的像素点；值得注意的是当手臂在正前方遮挡住躯干的情形，这时如果要判断躯干前某一像素点所表示的是躯干还是手臂区域，需要计算躯干“质心”的深度值，并将该像素点的深度与去躯干“质心”的深度值比较，如果深度差大于某一阈值，则标注该像素属于手臂区域，反之，该像素点属于躯干区域；某个区域的“质心”指该区域的平均深度，为此，可以通过计算该区域深度值的直方图，并将具有最高频率的深度值（或具有最高频率的两个或多个深度值的平均值）设为该区域质心的深度值；a) When calculating, it is necessary to mark the pixels belonging to the arm part in the foreground pixels in the scene depth map; first mark the pixel points representing the torso in the scene depth map through the positions of the left and right shoulders respectively, and then put them in the scene depth map The remaining parts connected to the torso are marked as pixels representing the limbs and the head; it is worth noting that when the arm is directly in front of the torso, if you want to judge whether a pixel in front of the torso represents the torso or For the arm area, it is necessary to calculate the depth value of the "centroid" of the torso, and compare the depth of the pixel with the depth value of the "centroid" of the torso. If the depth difference is greater than a certain threshold, mark the pixel as belonging to the arm area, otherwise, the The pixel points belong to the torso region; the "centroid" of a region refers to the average depth of the region, and this can be done by computing a histogram of the depth values of the region and taking the depth value with the highest frequency (or the two with the highest frequency or the average of multiple depth values) is set as the depth value of the centroid of the area;

8)通过脚部和臀部的三维坐标，计算出膝关节在场景深度图中的投影坐标，并转换为现实世界中的三维坐标：8) Calculate the projected coordinates of the knee joint in the scene depth map through the three-dimensional coordinates of the feet and hips, and convert them into three-dimensional coordinates in the real world:

a)在计算时，需要在场景深度图中的前景像素中标注出属于腿部部位的像素点；先分别通过左右肩膀的位置标注出场景深度图中表示躯干部位的像素点，再将场景深度图中与躯干连接的其余部位分别标注为表示四肢及头部的像素点；a) When calculating, it is necessary to mark the pixels belonging to the legs in the foreground pixels in the scene depth map; first mark the pixels representing the torso in the scene depth map through the positions of the left and right shoulders, and then set the scene depth The remaining parts connected to the torso in the figure are marked as pixels representing the limbs and head respectively;

9)按照步骤2）至8）处理场景深度图序列及场景彩色图序列中的每一帧，并将每一帧捕获到的人体各部位的三维坐标依据人体结构组成并输出如图3所示的人体骨架模型，设置每个捕获的人体部位的三维坐标的约束空间和可信度，对人体骨架模型进行平滑处理，所述约束空间表示了每个捕获的人体部位在相邻两帧内所允许最大的位移范围：9) Follow steps 2) to 8) to process each frame in the scene depth map sequence and scene color map sequence, and compose and output the three-dimensional coordinates of each part of the human body captured in each frame according to the human body structure, as shown in Figure 3 The human skeleton model, setting the constraint space and reliability of the three-dimensional coordinates of each captured human body part, smoothing the human body skeleton model, the constraint space represents the position of each captured human body part in two adjacent frames The maximum displacement range allowed:

a)针对每一个人体部位定义一个约束长度D以及可信度C，约束长度D可以描述约束范围；约束范围是指一个以该人体部位为球心，D为半径的球体，该球体描述了该人体部位在相邻两帧的时间内，所允许的最大位移范围，不同人体部位的约束空间大小会不一样，手部的约束空间相比肩部会大一些；a) Define a constraint length D and reliability C for each human body part. The constraint length D can describe the constraint range; the constraint range refers to a sphere with the human body part as the center and D as the radius. The sphere describes the The maximum displacement range allowed for human body parts within two adjacent frames, the size of the constrained space of different human body parts will be different, and the constrained space of the hand will be larger than that of the shoulder;

b)可信度表示了该人体部位目前坐标值的准确程度，C的值越高，则该人体部位的位置越准确；初始时每个人体部位的可信度都设置为0，在新的一帧中，如果该人体部位新的位置处于前一帧该人体部位的约束空间内，则该人体部位的可信度增加一点，若该人体部位的可信度已达到最大值，则不需要改变；反之，若该人体部位新的位置在上一帧中该人体部位的约束空间之外，则只需向新的位置移动Length/C的距离，其中Length为该部位原来的位置和新位置所表示线段的长度，随后再将该部位的可信度降低一点。b) Credibility indicates the accuracy of the current coordinates of the body part. The higher the value of C, the more accurate the position of the body part is. Initially, the credibility of each body part is set to 0. In the new In one frame, if the new position of the human body part is in the constrained space of the human body part in the previous frame, the reliability of the human body part is increased a little. If the reliability of the human body part has reached the maximum value, no need Conversely, if the new position of the human body part is outside the constrained space of the human body part in the previous frame, it only needs to move the distance of Length/C to the new position, where Length is the original position and the new position of the part The length of the line segment represented, and then reduce the credibility of the part a bit.

Claims

1. method based on the real-time human body three-dimensional pose recovery of multi-modal fusion is characterized in that its step is following:

1) accepts synchronously to comprise human body with the frame speed that is not less than 25 frame per seconds in interior scene depth graphic sequence and scene cromogram sequence; Each frame scene depth figure in the said scene depth graphic sequence is made up of picture element matrix; This pixel of the value representation of each pixel in the picture element matrix the position in the corresponding scene to the distance of reference position, the i.e. depth value of this pixel; Each frame picture in the said scene cromogram sequence is made up of picture element matrix, this pixel of the value representation of each pixel in the picture element matrix the represented colouring information in position in the corresponding scene, represent by the RGB color value;

2) cut apart background and the foreground pixel of said scene depth figure, obtain the zone of statement human body among the scene depth figure, i.e. foreground pixel;

3) foreground pixel among the said scene depth figure of processing marks out the pixel of representing trunk, head and four limbs among the scene depth figure;

4) detect through people's face, identify the people's face position in the scene cromogram, the mapping through scene cromogram and scene depth figure obtains the projection coordinate of human body head in scene depth figure; And converting the three-dimensional coordinate in the real world into, said projection coordinate is tri-vector (X, Y; Z); Wherein (X Y) points to certain pixel of scene depth figure particularly, and Z is the depth value of this pixel;

5), calculate neck and the projection coordinate of shoulder in scene depth figure, and convert the three-dimensional coordinate in the real world into according to the projection coordinate of head in scene depth figure;

6) label of wearing through the four limbs end points that has color obtains the projection coordinate in scene depth figure of hand and foot, and converts the three-dimensional coordinate in the real world into;

7) through the three-dimensional coordinate of hand and shoulder, calculate the projection coordinate of elbow joint in scene depth figure, and convert the three-dimensional coordinate in the real world into;

8) through the three-dimensional coordinate of foot and buttocks, calculate the projection coordinate of knee joint in scene depth figure, and convert the three-dimensional coordinate in the real world into;

9) according to step 2) to 8) handle each frame in scene depth graphic sequence and the scene cromogram sequence; And the three-dimensional coordinate of the partes corporis humani position that each frame-grab is arrived is formed according to organization of human body and the output skeleton pattern; The constraint space and the confidence level of the three-dimensional coordinate of each human body of catching are set; Skeleton pattern is carried out smoothing processing, and said constraint space has represented that each human body of catching allows maximum displacement range in adjacent two frames.

2. the method for the real-time human body three-dimensional pose recovery based on multi-modal fusion according to claim 1 is characterized in that said 5) in computing method:

A) after obtaining the three-dimensional coordinate of head, according to preset neck reference length L _Neck, neck reference depth D _NeckAnd the actual grade D of head _Head, calculate the physical length L_Real of neck through formula _Neck:

L_Real _neck＝D _head*L _neck/D _neck

On head and line segment that trunk is connected, according to the physical length L_Real of neck _NeckObtain the position of neck;

B) after obtaining the three-dimensional coordinate of neck, according to preset shoulder reference width W _Shoulder, shoulder reference depth D _ShoulderAnd the actual grade R of neck _Neck, calculate the developed width W_Real of shoulder through formula _Shoulder:

W_Real _shoulder＝R _neck*W _shoulder/D _shoulder

On the residing horizontal line section of neck, according to shoulder developed width W_Real _ShoulderObtain the position of human body left and right sides shoulder;

C) when calculating shoulder position, under the attitude situation that the user takes to lean to one side, the width of shoulder projection when needing adjustment to lean to one side; This set-up procedure need calculate the depth D of shoulder position, the left and right sides _Left, D _RightAnd the preset length W of shoulder _Shoulder, the shoulder width W after changing so _ProjectedShould be:

W_{Projected} = \sqrt{{W_{shoulder}}^{2} - {(D_{left} - D_{right})}^{2}}

Pass through W _Projected, the position of calculating left and right sides shoulder according to step b);

D) through the method for Local Search, guarantee above-mentioned a), b), c) the shoulder coordinate that obtains of step is in the foreground pixel; The method of described Local Search is when surveying left shoulder, if (x y) is in the background pixel the left shoulder pixel of estimation; So the pixel on this estimation pixel right side of search (x+t, y+t), wherein t is the hunting zone threshold value; Find out the foreground pixel nearest in the pixel in this scope apart from this estimation pixel; If fail to find the point that is in foreground pixel, the value that increases t to enlarge the hunting zone, till finding the most contiguous foreground pixel so ladderingly.

3. the method for the real-time human body three-dimensional pose recovery based on multi-modal fusion according to claim 1 is characterized in that said 6) in acquisition methods:

A) user need wear the label that has color in hand and foot and comes aid identification hand and foot position, and the color of institute's descriptive markup thing should distinguish over the color at other position of user's body;

B) be the hsv color space with cromogram by the RBG color space conversion; And the hsv color characteristic of extracting hand and foot's label to each frame scene cromogram, uses this threshold value that it is carried out filtering as threshold value again; The pixel that does not meet this color characteristic is removed; Obtain color threshold figure, and pass through the corrosion and the expansive working of image, remove the noise among the color threshold figure;

C) through above processing; Obtain a bianry image, wherein the position of hand and foot's label can be by corresponding patch (Blob) statement, with the central point of this patch position as the four limbs end points; Through coordinate conversion, obtain hand and the foot three-dimensional coordinate in real world respectively again.

4. the method for the real-time human body three-dimensional pose recovery based on multi-modal fusion according to claim 1 is characterized in that said 7) in computing method:

A) when calculating, need in the foreground pixel among the scene depth figure, mark the pixel that belongs to the arm position; Mark out the pixel of expression metastomium among the scene depth figure earlier respectively through the position of left and right sides shoulder, again all the other positions that are connected with trunk among the scene depth figure are labeled as the pixel of expression four limbs and head respectively; When arm shelters from trunk in the dead ahead; Need to calculate the depth value of trunk " barycenter "; And with the depth value of the depth value of each pixel on the trunk and trunk " barycenter " relatively,, depth difference belongs to arm regions if, then marking this pixel greater than a certain threshold value; Otherwise this pixel belongs to torso area; " barycenter " in certain zone refers to the mean depth that this is regional; For this reason; Can be through calculating the histogram of this regional depth value, and the mean value that will have the depth value of highest frequency or have two or more depth values of highest frequency is made as the depth value of this zone barycenter;

B) successfully mark out the pixel of arm regions after, be starting point with the hand, all are labeled as the pixel of arm to traversal in the depth map, if the distance of this pixel and hand satisfies the constraint condition of little arm lengths, then it are labeled as potential elbow region; Be starting point afterwards again with the shoulder; On arm, search the shoulder distance once more and meet the pixel that big arm lengths retrains; These points and the point that marks out ancon are before got the estimation scope that common factor can obtain ancon, and the mid point with these points is labeled as human body ancon position again.

5. the method for the real-time human body three-dimensional pose recovery based on multi-modal fusion according to claim 1 is characterized in that said 8) in computing method:

A) when calculating; Need in the foreground pixel among the scene depth figure, mark out the pixel that belongs to the shank position; Mark out the pixel of expression metastomium among the scene depth figure earlier respectively through the position of left and right sides shoulder, again all the other positions that are connected with trunk among the scene depth figure are labeled as the pixel of expression four limbs and head respectively;

B) successfully mark out the pixel of leg area after, be starting point with the foot, all are labeled as the pixel of shank to traversal in the depth map, if the distance of this pixel and foot satisfies the constraint condition of shank length, then it are labeled as potential knee point; Be starting point afterwards again with the buttocks; On shank, search buttocks once more apart from the pixel that meets the thigh length constraint; These points and the point that marks out knee are before got the estimation scope that common factor can obtain knee, and the mid point with these points is labeled as the human knee position again.

6. the method for the real-time human body three-dimensional pose recovery based on multi-modal fusion according to claim 1 is characterized in that said 9) in disposal route:

C) to everyone constraint length D of body region definition and confidence level C, constraint length D can describe restriction range; Restriction range is meant that one is the centre of sphere with this human body; D is the spheroid of radius, and this spheroid has been described this human body in the time of adjacent two frames, the maximum displacement scope that is allowed; The constraint space size at different human body position can be different, and the constraint space of hand is compared shoulder can be big;

D) confidence level has been represented the order of accuarcy of the present coordinate figure of this human body, and the value of C is high more, and then the position of this human body is accurate more; The confidence level of everyone body region all is set to 0 when initial; In a new frame; If the position that this human body is new is in the constraint space of this human body of former frame; Then the confidence level increase of this human body a bit if the confidence level of this human body has reached maximal value, does not then need to change; Otherwise; If outside the constraint space of position this human body in previous frame that this human body is new; Then only need to move the distance of Length/C to new position; Wherein Length is the length of original position, this position and the represented line segment of reposition, the confidence level at this position is reduced a bit subsequently again.