TW202137144A

TW202137144A - Method of estimating three-dimensional human skeleton of human body in an image, three-dimensional human skeleton estimator, and training method of estimator which is obtained by using two-dimensional skeleton and relative joint point depth estimated by deep learning network as inputs

Info

Publication number: TW202137144A
Application number: TW109108622A
Authority: TW
Inventors: 賴文能; 施龍聖
Original assignee: 國立中正大學
Priority date: 2020-03-16
Filing date: 2020-03-16
Publication date: 2021-10-01
Also published as: TWI753382B

Abstract

Disclosed is a method of estimating the three-dimensional human skeleton of the human body in an image and carrying out deep network model training to the used estimator. The estimator mainly uses a first stage calculation model and a second stage calculation model. The first stage calculation model is constructed by taking a deep convolutional neural network as a backbone. The second stage calculation model is an artificial neural network with full-connected modules and is a residual connection architecture. While training the estimator, the first stage calculation model is sequentially inputted by a plurality of training sample images in the first training group, and an optimization parameter of the first stage calculation model is obtained. Afterward, the output of the first stage calculation model obtained by the optimization parameter of the first stage calculation model is taken as the input in the second stage calculation model, and then the optimization parameter of the second stage calculation model is obtained by the training of the second training group, wherein the second training group can be equal to or different from the first training group. When estimating the three-dimensional human skeleton of images, the image is inputted into a calculation model cascaded by the first and the second stages so as to obtain human skeleton estimated result.

Description

A method for estimating a three-dimensional human skeleton of a human body in an image, a three-dimensional human skeleton estimator, and a training method of the estimator

本發明係與影像處理的技術有關，特別是指一種對一影像中的人體進行三維人體骨架估測的方法、三維人體骨架估測器、及估測器之訓練方法。The present invention is related to image processing technology, in particular to a method for estimating a three-dimensional human skeleton of a human body in an image, a three-dimensional human skeleton estimator, and a training method of the estimator.

在既有的技術中，目前用來估測人體在三維空間的關節點標的技術，主要是依靠昂貴的標記式動作捕捉系統或使用深度攝影機來獲得。然而，這樣的技術由於設備成本高，場地受限，且不易延伸至偵測多人骨架。Among the existing technologies, the current technologies used to estimate the joint points of the human body in three-dimensional space are mainly obtained by relying on expensive marker motion capture systems or using depth cameras. However, due to the high cost of the equipment, the space is limited, and it is not easy to extend this technology to detect multi-person skeletons.

為了解決上述成本昂貴的問題，有人在 A Simple yet Effective Baseline for 3D Human Pose Estimation (ICCV, 2017) 這篇論文中，提出兩階段估測人體 3D 骨架的方法，第一階段透過二維骨架估測器來輸出二維影像中的二維骨架，再於第二階段將二維骨架提升至三維骨架。此方法已被證明其性能優異，且目前大多數的研究都根據此方法來進行改良。例如，目前已知有文獻將前述方法拓展至類神經網路進行深度學習，由於學習二維到三維之間的映射關係是缺乏條件來進行約束的，因此，該文獻作者定義了骨架的語法作為約束條件，例如運動學關係、對稱關係、動作協調關係，並透過最後所連接的雙向循環神經網路學習關節之間的交互關係。In order to solve the above-mentioned costly problem, someone proposed a two-stage method for estimating the human body's 3D skeleton in A Simple yet Effective Baseline for 3D Human Pose Estimation (ICCV, 2017). The first stage is through the two-dimensional skeleton estimation. To output the two-dimensional skeleton in the two-dimensional image, and then upgrade the two-dimensional skeleton to the three-dimensional skeleton in the second stage. This method has been proven to have excellent performance, and most of the current studies are based on this method to improve. For example, there is currently known literature that extends the aforementioned method to neural network-like deep learning. Since learning the mapping relationship between two-dimensional and three-dimensional is lack of conditions to be constrained, the author of the literature defines the grammar of the skeleton as Constraint conditions, such as kinematic relationship, symmetry relationship, movement coordination relationship, and learn the interaction relationship between joints through the finally connected two-way cyclic neural network.

而在 Learning Pose Grammar to Encode Human Body Configuration for 3D Pose Estimation (AAAI, 2018) 這篇論文中，則提到了其以第一階段來估測二維骨架及各關節點間的深度排序，再以第二階段來以類神經網路之深度學習方式來回歸至人體三維骨架。In the paper Learning Pose Grammar to Encode Human Body Configuration for 3D Pose Estimation (AAAI, 2018), it is mentioned that it uses the first stage to estimate the depth ranking of the two-dimensional skeleton and each joint point. In the second stage, the neural network-like deep learning method is used to return to the human body's three-dimensional skeleton.

前述的先前技術中，以兩個階段來回歸人體三維骨架的方式，其做法尚有改進之餘地以達到精確生成人體三維骨架的目的，本發明即是使用二維骨架以及深度學習網路所估測出的相對關節點深度來做為輸入，以生成真正的三維關節點座標。In the aforementioned prior art, the method of returning to the human body three-dimensional skeleton in two stages still has room for improvement to achieve the purpose of accurately generating the human body three-dimensional skeleton. The present invention uses the two-dimensional skeleton and the deep learning network to estimate The measured relative joint point depth is used as input to generate true three-dimensional joint point coordinates.

基於上述，本發明提出一種對一影像中的人體進行三維人體骨架估測的方法，包含有下列步驟：S1) 估測得到人體邊界框：將該影像輸入至由一類神經網路訓練完成的一邊界框計算模型，並由該邊界框計算模型輸出一或多個人體邊界框，每一該人體邊界框係為一矩形區域，包括一中心點位於一人體參考點影像座標、四個頂角點影像座標、及屬於該人體邊界框內的影像內容；S2) 數據正規化：以正規化的方法來讓每一該人體邊界框內影像內容的像素色彩值常態分布；S3) 二維人體骨架複數關節點影像座標與各關節點相對於一根關節點 (root joint) 的相對深度估測：使用由深度卷積神經網路做為骨幹所訓練而成的一第一階段計算模型21，將前述步驟S2)所正規化後的每一該人體邊界框內的影像輸入至該第一階段計算模型21，而獲得輸出為一三維熱圖，從該三維熱圖可得出一個二維人體骨架上的複數關節點影像座標、一根關節點影像座標、及該複數關節點相對於該根關節點的相對深度估測值，其中，該根關節點係對應至該人體骨架上的一個關節點；以及S4) 三維人體骨架估測：使用一個由全連接模組做為骨幹所訓練而成的第二階段計算模型，將前述步驟S3)中的該複數關節點影像座標、及該複數關節點相對根關節點的相對深度估測值輸入至該第二階段計算模型，而獲得輸出為一估測的三維人體骨架模型與其關節點的三維座標，其中該全連接模組係為一種類神經網路，而且為一種殘差連接的架構，該些輸出的關節點的三維座標係相對於該根關節點。Based on the above, the present invention proposes a method for estimating a three-dimensional human skeleton of a human body in an image, which includes the following steps: S1) Estimating the bounding box of the human body: inputting the image into a neural network trained by a type of neural network. A bounding box calculation model, and one or more human body bounding boxes are output from the bounding box calculation model. Each human body bounding box is a rectangular area, including a center point located at a human body reference point image coordinate, and four vertex points The image coordinates and the image content belonging to the bounding box of the human body; S2) Data normalization: the normal distribution of the pixel color value of each image content within the bounding box of the human body by a normalization method; S3) Two-dimensional body skeleton plural The image coordinates of the key nodes and the relative depth estimation of each joint point with respect to a root joint: using a first-stage calculation model 21 trained by a deep convolutional neural network as the backbone, the aforementioned Step S2) The normalized images in the bounding box of the human body are input to the first-stage calculation model 21, and the output is obtained as a three-dimensional heat map. From the three-dimensional heat map, a two-dimensional human body skeleton can be obtained. The image coordinates of the plural joint points, the image coordinates of a joint point, and the estimated relative depth of the plural joint points with respect to the root joint point, wherein the root joint point corresponds to a joint point on the human skeleton; And S4) Three-dimensional human skeleton estimation: Using a second-stage calculation model trained with a fully connected module as the backbone, the image coordinates of the complex joint points in the aforementioned step S3) and the complex joint points are relative to each other The estimated value of the relative depth of the root joint point is input to the second-stage calculation model, and the obtained output is an estimated three-dimensional human skeleton model and the three-dimensional coordinates of the joint points, wherein the fully connected module is a kind of neural network , And it is a residual connection architecture, and the three-dimensional coordinates of the output joint points are relative to the root joint point.

藉此，本發明使用二階段訓練，並以二維骨架以及相對關節點深度來做為中間結果 (即第一階段輸出與第二階段輸入)，進而可以生成精確的三維關節座標。在估測三維骨架 (及其關節點座標)上，本發明較目前的已知技術更為準確。In this way, the present invention uses two-stage training, and uses a two-dimensional skeleton and relative joint point depths as intermediate results (that is, the output of the first stage and the input of the second stage), so as to generate accurate three-dimensional joint coordinates. In estimating the three-dimensional skeleton (and the coordinates of the joint points), the present invention is more accurate than the currently known technology.

另外，本發明還提出一種對一三維人體骨架估測器進行模型訓練的方法，該三維人體骨架估測器包含一第一階段計算模型及一第二階段計算模型，該第一階段計算模型係由一深度卷積神經網路做為骨幹所建構而成，該第二階段計算模型係為一全連接模組的類神經網路，且為一種殘差連接的架構，該方法包含有下列步驟：SS1) 輸入一第一訓練組：該第一訓練組包含複數訓練樣本，各該訓練樣本具有一人體邊界框，每一該人體邊界框係為一矩形區域，包括一中心點位於一人體參考點影像座標、四個頂角點影像座標、及屬於該人體邊界框內的影像內容；SS2) 數據正規化：以正規化的方法來讓每一該人體邊界框內影像內容的像素色彩值常態分布；SS3) 二維人體骨架複數關節點影像座標與各關節點相對於一根關節點(root joint)的相對深度估測器參數訓練：將前述步驟SS2)所正規化後的每一該人體邊界框內的影像輸入至該第一階段計算模型，而獲得輸出為一三維熱圖，從該三維熱圖可得出該二維人體骨架上的複數關節點影像座標、及該複數關節點相對於該根關節點的相對深度估測值，其中，該根關節點係對應至該人體骨架上的一個關節點，訓練該第一階段計算模型的參數時使用一第一損失函數；SS4) 依序輸入該第一訓練組之複數訓練樣本：在依序輸入該第一訓練組的複數訓練樣本期間對該第一損失函數進行最佳化而獲致該第一階段計算模型的最佳化參數；SS5) 三維人體骨架估測器參數訓練：使用前述步驟SS4)中所得到的該第一階段計算模型最佳化參數對一第二訓練組的複數訓練樣本預測得到該複數關節點影像座標、及其相對根關節點的相對深度估測值輸入至該第二階段計算模型，而獲得輸出為一估測的三維人體骨架模型與其關節點的三維座標，其中該些輸出的關節點的三維座標係相對於該根關節點，訓練該第二階段計算模型的參數時使用一第二損失函數；以及SS6) 獲得最佳化參數：在循環執行SS5)步驟以輸入該第二訓練組的複數訓練樣本所對應的複數關節點影像座標及相對根關節點的相對深度估測值、及網路輸出處的三維人體骨架與關節點座標真實值後，對該第二損失函數進行最佳化而獲致該第二階段計算模型的最佳化參數。In addition, the present invention also provides a method for model training of a three-dimensional human skeleton estimator. The three-dimensional human skeleton estimator includes a first-stage calculation model and a second-stage calculation model. The first-stage calculation model is Constructed by a deep convolutional neural network as the backbone, the second-stage calculation model is a fully connected module-like neural network and a residual connection architecture. The method includes the following steps : SS1) Input a first training group: the first training group contains plural training samples, each of the training samples has a human body bounding box, each of the body bounding boxes is a rectangular area, including a center point located on a human body reference Point image coordinates, four vertex point image coordinates, and the image content belonging to the human body boundary box; SS2) Data normalization: normalize the pixel color value of each image content in the human body boundary box by a normalization method Distribution; SS3) Two-dimensional human skeleton image coordinates of multiple joint points and the relative depth estimator parameter training of each joint point with respect to a root joint: each human body normalized in step SS2) The image in the bounding box is input to the first-stage calculation model, and the output obtained is a three-dimensional heat map. From the three-dimensional heat map, the image coordinates of the complex joint points on the two-dimensional human skeleton and the relative joint points can be obtained. The relative depth estimation value at the root joint point, where the root joint point corresponds to a joint point on the human skeleton, and a first loss function is used when training the parameters of the first stage calculation model; SS4) according to Input the complex number training samples of the first training group sequentially: optimize the first loss function during the sequence input of the complex number training samples of the first training group to obtain the optimized parameters of the first stage calculation model; SS5) Three-dimensional human skeleton estimator parameter training: using the first-stage calculation model optimization parameters obtained in the foregoing step SS4) to predict the complex training samples of a second training group to obtain the complex joint point image coordinates, and The relative depth estimation value relative to the root joint point is input to the second-stage calculation model, and the output obtained is an estimated three-dimensional human skeleton model and the three-dimensional coordinates of the joint points, wherein the three-dimensional coordinates of the output joint points are Relative to the root joint point, use a second loss function when training the parameters of the second-stage calculation model; and SS6) Obtain optimized parameters: SS5) is performed in a loop to input the complex training samples of the second training group After the corresponding complex joint point image coordinates and the relative depth estimation value of the relative root joint point, and the real value of the three-dimensional human body skeleton and joint point coordinates at the network output, the second loss function is optimized to obtain the The second stage calculates the optimized parameters of the model.

藉此，本發明可以對三維人體骨架估測器進行模型訓練，進而訓練出一個有效的模型，而可供操作來估測輸入影像中的三維人體骨架及關節點座標。In this way, the present invention can perform model training on the three-dimensional human skeleton estimator, and then train an effective model, which can be operated to estimate the three-dimensional human skeleton and joint point coordinates in the input image.

本發明還揭露一種三維人體骨架估測器，包含：一第一階段計算模型：係由一深度卷積神經網路做為骨幹所建構而成，其輸入為一正規化後的人體邊界框影像，該人體邊界框係為一矩形區域，包括一中心點位於一人體參考點影像座標、四個頂角點影像座標、及屬於該人體邊界框內的影像內容，其輸出為一三維熱圖，從該三維熱圖得出一個二維人體骨架上的複數關節點影像座標、及該複數關節點相對於一根關節點的相對深度估測值，其中，該根關節點係對應至該人體骨架上的一個關節點；以及一第二階段計算模型：係為一全連接模組的類神經網路，且為一種殘差連接的架構，其輸入為該第一階段計算模型輸出的該複數關節點影像座標、及該複數關節點相對該根關節點的相對深度估測值，其輸出為一估測的三維人體骨架模型與其複數關節點的三維座標，該些輸出的關節點的三維座標係相對於該根關節點。The present invention also discloses a three-dimensional human skeleton estimator, including: a first-stage calculation model: constructed by a deep convolutional neural network as the backbone, the input of which is a normalized human bounding box image The boundary frame of the human body is a rectangular area, including a center point at a reference point of the human body, image coordinates of four vertex points, and image content belonging to the boundary frame of the human body. The output is a three-dimensional heat map, Obtain the image coordinates of the complex joint points on a two-dimensional human skeleton from the three-dimensional heat map, and the relative depth estimation value of the complex joint points with respect to a joint point, wherein the root joint point corresponds to the human skeleton A joint point on the above; and a second-stage calculation model: it is a fully-connected module-like neural network and a residual-connected architecture, the input of which is the complex number of joints output by the first-stage calculation model Point image coordinates, and the estimated relative depth of the complex joint point relative to the root joint point, the output of which is an estimated three-dimensional human skeleton model and the three-dimensional coordinates of its complex joint points, and the three-dimensional coordinate system of the output joint points Relative to the root joint point.

為了詳細說明本發明之技術特點所在，茲舉以下之較佳實施例並配合圖式說明如後，其中：In order to describe the technical features of the present invention in detail, the following preferred embodiments are described in conjunction with the drawings, in which:

如圖1至圖5所示，本發明第一較佳實施例提出一種對一影像中的人體進行三維人體骨架估測的方法，主要具有下列步驟：As shown in FIGS. 1 to 5, the first preferred embodiment of the present invention proposes a method for estimating a three-dimensional human skeleton of a human body in an image, which mainly includes the following steps:

S1) 估測得到人體邊界框：將一影像輸入至由一類神經網路訓練完成的一邊界框計算模型，並由該邊界框計算模型輸出一或多個人體邊界框11，此處的一或多個人體邊界框11主要是依據該影像中的人體影像數量而定，所輸出的人體邊界框11可標記於影像上，如圖2所示。每一該人體邊界框11係為一矩形區域，包括一中心點位於一人體參考點影像座標、四個頂角點影像座標、以及屬於人體邊界框11內的影像內容。於本實施例中，該類神經網路可係為中心網(CenterNet) 技術，其使用全卷積網路 (Full Convolutional Networks, FCN) 做為網路骨幹，例如 ResNet 結合轉置卷積層、Hourglass、Deep Layer Aggregation (DLA)　技術等，但不以此為限。任何可從影像中找出人體所在位置的技術皆可使用於找出所有的人體邊界框11。S1) Estimated human bounding box: input an image to a bounding box calculation model trained by a type of neural network, and output one or more human bounding boxes 11 from the bounding box calculation model, where one or The multiple human body boundary boxes 11 are mainly determined according to the number of human body images in the image, and the output human body boundary boxes 11 can be marked on the image, as shown in FIG. 2. Each human body boundary box 11 is a rectangular area, including a center point at a human body reference point image coordinate, four vertex point image coordinates, and image content belonging to the human body boundary box 11. In this embodiment, this type of neural network can be a CenterNet technology, which uses Full Convolutional Networks (FCN) as the backbone of the network, such as ResNet combined with transposed convolutional layers, Hourglass , Deep Layer Aggregation (DLA) 　 technology, but not limited to this. Any technology that can find the position of the human body from the image can be used to find all the human body bounding boxes 11.

S2) 數據正規化：以正規化的方法來讓每一該人體邊界框11內影像內容的像素色彩值常態分布。於本實施例中，該正規化的方法為Z-分數正規化法 (Z-score normalization)，如下式(1)所示：S2) Data normalization: normalize the pixel color value of each image content within the human body boundary box 11 by a normalization method. In this embodiment, the normalization method is Z-score normalization, as shown in the following formula (1):

式(1)

Formula 1)

其中，平均數為μ，標準差為σ，分別對應影像的 RGB 通道，原始影像像素值為p (RGB 三通道值)，正規化後像素值為 p’。如此可以藉由R,G,B的平均值及標準差來將各通道像素色彩值正規化至[0,1]的範圍。Among them, the average is μ and the standard deviation is σ, respectively corresponding to the RGB channels of the image. The original image pixel value is p (RGB three-channel value), and the normalized pixel value is p'. In this way, the color value of each channel pixel can be normalized to the range of [0,1] by using the average value and standard deviation of R, G, and B.

S3) 二維人體骨架複數關節點影像座標與各關節點相對於一根關節點(root joint)的相對深度估測：使用由深度卷積神經網路做為骨幹所訓練而成的一第一階段計算模型21，將前述步驟S2)所正規化後的每一該人體邊界框11內的影像輸入至該第一階段計算模型21，而獲得輸出為數個三維熱圖，從該數個三維熱圖可得出一個二維人體骨架上的複數關節點影像座標 (均量化為 [0,64] 範圍內)、及該複數關節點相對於該根關節點的相對深度估測值(量化為 [0,64] 範圍內)，其估測的架構如圖3所示。該複數關節點的相對深度值係定義為Z-Zroot，其中， Zroot 為該根關節點的深度值 (可以由另外方法求得或定義為 0 深度點)，Z則為各該關節點的深度值。由於第一階段計算模型 21 的輸出為經過量化與正規化動作，其正規化關節點相對深度值

與其實際相對深度值 Z-Zroot 間的關係如下式(2)所示：S3) Two-dimensional human skeleton image coordinates of multiple joint points and the relative depth estimation of each joint point with respect to a joint point (root joint): using a first-class model trained by a deep convolutional neural network as the backbone The stage calculation model 21 inputs each image in the human body bounding box 11 normalized in the aforementioned step S2) to the first stage calculation model 21, and the output is obtained as a number of three-dimensional heat maps. The figure can obtain the image coordinates of the complex joint points on a two-dimensional human skeleton (all quantified in the range of [0,64]), and the relative depth estimation value of the complex joint point relative to the root joint point (quantified as [ 0,64]), the estimated architecture is shown in Figure 3. The relative depth value of the complex joint point is defined as Z-Zroot, where Zroot is the depth value of the root joint point (it can be obtained by another method or defined as 0 depth point), and Z is the depth of each joint point value. Since the output of the first stage calculation model 21 is a quantized and normalized action, its normalized joint point relative depth value

The relationship between its actual relative depth value Z-Zroot is shown in the following formula (2):

式(2)

Formula (2)

其中，scale係指一給予之倍率。Among them, scale refers to a given magnification.

在式 (2) 中，我們可由該第一階段計算模型21預測得到的depth’ 計算得到相關關節點對應的相對深度值Z-Zroot。此外，該根關節點係對應至該人體骨架上的一個關節點，於本實施例中係以人的骨盆點的座標來做為根關節點座標，但不以此為限。In formula (2), we can calculate the relative depth value Z-Zroot corresponding to the relevant joint points from the depth' predicted by the first-stage calculation model 21. In addition, the root joint point corresponds to a joint point on the human skeleton. In this embodiment, the coordinates of the human pelvic point are used as the root joint point coordinates, but it is not limited to this.

S4) 三維人體骨架估測：使用一個由全連接模組做為骨幹所訓練而成的一第二階段計算模型31，將前述步驟S3)中第一階段計算模型輸出的該複數關節點影像座標、及該複數關節點的相對深度估測值輸入該第二階段計算模型31，而獲得輸出為一估測的三維人體骨架模型與其關節點的三維座標，其中該全連接模組係為一種類神經網路，而且為一種殘差連接的架構，這個架構如圖4所示，而這些輸出的關節點的三維座標係相對於該根關節點。其中，該估測的三維人體骨架模型的關節點的三維座標，乃是以線性正規化 (linear normalization)的方式來表示，其值在 -1.0 ~ 1.0 之間：S4) Three-dimensional human skeleton estimation: use a second-stage calculation model 31 trained with a fully connected module as the backbone, and use the multiple joint point image coordinates output by the first-stage calculation model in step S3) , And the relative depth estimates of the complex joint points are input into the second-stage calculation model 31, and the obtained output is an estimated three-dimensional human skeleton model and the three-dimensional coordinates of the joint points, wherein the fully connected module is a kind of The neural network is a residual connection architecture. This architecture is shown in Figure 4, and the three-dimensional coordinates of the output joint points are relative to the root joint point. Among them, the three-dimensional coordinates of the joint points of the estimated three-dimensional human skeleton model are expressed in the form of linear normalization, and their values are between -1.0 and 1.0:

式(3)

Formula (3)

其中，

為正規化後的三維關節點座標，

為給予之根關節點三維座標。藉由式(3)，我們可以反求每一關節點的三維座標 (X ,Y ,Z )。in,

Is the normalized three-dimensional joint point coordinates,

It is the three-dimensional coordinates of the root joint point given. By formula (3), we can reverse the three-dimensional coordinates ( X , Y , Z ) of each joint point.

由本第一實施例的上述步驟 S1) ~ S4)可以瞭解到，本發明使用估測出的二維骨架以及關節相對深度來做為輸入，進而可以生成真正的關節三維座標，其關節點位置估測，本發明較目前的已知技術更為準確。From the above steps S1) ~ S4) of the first embodiment, it can be understood that the present invention uses the estimated two-dimensional skeleton and the relative depth of the joints as input, and then can generate the real three-dimensional coordinates of the joints, and the estimated joint position Therefore, the present invention is more accurate than the currently known technology.

如圖5至圖7所示，本發明第二較佳實施例提出一種對一個三維人體骨架估測器進行模型訓練的方法，主要是為了說明前述第一實施例的估測方法其估測器是如何訓練出來的，該三維人體骨架估測器包含一第一階段計算模型21及一第二階段計算模型31，該第一階段計算模型21即為前述第一實施例中的步驟S3)的第一階段計算模型21，該第二階段計算模型31即為前述第一實施例中的步驟S4)裡的第二階段計算模型31，本第二實施例之對一個三維人體骨架估測器進行模型訓練的方法具有下列步驟：As shown in Figures 5 to 7, the second preferred embodiment of the present invention proposes a method for model training of a three-dimensional human skeleton estimator, which is mainly to illustrate the estimator of the estimation method of the aforementioned first embodiment. How is it trained? The three-dimensional human skeleton estimator includes a first-stage calculation model 21 and a second-stage calculation model 31. The first-stage calculation model 21 is the step S3) in the first embodiment. The first-stage calculation model 21. The second-stage calculation model 31 is the second-stage calculation model 31 in step S4) in the first embodiment. This second embodiment performs a three-dimensional human skeleton estimator The method of model training has the following steps:

SS1) 輸入一第一訓練組：該第一訓練組19包含複數訓練樣本191，各該訓練樣本191係為影像而具有一個人體邊界框11(示於圖2)，每一該人體邊界框11係為一矩形區域，包括一中心點位於一人體參考點影像座標、四個頂角點影像座標、及屬於該人體邊界框11內的影像內容，其中，該第一訓練組19係如圖6所示。此外，在輸入該第一訓練組19時，亦輸入各該訓練樣本191所對應的輸出真實值(ground truth)。SS1) Input a first training set: the first training set 19 includes a plurality of training samples 191, each of the training samples 191 is an image and has a human body bounding box 11 (shown in Figure 2), and each of the body bounding boxes 11 It is a rectangular area, including a center point at a human reference point image coordinate, four vertex point image coordinates, and image content belonging to the human body boundary box 11, where the first training group 19 is shown in Fig. 6 Shown. In addition, when the first training group 19 is input, the ground truth corresponding to each training sample 191 is also input.

SS2) 數據正規化：以正規化的方法來讓每一該人體邊界框11內影像內容的像素色彩值常態分布，其係以上述式(1)為例，而可藉由R,G,B的平均值及標準差來將像素色值正規化至[0,1]範圍。SS2) Data normalization: the normal distribution of the pixel color value of the image content in each human body boundary box 11 is made by the normalization method, which is based on the above formula (1) as an example, and can be determined by R, G, B The average value and standard deviation of, normalize the color value of the pixel to the range of [0,1].

SS3) 二維人體骨架複數關節點影像座標與各關節點相對於一根關節點(root joint)的相對深度估測器參數訓練：將前述步驟SS2)所正規化後的每一該人體邊界框11內的影像輸入至該第一階段計算模型21，而獲得輸出為數個三維熱圖，從該些三維熱圖可得出該二維人體骨架上的複數關節點影像座標、及該複數關節點相對於一根關節點的相對深度估測值，其估測的網路架構如圖6所示。其中，該根關節點係對應至該人體骨架上的一個關節點，例如人體骨架模型的骨盆點，訓練該第一階段計算模型21的網路參數時使用一第一損失函數。其中，該複數關節點的相對深度值的真實值 (ground truth)，在網路訓練時乃是由Z-Zroot方法計算而得，而Zroot為根關節點的真實深度值，Z則為各該關節點的真實深度值。該第一階段計算模型 21 訓練時，該複數關節點的相對深度值的真實值亦必須經過量化與正規化動作，使其正規化關節點相對深度值

在 [0,64] 範圍內，而

與其實際相對深度值 Z-Zroot 間的關係如上式(2)所示。SS3) Two-dimensional human skeleton image coordinates of multiple joint points and the relative depth estimator parameter training of each joint point with respect to a root joint: each body bounding box normalized in the previous step SS2) The images in 11 are input to the first-stage calculation model 21, and the output obtained is a number of three-dimensional heat maps. From the three-dimensional heat maps, the image coordinates of the complex joint points on the two-dimensional human skeleton and the complex joint points can be obtained. Relative to the estimated relative depth of a joint point, the estimated network architecture is shown in Figure 6. Wherein, the root joint point corresponds to a joint point on the human skeleton, such as the pelvic point of the human skeleton model, and a first loss function is used when training the network parameters of the first-stage calculation model 21. Among them, the ground truth of the relative depth value of the complex number of joint points is calculated by the Z-Zroot method during network training, and Zroot is the true depth value of the root joint point, and Z is each The true depth value of the junction. During the training of the first-stage calculation model 21, the true value of the relative depth value of the complex number of joint points must also be quantified and normalized to normalize the relative depth value of the joint point

In the range [0,64], and

The relationship between the actual relative depth value Z-Zroot is shown in the above formula (2).

該第一損失函數係關節的平均絕對誤差值(Mean Absolute Error, MAE)，該誤差係關關節的二維影像座標及相對深度值，並以　L_first 做為第一損失函數，係如下式(4)所示：The first loss function is the mean absolute error value of the joint (Mean Absolute Error, MAE), the error is related to the two-dimensional image coordinates and relative depth value of the joint, and L _{first is used} as the first loss function, which is the following formula ( 4) Shown:

式(4)

Formula (4)

其中，𝑠^𝑗 =(𝑥^𝑗 , 𝑦^𝑗 , 𝑑𝑒𝑝𝑡^𝑗 )為第一階段計算模型　21的輸出預測值，s^j =(x^j , y^j , depth^j )為其對應之真實值。Among them, 𝑠 ^𝑗 = (𝑥 ^𝑗 , 𝑦 ^𝑗 , 𝑑𝑒𝑝𝑡 ^𝑗 ) is the output predicted value of the first stage calculation model 21, and s ^j = (x ^j , y ^j , depth ^j ) is its corresponding real value.

SS4) 依序輸入該第一訓練組之複數訓練樣本：在依序輸入該第一訓練組19的複數訓練樣本191期間對該第一損失函數進行最佳化而獲致該第一階段計算模型21的最佳化參數。SS4) Input the plural training samples of the first training group in sequence: during the input of the plural training samples 191 of the first training group 19 in sequence, the first loss function is optimized to obtain the first stage calculation model 21 Optimized parameters.

SS5) 三維人體骨架估測器參數之訓練：循環使用前述步驟SS4)中所得到的該第一階段計算模型21最佳化參數對每一訓練樣本191預測得到該複數關節點影像座標、及其相對根關節點的相對深度估測值輸入至該第二階段計算模型31，而獲得輸出為估測的三維人體骨架模型與其關節點的三維座標，其中該些輸出的關節點的三維座標係相對於該根關節點。此處訓練第二階段計算模型31所使用輸入的二維人體骨架關節點影像座標、及其相對根關節點的相對深度值不限於如圖6所示之使用第一訓練組19 的訓練樣本191而得，亦可來自於另一第二訓練組 29 的訓練樣本 291而得，如圖7所示，其中第二訓練組 29可以相同、被包含於、或不同於第一訓練組 19。訓練該第二階段計算模型31參數時使用一第二損失函數，該第二損失函數L_second 係由關節點之三維座標誤差及一組骨骼向量誤差來組成，其中每一骨骼向量代表一實際或虛擬的兩個關節點間的向量，骨骼向量為事先定義的骨骼終點與骨骼起點相減得到的骨骼特徵(箭頭方向為終點)。該第二損失函數的計算式係如下式(5)所示：SS5) Training of the parameters of the three-dimensional human skeleton estimator: cyclically use the first-stage calculation model 21 optimized parameters obtained in the previous step SS4) to predict each training sample 191 to obtain the image coordinates of the complex joint points, and The relative depth estimation value relative to the root joint point is input to the second-stage calculation model 31, and the output obtained is the estimated three-dimensional human skeleton model and the three-dimensional coordinates of its joint points, wherein the three-dimensional coordinates of the output joint points are relative to At the root joint point. The image coordinates of the two-dimensional human skeleton joint points and the relative depth values relative to the root joint points used in the second stage of training the calculation model 31 are not limited to using the training samples 191 of the first training group 19 as shown in FIG. 6 However, it can also be obtained from another training sample 291 of the second training group 29, as shown in FIG. A second loss function is used when training the second stage to calculate the parameters of the model 31. The second loss function L _second is composed of the three-dimensional coordinate errors of the joint points and a set of skeletal vector errors, where each skeletal vector represents an actual or The virtual vector between two joint points, the bone vector is the bone feature obtained by subtracting the bone end point and the bone start point defined in advance (the arrow direction is the end point). The calculation formula of the second loss function is shown in the following formula (5):

式(5)

Formula (5)

其中，

為第j 個關節點的三維座標預測值，

為其對應的三維座標真實值，

為三維骨骼向量預測值，

為三維骨骼向量真實值。其中，dest(k)與start(k)為兩個函數，代表第k個骨骼的終點與起點的關節編號。前述函數的超參數分別為

以及

，K=22以及J=17則為事先定義的骨骼數量及關節點數量。

函數如式(6)所示，其用來計算整組骨骼

與

間的相異程度。

式(6)

in,

Is the predicted value of the three-dimensional coordinate of the j-th joint point,

Is the true value of its corresponding three-dimensional coordinates,

Is the predicted value of the three-dimensional bone vector,

Is the true value of the three-dimensional bone vector. Among them, dest(k) and start(k) are two functions, representing the joint number of the end point and the starting point of the kth bone. The hyperparameters of the aforementioned functions are

as well as

, K=22 and J=17 are the number of bones and joint points defined in advance.

The function is shown in formula (6), which is used to calculate the entire set of bones

and

The degree of dissimilarity between.

Formula (6)

上述的該第二損失函數係基於關節點位置S^j 及骨骼向量B^k ，骨骼向量的誤差L_bone 使類神經網路學到骨架結構的空間關係，可藉以增強關節點之間的物理約束(physical constraint)。關節點的誤差L_joint 使類神經網路學到精確的座標位置。骨骼向量為事先定義的骨骼終點與骨骼起點相減得到的骨骼特徵(箭頭方向為終點)，骨骼的終點與起點定義關係如圖8所示。The above-mentioned second loss function is based on the joint point position S ^j and the bone vector B ^k . The error L _{bone of the} bone vector enables the neural network to learn the spatial relationship of the skeleton structure, thereby enhancing the physical constraints between the joint points ( physical constraint). The error L _{joint of} the nodes enables the neural network to learn precise coordinate positions. The bone vector is the bone feature obtained by subtracting the pre-defined bone end point and the bone start point (the arrow direction is the end point), and the definition relationship between the bone end point and the start point is shown in Figure 8.

SS6) 獲得最佳化參數：在循環執行SS5)步驟以輸入該第二訓練組29的複數訓練樣本291所對應的複數關節點影像座標及相對根關節點的相對深度估測值、及網路輸出處的三維人體骨架與關節點座標真實值後，對該第二損失函數進行最佳化而獲致該第二階段計算模型31的最佳化參數。SS6) Obtain optimized parameters: Execute the SS5) step in the loop to input the image coordinates of the complex joint points corresponding to the complex training sample 291 of the second training group 29 and the relative depth estimation value of the relative root joint point, and the network After outputting the real values of the three-dimensional human body skeleton and joint point coordinates, the second loss function is optimized to obtain the optimized parameters of the second-stage calculation model 31.

由上述步驟可知，本第二實施例可以對本發明的兩階段三維人體骨架估測器進行模型訓練，進而訓練出一個有效的模型，而可供操作，並以前述第一實施例的方法使用該訓練完成的估測器來對一影像中的人體進行三維人體骨架估測。It can be seen from the above steps that the second embodiment can perform model training on the two-stage three-dimensional human skeleton estimator of the present invention, and then train an effective model, which is available for operation, and uses the method in the aforementioned first embodiment. The trained estimator is used to estimate the three-dimensional human skeleton of the human body in an image.

11:人體邊界框 19:第一訓練組 191:訓練樣本 21:第一階段計算模型 29:第二訓練組 291:訓練樣本 31:第二階段計算模型11: Human body bounding box 19: The first training group 191: training samples 21: The first stage calculation model 29: The second training group 291: training samples 31: The second stage calculation model

圖1係本發明第一較佳實施例之流程圖。圖2係本發明第一較佳實施例之影像示意圖，顯示影像上標記了人體邊界框的狀態。圖3係本發明第一較佳實施例之深度學習網路架構示意圖，顯示第一階段計算模型之架構。圖4係本發明第一較佳實施例之另一深度學習網路架構示意圖，顯示第二階段計算模型之架構。圖5係本發明第二較佳實施例之流程圖。圖6係本發明第二較佳實施例之深度網路架構及學習示意圖，顯示利用第一訓練組之第一階段計算模型之訓練。圖7係本發明第二較佳實施例之再一深度網路架構及學習示意圖，顯示由一第二訓練組形成之二維人體骨節點座標及各節點相對根節點之相對深度使用第二階段計算模型之訓練。圖8係本發明第二較佳實施例中人體骨架之關節及骨骼示意圖。Fig. 1 is a flowchart of the first preferred embodiment of the present invention. FIG. 2 is a schematic diagram of an image of the first preferred embodiment of the present invention, showing a state where the boundary frame of the human body is marked on the image. FIG. 3 is a schematic diagram of the deep learning network architecture of the first preferred embodiment of the present invention, showing the architecture of the first stage calculation model. 4 is a schematic diagram of another deep learning network architecture of the first preferred embodiment of the present invention, showing the architecture of the second stage calculation model. Fig. 5 is a flowchart of the second preferred embodiment of the present invention. 6 is a schematic diagram of the deep network architecture and learning of the second preferred embodiment of the present invention, showing the training of the first stage calculation model using the first training set. FIG. 7 is another schematic diagram of the deep network architecture and learning of the second preferred embodiment of the present invention, showing the two-dimensional human bone node coordinates formed by a second training set and the relative depth of each node to the root node. The second stage of use Training of computational model. Fig. 8 is a schematic diagram of the joints and bones of the human skeleton in the second preferred embodiment of the present invention.

Claims

A method for estimating the three-dimensional human skeleton of the human body in an image includes the following steps: S1) Estimated human body bounding box: input the image to a bounding box calculation model completed by a type of neural network training, and output one or more body bounding boxes from the bounding box calculation model, each of the human body bounding boxes It is a rectangular area, including a center point at a human body reference point image coordinate, four vertex point image coordinates, and image content belonging to the body's bounding box; S2) Data normalization: using a normalization method to make the pixel color value of each image content within the bounding box of the human body normally distributed; S3) The image coordinates of the complex joint points of the two-dimensional human skeleton and the relative depth estimation of each joint point with respect to a joint point (root joint): using a first-class model trained by a deep convolutional neural network as the backbone In the stage calculation model, each image in the bounding box of the human body normalized in step S2) is input into the first stage calculation model, and the output is obtained as a three-dimensional heat map, and a two-dimensional heat map is obtained from the three-dimensional heat map. The image coordinates of the plural joint points on the human body skeleton and the estimated relative depth of the plural joint points with respect to a joint point, wherein the root joint point corresponds to a joint point on the body skeleton; and S4) Three-dimensional human skeleton estimation: Using a second-stage calculation model trained by a fully connected module as the backbone, the image coordinates of the multiple joint points in the aforementioned step S3) and the multiple joint points are relative to the The estimated value of the relative depth of the root joint point is input to the second-stage calculation model, and the obtained output is an estimated three-dimensional human skeleton model and the three-dimensional coordinates of its multiple joint points, wherein the fully connected module is a kind of neural network The three-dimensional coordinate system of the output joint points is relative to the root joint point.

According to the method for estimating the three-dimensional human skeleton of the human body in an image according to the first item of the scope of patent application, the root joint point refers to the pelvic point of the human skeleton model.

According to the method for estimating the three-dimensional human skeleton of the human body in an image according to the first item of the scope of patent application, wherein: in step S3), the relative depth estimation value of the plural joint points is defined as Z-Zroot, where Zroot is the depth value of the root joint point, and Z is the depth value of each joint point; in step S4), the three-dimensional coordinates of the joint points of the three-dimensional human skeleton model are based on linear normalization. To indicate that its value is between -1.0 ~ 1.0.

According to the method for estimating the three-dimensional human skeleton of the human body in an image according to the first item of the scope of patent application, wherein: in step S2), the normalization method is based on the average value of R, G, B and The standard deviation normalizes the pixel color value to the range [0,1].

A method for model training of a three-dimensional human skeleton estimator. The three-dimensional human skeleton estimator includes a first-stage calculation model and a second-stage calculation model. The first-stage calculation model is composed of a deep convolutional neural network. The network is constructed as the backbone. The second-stage calculation model is a fully connected module-like neural network and a residual connection architecture. The method includes the following steps: SS1) Input a first training set: the first training set contains plural training samples, each of the training samples has a human body bounding box, each of the body bounding boxes is a rectangular area, including a center point located at a human body reference point The image coordinates, the image coordinates of the four vertex points, and the image content belonging to the bounding box of the human body; SS2) Data normalization: using a normalization method to make the pixel color value of each image content within the bounding box of the human body normally distributed; SS3) Two-dimensional human skeleton image coordinates of multiple joint points and the relative depth estimator parameter training of each joint point with respect to a root joint: each body bounding box normalized in the previous step SS2) The images inside are input to the first-stage calculation model, and the output obtained is a number of three-dimensional heat maps. From these three-dimensional heat maps, the image coordinates of the complex joint points on the two-dimensional human skeleton and the complex joint points relative to The relative depth estimation value of the root joint point, where the root joint point corresponds to a joint point on the human skeleton, and a first loss function is used when training the parameters of the first-stage calculation model; SS4) Input the complex number training samples of the first training group in sequence: optimize the first loss function during the sequence input of the complex number training samples of the first training group to obtain the best calculation model of the first stage化Parameters; SS5) Training of the parameters of the three-dimensional human skeleton estimator: using the first-stage calculation model optimization parameters obtained in the previous step SS4) to predict the complex training samples of a second training group to obtain the complex joint point image coordinates, The relative depth estimates of the relative root joint points are input to the second-stage calculation model, and the output obtained is an estimated three-dimensional human skeleton model and the three-dimensional coordinates of the joint points, wherein the three-dimensional coordinates of the output joint points Relative to the root joint point, a second loss function is used when training the parameters of the second-stage calculation model; and SS6) Obtain optimized parameters: Execute the SS5) step in the loop to input the image coordinates of the multiple joint points corresponding to the plural training samples of the second training group and the relative depth estimation value relative to the root joint point, and the network output After the actual values of the three-dimensional human body skeleton and joint point coordinates, the second loss function is optimized to obtain the optimized parameters of the second-stage calculation model.

According to the method for model training of a three-dimensional human skeleton estimator described in item 5 of the scope of patent application, the root joint point refers to the pelvic point of the human skeleton model.

According to the method for model training of a three-dimensional human skeleton estimator described in item 5 of the scope of patent application, wherein: in step SS3), the relative depth estimation value of the plural joint points is defined as Z-Zroot, where Zroot Is the depth value of the root joint point, and Z is the depth value of each joint point; in step SS5), the three-dimensional coordinates of the joint points of each three-dimensional human skeleton are expressed in the form of linear normalization, Its value is between -1.0 ~ 1.0.

According to the method for model training of a three-dimensional human skeleton estimator described in item 5 of the scope of patent application, wherein: in step SS3), the first loss function is the mean absolute error value of a joint (Mean Absolute Error, MAE), the error is related to the two-dimensional image coordinates and relative depth values of the joints.

According to the method for model training of a three-dimensional human skeleton estimator described in item 5 of the scope of patent application, wherein: in step SS5), the second loss function is determined by the three-dimensional coordinate error of the complex joint point and a set of The bone vector error is composed, where each bone vector represents a line segment vector between two actual or virtual joint points.

According to the method for model training of a three-dimensional human skeleton estimator described in item 5 of the scope of patent application, wherein: in step SS2), the normalization method is based on the average value and standard of R, G, B The difference normalizes the pixel color value to the range of [0,1].

According to the method for model training of a three-dimensional human skeleton estimator described in item 5 of the scope of patent application, wherein: in steps SS1) and SS5, the ground truth corresponding to each training sample is also input at the same time , In order to calculate the value of the first loss function and the second loss function in the SS4) and SS6) steps.

According to the method for model training of a three-dimensional human skeleton estimator described in item 5 of the scope of patent application, the second training group may be the same or different from the first training group.

A three-dimensional human skeleton estimator, including: A first stage calculation model: It is constructed by a deep convolutional neural network as the backbone, and its input is a normalized human body bounding box image. The body bounding box is a rectangular area, including an image with a center point at a reference point of the human body The output of the coordinates, the image coordinates of the four vertex points, and the image content belonging to the bounding box of the human body is a three-dimensional heat map. From the three-dimensional heat map, the image coordinates of the multiple joint points on the two-dimensional human skeleton, and The relative depth estimation value of the plural joint points with respect to a joint point, wherein the root joint point corresponds to a joint point on the human skeleton; and A second stage calculation model: It is a fully connected module-like neural network, and is a residual connection architecture. Its input is the image coordinates of the complex joint points output by the first-stage calculation model, and the complex joint points relative to the root joint point The output of the relative depth estimation value is an estimated three-dimensional human skeleton model and the three-dimensional coordinates of its multiple joint points, and the three-dimensional coordinates of the output joint points are relative to the root joint point.

A three-dimensional human skeleton estimator according to item 13 of the scope of patent application, wherein: the root joint point refers to the pelvic point of the human skeleton model.

A three-dimensional human skeleton estimator according to item 13 of the scope of patent application, wherein: the relative depth estimation value of the plural joint points is defined as Z-Zroot, where Zroot is the depth value of the root joint point, and Z is each The depth value of the joint point and the three-dimensional coordinates of the joint point of each three-dimensional human skeleton model are expressed in a linear normalization manner, and the value is between -1.0 and 1.0.

A three-dimensional human skeleton estimator according to item 13 of the scope of patent application, wherein: the normalization method is to normalize the pixel color value to [0,1] by using the average value and standard deviation of R, G, and B ] Scope.