TWI753382B

TWI753382B - Method for estimating three-dimensional human skeleton for a human body in an image, three-dimensional human skeleton estimator, and training method for the estimator

Info

Publication number: TWI753382B
Application number: TW109108622A
Authority: TW
Inventors: 賴文能; 施龍聖
Original assignee: 國立中正大學
Priority date: 2020-03-16
Filing date: 2020-03-16
Publication date: 2022-01-21
Also published as: TW202137144A

Abstract

一種對一影像中的人體進行三維人體骨架估測及對所使用估測器進行深度網路模型訓練的方法，該估測器主要是使用一第一階段計算模型及一第二階段計算模型，該第一階段計算模型係由一深度卷積神經網路做為骨幹所建構而成，該第二階段計算模型係為一全連接模組的類神經網路，且為一種殘差連接的架構。在訓練估測器時，係藉第一訓練組中的複數訓練樣本影像來依序輸入該第一階段計算模型，並取得該第一階段計算模型的最佳化參數，再以這個第一階段計算模型的最佳化參數所得到的第一階段計算模型的輸出做為輸入至該第二階段計算模型，再經第二訓練組訓練而獲得該第二階段計算模型的最佳化參數，其中該第二訓練組可相同或不相同於該第一訓練組。在對影像進行三維人體骨架估測時，即將影像輸入至該第一與該第二階段串聯之計算模型，即可得到人體骨架估測結果。A method for estimating a three-dimensional human skeleton of a human body in an image and training a deep network model for a used estimator. The estimator mainly uses a first-stage calculation model and a second-stage calculation model, The first-stage calculation model is constructed by a deep convolutional neural network as the backbone, and the second-stage calculation model is a fully connected module neural network with a residual connection structure . When training the estimator, the first-stage calculation model is sequentially input by using the plurality of training sample images in the first training group, and the optimized parameters of the first-stage calculation model are obtained, and then the first-stage calculation model is used for the first-stage calculation model. The output of the first-stage calculation model obtained by calculating the optimized parameters of the model is used as input to the second-stage calculation model, and then trained by the second training group to obtain the optimized parameters of the second-stage calculation model, wherein The second training set may or may not be the same as the first training set. When estimating the three-dimensional human skeleton on the image, the image is input into the computing model connected in series with the first and the second stages, and the human skeleton estimation result can be obtained.

Description

Method for estimating three-dimensional human skeleton for a human body in an image, three-dimensional human skeleton estimator, and training method for the estimator

本發明係與影像處理的技術有關，特別是指一種對一影像中的人體進行三維人體骨架估測的方法、三維人體骨架估測器、及估測器之訓練方法。The present invention relates to the technology of image processing, in particular to a method for estimating a 3D human skeleton of a human body in an image, a 3D human skeleton estimator, and a training method for the estimator.

在既有的技術中，目前用來估測人體在三維空間的關節點標的技術，主要是依靠昂貴的標記式動作捕捉系統或使用深度攝影機來獲得。然而，這樣的技術由於設備成本高，場地受限，且不易延伸至偵測多人骨架。Among the existing technologies, the current techniques for estimating the joint points of the human body in three-dimensional space are mainly obtained by relying on expensive marker-based motion capture systems or using depth cameras. However, due to the high cost of equipment and limited space, such a technology cannot easily be extended to detect multiple skeletons.

為了解決上述成本昂貴的問題，有人在 A Simple yet Effective Baseline for 3D Human Pose Estimation (ICCV, 2017) 這篇論文中，提出兩階段估測人體 3D 骨架的方法，第一階段透過二維骨架估測器來輸出二維影像中的二維骨架，再於第二階段將二維骨架提升至三維骨架。此方法已被證明其性能優異，且目前大多數的研究都根據此方法來進行改良。例如，目前已知有文獻將前述方法拓展至類神經網路進行深度學習，由於學習二維到三維之間的映射關係是缺乏條件來進行約束的，因此，該文獻作者定義了骨架的語法作為約束條件，例如運動學關係、對稱關係、動作協調關係，並透過最後所連接的雙向循環神經網路學習關節之間的交互關係。In order to solve the above-mentioned expensive problem, someone proposed a two-stage method for estimating the human 3D skeleton in the paper A Simple yet Effective Baseline for 3D Human Pose Estimation (ICCV, 2017). The 2D skeleton is output from the 2D image, and the 2D skeleton is upgraded to a 3D skeleton in the second stage. This method has been proven to perform well, and most of the current research is based on this method. For example, there are known literatures that extend the aforementioned methods to neural networks for deep learning. Since learning the mapping relationship between two-dimensional and three-dimensional is lack of conditions to be constrained, the author of this document defines the grammar of the skeleton as Constraints, such as kinematic relationship, symmetry relationship, action coordination relationship, and the interaction between joints are learned through the last connected bidirectional recurrent neural network.

而在 Learning Pose Grammar to Encode Human Body Configuration for 3D Pose Estimation (AAAI, 2018) 這篇論文中，則提到了其以第一階段來估測二維骨架及各關節點間的深度排序，再以第二階段來以類神經網路之深度學習方式來回歸至人體三維骨架。In the paper Learning Pose Grammar to Encode Human Body Configuration for 3D Pose Estimation (AAAI, 2018), it is mentioned that it uses the first stage to estimate the depth ordering between the two-dimensional skeleton and each joint point, and then uses the first stage to estimate the depth ordering between the joint points. In the second stage, the neural network-like deep learning method is used to return to the three-dimensional skeleton of the human body.

前述的先前技術中，以兩個階段來回歸人體三維骨架的方式，其做法尚有改進之餘地以達到精確生成人體三維骨架的目的，本發明即是使用二維骨架以及深度學習網路所估測出的相對關節點深度來做為輸入，以生成真正的三維關節點座標。In the aforementioned prior art, the method of returning to the 3D skeleton of the human body in two stages has room for improvement to achieve the purpose of accurately generating the 3D skeleton of the human body. The measured relative joint depths are used as input to generate true 3D joint coordinates.

基於上述，本發明提出一種對一影像中的人體進行三維人體骨架估測的方法，包含有下列步驟：S1) 估測得到人體邊界框：將該影像輸入至由一類神經網路訓練完成的一邊界框計算模型，並由該邊界框計算模型輸出一或多個人體邊界框，每一該人體邊界框係為一矩形區域，包括一中心點位於一人體參考點影像座標、四個頂角點影像座標、及屬於該人體邊界框內的影像內容；S2) 數據正規化：以正規化的方法來讓每一該人體邊界框內影像內容的像素色彩值常態分布；S3) 二維人體骨架複數關節點影像座標與各關節點相對於一根關節點 (root joint) 的相對深度估測：使用由深度卷積神經網路做為骨幹所訓練而成的一第一階段計算模型21，將前述步驟S2)所正規化後的每一該人體邊界框內的影像輸入至該第一階段計算模型21，而獲得輸出為一三維熱圖，從該三維熱圖可得出一個二維人體骨架上的複數關節點影像座標、一根關節點影像座標、及該複數關節點相對於該根關節點的相對深度估測值，其中，該根關節點係對應至該人體骨架上的一個關節點；以及S4) 三維人體骨架估測：使用一個由全連接模組做為骨幹所訓練而成的第二階段計算模型，將前述步驟S3)中的該複數關節點影像座標、及該複數關節點相對根關節點的相對深度估測值輸入至該第二階段計算模型，而獲得輸出為一估測的三維人體骨架模型與其關節點的三維座標，其中該全連接模組係為一種類神經網路，而且為一種殘差連接的架構，該些輸出的關節點的三維座標係相對於該根關節點。Based on the above, the present invention proposes a method for estimating a three-dimensional human skeleton of a human body in an image, which includes the following steps: S1) estimating a human body bounding box: inputting the image into a neural network trained by a type of neural network. A bounding box calculation model, and the bounding box calculation model outputs one or more human body bounding boxes, each of which is a rectangular area, including a center point located at a human body reference point image coordinates, four vertex points Image coordinates, and the image content belonging to the human body bounding box; S2) Data normalization: use a normalization method to make the pixel color value of each image content in the human body bounding box normal distribution; S3) Two-dimensional human skeleton complex number Joint point image coordinates and relative depth estimation of each joint point relative to a root joint: using a first-stage calculation model 21 trained with a deep convolutional neural network as the backbone, the aforementioned The normalized image in each of the human body bounding boxes in step S2) is input to the first-stage calculation model 21, and the obtained output is a three-dimensional heat map. From the three-dimensional heat map, a two-dimensional human skeleton can be obtained. The complex joint point image coordinates, a joint point image coordinate, and the relative depth estimation value of the complex joint point relative to the root joint point, wherein the root joint point corresponds to a joint point on the human skeleton; and S4) 3D human skeleton estimation: using a second-stage computing model trained by a fully connected module as the backbone, the image coordinates of the plurality of joint points in the aforementioned step S3) and the plurality of joint points are relative to each other. The relative depth estimated value of the root joint point is input to the second-stage calculation model, and the obtained output is an estimated three-dimensional human skeleton model and the three-dimensional coordinates of its joint points, wherein the fully connected module is a kind of neural network , and is a residual connection architecture, the three-dimensional coordinate system of the output joint points is relative to the root joint point.

藉此，本發明使用二階段訓練，並以二維骨架以及相對關節點深度來做為中間結果 (即第一階段輸出與第二階段輸入)，進而可以生成精確的三維關節座標。在估測三維骨架 (及其關節點座標)上，本發明較目前的已知技術更為準確。In this way, the present invention uses two-stage training, and uses the two-dimensional skeleton and the relative joint point depths as intermediate results (ie, the first-stage output and the second-stage input), thereby generating accurate three-dimensional joint coordinates. In estimating the three-dimensional skeleton (and its joint point coordinates), the present invention is more accurate than the currently known techniques.

另外，本發明還提出一種對一三維人體骨架估測器進行模型訓練的方法，該三維人體骨架估測器包含一第一階段計算模型及一第二階段計算模型，該第一階段計算模型係由一深度卷積神經網路做為骨幹所建構而成，該第二階段計算模型係為一全連接模組的類神經網路，且為一種殘差連接的架構，該方法包含有下列步驟：SS1) 輸入一第一訓練組：該第一訓練組包含複數訓練樣本，各該訓練樣本具有一人體邊界框，每一該人體邊界框係為一矩形區域，包括一中心點位於一人體參考點影像座標、四個頂角點影像座標、及屬於該人體邊界框內的影像內容；SS2) 數據正規化：以正規化的方法來讓每一該人體邊界框內影像內容的像素色彩值常態分布；SS3) 二維人體骨架複數關節點影像座標與各關節點相對於一根關節點(root joint)的相對深度估測器參數訓練：將前述步驟SS2)所正規化後的每一該人體邊界框內的影像輸入至該第一階段計算模型，而獲得輸出為一三維熱圖，從該三維熱圖可得出該二維人體骨架上的複數關節點影像座標、及該複數關節點相對於該根關節點的相對深度估測值，其中，該根關節點係對應至該人體骨架上的一個關節點，訓練該第一階段計算模型的參數時使用一第一損失函數；SS4) 依序輸入該第一訓練組之複數訓練樣本：在依序輸入該第一訓練組的複數訓練樣本期間對該第一損失函數進行最佳化而獲致該第一階段計算模型的最佳化參數；SS5) 三維人體骨架估測器參數訓練：使用前述步驟SS4)中所得到的該第一階段計算模型最佳化參數對一第二訓練組的複數訓練樣本預測得到該複數關節點影像座標、及其相對根關節點的相對深度估測值輸入至該第二階段計算模型，而獲得輸出為一估測的三維人體骨架模型與其關節點的三維座標，其中該些輸出的關節點的三維座標係相對於該根關節點，訓練該第二階段計算模型的參數時使用一第二損失函數；以及SS6) 獲得最佳化參數：在循環執行SS5)步驟以輸入該第二訓練組的複數訓練樣本所對應的複數關節點影像座標及相對根關節點的相對深度估測值、及網路輸出處的三維人體骨架與關節點座標真實值後，對該第二損失函數進行最佳化而獲致該第二階段計算模型的最佳化參數。In addition, the present invention also provides a method for model training for a three-dimensional human skeleton estimator. The three-dimensional human skeleton estimator includes a first-stage calculation model and a second-stage calculation model. The first-stage calculation model is a Constructed by a deep convolutional neural network as the backbone, the second-stage computational model is a fully connected module-like neural network, and is a residual connection architecture, and the method includes the following steps : SS1) Input a first training group: The first training group includes a plurality of training samples, each of which has a human body bounding box, and each human body bounding box is a rectangular area, including a center point located in a human body reference Point image coordinates, four vertex point image coordinates, and image content belonging to the human body bounding box; SS2) Data normalization: normalize the pixel color value of each image content within the human body bounding box by a normalized method distribution; SS3) two-dimensional human skeleton complex joint point image coordinates and the relative depth estimator parameter training of each joint point relative to a joint point (root joint): each of the human body normalized in the aforementioned step SS2) The image in the bounding box is input to the first-stage calculation model, and the obtained output is a three-dimensional heat map. From the three-dimensional heat map, the image coordinates of the complex joint points on the two-dimensional human skeleton and the relative joint points of the complex joint points can be obtained. The relative depth estimation value of the root joint point, wherein the root joint point corresponds to a joint point on the human skeleton, and a first loss function is used when training the parameters of the first stage calculation model; SS4) according to Sequentially inputting the complex training samples of the first training group: optimizing the first loss function during the sequential inputting of the complex training samples of the first training group to obtain the optimized parameters of the first-stage computing model; SS5) Three-dimensional human skeleton estimator parameter training: using the first-stage calculation model optimization parameters obtained in the aforementioned step SS4) to predict the complex number of training samples of a second training group to obtain the complex number of joint point image coordinates, and The relative depth estimation value relative to the root joint point is input to the second-stage calculation model, and the obtained output is an estimated three-dimensional human skeleton model and three-dimensional coordinates of its joint points, wherein the three-dimensional coordinate system of the output joint points is With respect to the root joint node, a second loss function is used when training the parameters of the second-stage calculation model; and SS6) Obtaining optimized parameters: Step SS5) is performed in a loop to input the complex training samples of the second training group After the corresponding complex joint point image coordinates and relative depth estimates relative to the root joint point, and the real values of the three-dimensional human skeleton and joint point coordinates at the network output, the second loss function is optimized to obtain the The second stage calculates the optimized parameters of the model.

藉此，本發明可以對三維人體骨架估測器進行模型訓練，進而訓練出一個有效的模型，而可供操作來估測輸入影像中的三維人體骨架及關節點座標。In this way, the present invention can perform model training on the three-dimensional human skeleton estimator, and then train an effective model, which can be used for operation to estimate the three-dimensional human skeleton and joint point coordinates in the input image.

本發明還揭露一種三維人體骨架估測器，包含：一第一階段計算模型：係由一深度卷積神經網路做為骨幹所建構而成，其輸入為一正規化後的人體邊界框影像，該人體邊界框係為一矩形區域，包括一中心點位於一人體參考點影像座標、四個頂角點影像座標、及屬於該人體邊界框內的影像內容，其輸出為一三維熱圖，從該三維熱圖得出一個二維人體骨架上的複數關節點影像座標、及該複數關節點相對於一根關節點的相對深度估測值，其中，該根關節點係對應至該人體骨架上的一個關節點；以及一第二階段計算模型：係為一全連接模組的類神經網路，且為一種殘差連接的架構，其輸入為該第一階段計算模型輸出的該複數關節點影像座標、及該複數關節點相對該根關節點的相對深度估測值，其輸出為一估測的三維人體骨架模型與其複數關節點的三維座標，該些輸出的關節點的三維座標係相對於該根關節點。The present invention also discloses a three-dimensional human skeleton estimator, comprising: a first-stage calculation model: constructed by a deep convolutional neural network as the backbone, the input of which is a normalized human body bounding box image , the human body bounding box is a rectangular area, including a center point at a human body reference point image coordinates, four vertex image coordinates, and the image content belonging to the human body bounding box, the output is a three-dimensional heat map, From the 3D heat map, the image coordinates of a plurality of joint points on a two-dimensional human skeleton, and the relative depth estimation value of the complex joint point relative to a joint point are obtained, wherein the root joint point corresponds to the human skeleton and a second-stage computing model: a neural network-like network of a fully connected module, and a residual-connected architecture whose input is the complex joint output by the first-stage computing model point image coordinates, and the estimated relative depth of the complex joint point relative to the root joint point, the output is an estimated three-dimensional human skeleton model and the three-dimensional coordinates of its complex joint points, the three-dimensional coordinate system of the output joint points Relative to this root node.

為了詳細說明本發明之技術特點所在，茲舉以下之較佳實施例並配合圖式說明如後，其中：In order to illustrate the technical features of the present invention in detail, the following preferred embodiments are given and described in conjunction with the drawings as follows, wherein:

如圖1至圖5所示，本發明第一較佳實施例提出一種對一影像中的人體進行三維人體骨架估測的方法，主要具有下列步驟：As shown in FIG. 1 to FIG. 5 , a first preferred embodiment of the present invention provides a method for estimating a three-dimensional human skeleton of a human body in an image, which mainly includes the following steps:

S1) 估測得到人體邊界框：將一影像輸入至由一類神經網路訓練完成的一邊界框計算模型，並由該邊界框計算模型輸出一或多個人體邊界框11，此處的一或多個人體邊界框11主要是依據該影像中的人體影像數量而定，所輸出的人體邊界框11可標記於影像上，如圖2所示。每一該人體邊界框11係為一矩形區域，包括一中心點位於一人體參考點影像座標、四個頂角點影像座標、以及屬於人體邊界框11內的影像內容。於本實施例中，該類神經網路可係為中心網(CenterNet) 技術，其使用全卷積網路 (Full Convolutional Networks, FCN) 做為網路骨幹，例如 ResNet 結合轉置卷積層、Hourglass、Deep Layer Aggregation (DLA)　技術等，但不以此為限。任何可從影像中找出人體所在位置的技術皆可使用於找出所有的人體邊界框11。S1) Estimate the human body bounding box: input an image into a bounding box calculation model trained by a class of neural networks, and output one or more human body bounding boxes 11 from the bounding box calculation model, where one or more The plurality of human body bounding boxes 11 are mainly determined according to the number of human body images in the image, and the output human body bounding boxes 11 can be marked on the image, as shown in FIG. 2 . Each of the human body bounding boxes 11 is a rectangular area, including an image coordinate whose center point is located at a human body reference point, four vertex and corner image coordinates, and image content belonging to the human body bounding box 11 . In this embodiment, this type of neural network can be a CenterNet technology, which uses a full convolutional network (FCN) as the backbone of the network, such as ResNet combined with transposed convolutional layers, Hourglass , Deep Layer Aggregation (DLA) technology, etc., but not limited to this. Any technique that can find the location of the human body from the image can be used to find all the human body bounding boxes 11 .

S2) 數據正規化：以正規化的方法來讓每一該人體邊界框11內影像內容的像素色彩值常態分布。於本實施例中，該正規化的方法為Z-分數正規化法 (Z-score normalization)，如下式(1)所示：S2) Data normalization: normalize the pixel color values of the image content in each human body bounding box 11 by a normalization method. In this embodiment, the normalization method is Z-score normalization (Z-score normalization), as shown in the following formula (1):

式(1)

Formula 1)

其中，平均數為μ，標準差為σ，分別對應影像的 RGB 通道，原始影像像素值為p (RGB 三通道值)，正規化後像素值為 p’。如此可以藉由R,G,B的平均值及標準差來將各通道像素色彩值正規化至[0,1]的範圍。Among them, the mean is μ and the standard deviation is σ, which correspond to the RGB channels of the image respectively, the original image pixel value is p (RGB three-channel value), and the normalized pixel value is p’. In this way, the pixel color value of each channel can be normalized to the range of [0,1] by the mean and standard deviation of R, G, B.

S3) 二維人體骨架複數關節點影像座標與各關節點相對於一根關節點(root joint)的相對深度估測：使用由深度卷積神經網路做為骨幹所訓練而成的一第一階段計算模型21，將前述步驟S2)所正規化後的每一該人體邊界框11內的影像輸入至該第一階段計算模型21，而獲得輸出為數個三維熱圖，從該數個三維熱圖可得出一個二維人體骨架上的複數關節點影像座標 (均量化為 [0,64] 範圍內)、及該複數關節點相對於該根關節點的相對深度估測值(量化為 [0,64] 範圍內)，其估測的架構如圖3所示。該複數關節點的相對深度值係定義為Z-Zroot，其中， Zroot 為該根關節點的深度值 (可以由另外方法求得或定義為 0 深度點)，Z則為各該關節點的深度值。由於第一階段計算模型 21 的輸出為經過量化與正規化動作，其正規化關節點相對深度值

與其實際相對深度值 Z-Zroot 間的關係如下式(2)所示：S3) Two-dimensional human skeleton complex joint point image coordinates and relative depth estimation of each joint point relative to a joint point (root joint): using a deep convolutional neural network as the backbone to train a first The stage calculation model 21, the normalized image in each of the human body bounding boxes 11 in the aforementioned step S2) is input into the first stage calculation model 21, and the output is obtained as a number of three-dimensional heat maps, from the number of three-dimensional heat maps. The figure can obtain the image coordinates of complex joint points on a two-dimensional human skeleton (both quantified as [0,64] range), and the relative depth estimation value of the complex joint point relative to the root joint point (quantified as [ 0,64]), and its estimated architecture is shown in Figure 3. The relative depth value of the complex joint point is defined as Z-Zroot, where Zroot is the depth value of the root joint point (can be obtained by other methods or defined as 0 depth point), and Z is the depth of each joint point value. Since the output of the first-stage calculation model 21 is quantized and normalized, its normalized joint point relative depth value is

The relationship between its actual relative depth value Z-Zroot is shown in the following formula (2):

式(2)

Formula (2)

其中，scale係指一給予之倍率。Among them, scale refers to a given magnification.

在式 (2) 中，我們可由該第一階段計算模型21預測得到的depth’ 計算得到相關關節點對應的相對深度值Z-Zroot。此外，該根關節點係對應至該人體骨架上的一個關節點，於本實施例中係以人的骨盆點的座標來做為根關節點座標，但不以此為限。In formula (2), we can obtain the relative depth value Z-Zroot corresponding to the relevant joint points by calculating the depth' predicted by the first-stage calculation model 21 . In addition, the root joint point corresponds to a joint point on the human skeleton. In this embodiment, the coordinates of the human pelvis point are used as the coordinates of the root joint point, but not limited thereto.

S4) 三維人體骨架估測：使用一個由全連接模組做為骨幹所訓練而成的一第二階段計算模型31，將前述步驟S3)中第一階段計算模型輸出的該複數關節點影像座標、及該複數關節點的相對深度估測值輸入該第二階段計算模型31，而獲得輸出為一估測的三維人體骨架模型與其關節點的三維座標，其中該全連接模組係為一種類神經網路，而且為一種殘差連接的架構，這個架構如圖4所示，而這些輸出的關節點的三維座標係相對於該根關節點。其中，該估測的三維人體骨架模型的關節點的三維座標，乃是以線性正規化 (linear normalization)的方式來表示，其值在 -1.0 ~ 1.0 之間：S4) 3D human skeleton estimation: using a second-stage calculation model 31 trained by a fully connected module as the backbone, the complex joint point image coordinates output by the first-stage calculation model in the aforementioned step S3) are used. , and the relative depth estimates of the complex joint points are input into the second-stage calculation model 31, and the obtained output is an estimated three-dimensional human skeleton model and the three-dimensional coordinates of its joint points, wherein the fully connected module is a kind of The neural network is also a residual connection architecture, which is shown in Figure 4, and the three-dimensional coordinate system of the output joint points is relative to the root joint point. Among them, the three-dimensional coordinates of the joint points of the estimated three-dimensional human skeleton model are represented by linear normalization, and its value is between -1.0 ~ 1.0:

式(3)

Formula (3)

其中，

為正規化後的三維關節點座標，

為給予之根關節點三維座標。藉由式(3)，我們可以反求每一關節點的三維座標 (X ,Y ,Z )。in,

is the normalized three-dimensional joint point coordinates,

is the three-dimensional coordinate of the given root joint point. With Equation (3), we can reverse the three-dimensional coordinates ( X , Y , Z ) of each joint point.

由本第一實施例的上述步驟 S1) ~ S4)可以瞭解到，本發明使用估測出的二維骨架以及關節相對深度來做為輸入，進而可以生成真正的關節三維座標，其關節點位置估測，本發明較目前的已知技術更為準確。From the above-mentioned steps S1) to S4) of the first embodiment, it can be known that the present invention uses the estimated two-dimensional skeleton and the relative depth of the joints as input, and then can generate the real three-dimensional joint coordinates, and the joint point positions are estimated. It is found that the present invention is more accurate than the current known technology.

如圖5至圖7所示，本發明第二較佳實施例提出一種對一個三維人體骨架估測器進行模型訓練的方法，主要是為了說明前述第一實施例的估測方法其估測器是如何訓練出來的，該三維人體骨架估測器包含一第一階段計算模型21及一第二階段計算模型31，該第一階段計算模型21即為前述第一實施例中的步驟S3)的第一階段計算模型21，該第二階段計算模型31即為前述第一實施例中的步驟S4)裡的第二階段計算模型31，本第二實施例之對一個三維人體骨架估測器進行模型訓練的方法具有下列步驟：As shown in FIG. 5 to FIG. 7 , a second preferred embodiment of the present invention proposes a method for model training a three-dimensional human skeleton estimator, mainly to illustrate the estimation method and its estimator of the aforementioned first embodiment How is it trained? The three-dimensional human skeleton estimator includes a first-stage calculation model 21 and a second-stage calculation model 31, and the first-stage calculation model 21 is the step S3) in the aforementioned first embodiment. The first-stage calculation model 21, the second-stage calculation model 31 is the second-stage calculation model 31 in step S4) in the aforementioned first embodiment, and this second embodiment performs a three-dimensional human skeleton estimator. The method of model training has the following steps:

SS1) 輸入一第一訓練組：該第一訓練組19包含複數訓練樣本191，各該訓練樣本191係為影像而具有一個人體邊界框11(示於圖2)，每一該人體邊界框11係為一矩形區域，包括一中心點位於一人體參考點影像座標、四個頂角點影像座標、及屬於該人體邊界框11內的影像內容，其中，該第一訓練組19係如圖6所示。此外，在輸入該第一訓練組19時，亦輸入各該訓練樣本191所對應的輸出真實值(ground truth)。SS1) Input a first training set: the first training set 19 includes a plurality of training samples 191, each of the training samples 191 is an image and has a human body bounding box 11 (shown in FIG. 2), each of the human body bounding boxes 11 It is a rectangular area, including a center point located at a human body reference point image coordinates, four vertex and corner image coordinates, and the image content belonging to the human body bounding box 11, wherein, the first training group 19 is shown in FIG. 6 shown. In addition, when the first training group 19 is input, the output ground truth corresponding to each training sample 191 is also input.

SS2) 數據正規化：以正規化的方法來讓每一該人體邊界框11內影像內容的像素色彩值常態分布，其係以上述式(1)為例，而可藉由R,G,B的平均值及標準差來將像素色值正規化至[0,1]範圍。SS2) Data normalization: the normalization method is used to make the pixel color value of each image content in the human body bounding box 11 normally distributed, which takes the above formula (1) as an example, and can be determined by R, G, B to normalize pixel color values to the [0,1] range.

SS3) 二維人體骨架複數關節點影像座標與各關節點相對於一根關節點(root joint)的相對深度估測器參數訓練：將前述步驟SS2)所正規化後的每一該人體邊界框11內的影像輸入至該第一階段計算模型21，而獲得輸出為數個三維熱圖，從該些三維熱圖可得出該二維人體骨架上的複數關節點影像座標、及該複數關節點相對於一根關節點的相對深度估測值，其估測的網路架構如圖6所示。其中，該根關節點係對應至該人體骨架上的一個關節點，例如人體骨架模型的骨盆點，訓練該第一階段計算模型21的網路參數時使用一第一損失函數。其中，該複數關節點的相對深度值的真實值 (ground truth)，在網路訓練時乃是由Z-Zroot方法計算而得，而Zroot為根關節點的真實深度值，Z則為各該關節點的真實深度值。該第一階段計算模型 21 訓練時，該複數關節點的相對深度值的真實值亦必須經過量化與正規化動作，使其正規化關節點相對深度值

在 [0,64] 範圍內，而

與其實際相對深度值 Z-Zroot 間的關係如上式(2)所示。SS3) Two-dimensional human skeleton complex joint point image coordinates and the relative depth estimator parameter training of each joint point with respect to a joint point (root joint). The images in 11 are input to the first-stage calculation model 21, and the obtained output is a plurality of three-dimensional heat maps. From these three-dimensional heat maps, the image coordinates of the complex joint points on the two-dimensional human skeleton and the complex joint points can be obtained. Relative to the relative depth estimate of a joint point, the estimated network architecture is shown in Figure 6. The root joint point corresponds to a joint point on the human skeleton, such as the pelvis point of the human skeleton model, and a first loss function is used when training the network parameters of the first stage calculation model 21 . Among them, the ground truth of the relative depth value of the complex joint node is calculated by the Z-Zroot method during network training, and Zroot is the true depth value of the root joint node, Z is the The true depth value of the joint point. During the training of the first-stage calculation model 21, the true value of the relative depth value of the complex joint points must also undergo quantization and normalization actions to normalize the relative depth value of the joint point.

in the range [0,64], while

The relationship between its actual relative depth value Z-Zroot is shown in the above formula (2).

該第一損失函數係關節的平均絕對誤差值(Mean Absolute Error, MAE)，該誤差係關關節的二維影像座標及相對深度值，並以　L_first 做為第一損失函數，係如下式(4)所示：The first loss function is the Mean Absolute Error (MAE) of the joint, and the error is related to the two-dimensional image coordinates and the relative depth value of the joint, and L _first is used as the first loss function, which is represented by the following formula ( 4) shown:

式(4)

Formula (4)

其中，𝑠^𝑗 =(𝑥^𝑗 , 𝑦^𝑗 , 𝑑𝑒𝑝𝑡^𝑗 )為第一階段計算模型　21的輸出預測值，s^j =(x^j , y^j , depth^j )為其對應之真實值。Among them, 𝑠 ^𝑗 =(𝑥 ^𝑗 , 𝑦 ^𝑗 , 𝑑𝑒𝑝𝑡 ^𝑗 ) is the output prediction value of the first stage calculation model 21, and s ^j =(x ^j , y ^j , depth ^j ) is the corresponding real value.

SS4) 依序輸入該第一訓練組之複數訓練樣本：在依序輸入該第一訓練組19的複數訓練樣本191期間對該第一損失函數進行最佳化而獲致該第一階段計算模型21的最佳化參數。SS4) Sequentially inputting the complex training samples of the first training group: optimizing the first loss function during the sequential inputting of the complex training samples 191 of the first training group 19 to obtain the first-stage calculation model 21 optimization parameters.

SS5) 三維人體骨架估測器參數之訓練：循環使用前述步驟SS4)中所得到的該第一階段計算模型21最佳化參數對每一訓練樣本191預測得到該複數關節點影像座標、及其相對根關節點的相對深度估測值輸入至該第二階段計算模型31，而獲得輸出為估測的三維人體骨架模型與其關節點的三維座標，其中該些輸出的關節點的三維座標係相對於該根關節點。此處訓練第二階段計算模型31所使用輸入的二維人體骨架關節點影像座標、及其相對根關節點的相對深度值不限於如圖6所示之使用第一訓練組19 的訓練樣本191而得，亦可來自於另一第二訓練組 29 的訓練樣本 291而得，如圖7所示，其中第二訓練組 29可以相同、被包含於、或不同於第一訓練組 19。訓練該第二階段計算模型31參數時使用一第二損失函數，該第二損失函數L_second 係由關節點之三維座標誤差及一組骨骼向量誤差來組成，其中每一骨骼向量代表一實際或虛擬的兩個關節點間的向量，骨骼向量為事先定義的骨骼終點與骨骼起點相減得到的骨骼特徵(箭頭方向為終點)。該第二損失函數的計算式係如下式(5)所示：SS5) The training of the parameters of the three-dimensional human skeleton estimator: cyclically use the optimized parameters of the first-stage calculation model 21 obtained in the aforementioned step SS4) to predict each training sample 191 to obtain the complex joint point image coordinates, and The relative depth estimation value relative to the root joint point is input to the second-stage calculation model 31, and the output is obtained as the estimated three-dimensional human skeleton model and the three-dimensional coordinates of its joint points, wherein the three-dimensional coordinate systems of the output joint points are relative to each other. at the root node. Here, the input two-dimensional human skeleton joint point image coordinates and the relative depth values of the relative root joint points used in the training of the second-stage calculation model 31 are not limited to the training samples 191 using the first training group 19 as shown in FIG. 6 . It can also be obtained from training samples 291 from another second training group 29 , as shown in FIG. A second loss function is used when training the second-stage calculation model 31 parameters. The second loss function L _second is composed of the three-dimensional coordinate error of the joint point and a set of skeleton vector errors, wherein each skeleton vector represents an actual or The vector between two virtual joint points, the bone vector is the bone feature obtained by subtracting the predefined bone end point and the bone start point (the arrow direction is the end point). The calculation formula of the second loss function is shown in the following formula (5):

式(5)

Formula (5)

其中，

為第j 個關節點的三維座標預測值，

為其對應的三維座標真實值，

為三維骨骼向量預測值，

為三維骨骼向量真實值。其中，dest(k)與start(k)為兩個函數，代表第k個骨骼的終點與起點的關節編號。前述函數的超參數分別為

以及

，K=22以及J=17則為事先定義的骨骼數量及關節點數量。

函數如式(6)所示，其用來計算整組骨骼

與

間的相異程度。

式(6)

in,

is the predicted value of the three-dimensional coordinates of the jth joint point,

is the true value of its corresponding three-dimensional coordinate,

is the predicted value for the 3D bone vector,

is the true value of the 3D bone vector. Among them, dest(k) and start(k) are two functions, representing the joint number of the end point and the start point of the kth bone. The hyperparameters of the aforementioned functions are

as well as

, K=22 and J=17 are the predefined number of bones and joints.

The function is shown in formula (6), which is used to calculate the entire set of bones

and

degree of dissimilarity.

Formula (6)

上述的該第二損失函數係基於關節點位置S^j 及骨骼向量B^k ，骨骼向量的誤差L_bone 使類神經網路學到骨架結構的空間關係，可藉以增強關節點之間的物理約束(physical constraint)。關節點的誤差L_joint 使類神經網路學到精確的座標位置。骨骼向量為事先定義的骨骼終點與骨骼起點相減得到的骨骼特徵(箭頭方向為終點)，骨骼的終點與起點定義關係如圖8所示。The above-mentioned second loss function is based on the joint point position S ^j and the bone vector B ^k , and the error L _bone of the bone vector enables the neural network to learn the spatial relationship of the skeleton structure, thereby enhancing the physical constraints between the joint points ( physical constraint). The error L _joint of the joint point enables the neural network to learn the precise coordinate position. The bone vector is the bone feature obtained by subtracting the pre-defined bone end point and the bone start point (the direction of the arrow is the end point), and the definition relationship between the bone end point and the start point is shown in Figure 8.

SS6) 獲得最佳化參數：在循環執行SS5)步驟以輸入該第二訓練組29的複數訓練樣本291所對應的複數關節點影像座標及相對根關節點的相對深度估測值、及網路輸出處的三維人體骨架與關節點座標真實值後，對該第二損失函數進行最佳化而獲致該第二階段計算模型31的最佳化參數。SS6) Obtaining optimized parameters: Step SS5) is executed in a loop to input the complex joint image coordinates and relative depth estimates relative to the root joint point corresponding to the complex training samples 291 of the second training group 29, and the network After outputting the real values of the three-dimensional human skeleton and joint point coordinates, the second loss function is optimized to obtain the optimized parameters of the second-stage calculation model 31 .

由上述步驟可知，本第二實施例可以對本發明的兩階段三維人體骨架估測器進行模型訓練，進而訓練出一個有效的模型，而可供操作，並以前述第一實施例的方法使用該訓練完成的估測器來對一影像中的人體進行三維人體骨架估測。It can be seen from the above steps that the second embodiment can perform model training on the two-stage three-dimensional human skeleton estimator of the present invention, and then trains an effective model, which is ready for operation, and uses the method of the first embodiment described above. The completed estimator is trained to perform 3D human skeleton estimation for a human body in an image.

11:人體邊界框 19:第一訓練組 191:訓練樣本 21:第一階段計算模型 29:第二訓練組 291:訓練樣本 31:第二階段計算模型11: Human Bounding Box 19: The first training group 191: Training samples 21: Phase 1 Computational Model 29: Second training group 291: training samples 31: The second stage computational model

圖1係本發明第一較佳實施例之流程圖。圖2係本發明第一較佳實施例之影像示意圖，顯示影像上標記了人體邊界框的狀態。圖3係本發明第一較佳實施例之深度學習網路架構示意圖，顯示第一階段計算模型之架構。圖4係本發明第一較佳實施例之另一深度學習網路架構示意圖，顯示第二階段計算模型之架構。圖5係本發明第二較佳實施例之流程圖。圖6係本發明第二較佳實施例之深度網路架構及學習示意圖，顯示利用第一訓練組之第一階段計算模型之訓練。圖7係本發明第二較佳實施例之再一深度網路架構及學習示意圖，顯示由一第二訓練組形成之二維人體骨節點座標及各節點相對根節點之相對深度使用第二階段計算模型之訓練。圖8係本發明第二較佳實施例中人體骨架之關節及骨骼示意圖。FIG. 1 is a flow chart of the first preferred embodiment of the present invention. FIG. 2 is a schematic diagram of an image of the first preferred embodiment of the present invention, showing a state in which a human body bounding box is marked on the image. FIG. 3 is a schematic diagram of the deep learning network architecture according to the first preferred embodiment of the present invention, showing the architecture of the first-stage computing model. FIG. 4 is a schematic diagram of another deep learning network architecture according to the first preferred embodiment of the present invention, showing the architecture of the second-stage computing model. FIG. 5 is a flow chart of the second preferred embodiment of the present invention. FIG. 6 is a schematic diagram of the deep network architecture and learning of the second preferred embodiment of the present invention, showing the training of the first-stage computational model using the first training group. 7 is a schematic diagram of still another deep network architecture and learning according to the second preferred embodiment of the present invention, showing the two-dimensional human bone node coordinates formed by a second training group and the relative depth of each node relative to the root node using the second stage Training of computational models. 8 is a schematic diagram of the joints and bones of the human skeleton in the second preferred embodiment of the present invention.

Claims

A method for estimating a three-dimensional human skeleton of a human body in an image, comprising the following steps: S1) Estimate the human body bounding box: input the image into a bounding box calculation model trained by a class of neural networks, and output one or more human body bounding boxes from the bounding box calculation model, each of which is a bounding box of the human body. is a rectangular area, including an image coordinate whose center point is located at a human body reference point, four corner image coordinates, and the image content belonging to the bounding box of the human body; S2) Data normalization: use a normalization method to make the pixel color values of the image content in each human body bounding box normally distributed; S3) Two-dimensional human skeleton complex joint point image coordinates and relative depth estimation of each joint point relative to a joint point (root joint): using a deep convolutional neural network as the backbone to train a first A stage calculation model, the normalized image in each of the human body bounding boxes in the aforementioned step S2) is input into the first stage calculation model, and the output is a three-dimensional heat map, and a two-dimensional heat map is obtained from the three-dimensional heat map. Dimension the image coordinates of the complex joint points on the human skeleton, and the relative depth estimation value of the complex joint points with respect to a joint point, wherein the root joint point corresponds to a joint point on the human skeleton; and S4) 3D human skeleton estimation: Using a second-stage computing model trained by a fully connected module as the backbone, the image coordinates of the plurality of joint points in the aforementioned step S3) and the plurality of joint points are relative to the The relative depth estimated value of the root joint point is input into the second-stage calculation model, and the obtained output is an estimated three-dimensional human skeleton model and the three-dimensional coordinates of its complex joint points, wherein the fully connected module is a kind of neural network It is a residual connection structure, and the three-dimensional coordinate system of the output joint points is relative to the root joint point.

According to the method for estimating a three-dimensional human skeleton of a human body in an image according to item 1 of the scope of the patent application, the root joint point refers to the pelvis point of the human skeleton model.

According to the method for estimating a three-dimensional human skeleton of a human body in an image according to item 1 of the scope of the application, wherein: in step S3), the relative depth estimation value of the complex joint points is defined as Z-Zroot, wherein Zroot is the depth value of the root joint point, and Z is the depth value of each of the joint points; in step S4), the three-dimensional coordinates of the joint points of each of the three-dimensional human skeleton models are in the form of linear normalization. to indicate that the value is between -1.0 and 1.0.

According to the method for estimating a three-dimensional human skeleton of a human body in an image according to item 1 of the scope of the application, wherein: in step S2), the normalization method is based on the average value of R, G, B and Standard deviation normalizes pixel color values to the [0,1] range.

A method for model training for a three-dimensional human skeleton estimator, the three-dimensional human skeleton estimator includes a first-stage calculation model and a second-stage calculation model, the first-stage calculation model is composed of a deep convolutional neural network The network is constructed as the backbone. The second-stage calculation model is a fully connected module neural network, and is a residual connection structure. The method includes the following steps: SS1) Input a first training group: the first training group includes a plurality of training samples, each of which has a human body bounding box, and each of the human body bounding boxes is a rectangular area, including a center point located at a human body reference point The image coordinates, the image coordinates of the four vertex points, and the image content belonging to the bounding box of the human body; SS2) Data normalization: use a normalization method to make the pixel color values of the image content in each human body bounding box normally distributed; SS3) Two-dimensional human skeleton complex joint point image coordinates and the relative depth estimator parameter training of each joint point with respect to a joint point (root joint). The images inside are input to the first-stage calculation model, and the obtained output is several three-dimensional heat maps. From these three-dimensional heat maps, the image coordinates of the complex joint points on the two-dimensional human skeleton, and the relative joint points of the complex joint points can be obtained. The relative depth estimation value of the root joint point, wherein the root joint point corresponds to a joint point on the human skeleton, and a first loss function is used when training the parameters of the first stage calculation model; SS4) Sequentially input the complex training samples of the first training group: during the sequential input of the complex training samples of the first training group, the first loss function is optimized to obtain the optimal calculation model of the first stage. parameters; SS5) The training of three-dimensional human skeleton estimator parameters: using the first-stage calculation model optimization parameters obtained in the aforementioned step SS4) to predict the complex number of training samples of a second training group to obtain the complex number of joint point image coordinates, and the relative depth estimates of its relative root joint points are input into the second-stage calculation model, and the output is an estimated three-dimensional human skeleton model and the three-dimensional coordinates of its joint points, wherein the three-dimensional coordinates of the output joint points A second loss function is used when training the parameters of the second stage calculation model with respect to the root joint; and SS6) Obtain optimization parameters: Execute step SS5) in a loop to input the complex joint image coordinates and relative depth estimates relative to the root joint corresponding to the complex training samples of the second training group, and the network output. After the actual values of the three-dimensional human skeleton and joint point coordinates, the second loss function is optimized to obtain the optimized parameters of the second-stage calculation model.

According to the method for model training for a three-dimensional human skeleton estimator described in item 5 of the scope of the patent application, the root joint point refers to the pelvis point of the human skeleton model.

According to the method for model training of a three-dimensional human skeleton estimator described in item 5 of the scope of application, wherein: in step SS3), the relative depth estimation value of the complex joint points is defined as Z-Zroot, wherein Zroot is the depth value of the root joint point, and Z is the depth value of each joint point; in step SS5), the three-dimensional coordinates of each joint point of the three-dimensional human skeleton are expressed in the form of linear normalization, Its value is between -1.0 ~ 1.0.

According to the method for model training of a three-dimensional human skeleton estimator described in item 5 of the scope of application, wherein: in step SS3), the first loss function is a mean absolute error value of a joint (Mean Absolute Error, MAE), the error is related to the 2D image coordinates and relative depth values of the joints.

According to the method for model training of a three-dimensional human skeleton estimator described in item 5 of the scope of the application, wherein: in step SS5), the second loss function is determined by the three-dimensional coordinate error of the complex joint point and a set of Bone vector error, where each bone vector represents an actual or virtual line segment vector between two joint points.

According to the method for model training of a three-dimensional human skeleton estimator described in item 5 of the scope of application, wherein: in step SS2), the normalization method is based on the average value and standard of R, G, B Difference normalizes pixel color values to the [0,1] range.

According to the method for model training of a three-dimensional human skeleton estimator described in item 5 of the scope of the application, wherein: in steps SS1) and SS5, the output ground truth corresponding to each training sample is also input at the same time (ground truth) , in order to calculate the value of the first loss function and the second loss function in steps SS4) and SS6).

According to the method for model training for a three-dimensional human skeleton estimator described in item 5 of the claimed scope, the second training group can be the same or different from the first training group.

A three-dimensional human skeleton estimator, comprising: A first stage calculation model: It is constructed by a deep convolutional neural network as the backbone, and its input is a normalized human body bounding box image, the human body bounding box is a rectangular area, including a center point at a human body reference point image coordinates, the image coordinates of four vertex points, and the image content belonging to the bounding box of the human body, the output is a three-dimensional heat map, and the image coordinates of complex joint points on a two-dimensional human skeleton are obtained from the three-dimensional heat map, and a relative depth estimate of the plurality of joint points with respect to a joint point, wherein the root joint point corresponds to a joint point on the human skeleton; and A second-stage calculation model: It is a neural network-like network of a fully connected module, and is a residual connection structure. Its input is the image coordinates of the complex joint point output by the first-stage calculation model, and the relative joint point of the complex joint point to the root joint point. The relative depth estimation value of , the output is an estimated 3D human skeleton model and the 3D coordinates of its complex joint points, and the 3D coordinate system of the output joint points is relative to the root joint point.

According to the three-dimensional human skeleton estimator described in claim 13, the root joint point refers to the pelvis point of the human skeleton model.

A three-dimensional human skeleton estimator according to item 13 of the scope of the application, wherein: the relative depth estimation value of the complex joint point is defined as Z-Zroot, wherein Zroot is the depth value of the root joint point, and Z is each The depth value of the joint point and the three-dimensional coordinates of the joint point of each three-dimensional human skeleton model are represented by linear normalization, and the value is between -1.0 and 1.0.

A three-dimensional human skeleton estimator according to claim 13, wherein: the normalization method is to normalize pixel color values to [0,1 by the mean and standard deviation of R, G, B ] Scope.