CN112818898B

CN112818898B - Model training method and device and electronic equipment

Info

Publication number: CN112818898B
Application number: CN202110195069.4A
Authority: CN
Inventors: 罗宇轩; 唐堂
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2021-02-20
Filing date: 2021-02-20
Publication date: 2024-02-20
Anticipated expiration: 2041-02-20
Also published as: CN112818898A

Abstract

The embodiments of the present disclosure disclose model training methods, devices and electronic devices. A specific implementation of the method includes: obtaining a training sample set; selecting a training sample from the training sample set, and performing the following training steps based on the selected training sample: inputting the sample human body image of the selected training sample into the initial neural network, obtaining Three-dimensional human posture information; determine the transformation matrix between the three-dimensional human posture information and the sample inertial motion capture data; use the transformation matrix to convert the posture key points and the inertial motion capture three-dimensional points into the same coordinate system to determine the posture key points and inertial actions Capture the differences between three-dimensional points; based on the determined differences, adjust the network parameters of the initial neural network; if the training end conditions are met, the adjusted initial neural network is determined as the trained three-dimensional human posture prediction network. This implementation can save calibration costs between different coordinate systems.

Description

Model training methods, devices and electronic equipment

技术领域Technical field

本公开实施例涉及计算机技术领域，具体涉及模型训练方法、装置和电子设备。The embodiments of the present disclosure relate to the field of computer technology, and specifically to model training methods, devices and electronic devices.

背景技术Background technique

人体姿态估计(Human Pose Estimation)是计算机视觉中的一个重要任务，也是计算机理解人类动作、行为必不可少的一步。近年来，使用深度学习进行人体姿态估计的方法陆续被提出，且达到了远超传统方法的表现。在实际求解时，对人体姿态的估计常常转化为对人体关键点的预测问题，即首先预测出人体各个关键点的位置坐标，然后根据先验知识确定关键点之间的空间位置关系，从而得到预测的人体骨架。Human Pose Estimation is an important task in computer vision and an essential step for computers to understand human actions and behaviors. In recent years, methods using deep learning for human posture estimation have been proposed one after another, and have achieved performance that far exceeds traditional methods. In actual solution, the estimation of human posture is often transformed into the prediction problem of key points of the human body, that is, first predicting the position coordinates of each key point of the human body, and then determining the spatial position relationship between the key points based on prior knowledge, thus obtaining Predicted human skeleton.

发明内容Contents of the invention

提供该公开内容部分以便以简要的形式介绍构思，这些构思将在后面的具体实施方式部分被详细描述。该公开内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征，也不旨在用于限制所要求的保护的技术方案的范围。This Disclosure is provided to introduce in simplified form the concepts that are later described in detail in the Detailed Description. This disclosure section is not intended to identify key features or essential features of the claimed technical solution, nor is it intended to be used to limit the scope of the claimed technical solution.

本公开实施例提供了一种模型训练方法、装置和电子设备，可以节省不同坐标系之间的标定成本，并且可以使得三维人体姿态预测网络达到更好的精度。Embodiments of the present disclosure provide a model training method, device and electronic equipment, which can save calibration costs between different coordinate systems and enable a three-dimensional human posture prediction network to achieve better accuracy.

第一方面，本公开实施例提供了一种模型训练方法，该方法包括：获取训练样本集合，其中，训练样本包括样本人体图像和与样本人体图像对应的样本惯性动作捕捉数据，样本惯性动作捕捉数据为拍摄样本人体图像时采集到的样本人体图像中呈现的人体的惯性动作捕捉数据；从训练样本集合中选取训练样本，基于选取的训练样本，执行以下训练步骤：将选取的训练样本的样本人体图像输入初始神经网络中，得到选取的训练样本对应的三维人体姿态信息；确定选取的训练样本对应的三维人体姿态信息与对应的样本惯性动作捕捉数据之间的变换矩阵；利用变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点和对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到同一坐标系下，确定姿态关键点和惯性动作捕捉三维点之间的差异；基于确定出的差异，调整初始神经网络的网络参数；确定是否满足预设的训练结束条件；若满足训练结束条件，则将调整后的初始神经网络确定为训练完成的三维人体姿态预测网络。In a first aspect, embodiments of the present disclosure provide a model training method, which method includes: obtaining a training sample set, wherein the training samples include sample human body images and sample inertial motion capture data corresponding to the sample human body images, and the sample inertial motion capture data The data is the inertial motion capture data of the human body presented in the sample human body image collected when shooting the sample human body image; select the training sample from the training sample set, and perform the following training steps based on the selected training sample: convert the sample of the selected training sample The human body image is input into the initial neural network to obtain the three-dimensional human posture information corresponding to the selected training sample; the transformation matrix between the three-dimensional human posture information corresponding to the selected training sample and the corresponding sample inertial motion capture data is determined; using the transformation matrix, The posture key points indicated by the three-dimensional human posture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data are converted to the same coordinate system, and the relationship between the posture key points and the inertial motion capture three-dimensional points is determined. based on the determined differences; adjust the network parameters of the initial neural network based on the determined differences; determine whether the preset training end conditions are met; if the training end conditions are met, the adjusted initial neural network is determined as the three-dimensional human body pose after training Prediction network.

第二方面，本公开实施例提供了一种模型训练装置，该装置包括：第一获取单元，用于获取训练样本集合，其中，训练样本包括样本人体图像和与样本人体图像对应的样本惯性动作捕捉数据，样本惯性动作捕捉数据为拍摄样本人体图像时采集到的样本人体图像中呈现的人体的惯性动作捕捉数据；训练单元，用于从训练样本集合中选取训练样本，基于选取的训练样本，执行以下训练步骤：将选取的训练样本的样本人体图像输入初始神经网络中，得到选取的训练样本对应的三维人体姿态信息；确定选取的训练样本对应的三维人体姿态信息与对应的样本惯性动作捕捉数据之间的变换矩阵；利用变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点和对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到同一坐标系下，确定姿态关键点和惯性动作捕捉三维点之间的差异；基于确定出的差异，调整初始神经网络的网络参数；确定是否满足预设的训练结束条件；若满足训练结束条件，则将调整后的初始神经网络确定为训练完成的三维人体姿态预测网络。In a second aspect, embodiments of the present disclosure provide a model training device. The device includes: a first acquisition unit configured to acquire a training sample set, wherein the training samples include sample human body images and sample inertial actions corresponding to the sample human body images. Capture data, the sample inertial motion capture data is the inertial motion capture data of the human body presented in the sample human body image collected when shooting the sample human body image; the training unit is used to select training samples from the training sample set, based on the selected training samples, Execute the following training steps: input the sample human body image of the selected training sample into the initial neural network to obtain the three-dimensional human body posture information corresponding to the selected training sample; determine the three-dimensional human body posture information corresponding to the selected training sample and the corresponding sample inertial motion capture Transformation matrix between data; use the transformation matrix to convert the posture key points indicated by the three-dimensional human posture information corresponding to the selected training sample and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data to the same coordinate system , determine the difference between the posture key points and the inertial motion capture three-dimensional points; based on the determined differences, adjust the network parameters of the initial neural network; determine whether the preset training end conditions are met; if the training end conditions are met, the adjusted The initial neural network is determined as the trained three-dimensional human posture prediction network.

第三方面，本公开实施例提供了一种电子设备，包括：一个或多个处理器；存储装置，用于存储一个或多个程序，当一个或多个程序被一个或多个处理器执行，使得一个或多个处理器实现如第一方面的模型训练方法。In a third aspect, embodiments of the present disclosure provide an electronic device, including: one or more processors; a storage device for storing one or more programs. When the one or more programs are executed by the one or more processors, , causing one or more processors to implement the model training method of the first aspect.

第四方面，本公开实施例提供了一种计算机可读介质，其上存储有计算机程序，该程序被处理器执行时实现如第一方面的模型训练方法的步骤。In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored. When the program is executed by a processor, the steps of the model training method of the first aspect are implemented.

本公开实施例提供的模型训练方法、装置和电子设备，通过获取训练样本集合；之后，从上述训练样本集合中选取训练样本，基于选取的训练样本，执行以下训练步骤：将选取的训练样本的样本人体图像输入初始神经网络中，得到选取的训练样本对应的三维人体姿态信息；确定选取的训练样本对应的三维人体姿态信息与对应的样本惯性动作捕捉数据之间的变换矩阵；利用上述变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点和对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到同一坐标系下，确定姿态关键点和惯性动作捕捉三维点之间的差异；基于确定出的差异，调整初始神经网络的网络参数；确定是否满足预设的训练结束条件；若满足上述训练结束条件，则将调整后的初始神经网络确定为训练完成的三维人体姿态预测网络。通过在网络训练过程中确定网络输出的三维人体姿态关键点与对应的惯性动作捕捉三维点之间的变换矩阵，可以将三维人体姿态关键点和惯性动作捕捉三维点转换到同一坐标系下，相比于在使用惯性动作捕捉数据充当三维人体姿态估计算法的数据集时，需要标定惯捕坐标系和相机坐标系的变换关系的这种方法，本实施例提供的这种方法可以节省不同坐标系之间的标定成本，并且可以使得三维人体姿态预测网络达到更好的精度。The model training method, device and electronic device provided by the embodiments of the present disclosure obtain a training sample set; then, select training samples from the above-mentioned training sample set, and perform the following training steps based on the selected training samples: The sample human body image is input into the initial neural network to obtain the three-dimensional human posture information corresponding to the selected training sample; determine the transformation matrix between the three-dimensional human posture information corresponding to the selected training sample and the corresponding sample inertial motion capture data; use the above transformation matrix , convert the posture key points indicated by the three-dimensional human posture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system, and determine the posture key points and the inertial motion capture three-dimensional points The difference between points; based on the determined difference, adjust the network parameters of the initial neural network; determine whether the preset training end conditions are met; if the above training end conditions are met, determine the adjusted initial neural network as the training is completed 3D human pose prediction network. By determining the transformation matrix between the three-dimensional human posture key points output by the network and the corresponding inertial motion capture three-dimensional points during the network training process, the three-dimensional human posture key points and the inertial motion capture three-dimensional points can be converted to the same coordinate system, respectively. Compared with this method, which requires calibrating the transformation relationship between the inertial capture coordinate system and the camera coordinate system when using inertial motion capture data as a data set for a three-dimensional human posture estimation algorithm, the method provided by this embodiment can save the cost of different coordinate systems. calibration cost, and can make the three-dimensional human posture prediction network achieve better accuracy.

附图说明Description of the drawings

结合附图并参考以下具体实施方式，本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中，相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的，原件和元素不一定按照比例绘制。The above and other features, advantages, and aspects of various embodiments of the present disclosure will become more apparent with reference to the following detailed description taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It is to be understood that the drawings are schematic and that elements and elements are not necessarily drawn to scale.

图1是本公开的各个实施例可以应用于其中的示例性系统架构图；Figure 1 is an exemplary system architecture diagram in which various embodiments of the present disclosure may be applied;

图2是根据本公开的模型训练方法的一个实施例的流程图；Figure 2 is a flow chart of one embodiment of a model training method according to the present disclosure;

图3是根据本公开的模型训练方法中预测三维人体姿态信息的一个实施例的流程图；Figure 3 is a flow chart of one embodiment of predicting three-dimensional human posture information in the model training method according to the present disclosure;

图4是根据本公开的模型训练装置的一个实施例的结构示意图；Figure 4 is a schematic structural diagram of an embodiment of a model training device according to the present disclosure;

图5是适于用来实现本公开实施例的电子设备的计算机系统的结构示意图。FIG. 5 is a schematic structural diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present disclosure.

具体实施方式Detailed ways

下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例，然而应当理解的是，本公开可以通过各种形式来实现，而且不应该被解释为限于这里阐述的实施例，相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是，本公开的附图及实施例仅用于示例性作用，并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, which rather are provided for A more thorough and complete understanding of this disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of the present disclosure.

应当理解，本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行，和/或并行执行。此外，方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。It should be understood that various steps described in the method implementations of the present disclosure may be executed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performance of illustrated steps. The scope of the present disclosure is not limited in this regard.

本文使用的术语“包括”及其变形是开放性包括，即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”；术语“另一实施例”表示“至少一个另外的实施例”；术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。As used herein, the term "include" and its variations are open-ended, ie, "including but not limited to." The term "based on" means "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; and the term "some embodiments" means "at least some embodiments". Relevant definitions of other terms will be given in the description below.

需要注意，本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分，并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。It should be noted that concepts such as “first” and “second” mentioned in this disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units. Or interdependence.

需要注意，本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的，本领域技术人员应当理解，除非在上下文另有明确指出，否则应该理解为“一个或多个”。It should be noted that the modifications of "one" and "plurality" mentioned in this disclosure are illustrative and not restrictive. Those skilled in the art will understand that unless the context clearly indicates otherwise, it should be understood as "one or Multiple”.

本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的，而并不是用于对这些消息或信息的范围进行限制。The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are for illustrative purposes only and are not used to limit the scope of these messages or information.

图1示出了可以应用本公开的模型训练方法的实施例的示例性系统架构100。Figure 1 illustrates an exemplary system architecture 100 to which embodiments of the model training methods of the present disclosure may be applied.

如图1所示，系统架构100可以包括惯性动作捕捉设备101，网络1021、1022、1023，终端设备103和服务器104。网络1021用以在惯性动作捕捉设备101和终端设备103之间提供通信链路的介质。网络1022用以在惯性动作捕捉设备101和服务器104之间提供通信链路的介质。网络1023用以在终端设备103和服务器104之间提供通信链路的介质。网络1021、1022、1023可以包括各种连接类型，例如有线、无线通信链路或者光纤电缆等等。As shown in Figure 1, the system architecture 100 may include an inertial motion capture device 101, networks 1021, 1022, 1023, a terminal device 103 and a server 104. The network 1021 is a medium used to provide a communication link between the inertial motion capture device 101 and the terminal device 103 . Network 1022 is a medium used to provide a communication link between inertial motion capture device 101 and server 104 . The network 1023 is a medium used to provide a communication link between the terminal device 103 and the server 104 . Networks 1021, 1022, 1023 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

惯性动作捕捉设备101可以包括但不限于：安装在人体各个节点(例如，关节)上的惯性动作捕捉传感器和惯性动作捕捉用的服装。将惯性动作捕捉传感器安装于用户身体的关节处，或者用户穿上惯性动作捕捉用的服装之后，可以采集身体部位的姿态和方位，并利用网络1021将惯性动作捕捉数据传输给终端设备103，或者利用网络1022将惯性动作捕捉数据传输给服务器104。The inertial motion capture device 101 may include, but is not limited to: inertial motion capture sensors installed on various nodes (eg, joints) of the human body and clothing for inertial motion capture. Install inertial motion capture sensors on the joints of the user's body, or the user can collect the posture and orientation of the body parts after wearing inertial motion capture clothing, and use the network 1021 to transmit the inertial motion capture data to the terminal device 103, or The inertial motion capture data is transmitted to the server 104 using the network 1022.

用户可以使用终端设备103通过网络1023与服务器104交互，以发送或接收消息等，例如，用户可以利用终端设备103采集到人体图像，服务器104可以从终端设备103中获取人体图像。终端设备103上可以安装有各种通讯客户端应用，例如惯性动作捕捉应用、图像采集类应用、即时通讯软件等。The user can use the terminal device 103 to interact with the server 104 through the network 1023 to send or receive messages, etc. For example, the user can use the terminal device 103 to collect human body images, and the server 104 can obtain the human body image from the terminal device 103 . Various communication client applications can be installed on the terminal device 103, such as inertial motion capture applications, image acquisition applications, instant messaging software, etc.

终端设备103可以首先获取训练样本集合，其中，训练样本集合中的样本惯性动作捕捉数据可以是从惯性动作捕捉设备101中获取的；之后，可以从上述训练样本集合中选取训练样本，基于选取的训练样本，执行以下训练步骤：将选取的训练样本的样本人体图像输入初始神经网络中，得到选取的训练样本对应的三维人体姿态信息；确定选取的训练样本对应的三维人体姿态信息与对应的样本惯性动作捕捉数据之间的变换矩阵；利用上述变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点和对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到同一坐标系下，确定姿态关键点和惯性动作捕捉三维点之间的差异；基于确定出的差异，调整初始神经网络的网络参数；确定是否满足预设的训练结束条件；若满足上述训练结束条件，则将调整后的初始神经网络确定为训练完成的三维人体姿态预测网络。The terminal device 103 may first obtain a training sample set, wherein the sample inertial motion capture data in the training sample set may be obtained from the inertial motion capture device 101; then, a training sample may be selected from the above training sample set, based on the selected To train a sample, perform the following training steps: input the sample human body image of the selected training sample into the initial neural network to obtain the three-dimensional human posture information corresponding to the selected training sample; determine the three-dimensional human posture information corresponding to the selected training sample and the corresponding sample Transformation matrix between inertial motion capture data; use the above transformation matrix to convert the posture key points indicated by the three-dimensional human posture information corresponding to the selected training sample and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data to Under the same coordinate system, determine the difference between the attitude key point and the inertial motion capture three-dimensional point; based on the determined difference, adjust the network parameters of the initial neural network; determine whether the preset training end conditions are met; if the above training end conditions are met , then the adjusted initial neural network is determined as the trained three-dimensional human posture prediction network.

终端设备103可以是硬件，也可以是软件。当终端设备103为硬件时，可以是具有摄像头并且支持信息交互的各种电子设备，包括但不限于智能手机、平板电脑、膝上型便携计算机等。当终端设备103为软件时，可以安装在上述所列举的电子设备中。其可以实现成多个软件或软件模块(例如用来提供分布式服务的多个软件或软件模块)，也可以实现成单个软件或软件模块。在此不做具体限定。The terminal device 103 may be hardware or software. When the terminal device 103 is hardware, it can be various electronic devices that have cameras and support information interaction, including but not limited to smart phones, tablet computers, laptop computers, etc. When the terminal device 103 is software, it can be installed in the electronic devices listed above. It can be implemented as multiple software or software modules (for example, multiple software or software modules used to provide distributed services), or as a single software or software module. There are no specific limitations here.

服务器104可以是提供各种服务的服务器。例如，可以首先获取训练样本集合，其中，训练样本集合中的样本人体图像可以是从终端设备103中获取的，训练样本集合中的样本惯性动作捕捉数据可以是从惯性动作捕捉设备101中获取的；之后，可以从上述训练样本集合中选取训练样本，基于选取的训练样本，执行以下训练步骤：将选取的训练样本的样本人体图像输入初始神经网络中，得到选取的训练样本对应的三维人体姿态信息；确定选取的训练样本对应的三维人体姿态信息与对应的样本惯性动作捕捉数据之间的变换矩阵；利用上述变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点和对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到同一坐标系下，确定姿态关键点和惯性动作捕捉三维点之间的差异；基于确定出的差异，调整初始神经网络的网络参数；确定是否满足预设的训练结束条件；若满足上述训练结束条件，则将调整后的初始神经网络确定为训练完成的三维人体姿态预测网络。Server 104 may be a server that provides various services. For example, a training sample set may be obtained first, where the sample human body images in the training sample set may be obtained from the terminal device 103 , and the sample inertial motion capture data in the training sample set may be obtained from the inertial motion capture device 101 ; After that, a training sample can be selected from the above training sample set, and based on the selected training sample, the following training steps are performed: input the sample human body image of the selected training sample into the initial neural network, and obtain the three-dimensional human body posture corresponding to the selected training sample. information; determine the transformation matrix between the three-dimensional human posture information corresponding to the selected training sample and the corresponding sample inertial motion capture data; use the above transformation matrix to combine the posture key points indicated by the three-dimensional human posture information corresponding to the selected training sample and The inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data are converted to the same coordinate system, and the differences between the posture key points and the inertial motion capture three-dimensional points are determined; based on the determined differences, the network parameters of the initial neural network are adjusted. ; Determine whether the preset training end conditions are met; if the above training end conditions are met, the adjusted initial neural network is determined as the trained three-dimensional human posture prediction network.

需要说明的是，服务器104可以是硬件，也可以是软件。当服务器104为硬件时，可以实现成多个服务器组成的分布式服务器集群，也可以实现成单个服务器。当服务器104为软件时，可以实现成多个软件或软件模块(例如用来提供分布式服务)，也可以实现成单个软件或软件模块。在此不做具体限定。It should be noted that the server 104 may be hardware or software. When the server 104 is hardware, it can be implemented as a distributed server cluster composed of multiple servers or as a single server. When the server 104 is software, it may be implemented as multiple software or software modules (for example, used to provide distributed services), or it may be implemented as a single software or software module. There are no specific limitations here.

还需要说明的是，本公开实施例所提供的模型训练方法可以由服务器104执行，此时，模型训练装置通常设置于服务器104中。本公开实施例所提供的模型训练方法也可以由终端设备103执行，此时，模型训练装置通常设置于终端设备103中。It should also be noted that the model training method provided by the embodiment of the present disclosure can be executed by the server 104. In this case, the model training device is usually provided in the server 104. The model training method provided by the embodiment of the present disclosure can also be executed by the terminal device 103. In this case, the model training device is usually provided in the terminal device 103.

还需要说明的是，在本公开实施例所提供的模型训练方法由服务器104执行的情况下，若服务器104的本地存储有训练样本集合，此时示例性系统架构100可以不存在惯性动作捕捉设备101，网络1021、1022、1023和终端设备103。It should also be noted that when the model training method provided by the embodiment of the present disclosure is executed by the server 104, if the server 104 locally stores a training sample set, the exemplary system architecture 100 may not have an inertial motion capture device at this time. 101, networks 1021, 1022, 1023 and terminal device 103.

还需要说明的是，在本公开实施例所提供的模型训练方法由终端设备103执行的情况下，若终端设备103的本地存储有惯性动作捕捉数据和初始神经网络等信息，此时示例性系统架构100可以不存在惯性动作捕捉设备101，网络1021、1022、1023和服务器104。It should also be noted that when the model training method provided by the embodiment of the present disclosure is executed by the terminal device 103, if the terminal device 103 locally stores inertial motion capture data, initial neural network and other information, then the exemplary system Architecture 100 may be without inertial motion capture device 101, networks 1021, 1022, 1023 and server 104.

应该理解，图1中的惯性动作捕捉设备、网络、终端设备和服务器的数目仅仅是示意性的。根据实现需要，可以具有任意数目的惯性动作捕捉设备、网络、终端设备和服务器。It should be understood that the numbers of inertial motion capture devices, networks, terminal devices and servers in Figure 1 are only illustrative. Depending on implementation needs, there can be any number of inertial motion capture devices, networks, end devices, and servers.

继续参考图2，示出了根据本公开的模型训练方法的一个实施例的流程200。该模型训练方法，包括以下步骤：Continuing to refer to FIG. 2 , a process 200 of one embodiment of a model training method according to the present disclosure is shown. The model training method includes the following steps:

步骤201，获取训练样本集合。Step 201: Obtain a training sample set.

在本实施例中，模型训练方法的执行主体(例如图1所示的终端设备或服务器)可以获取训练样本集合。上述训练样本集合中的训练样本可以包括样本人体图像和与样本人体图像对应的样本惯性动作捕捉数据。样本惯性动作捕捉数据可以为拍摄样本人体图像时采集到的样本人体图像中呈现的人体的惯性动作捕捉数据。惯性动作捕捉是一种新型的人体动作捕捉技术，它用无线动作姿态传感器采集身体部位的姿态和方位，利用人体运动学原理恢复人体运动模型，同时采用无线传输的方式将数据呈现在电脑软件里。In this embodiment, the execution subject of the model training method (such as the terminal device or server shown in Figure 1) can obtain the training sample set. The training samples in the above training sample set may include sample human body images and sample inertial motion capture data corresponding to the sample human body images. The sample inertial motion capture data may be the inertial motion capture data of the human body presented in the sample human body image collected when shooting the sample human body image. Inertial motion capture is a new type of human motion capture technology. It uses wireless motion attitude sensors to collect the posture and orientation of body parts, uses the principles of human kinematics to restore the human body motion model, and uses wireless transmission to present the data in computer software. .

在这里，用户在利用惯性动作捕捉设备进行人体的惯性动作捕捉的同时，可以利用摄像装置对该用户进行拍摄，从而得到与惯性动作捕捉数据对应的人体图像。Here, while the user uses the inertial motion capture device to capture the inertial motion of the human body, the user can also use the camera device to photograph the user, thereby obtaining a human body image corresponding to the inertial motion capture data.

步骤202，从训练样本集合中选取训练样本，基于选取的训练样本，执行以下训练步骤：将选取的训练样本的样本人体图像输入初始神经网络中，得到选取的训练样本对应的三维人体姿态信息；确定选取的训练样本对应的三维人体姿态信息与对应的样本惯性动作捕捉数据之间的变换矩阵；利用变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点和对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到同一坐标系下，确定姿态关键点和惯性动作捕捉三维点之间的差异；基于确定出的差异，调整初始神经网络的网络参数；确定是否满足预设的训练结束条件；若满足训练结束条件，则将调整后的初始神经网络确定为训练完成的三维人体姿态预测网络。Step 202: Select a training sample from the training sample set, and perform the following training steps based on the selected training sample: input the sample human body image of the selected training sample into the initial neural network to obtain the three-dimensional human posture information corresponding to the selected training sample; Determine the transformation matrix between the three-dimensional human posture information corresponding to the selected training sample and the corresponding sample inertial motion capture data; use the transformation matrix to combine the posture key points indicated by the three-dimensional human posture information corresponding to the selected training sample and the corresponding sample The inertial motion capture three-dimensional points indicated by the inertial motion capture data are converted to the same coordinate system, and the differences between the posture key points and the inertial motion capture three-dimensional points are determined; based on the determined differences, the network parameters of the initial neural network are adjusted; determine whether The preset training end conditions are met; if the training end conditions are met, the adjusted initial neural network is determined as the trained three-dimensional human posture prediction network.

在本实施例中，上述执行主体可以从在步骤201中获取到的训练样本集合中选取训练样本，基于选取的训练样本，执行如下训练步骤。In this embodiment, the above-mentioned execution subject may select training samples from the training sample set obtained in step 201, and perform the following training steps based on the selected training samples.

在本实施例中，训练步骤202可以包括子步骤2021、2022、2023、2024、2025和2026。其中：In this embodiment, the training step 202 may include sub-steps 2021, 2022, 2023, 2024, 2025 and 2026. in:

步骤2021，将选取的训练样本的样本人体图像输入初始神经网络中，得到选取的训练样本对应的三维人体姿态信息。Step 2021: Input the sample human body image of the selected training sample into the initial neural network to obtain the three-dimensional human body posture information corresponding to the selected training sample.

在本实施例中，上述执行主体可以将选取的训练样本的样本人体图像输入初始神经网络中，得到选取的训练样本对应的三维人体姿态信息。上述初始神经网络可以是能够根据人体图像得到三维人体姿态信息的各种神经网络，例如，卷积神经网络、深度神经网络等等。三维人体姿态信息可以包括各种身体部位的姿态和方位，例如，人体关节的方向和位置。In this embodiment, the above-mentioned execution subject can input the sample human body image of the selected training sample into the initial neural network to obtain the three-dimensional human body posture information corresponding to the selected training sample. The above-mentioned initial neural network can be various neural networks that can obtain three-dimensional human body posture information based on human body images, such as convolutional neural networks, deep neural networks, etc. Three-dimensional human body posture information can include the posture and orientation of various body parts, for example, the direction and position of human joints.

步骤2022，确定选取的训练样本对应的三维人体姿态信息与对应的样本惯性动作捕捉数据之间的变换矩阵。Step 2022: Determine the transformation matrix between the three-dimensional human posture information corresponding to the selected training sample and the corresponding sample inertial motion capture data.

在本实施例中，上述执行主体可以确定选取的训练样本对应的三维人体姿态信息与对应的样本惯性动作捕捉数据之间的变换矩阵。变换矩阵是数学线性代数中的一个概念。在线性代数中，线性变换能够用矩阵表示。如果T是一个把Rn映射到Rm的线性变换，且x是一个具有n个元素的列向量，那么我们把m×n的矩阵A，称为T的变换矩阵。In this embodiment, the above-mentioned execution subject can determine the transformation matrix between the three-dimensional human posture information corresponding to the selected training sample and the corresponding sample inertial motion capture data. Transformation matrix is a concept in mathematical linear algebra. In linear algebra, linear transformations can be represented by matrices. If T is a linear transformation that maps Rn to Rm, and x is a column vector with n elements, then we call the m×n matrix A the transformation matrix of T.

在这里，上述执行主体可以利用最小二乘法确定选取的训练样本对应的三维人体姿态信息与对应的样本惯性动作捕捉数据之间的变换矩阵。最小二乘法又称最小平方法，是一种数学优化技术。它通过最小化误差的平方和寻找数据的最佳函数匹配。利用最小二乘法可以简便地求得未知的数据，并使得这些求得的数据与实际数据之间误差的平方和为最小。Here, the above-mentioned execution subject can use the least squares method to determine the transformation matrix between the three-dimensional human posture information corresponding to the selected training sample and the corresponding sample inertial motion capture data. The least squares method, also known as the least squares method, is a mathematical optimization technique. It finds the best functional match of the data by minimizing the sum of squared errors. The least squares method can be used to easily obtain unknown data, and minimize the sum of square errors between the obtained data and the actual data.

步骤2023，利用变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点和对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到同一坐标系下，确定姿态关键点和惯性动作捕捉三维点之间的差异。Step 2023: Use the transformation matrix to convert the posture key points indicated by the three-dimensional human posture information corresponding to the selected training sample and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system to determine the posture key points. Difference between point and inertial motion capture 3D points.

在本实施例中，上述执行主体可以利用上述变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点和对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到同一坐标系下。在这里，上述执行主体可以建立一个坐标系，可以将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点变换到所建立的坐标系中，可以将选取的训练样本对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点变换到所建立的坐标系中。In this embodiment, the above-mentioned execution subject can use the above-mentioned transformation matrix to convert the posture key points indicated by the three-dimensional human posture information corresponding to the selected training sample and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into under the same coordinate system. Here, the above-mentioned execution subject can establish a coordinate system, can transform the posture key points indicated by the three-dimensional human posture information corresponding to the selected training sample into the established coordinate system, and can convert the sample inertial action corresponding to the selected training sample. The inertial motion capture three-dimensional points indicated by the capture data are transformed into the established coordinate system.

之后，上述执行主体可以确定姿态关键点和惯性动作捕捉三维点之间的差异。具体地，上述执行主体可以利用预设的损失函数确定姿态关键点和惯性动作捕捉三维点之间的差异，例如，可以采用均方误差作为损失函数确定姿态关键点和惯性动作捕捉三维点之间的差异，也可以采用L2范数作为损失函数确定姿态关键点和惯性动作捕捉三维点之间的差异。Afterwards, the above execution subject can determine the difference between the posture key points and the inertial motion capture 3D points. Specifically, the above execution subject can use a preset loss function to determine the difference between the posture key point and the inertial motion capture three-dimensional point. For example, the mean square error can be used as the loss function to determine the difference between the posture key point and the inertial motion capture three-dimensional point. The difference can also be determined by using the L2 norm as the loss function to determine the difference between the posture key points and the inertial motion capture three-dimensional points.

步骤2024，基于确定出的差异，调整初始神经网络的网络参数。Step 2024: Adjust the network parameters of the initial neural network based on the determined difference.

在本实施例中，上述执行主体可以基于在步骤2023中确定出的差异，调整初始神经网络的网络参数。在这里，上述执行主体可以采用各种实现方式基于姿态关键点和惯性动作捕捉三维点之间的差异调整初始神经网络的网络参数。例如，可以采用BP(BackPropagation，反向传播)算法或者SGD(Stochastic Gradient Descent，随机梯度下降)算法来调整初始神经网络的网络参数。In this embodiment, the above execution subject may adjust the network parameters of the initial neural network based on the difference determined in step 2023. Here, the above execution subject can use various implementation methods to adjust the network parameters of the initial neural network based on the differences between posture key points and inertial motion capture three-dimensional points. For example, the BP (BackPropagation) algorithm or the SGD (Stochastic Gradient Descent) algorithm can be used to adjust the network parameters of the initial neural network.

步骤2025，确定是否满足预设的训练结束条件。Step 2025: Determine whether preset training end conditions are met.

在本实施例中，上述执行主体可以确定是否满足预设的训练结束条件。这里预设的训练结束条件可以包括但不限于以下至少一项：训练时间超过预设时长；训练次数超过预设次数；确定出的差异小于预设差异阈值。In this embodiment, the above execution subject can determine whether the preset training end conditions are met. The preset training end conditions here may include but are not limited to at least one of the following: the training time exceeds the preset duration; the number of training times exceeds the preset number; and the determined difference is less than the preset difference threshold.

若满足上述训练结束条件，则上述执行主体可以执行步骤2026。If the above training end conditions are met, the above execution subject can execute step 2026.

步骤2026，若满足训练结束条件，则将调整后的初始神经网络确定为训练完成的三维人体姿态预测网络。Step 2026: If the training end conditions are met, the adjusted initial neural network is determined as the trained three-dimensional human posture prediction network.

在本实施例中，若在步骤2025中确定出满足上述训练结束条件，则上述执行主体可以将调整后的初始神经网络确定为训练完成的三维人体姿态预测网络。In this embodiment, if it is determined in step 2025 that the above training end condition is met, the above execution subject may determine the adjusted initial neural network as the trained three-dimensional human posture prediction network.

本公开的上述实施例提供的方法通过在网络训练过程中确定网络输出的三维人体姿态关键点与对应的惯性动作捕捉三维点之间的变换矩阵，可以将三维人体姿态关键点和惯性动作捕捉三维点转换到同一坐标系下，相比于在使用惯性动作捕捉数据充当三维人体姿态估计算法的数据集时，需要标定惯捕坐标系和相机坐标系的变换关系的这种方法，本实施例提供的这种方法可以节省不同坐标系之间的标定成本，并且可以使得三维人体姿态预测网络达到更好的精度。The method provided by the above embodiments of the present disclosure can determine the transformation matrix between the three-dimensional human posture key points output by the network and the corresponding inertial motion capture three-dimensional points during the network training process, so that the three-dimensional human posture key points and the inertial motion capture three-dimensional points can be combined Points are converted to the same coordinate system. Compared with the method that requires calibrating the transformation relationship between the inertial capture coordinate system and the camera coordinate system when using inertial motion capture data as a data set for a three-dimensional human posture estimation algorithm, this embodiment provides This method can save the calibration cost between different coordinate systems, and can make the three-dimensional human posture prediction network achieve better accuracy.

在一些可选的实现方式中，若在步骤2025中确定出不满足上述训练结束条件，则上述执行主体可以将调整后的初始神经网络作为初始神经网络，从上述训练样本集合中选取未使用过的训练样本，以及基于重新选取出的训练样本，继续执行上述训练步骤(子步骤2021-2026)。In some optional implementations, if it is determined in step 2025 that the above training end conditions are not met, the above execution subject can use the adjusted initial neural network as the initial neural network and select unused data from the above training sample set. of training samples, and based on the reselected training samples, continue to perform the above training steps (sub-steps 2021-2026).

在一些可选的实现方式中，上述训练样本集合中的训练样本可以包括样本人体视频和与样本人体视频对应的样本惯性动作捕捉数据。样本惯性动作捕捉数据可以为拍摄样本人体视频时采集到的样本人体视频中呈现的人体的惯性动作捕捉数据。在这里，用户在利用惯性动作捕捉设备进行人体的惯性动作捕捉的同时，可以利用摄像装置对该用户进行拍摄，从而得到与惯性动作捕捉数据对应的人体视频。上述执行主体可以通过如下方式将选取的训练样本的样本人体图像输入初始神经网络中，得到选取的训练样本对应的三维人体姿态信息：上述执行主体可以将选取的训练样本的样本人体视频输入初始神经网络中，得到选取的训练样本对应的三维人体姿态信息。上述初始神经网络可以是能够根据人体视频得到三维人体姿态信息的各种神经网络，例如，卷积神经网络、深度神经网络等等。三维人体姿态信息可以包括各种身体部位的姿态和方位，例如，人体关节的方向和位置。相比于利用图像对应的三维人体姿态信息与对应的惯性动作捕捉数据求取变换矩阵，这里使用整段视频对应的三维人体姿态信息与对应的惯性动作捕捉数据求取变换矩阵，使用视频级别变换矩阵，利用大量的数据减少误差，样本的随机误差相互之间会抵消一部分，网络输出的平均误差就会越小，求取的仿射变换的误差也相应地越小。In some optional implementations, the training samples in the above training sample set may include sample human body videos and sample inertial motion capture data corresponding to the sample human body videos. The sample inertial motion capture data may be the inertial motion capture data of the human body presented in the sample human body video collected when shooting the sample human body video. Here, while the user uses the inertial motion capture device to capture the inertial motion of the human body, the user can also use the camera device to photograph the user, thereby obtaining a human body video corresponding to the inertial motion capture data. The above-mentioned execution subject can input the sample human body image of the selected training sample into the initial neural network in the following manner to obtain the three-dimensional human posture information corresponding to the selected training sample: the above-mentioned execution subject can input the sample human body video of the selected training sample into the initial neural network. In the network, the three-dimensional human posture information corresponding to the selected training samples is obtained. The above-mentioned initial neural network can be various neural networks that can obtain three-dimensional human posture information based on human body videos, such as convolutional neural networks, deep neural networks, etc. Three-dimensional human body posture information can include postures and orientations of various body parts, such as the directions and positions of human body joints. Instead of using the three-dimensional human posture information corresponding to the image and the corresponding inertial motion capture data to obtain the transformation matrix, here we use the three-dimensional human posture information corresponding to the entire video and the corresponding inertial motion capture data to obtain the transformation matrix, using video-level transformation Matrix, using a large amount of data to reduce errors, the random errors of the samples will offset part of each other, the average error of the network output will be smaller, and the error of the affine transformation obtained will be smaller accordingly.

在一些可选的实现方式中，上述执行主体可以通过如下方式利用上述变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点和对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到同一坐标系下：上述执行主体可以利用上述变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点转换到选取的训练样本对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点所在的坐标系中。惯性动作捕捉三维点所在的坐标系中可以是采集样本惯性动作捕捉数据时的人体坐标系，例如，可以是以人体区域的左下角为坐标原点、以平行地面和垂直地面的两条边为坐标轴的坐标系。In some optional implementations, the above execution subject can use the above transformation matrix in the following manner to convert the posture key points indicated by the three-dimensional human posture information corresponding to the selected training sample and the inertia indicated by the corresponding sample inertial motion capture data. Convert motion capture three-dimensional points to the same coordinate system: the above execution subject can use the above transformation matrix to convert the posture key points indicated by the three-dimensional human posture information corresponding to the selected training sample to the sample inertial motion capture data corresponding to the selected training sample The indicated inertial motion captures the coordinate system in which the 3D point is located. The coordinate system where the inertial motion capture three-dimensional point is located can be the human body coordinate system when collecting sample inertial motion capture data. For example, it can be the lower left corner of the human body area as the coordinate origin, and the two sides parallel to the ground and perpendicular to the ground as the coordinates. The coordinate system of the axis.

在一些可选的实现方式中，上述执行主体可以通过如下方式利用上述变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点和对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到同一坐标系下：利用上述变换矩阵，将选取的训练样本对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到选取的训练样本对应的三维人体姿态信息所指示的姿态关键点所在的坐标系中。姿态关键点所在的坐标系可以是选取的训练样本的样本人体图像对应的相机坐标系。In some optional implementations, the above execution subject can use the above transformation matrix in the following manner to convert the posture key points indicated by the three-dimensional human posture information corresponding to the selected training sample and the inertia indicated by the corresponding sample inertial motion capture data. Convert the motion capture three-dimensional points to the same coordinate system: Use the above transformation matrix to convert the inertial motion capture three-dimensional points indicated by the sample inertial motion capture data corresponding to the selected training samples to the three-dimensional human posture information corresponding to the selected training samples. In the coordinate system where the posture key points are located. The coordinate system where the posture key points are located may be the camera coordinate system corresponding to the sample human image of the selected training sample.

进一步参考图3，图3是根据本实施例的模型训练方法中预测三维人体姿态信息的一个实施例的流程300。该预测三维人体姿态信息的流程300，包括以下步骤：Referring further to FIG. 3 , FIG. 3 is a process 300 for predicting three-dimensional human posture information in the model training method according to this embodiment. The process 300 for predicting three-dimensional human posture information includes the following steps:

步骤301，获取待预测的人体图像。Step 301: Obtain the human body image to be predicted.

在本实施例中，模型训练方法的执行主体(例如图1所示的终端设备或服务器)可以直接或间接地获取待预测的人体图像。例如，当上述执行主体为终端设备时，上述执行主体可以直接获取用户所输入的待预测的人体图像；当上述执行主体为服务器时，上述执行主体可以通过有线连接方式或无线连接方式从终端设备获取用户所输入的待预测的人体图像。在这里，上述人体图像中可以包括人体的各个部位，例如，头、腰、胳膊、腿等等。In this embodiment, the execution subject of the model training method (such as the terminal device or server shown in Figure 1) can directly or indirectly obtain the human body image to be predicted. For example, when the execution subject is a terminal device, the execution subject can directly obtain the human body image to be predicted input by the user; when the execution subject is a server, the execution subject can obtain the human body image from the terminal device through a wired connection or a wireless connection. Obtain the human body image to be predicted input by the user. Here, the above-mentioned human body image may include various parts of the human body, such as head, waist, arms, legs, etc.

步骤302，将人体图像输入训练完成的三维人体姿态预测网络中，得到人体图像中呈现的人体的三维人体姿态信息。Step 302: Input the human body image into the trained three-dimensional human posture prediction network to obtain the three-dimensional human posture information of the human body presented in the human body image.

在本实施例中，上述执行主体可以将上述人体图像输入训练完成的三维人体姿态预测网络中，得到上述人体图像中呈现的人体的三维人体姿态信息。上述三维人体姿态预测网络是利用图2所述的方法训练得到的三维人体姿态预测网络。上述三维人体姿态预测网络可以用于表征图像与图像中呈现的人体的三维人体姿态信息之间的对应关系。上述三维人体姿态信息可以包括各种身体部位的姿态和方位，例如，人体关节的方向和位置。In this embodiment, the above-mentioned execution subject can input the above-mentioned human body image into the trained three-dimensional human body posture prediction network to obtain the three-dimensional human body posture information of the human body presented in the above-mentioned human body image. The above-mentioned three-dimensional human posture prediction network is a three-dimensional human posture prediction network trained using the method described in Figure 2. The above-mentioned three-dimensional human posture prediction network can be used to characterize the correspondence between the image and the three-dimensional human posture information of the human body presented in the image. The above-mentioned three-dimensional human body posture information may include postures and orientations of various body parts, for example, the directions and positions of human body joints.

本公开的上述实施例提供的方法通过将人体图像输入图2所述的方法训练得到的三维人体姿态预测网络中，从而预测人体图像中呈现的人体的三维人体姿态信息，通过这种方式可以提高预测出的三维人体姿态信息的准确性。The method provided by the above embodiments of the present disclosure inputs the human body image into the three-dimensional human body posture prediction network trained by the method described in Figure 2, thereby predicting the three-dimensional human body posture information of the human body presented in the human body image. In this way, it can improve Accuracy of predicted three-dimensional human posture information.

进一步参考图4，作为对上述各图所示方法的实现，本公开提供了一种模型训练装置的一个实施例，该装置实施例与图2所示的方法实施例相对应，该装置具体可以应用于各种电子设备中。With further reference to Figure 4, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of a model training device. The device embodiment corresponds to the method embodiment shown in Figure 2. The device can specifically Used in various electronic equipment.

如图4所示，本实施例的模型训练装置400包括：第一获取单元401和训练单元402。其中，第一获取单元401用于获取训练样本集合，其中，训练样本包括样本人体图像和与样本人体图像对应的样本惯性动作捕捉数据，样本惯性动作捕捉数据为拍摄样本人体图像时采集到的样本人体图像中呈现的人体的惯性动作捕捉数据；训练单元402用于从训练样本集合中选取训练样本，基于选取的训练样本，执行以下训练步骤：将选取的训练样本的样本人体图像输入初始神经网络中，得到选取的训练样本对应的三维人体姿态信息；确定选取的训练样本对应的三维人体姿态信息与对应的样本惯性动作捕捉数据之间的变换矩阵；利用变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点和对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到同一坐标系下，确定姿态关键点和惯性动作捕捉三维点之间的差异；基于确定出的差异，调整初始神经网络的网络参数；确定是否满足预设的训练结束条件；若满足训练结束条件，则将调整后的初始神经网络确定为训练完成的三维人体姿态预测网络。As shown in Figure 4, the model training device 400 of this embodiment includes: a first acquisition unit 401 and a training unit 402. The first acquisition unit 401 is used to acquire a training sample set, where the training samples include sample human body images and sample inertial motion capture data corresponding to the sample human body images, and the sample inertial motion capture data are samples collected when shooting the sample human body images. Inertial motion capture data of the human body presented in the human body image; the training unit 402 is used to select training samples from the training sample set, and perform the following training steps based on the selected training samples: input the sample human body image of the selected training sample into the initial neural network , obtain the three-dimensional human posture information corresponding to the selected training sample; determine the transformation matrix between the three-dimensional human posture information corresponding to the selected training sample and the corresponding sample inertial motion capture data; use the transformation matrix to convert the selected training sample corresponding to The posture key points indicated by the three-dimensional human posture information and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data are converted to the same coordinate system, and the difference between the posture key points and the inertial motion capture three-dimensional points is determined; based on the determination If the difference is found, adjust the network parameters of the initial neural network; determine whether the preset training end conditions are met; if the training end conditions are met, determine the adjusted initial neural network as the trained three-dimensional human posture prediction network.

在本实施例中，模型训练装置400的第一获取单元401和训练单元402的具体处理可以参考图2对应实施例中的步骤201和步骤202。In this embodiment, the specific processing of the first acquisition unit 401 and the training unit 402 of the model training device 400 can refer to step 201 and step 202 in the corresponding embodiment of FIG. 2 .

在一些可选的实现方式中，上述模型训练装置400还可以包括反馈单元(图中未示出)。上述反馈单元可以用于若不满足上述训练结束条件，则将调整后的初始神经网络作为初始神经网络，从上述训练样本集合中选取未使用过的训练样本，基于选取的训练样本，继续执行上述训练步骤。In some optional implementations, the above-mentioned model training device 400 may also include a feedback unit (not shown in the figure). The above feedback unit can be used to use the adjusted initial neural network as the initial neural network if the above training end conditions are not met, select unused training samples from the above training sample set, and continue to execute the above based on the selected training samples. training steps.

在一些可选的实现方式中，上述模型训练装置400还可以包括第二获取单元(图中未示出)和输入单元(图中未示出)。上述第二获取单元可以用于获取待预测的人体图像；上述输入单元可以用于将上述人体图像输入上述训练完成的三维人体姿态预测网络中，得到上述人体图像中呈现的人体的三维人体姿态信息。In some optional implementations, the above-mentioned model training device 400 may also include a second acquisition unit (not shown in the figure) and an input unit (not shown in the figure). The above-mentioned second acquisition unit can be used to acquire the human body image to be predicted; the above-mentioned input unit can be used to input the above-mentioned human body image into the above-mentioned trained three-dimensional human posture prediction network to obtain the three-dimensional human body posture information of the human body presented in the above-mentioned human body image. .

在一些可选的实现方式中，训练样本可以包括样本人体视频和与样本人体视频对应的样本惯性动作捕捉数据，样本惯性动作捕捉数据可以为拍摄样本人体视频时采集到的样本人体视频中呈现的人体的惯性动作捕捉数据；以及上述训练单元402可以进一步用于通过如下方式将选取的训练样本的样本人体图像输入初始神经网络中，得到选取的训练样本对应的三维人体姿态信息：上述训练单元402可以将选取的训练样本的样本人体视频输入初始神经网络中，得到选取的训练样本对应的三维人体姿态信息。In some optional implementations, the training samples may include sample human body videos and sample inertial motion capture data corresponding to the sample human body videos. The sample inertial motion capture data may be presented in the sample human body videos collected when shooting the sample human body videos. Inertial motion capture data of the human body; and the above-mentioned training unit 402 can be further used to input the sample human body image of the selected training sample into the initial neural network in the following manner to obtain the three-dimensional human posture information corresponding to the selected training sample: the above-mentioned training unit 402 The sample human video of the selected training sample can be input into the initial neural network to obtain the three-dimensional human posture information corresponding to the selected training sample.

在一些可选的实现方式中，上述训练单元402可以进一步用于通过如下方式利用上述变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点和对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到同一坐标系下：上述训练单元402可以利用上述变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点转换到选取的训练样本对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点所在的坐标系中。In some optional implementations, the above-mentioned training unit 402 may be further configured to use the above-mentioned transformation matrix in the following manner to capture the posture key points indicated by the three-dimensional human posture information corresponding to the selected training sample and the corresponding sample inertial motion capture data. The indicated inertial motion capture three-dimensional points are converted into the same coordinate system: the above-mentioned training unit 402 can use the above-mentioned transformation matrix to convert the posture key points indicated by the three-dimensional human posture information corresponding to the selected training sample to the corresponding key points of the selected training sample. The coordinate system in which the inertial motion capture three-dimensional point indicated by the sample inertial motion capture data is located.

在一些可选的实现方式中，上述训练单元402可以进一步用于通过如下方式利用上述变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点和对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到同一坐标系下：上述训练单元402可以利用上述变换矩阵，将选取的训练样本对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到选取的训练样本对应的三维人体姿态信息所指示的姿态关键点所在的坐标系中。In some optional implementations, the above-mentioned training unit 402 may be further configured to use the above-mentioned transformation matrix in the following manner to capture the posture key points indicated by the three-dimensional human posture information corresponding to the selected training sample and the corresponding sample inertial motion capture data. The indicated inertial motion capture three-dimensional points are converted to the same coordinate system: the above-mentioned training unit 402 can use the above-mentioned transformation matrix to convert the inertial motion capture three-dimensional points indicated by the sample inertial motion capture data corresponding to the selected training sample to the selected training In the coordinate system where the posture key points indicated by the three-dimensional human posture information corresponding to the sample are located.

下面参考图5，其示出了适于用来实现本公开的实施例的电子设备(例如图1中的服务器或终端设备)500的结构示意图。本公开的实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图5示出的电子设备仅仅是一个示例，不应对本公开的实施例的功能和使用范围带来任何限制。Referring now to FIG. 5 , a schematic structural diagram of an electronic device (such as the server or terminal device in FIG. 1 ) 500 suitable for implementing embodiments of the present disclosure is shown. Terminal devices in embodiments of the present disclosure may include, but are not limited to, mobile phones, laptops, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablets), PMPs (portable multimedia players), vehicle-mounted terminals ( Mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers, etc. The electronic device shown in FIG. 5 is only an example and should not bring any limitations to the functions and usage scope of the embodiments of the present disclosure.

如图5所示，电子设备500可以包括处理装置(例如中央处理器、图形处理器等)501，其可以根据存储在只读存储器(ROM)502中的程序或者从存储装置508加载到随机访问存储器(RAM)503中的程序而执行各种适当的动作和处理。在RAM 503中，还存储有电子设备500操作所需的各种程序和数据。处理装置501、ROM 502以及RAM503通过总线504彼此相连。输入/输出(I/O)接口505也连接至总线504。As shown in FIG. 5 , the electronic device 500 may include a processing device (eg, central processing unit, graphics processor, etc.) 501 that may be loaded into a random access device according to a program stored in a read-only memory (ROM) 502 or from a storage device 508 . The program in the memory (RAM) 503 executes various appropriate actions and processes. In the RAM 503, various programs and data required for the operation of the electronic device 500 are also stored. The processing device 501, the ROM 502 and the RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

通常，以下装置可以连接至I/O接口505：包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置506；包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置507；包括例如磁带、硬盘等的存储装置508；以及通信装置509。通信装置509可以允许电子设备500与其他设备进行无线或有线通信以交换数据。虽然图5示出了具有各种装置的电子设备500，但是应理解的是，并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。图5中示出的每个方框可以代表一个装置，也可以根据需要代表多个装置。Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 507 such as a computer; a storage device 508 including a magnetic tape, a hard disk, etc.; and a communication device 509. Communication device 509 may allow electronic device 500 to communicate wirelessly or wiredly with other devices to exchange data. Although FIG. 5 illustrates electronic device 500 with various means, it should be understood that implementation or availability of all illustrated means is not required. More or fewer means may alternatively be implemented or provided. Each block shown in Figure 5 may represent one device, or may represent multiple devices as needed.

特别地，根据本公开的实施例，上文参考流程图描述的过程可以被实现为计算机软件程序。例如，本公开的实施例包括一种计算机程序产品，其包括承载在计算机可读介质上的计算机程序，该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中，该计算机程序可以通过通信装置509从网络上被下载和安装，或者从存储装置508被安装，或者从ROM 502被安装。在该计算机程序被处理装置501执行时，执行本公开的实施例的方法中限定的上述功能。需要说明的是，本公开的实施例所述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开的实施例中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开的实施例中，计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质，该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输，包括但不限于：电线、光缆、RF(射频)等等，或者上述的任意合适的组合。In particular, according to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product including a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such embodiments, the computer program may be downloaded and installed from the network via communication device 509, or from storage device 508, or from ROM 502. When the computer program is executed by the processing device 501, the above-described functions defined in the method of the embodiment of the present disclosure are performed. It should be noted that the computer-readable medium described in the embodiments of the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium may be, for example, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination thereof. More specific examples of computer readable storage media may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard drive, random access memory (RAM), read only memory (ROM), removable Programmed read-only memory (EPROM or flash memory), fiber optics, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In embodiments of the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, in which computer-readable program code is carried. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device . Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to: wire, optical cable, RF (radio frequency), etc., or any suitable combination of the above.

上述计算机可读介质可以是上述电子设备中所包含的；也可以是单独存在，而未装配入该电子设备中。上述计算机可读介质承载有一个或者多个程序，当上述一个或者多个程序被该电子设备执行时，使得该电子设备：获取训练样本集合，其中，训练样本包括样本人体图像和与样本人体图像对应的样本惯性动作捕捉数据，样本惯性动作捕捉数据为拍摄样本人体图像时采集到的样本人体图像中呈现的人体的惯性动作捕捉数据；从训练样本集合中选取训练样本，基于选取的训练样本，执行以下训练步骤：将选取的训练样本的样本人体图像输入初始神经网络中，得到选取的训练样本对应的三维人体姿态信息；确定选取的训练样本对应的三维人体姿态信息与对应的样本惯性动作捕捉数据之间的变换矩阵；利用变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点和对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到同一坐标系下，确定姿态关键点和惯性动作捕捉三维点之间的差异；基于确定出的差异，调整初始神经网络的网络参数；确定是否满足预设的训练结束条件；若满足训练结束条件，则将调整后的初始神经网络确定为训练完成的三维人体姿态预测网络。The above-mentioned computer-readable medium may be included in the above-mentioned electronic device; it may also exist independently without being assembled into the electronic device. The computer-readable medium carries one or more programs. When the one or more programs are executed by the electronic device, the electronic device: obtains a training sample set, where the training samples include sample human body images and sample human body images. Corresponding sample inertial motion capture data, the sample inertial motion capture data is the inertial motion capture data of the human body presented in the sample human body image collected when shooting the sample human body image; select the training sample from the training sample set, based on the selected training sample, Execute the following training steps: input the sample human body image of the selected training sample into the initial neural network to obtain the three-dimensional human body posture information corresponding to the selected training sample; determine the three-dimensional human body posture information corresponding to the selected training sample and the corresponding sample inertial motion capture Transformation matrix between data; use the transformation matrix to convert the posture key points indicated by the three-dimensional human posture information corresponding to the selected training sample and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data to the same coordinate system , determine the difference between the posture key points and the inertial motion capture three-dimensional points; based on the determined differences, adjust the network parameters of the initial neural network; determine whether the preset training end conditions are met; if the training end conditions are met, the adjusted The initial neural network is determined as the trained three-dimensional human posture prediction network.

可以以一种或多种程序设计语言或其组合来编写用于执行本公开的实施例的操作的计算机程序代码，所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++，还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中，远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到用户计算机，或者，可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for performing operations of embodiments of the present disclosure may be written in one or more programming languages, including object-oriented programming languages—such as Java, Smalltalk, C++, or a combination thereof, Also included are conventional procedural programming languages—such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In situations involving remote computers, the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as an Internet service provider). connected via the Internet).

附图中的流程图和框图，图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operations of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagram may represent a module, segment, or portion of code that contains one or more logic functions that implement the specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved. It will also be noted that each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or operations. , or can be implemented using a combination of specialized hardware and computer instructions.

根据本公开的一个或多个实施例，提供了一种模型训练方法，该方法包括：获取训练样本集合，其中，训练样本包括样本人体图像和与样本人体图像对应的样本惯性动作捕捉数据，样本惯性动作捕捉数据为拍摄样本人体图像时采集到的样本人体图像中呈现的人体的惯性动作捕捉数据；从训练样本集合中选取训练样本，基于选取的训练样本，执行以下训练步骤：将选取的训练样本的样本人体图像输入初始神经网络中，得到选取的训练样本对应的三维人体姿态信息；确定选取的训练样本对应的三维人体姿态信息与对应的样本惯性动作捕捉数据之间的变换矩阵；利用变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点和对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到同一坐标系下，确定姿态关键点和惯性动作捕捉三维点之间的差异；基于确定出的差异，调整初始神经网络的网络参数；确定是否满足预设的训练结束条件；若满足训练结束条件，则将调整后的初始神经网络确定为训练完成的三维人体姿态预测网络。According to one or more embodiments of the present disclosure, a model training method is provided, which method includes: obtaining a training sample set, wherein the training samples include sample human body images and sample inertial motion capture data corresponding to the sample human body images, and the sample The inertial motion capture data is the inertial motion capture data of the human body presented in the sample human body image collected when shooting the sample human body image; select the training sample from the training sample set, and perform the following training steps based on the selected training sample: The sample human body image of the sample is input into the initial neural network to obtain the three-dimensional human posture information corresponding to the selected training sample; determine the transformation matrix between the three-dimensional human posture information corresponding to the selected training sample and the corresponding sample inertial motion capture data; use the transformation Matrix, convert the posture key points indicated by the three-dimensional human posture information corresponding to the selected training sample and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system to determine the posture key points and inertial motion capture The difference between three-dimensional points; based on the determined difference, adjust the network parameters of the initial neural network; determine whether the preset training end conditions are met; if the training end conditions are met, the adjusted initial neural network is determined to be trained. 3D human pose prediction network.

根据本公开的一个或多个实施例，该方法还包括：若不满足训练结束条件，则将调整后的初始神经网络作为初始神经网络，从训练样本集合中选取未使用过的训练样本，基于选取的训练样本，继续执行训练步骤。According to one or more embodiments of the present disclosure, the method further includes: if the training end condition is not met, using the adjusted initial neural network as the initial neural network, selecting unused training samples from the training sample set, and based on Select the training sample and continue the training steps.

根据本公开的一个或多个实施例，该方法还包括：获取待预测的人体图像；将人体图像输入训练完成的三维人体姿态预测网络中，得到人体图像中呈现的人体的三维人体姿态信息。According to one or more embodiments of the present disclosure, the method further includes: obtaining a human body image to be predicted; inputting the human body image into the trained three-dimensional human body posture prediction network to obtain the three-dimensional human body posture information of the human body presented in the human body image.

根据本公开的一个或多个实施例，训练样本包括样本人体视频和与样本人体视频对应的样本惯性动作捕捉数据，样本惯性动作捕捉数据为拍摄样本人体视频时采集到的样本人体视频中呈现的人体的惯性动作捕捉数据；以及将选取的训练样本的样本人体图像输入初始神经网络中，得到选取的训练样本对应的三维人体姿态信息，包括：将选取的训练样本的样本人体视频输入初始神经网络中，得到选取的训练样本对应的三维人体姿态信息。According to one or more embodiments of the present disclosure, the training sample includes a sample human body video and sample inertial motion capture data corresponding to the sample human body video. The sample inertial motion capture data is presented in the sample human body video collected when shooting the sample human body video. Inertial motion capture data of the human body; and inputting the sample human body image of the selected training sample into the initial neural network to obtain the three-dimensional human posture information corresponding to the selected training sample, including: inputting the sample human body video of the selected training sample into the initial neural network , the three-dimensional human posture information corresponding to the selected training sample is obtained.

根据本公开的一个或多个实施例，利用变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点和对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到同一坐标系下，包括：利用变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点转换到选取的训练样本对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点所在的坐标系中。According to one or more embodiments of the present disclosure, a transformation matrix is used to convert the posture key points indicated by the three-dimensional human posture information corresponding to the selected training sample and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into Under the same coordinate system, it includes: using the transformation matrix to convert the posture key points indicated by the three-dimensional human posture information corresponding to the selected training sample to the inertial motion capture three-dimensional points indicated by the sample inertial motion capture data corresponding to the selected training sample. in the coordinate system.

根据本公开的一个或多个实施例，利用变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点和对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到同一坐标系下，包括：利用变换矩阵，将选取的训练样本对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到选取的训练样本对应的三维人体姿态信息所指示的姿态关键点所在的坐标系中。According to one or more embodiments of the present disclosure, a transformation matrix is used to convert the posture key points indicated by the three-dimensional human posture information corresponding to the selected training sample and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into Under the same coordinate system, including: using the transformation matrix to convert the inertial motion capture three-dimensional points indicated by the sample inertial motion capture data corresponding to the selected training samples to the posture key points indicated by the three-dimensional human posture information corresponding to the selected training samples. in the coordinate system.

根据本公开的一个或多个实施例，提供了一种模型训练装置，该装置包括：第一获取单元，用于获取训练样本集合，其中，训练样本包括样本人体图像和与样本人体图像对应的样本惯性动作捕捉数据，样本惯性动作捕捉数据为拍摄样本人体图像时采集到的样本人体图像中呈现的人体的惯性动作捕捉数据；训练单元，用于从训练样本集合中选取训练样本，基于选取的训练样本，执行以下训练步骤：将选取的训练样本的样本人体图像输入初始神经网络中，得到选取的训练样本对应的三维人体姿态信息；确定选取的训练样本对应的三维人体姿态信息与对应的样本惯性动作捕捉数据之间的变换矩阵；利用变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点和对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到同一坐标系下，确定姿态关键点和惯性动作捕捉三维点之间的差异；基于确定出的差异，调整初始神经网络的网络参数；确定是否满足预设的训练结束条件；若满足训练结束条件，则将调整后的初始神经网络确定为训练完成的三维人体姿态预测网络。According to one or more embodiments of the present disclosure, a model training device is provided. The device includes: a first acquisition unit configured to acquire a training sample set, wherein the training samples include a sample human body image and a sample human body image corresponding to the sample human body image. Sample inertial motion capture data, the sample inertial motion capture data is the inertial motion capture data of the human body presented in the sample human body image collected when shooting the sample human body image; the training unit is used to select training samples from the training sample set, based on the selected To train a sample, perform the following training steps: input the sample human body image of the selected training sample into the initial neural network to obtain the three-dimensional human posture information corresponding to the selected training sample; determine the three-dimensional human posture information corresponding to the selected training sample and the corresponding sample Transformation matrix between inertial motion capture data; use the transformation matrix to convert the posture key points indicated by the three-dimensional human posture information corresponding to the selected training sample and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same Under the coordinate system, determine the difference between the attitude key point and the inertial motion capture three-dimensional point; based on the determined difference, adjust the network parameters of the initial neural network; determine whether the preset training end conditions are met; if the training end conditions are met, then The adjusted initial neural network is determined as the trained three-dimensional human posture prediction network.

根据本公开的一个或多个实施例，该装置还包括：反馈单元，用于若不满足训练结束条件，则将调整后的初始神经网络作为初始神经网络，从训练样本集合中选取未使用过的训练样本，基于选取的训练样本，继续执行训练步骤。According to one or more embodiments of the present disclosure, the device further includes: a feedback unit, configured to use the adjusted initial neural network as the initial neural network if the training end condition is not met, and select unused data from the training sample set. training samples, and continue to perform the training steps based on the selected training samples.

根据本公开的一个或多个实施例，该装置还包括：第二获取单元，用于获取待预测的人体图像；输入单元，用于将人体图像输入训练完成的三维人体姿态预测网络中，得到人体图像中呈现的人体的三维人体姿态信息。According to one or more embodiments of the present disclosure, the device further includes: a second acquisition unit for acquiring a human body image to be predicted; an input unit for inputting the human body image into the trained three-dimensional human posture prediction network to obtain Three-dimensional human posture information of the human body presented in the human body image.

根据本公开的一个或多个实施例，训练样本包括样本人体视频和与样本人体视频对应的样本惯性动作捕捉数据，样本惯性动作捕捉数据为拍摄样本人体视频时采集到的样本人体视频中呈现的人体的惯性动作捕捉数据；以及训练单元进一步用于通过如下方式将选取的训练样本的样本人体图像输入初始神经网络中，得到选取的训练样本对应的三维人体姿态信息：将选取的训练样本的样本人体视频输入初始神经网络中，得到选取的训练样本对应的三维人体姿态信息。According to one or more embodiments of the present disclosure, the training sample includes a sample human body video and sample inertial motion capture data corresponding to the sample human body video. The sample inertial motion capture data is presented in the sample human body video collected when shooting the sample human body video. Inertial motion capture data of the human body; and the training unit is further used to input the sample human body image of the selected training sample into the initial neural network in the following manner to obtain the three-dimensional human posture information corresponding to the selected training sample: The human body video is input into the initial neural network, and the three-dimensional human posture information corresponding to the selected training sample is obtained.

根据本公开的一个或多个实施例，训练单元进一步用于通过如下方式利用变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点和对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到同一坐标系下：利用变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点转换到选取的训练样本对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点所在的坐标系中。According to one or more embodiments of the present disclosure, the training unit is further configured to use the transformation matrix in the following manner to combine the posture key points indicated by the three-dimensional human posture information corresponding to the selected training sample and the corresponding sample inertial motion capture data. The inertial motion capture three-dimensional points are converted into the same coordinate system: using the transformation matrix, the posture key points indicated by the three-dimensional human posture information corresponding to the selected training sample are converted to the posture key points indicated by the sample inertial motion capture data corresponding to the selected training sample. Inertial motion captures the coordinate system in which the three-dimensional point is located.

根据本公开的一个或多个实施例，训练单元进一步用于通过如下方式利用变换矩阵，将选取的训练样本对应的三维人体姿态信息所指示的姿态关键点和对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到同一坐标系下：利用变换矩阵，将选取的训练样本对应的样本惯性动作捕捉数据所指示的惯性动作捕捉三维点转换到选取的训练样本对应的三维人体姿态信息所指示的姿态关键点所在的坐标系中。According to one or more embodiments of the present disclosure, the training unit is further configured to use the transformation matrix in the following manner to combine the posture key points indicated by the three-dimensional human posture information corresponding to the selected training sample and the corresponding sample inertial motion capture data. Convert the inertial motion capture three-dimensional points to the same coordinate system: use the transformation matrix to convert the inertial motion capture three-dimensional points indicated by the sample inertial motion capture data corresponding to the selected training samples to the three-dimensional human posture information corresponding to the selected training samples. The coordinate system in which the indicated pose keypoint is located.

根据本公开的一个或多个实施例，提供了一种电子设备，包括：一个或多个处理器；存储装置，用于存储一个或多个程序，当一个或多个程序被一个或多个处理器执行，使得一个或多个处理器实现如上述模型训练方法。According to one or more embodiments of the present disclosure, an electronic device is provided, including: one or more processors; a storage device for storing one or more programs. When the one or more programs are processed by one or more The processor executes, causing one or more processors to implement the above model training method.

根据本公开的一个或多个实施例，提供了一种计算机可读介质，其上存储有计算机程序，该程序被处理器执行时实现如上述模型训练方法的步骤。According to one or more embodiments of the present disclosure, a computer-readable medium is provided, on which a computer program is stored. When the program is executed by a processor, the steps of the above model training method are implemented.

描述于本公开的实施例中所涉及到的单元可以通过软件的方式实现，也可以通过硬件的方式来实现。所描述的单元也可以设置在处理器中，例如，可以描述为：一种处理器包括第一获取单元和训练单元。其中，这些单元的名称在某种情况下并不构成对该单元本身的限定，例如，第一获取单元还可以被描述为“获取训练样本集合的单元”。The units involved in the embodiments of the present disclosure may be implemented in software or hardware. The described unit may also be provided in a processor. For example, it may be described as follows: a processor includes a first acquisition unit and a training unit. The names of these units do not constitute a limitation on the unit itself under certain circumstances. For example, the first acquisition unit may also be described as "the unit that acquires the training sample set."

以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解，本公开的实施例中所涉及的发明范围，并不限于上述技术特征的特定组合而成的技术方案，同时也应涵盖在不脱离上述发明构思的情况下，由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开的实施例中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is only a description of the preferred embodiments of the present disclosure and the technical principles applied. Persons skilled in the art should understand that the scope of the invention involved in the embodiments of the present disclosure is not limited to technical solutions composed of specific combinations of the above technical features, and should also cover the above-mentioned technical solutions without departing from the above-mentioned inventive concept. Other technical solutions formed by any combination of technical features or their equivalent features. For example, a technical solution is formed by replacing the above features with technical features with similar functions disclosed in the embodiments of the present disclosure (but not limited to).

Claims

1. A method of model training, comprising:

acquiring a training sample set, wherein the training sample comprises a sample human body image and sample inertial motion capturing data corresponding to the sample human body image, and the sample inertial motion capturing data is inertial motion capturing data of a human body in the sample human body image acquired when the sample human body image is shot;

selecting a training sample from the training sample set, and executing the following training steps based on the selected training sample: inputting a sample human body image of the selected training sample into an initial neural network to obtain three-dimensional human body posture information corresponding to the selected training sample; determining a transformation matrix between three-dimensional human body posture information corresponding to the selected training sample and corresponding sample inertial motion capture data; converting the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by utilizing the transformation matrix, and determining the difference between the gesture key points and the inertial motion capture three-dimensional points; based on the determined difference, adjusting network parameters of the initial neural network; determining whether a preset training ending condition is met; and if the training ending condition is met, determining the adjusted initial neural network as a three-dimensional human body posture prediction network after training is completed.

2. The method according to claim 1, wherein the method further comprises:

and if the training ending condition is not met, taking the adjusted initial neural network as the initial neural network, selecting unused training samples from the training sample set, and continuously executing the training step based on the selected training samples.

3. The method according to claim 1, wherein the method further comprises:

acquiring a human body image to be predicted;

inputting the human body image into the three-dimensional human body posture prediction network after training, and obtaining three-dimensional human body posture information of the human body displayed in the human body image.

4. The method of claim 1, wherein the training sample comprises a sample human body video and sample inertial motion capture data corresponding to the sample human body video, the sample inertial motion capture data being inertial motion capture data of a human body presented in the sample human body video acquired when the sample human body video was captured; and

the step of inputting the sample human body image of the selected training sample into the initial neural network to obtain three-dimensional human body posture information corresponding to the selected training sample comprises the following steps:

And inputting the sample human body video of the selected training sample into the initial neural network to obtain three-dimensional human body posture information corresponding to the selected training sample.

5. The method according to one of claims 1 to 4, wherein converting, using the transformation matrix, the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system includes:

and converting the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples into a coordinate system where the inertial motion capturing three-dimensional points indicated by the sample inertial motion capturing data corresponding to the selected training samples are located by utilizing the transformation matrix.

6. The method according to one of claims 1 to 4, wherein converting, using the transformation matrix, the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system includes:

and converting the inertial motion capturing three-dimensional points indicated by the sample inertial motion capturing data corresponding to the selected training sample into a coordinate system where the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training sample are located by utilizing the transformation matrix.

7. A model training device, comprising:

the first acquisition unit is used for acquiring a training sample set, wherein the training sample comprises a sample human body image and sample inertial motion capturing data corresponding to the sample human body image, and the sample inertial motion capturing data is inertial motion capturing data of a human body in the sample human body image acquired when the sample human body image is shot;

the training unit is used for selecting training samples from the training sample set, and based on the selected training samples, the following training steps are executed: inputting a sample human body image of the selected training sample into an initial neural network to obtain three-dimensional human body posture information corresponding to the selected training sample; determining a transformation matrix between three-dimensional human body posture information corresponding to the selected training sample and corresponding sample inertial motion capture data; converting the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by utilizing the transformation matrix, and determining the difference between the gesture key points and the inertial motion capture three-dimensional points; based on the determined difference, adjusting network parameters of the initial neural network; determining whether a preset training ending condition is met; and if the training ending condition is met, determining the adjusted initial neural network as a three-dimensional human body posture prediction network after training is completed.

8. The apparatus of claim 7, wherein the apparatus further comprises:

and the feedback unit is used for taking the adjusted initial neural network as the initial neural network if the training ending condition is not met, selecting unused training samples from the training sample set, and continuously executing the training step based on the selected training samples.

9. The apparatus of claim 7, wherein the apparatus further comprises:

a second acquisition unit for acquiring a human body image to be predicted;

and the input unit is used for inputting the human body image into the three-dimensional human body posture prediction network after training to obtain three-dimensional human body posture information of the human body displayed in the human body image.

10. The apparatus of claim 7, wherein the training sample comprises a sample human body video and sample inertial motion capture data corresponding to the sample human body video, the sample inertial motion capture data being inertial motion capture data of a human body presented in the sample human body video acquired when the sample human body video was captured; and

the training unit is further used for inputting the sample human body image of the selected training sample into the initial neural network in the following manner to obtain three-dimensional human body posture information corresponding to the selected training sample:

11. The apparatus according to one of claims 7 to 10, wherein the training unit is further configured to convert the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by using the transformation matrix:

12. The apparatus according to one of claims 7 to 10, wherein the training unit is further configured to convert the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by using the transformation matrix:

13. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-6.

14. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-6.