CN109977834A

CN109977834A - The method and apparatus divided manpower from depth image and interact object

Info

Publication number: CN109977834A
Application number: CN201910207311.8A
Authority: CN
Inventors: 徐枫; 薄子豪; 雍俊海
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2019-07-05
Anticipated expiration: 2039-03-19
Also published as: CN109977834B

Abstract

The application proposes a kind of method and apparatus divided manpower from depth image and interact object, wherein method includes: to construct the manpower partitioned data set based on depth image using the dividing method based on color image；Using the manpower partitioned data set based on depth image, training obtains parted pattern, and parted pattern is made of encoder, attention TRANSFER MODEL and decoder；Depth image to be processed is split using parted pattern, obtains tag along sort figure corresponding with depth image to be processed, the value of each pixel is the types value of each pixel in tag along sort figure.The parted pattern that this method is obtained using the manpower partitioned data set based on depth image by training, depth image to be processed is split using parted pattern, realize the manpower and object segmentation of pixel scale, environmental robustness is improved, segmentation precision is higher, the manpower that is capable of handling under complex interaction situation and the case where object segmentation.

Description

Method and apparatus for segmenting human hands and interactive objects from depth images

技术领域technical field

本申请涉及计算机视觉技术领域，尤其涉及一种从深度图像中分割人手与交互物体的方法和装置。The present application relates to the technical field of computer vision, and in particular, to a method and apparatus for segmenting human hands and interactive objects from depth images.

背景技术Background technique

人手分割是许多诸如手势识别、人手跟踪、人手重建等研究领域的基本问题。相比于单独的手部运动，对同物体交互状态下的研究在人机交互和虚拟现实领域更较重要。Human hand segmentation is a fundamental problem in many research fields such as gesture recognition, hand tracking, and hand reconstruction. Compared with individual hand movements, the research on the interaction with the object is more important in the fields of human-computer interaction and virtual reality.

近年来通用的基于神经网络的语义分割模型越来越完善，但已有方法模型的环境鲁棒性低、分割精度差、无法处理复杂交互情形下的人手分割。In recent years, general neural network-based semantic segmentation models have become more and more perfect, but the existing method models have low environmental robustness, poor segmentation accuracy, and cannot handle human hand segmentation in complex interactive situations.

发明内容SUMMARY OF THE INVENTION

本申请提出一种从深度图像中分割人手与交互物体的方法和装置，用于解决相关技术中现有的人手分割模型环境鲁棒性低、分割精度差、无法处理复杂交互情形下的人手分割的问题。This application proposes a method and device for segmenting a human hand and an interactive object from a depth image, which is used to solve the problem that the existing human hand segmentation model in the related art has low environmental robustness, poor segmentation accuracy, and cannot handle human hand segmentation in complex interactive situations. The problem.

本申请一方面实施例提出了一种从深度图像中分割人手与交互物体的方法，包括：An embodiment of the present application provides a method for segmenting a human hand and an interactive object from a depth image, including:

利用基于颜色图像的分割方法，构建基于深度图像的人手分割数据集；Using the segmentation method based on color image, construct a dataset of human hand segmentation based on depth image;

利用所述基于深度图像的人手分割数据集，训练得到分割模型，所述分割模型由编码器、注意力传递模型和解码器构成；Using the depth image-based human hand segmentation data set, a segmentation model is obtained by training, and the segmentation model is composed of an encoder, an attention transfer model and a decoder;

利用所述分割模型对待处理的深度图像进行分割，获取与所述待处理的深度图像对应的分类标签图，所述分类标签图中每个像素点的值为所述每个像素点的类型值，所述类型值用于表征像素点在所述待处理的深度图像中所属的类型。Use the segmentation model to segment the depth image to be processed, and obtain a classification label map corresponding to the depth image to be processed, where the value of each pixel in the classification label map is the type value of each pixel , and the type value is used to represent the type to which the pixel belongs in the depth image to be processed.

本申请实施例的从深度图像中分割人手与交互物体的方法，通过利用基于颜色图像的分割方法，构建基于深度图像的人手分割数据集，利用基于深度图像的人手分割数据集，训练分割模型，分割模型由编码器、注意力传递模型和解码器构成，利用分割模型对待处理的深度图像进行分割，获取与待处理的深度图像对应的分类标签图，分类标签图中每个像素点的值为每个像素点的类型值，根据每个像素点的类型值可以确定每个像素点所属的类型，由此，利用基于深度图像的人手分割数据集通过训练得到的分割模型，利用分割模型对待处理的深度图像进行分割，实现了像素级别的人手与物体分割，提高了环境鲁棒性，分割精度较高、能够处理复杂交互情形下的人手与物体分割的情况。In the method for segmenting a human hand and an interactive object from a depth image according to the embodiment of the present application, a segmentation method based on a color image is used to construct a human hand segmentation data set based on a depth image, and a segmentation model is trained by using the human hand segmentation data set based on the depth image, The segmentation model consists of an encoder, an attention transfer model and a decoder. The segmentation model is used to segment the depth image to be processed, and a classification label map corresponding to the depth image to be processed is obtained. The value of each pixel in the classification label map is The type value of each pixel point, according to the type value of each pixel point, the type to which each pixel point belongs can be determined. Therefore, the segmentation model obtained by training the human hand segmentation data set based on the depth image is used, and the segmentation model is used for processing. The depth image is segmented, which realizes the segmentation of human hands and objects at the pixel level, improves the environmental robustness, has high segmentation accuracy, and can handle the segmentation of human hands and objects in complex interactive situations.

本申请另一方面实施例提出了一种从深度图像中分割人手与交互物体的装置，包括：Another embodiment of the present application provides an apparatus for segmenting a human hand and an interactive object from a depth image, including:

构建模块，用于利用基于颜色图像的分割方法，构建基于深度图像的人手分割数据集；A building block for building a depth image-based hand segmentation dataset using a color image-based segmentation method;

训练模块，用于利用所述基于深度图像的人手分割数据集，训练得到分割模型，所述分割模型由编码器、注意力传递模型和解码器构成；a training module for using the depth image-based human hand segmentation data set to obtain a segmentation model through training, and the segmentation model is composed of an encoder, an attention transfer model and a decoder;

识别模块，用于利用所述分割模型对待处理的深度图像进行分割，获取与所述待处理的深度图像对应的分类标签图，所述分类标签图中每个像素点的值为所述每个像素点的类型值，所述类型值用于表征像素点在所述待处理的深度图像中所属的类型。The identification module is used to segment the depth image to be processed by using the segmentation model, and obtain a classification label map corresponding to the depth image to be processed, and the value of each pixel in the classification label map is the value of each Type value of the pixel point, where the type value is used to represent the type of the pixel point in the depth image to be processed.

本申请实施例的从深度图像中分割人手与交互物体的装置，通过利用基于颜色图像的分割方法，构建基于深度图像的人手分割数据集，利用基于深度图像的人手分割数据集，训练分割模型，分割模型由编码器、注意力传递模型和解码器构成，利用分割模型对待处理的深度图像进行分割，获取与待处理的深度图像对应的分类标签图，分类标签图中每个像素点的值为每个像素点的类型值，根据每个像素点的类型值可以确定每个像素点所属的类型，由此，利用基于深度图像的人手分割数据集通过训练得到的分割模型，利用分割模型对待处理的深度图像进行分割，实现了像素级别的人手与物体分割，提高了环境鲁棒性，分割精度较高、能够处理复杂交互情形下的人手与物体分割的情况。The device for segmenting a human hand and an interactive object from a depth image according to the embodiment of the present application constructs a depth image-based human hand segmentation dataset by using a color image-based segmentation method, and uses the depth image-based human hand segmentation dataset to train a segmentation model, The segmentation model consists of an encoder, an attention transfer model and a decoder. The segmentation model is used to segment the depth image to be processed, and a classification label map corresponding to the depth image to be processed is obtained. The value of each pixel in the classification label map is The type value of each pixel point, according to the type value of each pixel point, the type to which each pixel point belongs can be determined. Therefore, the segmentation model obtained by training the human hand segmentation data set based on the depth image is used, and the segmentation model is used for processing. The depth image is segmented, which realizes the segmentation of human hands and objects at the pixel level, improves the environmental robustness, has high segmentation accuracy, and can handle the segmentation of human hands and objects in complex interactive situations.

本申请附加的方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本申请的实践了解到。Additional aspects and advantages of the present application will be set forth, in part, in the following description, and in part will be apparent from the following description, or learned by practice of the present application.

附图说明Description of drawings

本申请上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present application will become apparent and readily understood from the following description of embodiments taken in conjunction with the accompanying drawings, wherein:

图1为本申请实施例提供的一种从深度图像中分割人手与交互物体的方法的流程示意图；1 is a schematic flowchart of a method for segmenting a human hand and an interactive object from a depth image according to an embodiment of the present application;

图2为本申请实施例提供的一种分割模型的结构示意图；2 is a schematic structural diagram of a segmentation model provided by an embodiment of the present application;

图3为本申请实施例提供的一种注意力机制模型的结构示意图；FIG. 3 is a schematic structural diagram of an attention mechanism model provided by an embodiment of the present application;

图4为本申请实施例提供的另一种从深度图像中分割人手与交互物体的方法的流程示意图；4 is a schematic flowchart of another method for segmenting a human hand and an interactive object from a depth image according to an embodiment of the present application;

图5为本申请实施例提供的一种分割模型的训练过程示意图；5 is a schematic diagram of a training process of a segmentation model provided by an embodiment of the present application;

图6为本申请实施例提供的一种使用轮廓误差的效果示意图；FIG. 6 is a schematic diagram of the effect of using contour error according to an embodiment of the present application;

图7为本申请实施例提供的一种从深度图像中分割人手与交互物体的装置的结构示意图。FIG. 7 is a schematic structural diagram of an apparatus for segmenting a human hand and an interactive object from a depth image according to an embodiment of the present application.

具体实施方式Detailed ways

下面详细描述本申请的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，旨在用于解释本申请，而不能理解为对本申请的限制。The following describes in detail the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary, and are intended to be used to explain the present application, but should not be construed as a limitation to the present application.

下面参考附图描述本申请实施例的从深度图像中分割人手与交互物体的方法和装置。The following describes the method and apparatus for segmenting a human hand and an interactive object from a depth image according to the embodiments of the present application with reference to the accompanying drawings.

图1为本申请实施例提供的一种从深度图像中分割人手与交互物体的方法的流程示意图。FIG. 1 is a schematic flowchart of a method for segmenting a human hand and an interactive object from a depth image according to an embodiment of the present application.

如图1所示，该从深度图像中分割人手与交互物体的方法包括：As shown in Figure 1, the method for segmenting a human hand and an interactive object from a depth image includes:

步骤101，利用基于颜色图像的分割方法，构建基于深度图像的人手分割数据集。Step 101 , using a color image-based segmentation method to construct a depth image-based human hand segmentation dataset.

由于深度相机可同时采集彩色和深度图像，因此可利用深度相机采集人手与物体交互的彩色图像与深度图像，从而获得多对彩色图像和深度图像。然后，基于彩色图像对深度图像处理，进而获得于深度图像的人手分割数据集。Since the depth camera can collect color and depth images at the same time, the depth camera can be used to collect color images and depth images of the interaction between human hands and objects, so as to obtain multiple pairs of color images and depth images. Then, the depth image is processed based on the color image, and then the human hand segmentation dataset based on the depth image is obtained.

为了提高分割精度，本实施例中，可在相同亮度和色温的固定光源，对与人手皮肤相差较大的颜色的物体进行采集。例如，在同一亮度和光源的情况下，采集手握蓝色钢笔的图像。In order to improve the segmentation accuracy, in this embodiment, a fixed light source with the same brightness and color temperature may be used to collect objects with colors that are quite different from the skin of a human hand. For example, capture an image of a hand holding a blue pen with the same brightness and light source.

步骤102，利用基于深度图像的人手分割数据集，训练得到分割模型。Step 102 , using the depth image-based human hand segmentation dataset to train to obtain a segmentation model.

在获取基于深度图像的人手分割数据集后，利用该数据集对初始的神经网络模型进行训练，得到满足要求的分割模型。After obtaining the human hand segmentation dataset based on depth images, the initial neural network model is trained with the dataset, and a segmentation model that meets the requirements is obtained.

其中，在训练过程中，可以利用损失函数衡量分割模型的预测性能。Among them, in the training process, the loss function can be used to measure the prediction performance of the segmentation model.

本实施例中，分割模型由编码器、注意力传递模型和解码器构成。其中，编码器使用大型卷积网络，解码器使用反卷积层恢复高层信息到图像像素尺度。In this embodiment, the segmentation model consists of an encoder, an attention transfer model and a decoder. Among them, the encoder uses a large convolutional network, and the decoder uses a deconvolution layer to restore high-level information to the image pixel scale.

图2为本申请实施例提供的一种分割模型的结构示意图。如图2所示，分割模型由编码器、注意力传递模型和解码器构成。本实施例中，在编码器与解码器之间增加注意力机制，通过融合多尺度图像特征构建注意力特征图，用于强化编解码器间的同层连接，可提高两者之间信息传递的精确性和有效性。FIG. 2 is a schematic structural diagram of a segmentation model provided by an embodiment of the present application. As shown in Figure 2, the segmentation model consists of an encoder, an attention transfer model, and a decoder. In this embodiment, an attention mechanism is added between the encoder and the decoder, and an attention feature map is constructed by fusing multi-scale image features, which is used to strengthen the same-layer connection between the encoder and decoder, which can improve the information transfer between the two. accuracy and validity.

图3为本申请实施例提供的一种注意力机制模型的结构示意图。图3中，将第1层，第2层，…，第i-1层特征图进行相乘获取底层注意力(FineAtt)；第1层，第2层，…，第i-1层的每层包括尺度放缩网络(SqueezeNet，简称SN)和双线性下采样层(Bilinear down-sampling，简称DS)，其中，SN可以归一化特征图维度。将第i+1层，第i+2层，…，第n层特征图相乘获取构成高层注意力(CoarseAtt)；其中，第i+1层，第i+2层，…，第n层的每层包括SN和上采样层(up-sampling layer，简称US)。DS和US分别用于缩小特征图尺度和放大特征图尺度。将获取的FineAtt和CoarseAtt注意力图，同第i层的特征图级联后输入解码器。对图3中第1层到第n层的每一层的特征图尺度，均使用该注意力机制进行强化。FIG. 3 is a schematic structural diagram of an attention mechanism model provided by an embodiment of the present application. In Figure 3, the first layer, the second layer, ..., the i-1 layer feature maps are multiplied to obtain the bottom layer attention (FineAtt); the first layer, the second layer, ..., the i-1 layer each The layers include a scale scaling network (SqueezeNet, referred to as SN) and a bilinear down-sampling layer (Bilinear down-sampling, referred to as DS), where SN can normalize the dimension of the feature map. Multiply the feature maps of the i+1th layer, the i+2th layer, ..., the nth layer to obtain the high-level attention (CoarseAtt); among them, the i+1th layer, the i+2th layer, ..., the nth layer Each layer includes SN and an up-sampling layer (US for short). DS and US are used to downscale and upscale feature maps, respectively. The obtained FineAtt and CoarseAtt attention maps are concatenated with the feature map of the i-th layer and input to the decoder. The feature map scale of each layer from the 1st layer to the nth layer in Figure 3 is enhanced by this attention mechanism.

步骤103，利用分割模型对待处理的深度图像进行分割，获取与待处理的深度图像对应的分类标签图。Step 103: Use the segmentation model to segment the depth image to be processed, and obtain a classification label map corresponding to the depth image to be processed.

本实施例中，在对处理的深度图像进行识别之前，可通过深度相机获取待处理的深度图像。In this embodiment, before the processed depth image is identified, the depth image to be processed may be acquired by a depth camera.

在获取分割模型后，将待处理的深度图像输入到训练得到的分割模型中，分割模型输出与待处理的深度图像对应的分类标签图。其中，分类标签图与待处理的深度图像的尺寸相同，且分类标签图中每个像素点的值为每个像素点的类型值。类型值用于表征像素点在待处理的深度图像中所属的类型。另外，像素点坐标值隐含在图像像素排列中，输入的深度图像里每个像素点的值为深度值。After the segmentation model is acquired, the depth image to be processed is input into the segmentation model obtained by training, and the segmentation model outputs a classification label map corresponding to the depth image to be processed. The size of the classification label map is the same as that of the depth image to be processed, and the value of each pixel in the classification label map is the type value of each pixel. The type value is used to characterize the type of the pixel in the depth image to be processed. In addition, the pixel coordinate value is implicit in the image pixel arrangement, and the value of each pixel in the input depth image is the depth value.

其中，像素点在待处理深度图像中所属的类型可包括人手、物体、背景。在具体实现时，可用不同的类型值表示人手、物体、背景这三个类型。例如，用0表示背景、1表示人手、2表示物体。The type to which the pixel points belong in the depth image to be processed may include human hand, object, and background. During specific implementation, three types of human hand, object, and background can be represented by different type values. For example, use 0 for the background, 1 for the human hand, and 2 for the object.

本实施例中，根据每个像素点的类型值及类型值对应的类型，可以得到待处理的深度图像中人手和物体的分割结果，实现人手与交互的物体分割。In this embodiment, according to the type value of each pixel point and the type corresponding to the type value, the segmentation result of the human hand and the object in the depth image to be processed can be obtained, so as to realize the segmentation of the human hand and the interactive object.

如图2所示，将待处理的深度图像输入至深度网络模型中，先经过编码器，再经过注意力传递模型，最后经过解码器，输出待处理的深度图像的分类标签图，根据每个像素点的类型值，得到人手和物体的位置，实现人手与物体分割。As shown in Figure 2, the depth image to be processed is input into the deep network model, first through the encoder, then through the attention transfer model, and finally through the decoder to output the classification label map of the depth image to be processed. The type value of the pixel point, get the position of the human hand and the object, and realize the segmentation of the human hand and the object.

本申请实施例中，根据分割模型输出的待处理的深度图像中每个像素点的类型值及类型值对应的类型，可以确定属于人手的像素点和属于物体的像素点，从而实现了将待处理图像中交互的人手与物体分割开，实现了像素级别的人手与物体分割，分割精度较高，能够对复杂情形下交互的人手与物体进行分割。In the embodiment of the present application, according to the type value of each pixel in the depth image to be processed output by the segmentation model and the corresponding type of the type value, the pixels belonging to the human hand and the pixels belonging to the object can be determined, thereby realizing the The human hand and objects that interact in the processing image are separated, and the pixel-level human hand and object segmentation is realized. The segmentation accuracy is high, and it can segment the interacting human hands and objects in complex situations.

在本申请的一个实施例中，可根据彩色图像构建基于深度图像的人手分割训练数据集。下面结合图4进行详细说明，图4为本申请实施例提供的另一种从深度图像中分割人手与交互物体的方法的流程示意图。In one embodiment of the present application, a training dataset for human hand segmentation based on depth images can be constructed from color images. The following describes in detail with reference to FIG. 4 , which is a schematic flowchart of another method for segmenting a human hand and an interactive object from a depth image according to an embodiment of the present application.

如图4所示，该构建基于深度图像的人手分割数据集方法包括：As shown in Figure 4, the method for constructing a depth image-based hand segmentation dataset includes:

步骤301，获取人手与物体交互情景下，多对彩色图像和深度图像。Step 301: Acquire multiple pairs of color images and depth images in a scenario of interaction between a human hand and an object.

本实施例中，可先人为收集一些与人手皮肤颜色相差比较大的物体。然后，利用深度相机拍摄人手与每个物体交互情景下的图像，从而得到多对彩色图像和深度图像。另外，为了提高数据量，对于同一物体，可采集人手与物体不同交互姿势的图像。In this embodiment, some objects that are quite different in color from the skin color of a human hand can be collected manually. Then, a depth camera is used to capture images of the human hand interacting with each object, thereby obtaining multiple pairs of color images and depth images. In addition, in order to increase the amount of data, for the same object, images of different interaction poses between the human hand and the object can be collected.

在利用深度相机采集图像时，固定光照环境，例如使用相同亮度和色温的固定光源，以保证采集的彩色图像清晰无阴影。When capturing images with a depth camera, fix the lighting environment, such as using a fixed light source with the same brightness and color temperature, to ensure that the captured color images are clear and without shadows.

步骤302，对所有彩色图像进行基于HSV颜色空间的物体分割，获取每张彩色图像中每个像素点的类型值。Step 302 , perform object segmentation based on the HSV color space on all color images, and obtain the type value of each pixel in each color image.

本实施例中，可先通过深度阈值剔除所有彩色图像和深度图像中的背景，保留人手和物体的图像。然后，根据现有的RGB颜色空间到HSV颜色空间的转换公式，将获取的所有彩色图像转换到HSV颜色空间。其中，HSV颜色空间的参数分别是：色调(H)，饱和度(S)，明度(V)。In this embodiment, the backgrounds in all color images and depth images can be removed first through a depth threshold, and images of human hands and objects are retained. Then, according to the existing RGB color space to HSV color space conversion formula, all acquired color images are converted to HSV color space. Among them, the parameters of the HSV color space are: hue (H), saturation (S), and lightness (V).

之后，对每张彩色图像对应的HSV颜色空间进行分割，以得到每张彩色图像中每个像素点的类型值。具体而言，分析多张纯手样本和交互样本像素点在HSV空间中的分布，样本间重合区域即为人手像素点对应的区域，拟合出多条线性约束条件。对所有彩色图像进行分析，位于约束内的像素点标为人手，位于约束外的标为物体。After that, segment the HSV color space corresponding to each color image to obtain the type value of each pixel in each color image. Specifically, the distribution of pixels of multiple pure-hand samples and interactive samples in HSV space is analyzed, and the overlapping area between samples is the area corresponding to the pixels of human hands, and multiple linear constraints are fitted. All color images were analyzed, and the pixels inside the constraints were marked as human hands, and the pixels outside the constraints were marked as objects.

步骤303，针对每对彩色图像和深度图像，将彩色图像中每个像素点，映射到的深度图像中对应像素点，构建基于深度图像的人手分割训练数据集。Step 303: For each pair of color images and depth images, map each pixel in the color image to the corresponding pixel in the depth image to construct a depth image-based human hand segmentation training dataset.

针对每对彩色图像和深度图像，将彩色图像和深度图像进行像素对齐，即对深度和彩色传感器的相机内外参分别进行估计，将深度点云仿射变换至彩色相机空间使用一种自动化标注方法生成基于彩色图像的真实分类标签图像，该真实分类标签图也是彩色图像对应的深度图像的真实分类标签图。其中，真实分类标签图像中每个像素点的类型值可用0表示背景，1表示手，2表示物体。For each pair of color image and depth image, pixel-align the color image and depth image, that is, estimate the camera internal and external parameters of the depth and color sensors respectively, and use an automatic labeling method to affine the depth point cloud to the color camera space. Generates a true classification label image based on a color image, which is also the true classification label map of the depth image corresponding to the color image. Among them, the type value of each pixel in the real classification label image can be 0 to represent the background, 1 to represent the hand, and 2 to represent the object.

本实施例中，所有深度图像及其真实分类标签图构成了基于深度图像的人手分割训练数据集。In this embodiment, all depth images and their true classification label maps constitute a training dataset for human hand segmentation based on depth images.

进一步地，为了提高分割精度，在本申请的一个实施例中，在进行映射之前，可先对深度图像进行预处理，使用形态学和轮廓滤波方法进行去噪，并分析深度图像中的背景，仅保留人手以及与人手交互的物体。Further, in order to improve the segmentation accuracy, in an embodiment of the present application, before performing the mapping, the depth image may be preprocessed, the morphological and contour filtering methods may be used for denoising, and the background in the depth image may be analyzed, Only the human hand and objects that interact with the human hand are kept.

在获取用于训练分割模型的数据集后，在训练模型时，可先将基于深度图像的人手分割训练数据集分为训练数据集和测试数据集，其中，训练数据集中深度图像的数量远大于测试数据集中深度图像的数量，训练数据集用于训练，测试数据集用于对训练完成的模型进行测试。After obtaining the data set for training the segmentation model, when training the model, the training data set for human hand segmentation based on depth images can be divided into training data set and test data set, wherein the number of depth images in the training data set is much larger than The number of depth images in the test dataset, the training dataset is used for training, and the test dataset is used to test the trained model.

然后，利用训练数据集，对初始的分割模型进行训练，并计算第一损失函数。其中，第一损失函数采用softmax交叉熵损失函数，如下公式(1)所示：Then, using the training data set, the initial segmentation model is trained and the first loss function is calculated. Among them, the first loss function adopts the softmax cross entropy loss function, as shown in the following formula (1):

其中，y_i表示真实结果，x_i表示分割模型输出的预测值，下标i表示不同的类型，下标j也表示不同的类型。例如，像素点共有三种类型，首先计算类型值i＝0的损失，该损失为计算类型值i＝1的损失：计算类型值i＝2的损失：那么模型的损失为 Among them, _yi represents the real result, _xi represents the predicted value output by the segmentation model, the subscript i represents different types, and the subscript j also represents different types. For example, there are three types of pixel points, first calculate the loss of type value i=0, the loss is Compute the loss for type value i=1: Compute the loss for type value i=2: Then the loss of the model is

需要说明的是，第一损失函数也可以是其他能够实现分割任务的损失函数。It should be noted that the first loss function may also be other loss functions capable of realizing the segmentation task.

具体而言，将训练数据集中的深度图像输入至初始神经网络模型中，网络模型输出深度图像的预测分类标签图。然后，根据预测分类标签图与深度图像的真实标签图之间的差距，使用梯度下降算法反馈给网络内所有参数，并相应更新网络参数。当下一次输入深度图像时，网络输出的预测分类标签图会更接近真实分类标签图。Specifically, the depth images in the training dataset are input into the initial neural network model, and the network model outputs the predicted classification label map of the depth images. Then, according to the gap between the predicted classification label map and the real label map of the depth image, the gradient descent algorithm is used to feed back all the parameters in the network, and the network parameters are updated accordingly. The next time the depth image is input, the predicted class label map output by the network will be closer to the true class label map.

训练到第一损失函数的值不再下降时，也就是说，利用第一损失函数，该模型的性能达到最优时，使用轮廓误差作为损失函数继续训练。其中，轮廓误差如下公式(2)所示：When the value of the first loss function no longer decreases after training, that is to say, when the performance of the model reaches the optimum by using the first loss function, the training is continued using the contour error as the loss function. Among them, the contour error is shown in the following formula (2):

其中，B是模糊操作，如可使用5×5的σ＝2.121的高斯核进行高斯模糊；S为轮廓提取，如使用索伯算子进行轮廓提取；M_labels为真实分类标签图，M_logits为网络输出，具体为像素点的类型预测值。Among them, B is the fuzzy operation, for example, Gaussian blur with σ=2.121 kernel of 5×5 can be used for Gaussian blurring; S is the contour extraction, for example, the Sober operator is used for contour extraction; M _labels is the real classification label map, and M _logits is The output of the network, specifically the predicted value of the pixel's type.

当轮廓误差的值处于稳定、不再下降时，可以停止训练得到分割模型。然后，利用测试集测试该分割模型，具体地，可将测试集中的深度图像输入至分割模型中进行识别，统计测试集中所有深度图像的交并比(Intersection-over-Union，简称IOU)得分，利用IOU得分来判断该分割模型是否达到要求。When the value of the contour error is stable and no longer decreases, the training can be stopped to obtain the segmentation model. Then, use the test set to test the segmentation model, specifically, the depth images in the test set can be input into the segmentation model for identification, and the Intersection-over-Union (IOU) scores of all the depth images in the test set are counted, Use the IOU score to judge whether the segmentation model meets the requirements.

其中，IOU是指交集与并集的比值，在本实施例中，是指模型预测结果同真实结果的交集与并集的比值，也就是模型预测结果与真实结果的交集，与模型预测结果与真实结果的并集的比值。Among them, IOU refers to the ratio of intersection and union. In this embodiment, it refers to the ratio of the intersection and union of the model prediction result and the real result, that is, the intersection of the model prediction result and the real result, and the model prediction result and The ratio of the union of the true results.

图5为本申请实施例提供的一种分割模型的训练过程示意图。图5中左侧为数据构建过程示意图，右侧为模型训练过程示意图。数据构建时，将利用深度相机获取的彩色图像与深度图像对齐，并使用一种自动化标注方法生成基于彩色图像的真实分类标签图，其同时也是对齐的相应深度图像的真实分类标签。所有深度图像及其真实分类标签图像构成了基于深度图像的人手分割训练数据集。FIG. 5 is a schematic diagram of a training process of a segmentation model provided by an embodiment of the present application. The left side of Figure 5 is a schematic diagram of the data construction process, and the right side is a schematic diagram of the model training process. During data construction, the color images acquired with the depth camera are aligned with the depth images, and an automated labeling method is used to generate a map of true classification labels based on the color images, which are also the true classification labels of the aligned corresponding depth images. All the depth images and their ground-truth classified label images constitute the training dataset for human hand segmentation based on depth images.

模型训练时，利用数据集中的深度图像输入至注意力分割网络中，得到网络模型预测的分类标签图，同真实的分类标签图进行对比并计算损失，一步步迭代更新网络参数During model training, the depth images in the dataset are input into the attention segmentation network, and the classification label map predicted by the network model is obtained, which is compared with the real classification label map and the loss is calculated, and the network parameters are iteratively updated step by step.

图6为本申请实施例提供的一种使用轮廓误差的效果示意图。图6中，左边一列的物体和手为真实标签，中间一列为未使用轮廓误差的网络输出，右边一列表示使用了轮廓误差后的网络输出。FIG. 6 is a schematic diagram of the effect of using a contour error according to an embodiment of the present application. In Figure 6, the objects and hands in the left column are real labels, the middle column is the network output without contour error, and the right column is the network output after contour error is used.

本申请实施例中，在训练分割模型时，通过先使用通用的损失函数，当通用损失函数值处于稳定时，即模型在该损失函数下达到最优，将轮廓误差作为损失函数训练，并且在分割模型中加入注意力机制模型，由此，大大提高了模型的分割精度。In the embodiment of the present application, when training the segmentation model, by first using a general loss function, when the value of the general loss function is stable, that is, the model is optimal under the loss function, the contour error is used as the loss function for training, and in The attention mechanism model is added to the segmentation model, thereby greatly improving the segmentation accuracy of the model.

进一步地，为了增强分割模型的泛化能力，在利用训练数据集训练分割模型之前，可对训练数据集进行数据增广操作，将数据增广操作得到的深度图像加入训练数据集。Further, in order to enhance the generalization ability of the segmentation model, before using the training data set to train the segmentation model, a data augmentation operation may be performed on the training data set, and the depth images obtained by the data augmentation operation are added to the training data set.

其中，数据增广操作包括自由旋转深度图像、加入随机噪声、随机翻转深度图像中的至少一种。The data augmentation operation includes at least one of freely rotating the depth image, adding random noise, and randomly flipping the depth image.

为了实现上述实施例，本申请实施例还提出一种从深度图像中分割人手与交互物体的装置。图7为本申请实施例提供的一种从深度图像中分割人手与交互物体的装置的结构示意图。In order to realize the above embodiments, the embodiments of the present application further provide an apparatus for segmenting a human hand and an interactive object from a depth image. FIG. 7 is a schematic structural diagram of an apparatus for segmenting a human hand and an interactive object from a depth image according to an embodiment of the present application.

如图7所示，该从深度图像中分割人手与交互物体的装置包括：构建模块610、训练模块620、识别模块630。As shown in FIG. 7 , the apparatus for segmenting a human hand and an interactive object from a depth image includes: a building module 610 , a training module 620 , and a recognition module 630 .

构建模块610，用于利用基于颜色图像的分割方法，构建基于深度图像的人手分割数据集；A building module 610 is used to construct a depth image-based hand segmentation data set by using a color image-based segmentation method;

训练模块620，用于利用所述基于深度图像的人手分割数据集，训练得到分割模型，所述分割模型由编码器、注意力传递模型和解码器构成；A training module 620, configured to use the depth image-based human hand segmentation data set to obtain a segmentation model by training, and the segmentation model is composed of an encoder, an attention transfer model and a decoder;

识别模块630，用于利用所述分割模型对待处理的深度图像进行分割，获取与所述待处理的深度图像对应的分类标签图，所述分类标签图中每个像素点的值为所述每个像素点的类型值，所述类型值用于表征像素点在待处理的深度图像中所属的类型。The identification module 630 is configured to use the segmentation model to segment the depth image to be processed, and obtain a classification label map corresponding to the depth image to be processed, where the value of each pixel in the classification label map is the value of each pixel. Type value of a pixel point, where the type value is used to characterize the type to which the pixel point belongs in the depth image to be processed.

在本申请实施例的一种可能的实现方式中，上述构建模块610具体用于：In a possible implementation manner of the embodiment of the present application, the above-mentioned building module 610 is specifically used for:

采集人手与物体交互情景下，多对彩色图像和深度图像；Collect multiple pairs of color images and depth images in the interaction between human hands and objects;

对所有彩色图像进行基于HSV颜色空间的物体分割，获取每张彩色图像中每个像素点的类型值；Perform object segmentation based on HSV color space for all color images, and obtain the type value of each pixel in each color image;

针对每对彩色图像和深度图像，将彩色图像中每个像素点，映射到的深度图像中对应像素点，构建基于深度图像的人手分割训练数据集。For each pair of color image and depth image, map each pixel in the color image to the corresponding pixel in the depth image to construct a depth image-based human hand segmentation training dataset.

在本申请实施例的一种可能的实现方式中，对所述深度图像进行预处理，包括噪声和背景去除。In a possible implementation manner of the embodiment of the present application, the depth image is preprocessed, including noise and background removal.

在本申请实施例的一种可能的实现方式中，基于深度图像的人手分割数据集包括训练数据集和测试数据集，训练模块620，具体用于：In a possible implementation of the embodiment of the present application, the depth image-based human hand segmentation dataset includes a training dataset and a test dataset, and the training module 620 is specifically used for:

利用训练数据集，对初始神经网络模型进行训练，并计算第一损失函数，其中，第一损失函数采用softmax交叉熵损失函数；Using the training data set, the initial neural network model is trained, and the first loss function is calculated, wherein the first loss function adopts the softmax cross entropy loss function;

当第一损失函数的值不再下降时，使用轮廓误差作为损失函数继续训练。When the value of the first loss function no longer decreases, continue training using the contour error as the loss function.

在本申请实施例的一种可能的实现方式中，该装置还包括：In a possible implementation manner of the embodiment of the present application, the device further includes:

处理模块，用于对所述训练数据集进行数据增广操作，所述数据增广操作包括自由旋转深度图像、加入随机噪声、随机翻转深度图像中的至少一种。The processing module is configured to perform a data augmentation operation on the training data set, wherein the data augmentation operation includes at least one of freely rotating the depth image, adding random noise, and randomly flipping the depth image.

需要说明的是，上述对从深度图像中分割人手与交互物体的方法实施例的解释说明，也适用于该实施例的从深度图像中分割人手与交互物体的装置，故在此不再赘述。It should be noted that the above description of the method embodiment for segmenting a human hand and an interactive object from a depth image is also applicable to the device for segmenting a human hand and an interactive object from a depth image in this embodiment, so it will not be repeated here.

Claims

1. a method for segmenting human hands and interactive objects from a depth image, comprising:

Using the segmentation method based on color image, construct a dataset of human hand segmentation based on depth image;

Using the depth image-based human hand segmentation data set, a segmentation model is obtained by training, and the segmentation model is composed of an encoder, an attention transfer model and a decoder;

Use the segmentation model to segment the depth image to be processed, and obtain a classification label map corresponding to the depth image to be processed, where the value of each pixel in the classification label map is the type value of each pixel , and the type value is used to represent the type to which the pixel belongs in the depth image to be processed.

2. method as claimed in claim 1 is characterized in that, described utilizing the segmentation method based on color image, constructs the hand segmentation data set based on depth image, comprises:

Obtain multiple pairs of color images and depth images in the interaction scenario between human hands and objects;

Perform object segmentation based on HSV color space for all color images, and obtain the type value of each pixel in each color image;

For each pair of color images and depth images, each pixel in the color image is mapped to a corresponding pixel in the depth image, and a training dataset for human hand segmentation based on the depth image is constructed.

3. The method of claim 2, wherein after mapping each pixel in the color image to a corresponding pixel in the depth image, the method further comprises:

The depth image is preprocessed, including noise and background removal.

4. The method of claim 2, wherein the depth image-based human hand segmentation dataset comprises a training dataset and a test dataset, and the depth image-based human hand segmentation dataset is used to train segmentation models, including:

Using the training data set, the initial neural network model is trained, and the first loss function is calculated, wherein the first loss function adopts the softmax cross entropy loss function;

When the value of the first loss function no longer decreases, continue training using the contour error as the loss function.

5. The method according to claim 4, wherein, before the training of the segmentation model using the training data set, the method further comprises:

A data augmentation operation is performed on the training data set, and the data augmentation operation includes at least one of freely rotating the depth image, adding random noise, and randomly flipping the depth image.

6. A device for segmenting a human hand and an interactive object from a depth image, comprising:

A building block for building a depth image-based hand segmentation dataset using a color image-based segmentation method;

a training module for using the depth image-based human hand segmentation data set to obtain a segmentation model through training, and the segmentation model is composed of an encoder, an attention transfer model and a decoder;

The identification module is used to segment the depth image to be processed by using the segmentation model, and obtain a classification label map corresponding to the depth image to be processed, and the value of each pixel in the classification label map is the value of each Type value of the pixel point, where the type value is used to represent the type of the pixel point in the depth image to be processed.

7. The apparatus of claim 6, wherein the building block is specifically used for:

For each pair of color image and depth image, each pixel in the color image is mapped to the corresponding pixel in the depth image, and a training dataset for human hand segmentation based on the depth image is constructed.

8. The apparatus of claim 7, further comprising:

A preprocessing module for preprocessing the depth image, including noise and background removal.

9. The apparatus of claim 7, wherein the depth image-based hand segmentation data set comprises a training data set and a test data set, and the training module is specifically used for:

10. The apparatus of claim 9, further comprising:

The processing module is configured to perform a data augmentation operation on the training data set, wherein the data augmentation operation includes at least one of freely rotating the depth image, adding random noise, and randomly flipping the depth image.