HK40056713B

HK40056713B - Face image similarity calculation method, apparatus, and device, and storage medium

Info

Publication number: HK40056713B
Application number: HK42021045313.0A
Authority: HK
Inventors: 陈欣; 戴磊; 刘玉宇
Original assignee: 平安科技（深圳）有限公司
Filing date: 2021-12-27
Publication date: 2024-09-13

Description

Methods, apparatus, devices, and storage media for calculating facial image similarity

技术领域Technical Field

本发明涉及图像处理领域，尤其涉及一种人脸图像相似度的计算方法、装置、设备及存储介质。This invention relates to the field of image processing, and more particularly to a method, apparatus, device, and storage medium for calculating the similarity of human face images.

背景技术Background Technology

多帧单人脸跟踪，一般是先找到人脸，再比对，一般都是要跨越多帧，需要建立多帧视频之间的关系。过去的技术基本都是，分颗粒度从粗到细而言，基本是：通过检测框找到固定的对象范围，然后再对其找特征，再比较；通过关键点锁定更小的局部范围(通过坐标实现降维数据)，根据关键点的位置通过卡尔曼滤波，但是本身关键点就不准，卡尔曼滤波又会叠加误差；细节到分割的方法(标注成本很高)，实现更小更细局部的比对。Multi-frame single-face tracking typically involves first locating the face and then comparing it. This usually spans multiple frames and requires establishing relationships between them. Past techniques have generally progressed from coarse to fine granularity, essentially involving: finding a fixed object range using bounding boxes, then identifying features and comparing them; using keypoints to narrow down the local area (dimensionality reduction through coordinates), and applying Kalman filtering based on the keypoint positions. However, keypoints themselves are inaccurate, and Kalman filtering adds further errors; and finally, detailed segmentation methods (with high annotation costs) to achieve even smaller, more granular comparisons.

这几种方法，都是逐步精确锁定比较范围，实现比较对象尽量无杂质，从而提高准确度，但是两者的锁定范围和比较程度无法兼顾标准难度，速度，准确度，其中遇到的问题还有很多：姿势(面部旋转)，遮挡(环境遮挡)，光线(面部对光线的反射)，分辨率(越低的分辨率越模糊)，造成泛化性很差。These methods all gradually and precisely lock the comparison range to make the comparison objects as free of impurities as possible, thereby improving accuracy. However, the locking range and comparison degree of both methods cannot simultaneously take into account the standard difficulty, speed, and accuracy. There are many other problems encountered: posture (facial rotation), occlusion (environmental occlusion), lighting (facial reflection of light), and resolution (lower resolution results in greater blurriness), which leads to poor generalization.

发明内容Summary of the Invention

本发明的主要目的是通过对人脸图像进行特征提取和融合，根据两图像对应特征之间的相关性确定图像的相关性，提高了图像识别效率。The main objective of this invention is to improve image recognition efficiency by extracting and fusing features from facial images and determining the correlation between corresponding features of two images.

本发明第一方面提供了一种人脸图像相似度的计算方法，包括：获取两帧包含人脸的视频图像，并将所述视频图像输入预置人脸识别模型进行识别，输出所述视频图像中人脸的区域范围；根据所述区域范围，从所述两帧视频图像中提取出对应的第一人脸图像和第二人脸图像；将所述第一人脸图像和所述第二人脸图像输入预置注意力检测模型的特征层对所述第一人脸图像和所述第二人脸图像进行图像特征提取，分别得到所述第一人脸图像的第一图像特征和所述第二人脸图像的第二图像特征；分别对所述第一图像特征和所述第二图像特征进行卷积注意力的计算，得到第一注意力图像特征和第二注意力图像特征；计算所述第一注意力图像特征和第二注意力图像特征之间的特征相似度，并基于所述特征相似度确定所述第一人脸图像与所述第二人脸图像之间的图像相似度。The first aspect of the present invention provides a method for calculating the similarity of face images, comprising: acquiring two video images containing faces, inputting the video images into a preset face recognition model for recognition, and outputting the region range of the face in the video images; extracting corresponding first face images and second face images from the two video images according to the region range; inputting the first face images and the second face images into the feature layer of a preset attention detection model to extract image features from the first face images and the second face images, respectively obtaining a first image feature of the first face image and a second image feature of the second face image; calculating convolutional attention on the first image features and the second image features respectively, obtaining a first attention image feature and a second attention image feature; calculating the feature similarity between the first attention image feature and the second attention image feature, and determining the image similarity between the first face image and the second face image based on the feature similarity.

可选地，在本发明第一方面的第一种实现方式中，在所述获取两帧包含人脸的视频图像，并将所述视频图像输入预置人脸识别模型进行识别，输出所述视频图像中人脸的区域范围之前，还包括：获取多张不同应用场景下的包含人脸的样本图像，并将所述样本图像作为训练样本图像集；将所述训练样本图像集输入预置的初始人脸识别模型的主干网络，对所述训练样本图像集中的样本图像分别进行人脸特征提取，得到特征集，其中，所述初始人脸识别模型包括主干网络和多个分类网络；计算所述特征集的特征向量损失函数值，得到多个特征向量损失函数值；根据所述多个特征向量损失函数值，计算所述初始人脸识别模型的目标损失函数值；根据所述目标损失函数值对所述主干网络进行迭代更新，直至所述目标损失函数值收敛，得到目标人脸识别模型。Optionally, in a first implementation of the first aspect of the present invention, before acquiring two video images containing faces, inputting the video images into a preset face recognition model for recognition, and outputting the region of the face in the video images, the method further includes: acquiring multiple sample images containing faces in different application scenarios, and using the sample images as a training sample image set; inputting the training sample image set into the backbone network of a preset initial face recognition model, extracting face features from the sample images in the training sample image set to obtain a feature set, wherein the initial face recognition model includes a backbone network and multiple classification networks; calculating the feature vector loss function value of the feature set to obtain multiple feature vector loss function values; calculating the target loss function value of the initial face recognition model based on the multiple feature vector loss function values; iteratively updating the backbone network based on the target loss function value until the target loss function value converges to obtain the target face recognition model.

可选地，在本发明第一方面的第二种实现方式中，所述第一人脸图像和第二人脸图像包含全局图像信息，所述将所述第一人脸图像和所述第二人脸图像输入预置注意力检测模型的特征层对所述第一人脸图像和所述第二人脸图像进行图像特征提取，分别得到所述第一人脸图像的第一图像特征和所述第二人脸图像的第二图像特征包括：对所述第一人脸图像和所述第二人脸图像进行边缘提取，得到第一边缘图像和第二边缘图像，其中，所述第一边缘图像和所述第二边缘图像所包含边缘图像信息；将所述全局图像信息和所述边缘图像信息进行融合，得到所述第一人脸图像和所述第二人脸图像中包括目标对象的区域；对所述区域进行特征提取，得到所述第一人脸图像对应的第一全局特征、第一边缘特征和所述第二边缘图像对应的第二全局特征、第二边缘特征；对所述第一全局特征和所述第一边缘特征进行特征融合，得到第一人脸图像的第一图像特征，以及对所述第二全局特征和所述第二边缘特征进行特征融合得到所述第二人脸图像的第二图像特征。Optionally, in a second implementation of the first aspect of the present invention, the first face image and the second face image contain global image information. The step of inputting the first face image and the second face image into the feature layer of a preset attention detection model to extract image features from the first face image and the second face image, respectively obtaining a first image feature of the first face image and a second image feature of the second face image, includes: edge extraction of the first face image and the second face image to obtain a first edge image and a second edge image, wherein the first edge image and the second edge image contain edge image information; fusing the global image information and the edge image information to obtain regions in the first face image and the second face image that include the target object; feature extraction of the regions to obtain a first global feature and a first edge feature corresponding to the first face image, and a second global feature and a second edge feature corresponding to the second edge image; feature fusion of the first global feature and the first edge feature to obtain a first image feature of the first face image, and feature fusion of the second global feature and the second edge feature to obtain a second image feature of the second face image.

可选地，在本发明第一方面的第三种实现方式中，所述将所述全局图像信息和所述边缘图像信息进行融合，得到所述第一人脸图像和所述第二人脸图像中包括目标对象的区域包括：通过预置双路特征提取网络对所述第一边缘图像和所述第二边缘图像所包含边缘图像信息进行特征提取，并对所述第一边缘图像和所述第二边缘图像所包含边缘图像信息进行特征提取；将所述特征提取结果进行加和，得到所述第一人脸图像和所述第二人脸图像中包括目标对象的区域图像特征。Optionally, in a third implementation of the first aspect of the present invention, fusing the global image information and the edge image information to obtain the regions containing the target object in the first face image and the second face image includes: extracting features from the edge image information contained in the first edge image and the second edge image through a preset dual-path feature extraction network; and summing the feature extraction results to obtain the region image features containing the target object in the first face image and the second face image.

可选地，在本发明第一方面的第四种实现方式中，所述分别对所述第一图像特征和所述第二图像特征进行卷积注意力的计算，得到第一注意力图像特征和第二注意力图像特征包括：分别对所述第一图像特征和所述第二图像特征进行通道注意力的计算，得到所述图像特征的通道注意力图；基于注意力机制对所述图像特征和所述通道注意力图合并得到的增强图像特征进行空间注意力计算，得到所述图像特征的空间注意力图；将所述空间注意力图和所述增强图像特征合并，分别得到所述第一人脸图像的第一注意力图像特征和所述第二人脸图像的第二注意力图像特征。Optionally, in a fourth implementation of the first aspect of the present invention, the step of calculating convolutional attention on the first image feature and the second image feature respectively to obtain the first attention image feature and the second attention image feature includes: calculating channel attention on the first image feature and the second image feature respectively to obtain a channel attention map of the image feature; calculating spatial attention on the enhanced image feature obtained by merging the image feature and the channel attention map based on the attention mechanism to obtain a spatial attention map of the image feature; and merging the spatial attention map and the enhanced image feature to obtain the first attention image feature of the first face image and the second attention image feature of the second face image respectively.

可选地，在本发明第一方面的第五种实现方式中，所述分别对所述特征层输出的第一图像特征和所述第二图像特征进行通道注意力的计算，得到所述图像特征的通道注意力图包括：分别对所述第一图像特征和所述第二图像特征进行平均池化运算和最大池化运算，得到平均池化特征和最大池化特征；利用预先构建的多层感知机处理所述平均池化特征，得到平均池化参数，并利用所述多层感知机处理所述最大池化特征，得到最大池化参数；将所述平均池化参数与所述最大池化参数的和输入激活模块，得到所述第一图像特征的第一通道注意力图和第二图像特征的第二通道注意力图。Optionally, in a fifth implementation of the first aspect of the present invention, the step of calculating channel attention for the first image feature and the second image feature output by the feature layer to obtain a channel attention map of the image feature includes: performing average pooling and max pooling operations on the first image feature and the second image feature respectively to obtain average pooling features and max pooling features; processing the average pooling features using a pre-constructed multilayer perceptron to obtain average pooling parameters, and processing the max pooling features using the multilayer perceptron to obtain max pooling parameters; inputting the sum of the average pooling parameters and the max pooling parameters into an activation module to obtain a first channel attention map of the first image feature and a second channel attention map of the second image feature.

本发明第二方面提供了一种人脸图像相似度的计算装置，包括：识别模块，用于获取两帧包含人脸的视频图像，并将所述视频图像输入预置人脸识别模型进行识别，输出所述视频图像中人脸的区域范围；提取模块，用于根据所述区域范围，从所述两帧视频图像中提取出对应的第一人脸图像和第二人脸图像；第一特征提取模块，用于将所述第一人脸图像和所述第二人脸图像输入预置注意力检测模型的特征层对所述第一人脸图像和所述第二人脸图像进行图像特征提取，分别得到所述第一人脸图像的第一图像特征和所述第二人脸图像的第二图像特征；第一计算模块，用于分别对所述第一图像特征和所述第二图像特征进行卷积注意力的计算，得到第一注意力图像特征和第二注意力图像特征；确定模块，用于计算所述第一注意力图像特征和第二注意力图像特征之间的特征相似度，并基于所述特征相似度确定所述第一人脸图像与所述第二人脸图像之间的图像相似度。A second aspect of the present invention provides a device for calculating facial image similarity, comprising: a recognition module, configured to acquire two video images containing a face, input the video images into a preset facial recognition model for recognition, and output the region range of the face in the video images; an extraction module, configured to extract a corresponding first face image and a second face image from the two video images based on the region range; a first feature extraction module, configured to input the first face image and the second face image into the feature layer of a preset attention detection model to extract image features from the first face image and the second face image, respectively obtaining a first image feature of the first face image and a second image feature of the second face image; a first calculation module, configured to perform convolutional attention calculation on the first image feature and the second image feature, respectively, to obtain a first attention image feature and a second attention image feature; and a determination module, configured to calculate the feature similarity between the first attention image feature and the second attention image feature, and determine the image similarity between the first face image and the second face image based on the feature similarity.

可选地，在本发明第二方面的第一种实现方式中，所述人脸图像相似度的计算装置包括：获取模块，用于获取多张不同应用场景下的包含人脸的样本图像，并将所述样本图像作为训练样本图像集；第二特征提取模块，用于将所述训练样本图像集输入预置的初始人脸识别模型的主干网络，对所述训练样本图像集中的样本图像分别进行人脸特征提取，得到特征集，其中，所述初始人脸识别模型包括主干网络和多个分类网络；第二计算模块，用于计算所述特征集的特征向量损失函数值，得到多个特征向量损失函数值；第三计算模块，用于根据所述多个特征向量损失函数值，计算所述初始人脸识别模型的目标损失函数值；更新模块，用于根据所述目标损失函数值对所述主干网络进行迭代更新，直至所述目标损失函数值收敛，得到目标人脸识别模型。Optionally, in a first implementation of the second aspect of the present invention, the facial image similarity calculation device includes: an acquisition module, configured to acquire multiple sample images containing faces in different application scenarios, and use the sample images as a training sample image set; a second feature extraction module, configured to input the training sample image set into the backbone network of a preset initial facial recognition model, and extract facial features from the sample images in the training sample image set to obtain a feature set, wherein the initial facial recognition model includes a backbone network and multiple classification networks; a second calculation module, configured to calculate the feature vector loss function value of the feature set to obtain multiple feature vector loss function values; a third calculation module, configured to calculate the target loss function value of the initial facial recognition model based on the multiple feature vector loss function values; and an update module, configured to iteratively update the backbone network based on the target loss function value until the target loss function value converges to obtain a target facial recognition model.

可选地，在本发明第二方面的第二种实现方式中，所述第一特征提取模块包括：边缘提取单元，用于对所述第一人脸图像和所述第二人脸图像进行边缘提取，得到第一边缘图像和第二边缘图像，其中，所述第一边缘图像和所述第二边缘图像所包含边缘图像信息；融合单元，用于将所述全局图像信息和所述边缘图像信息进行融合，得到所述第一人脸图像和所述第二人脸图像中包括目标对象的区域；特征提取单元，用于对所述区域进行特征提取，得到所述第一人脸图像对应的第一全局特征、第一边缘特征和所述第二边缘图像对应的第二全局特征、第二边缘特征；特征融合单元，用于对所述第一全局特征和所述第一边缘特征进行特征融合，得到第一人脸图像的第一图像特征，以及对所述第二全局特征和所述第二边缘特征进行特征融合得到所述第二人脸图像的第二图像特征。Optionally, in a second implementation of the second aspect of the present invention, the first feature extraction module includes: an edge extraction unit, configured to extract edges from the first face image and the second face image to obtain a first edge image and a second edge image, wherein the first edge image and the second edge image contain edge image information; a fusion unit, configured to fuse the global image information and the edge image information to obtain regions in the first face image and the second face image that include the target object; a feature extraction unit, configured to extract features from the regions to obtain a first global feature and a first edge feature corresponding to the first face image and a second global feature and a second edge feature corresponding to the second edge image; and a feature fusion unit, configured to fuse the first global feature and the first edge feature to obtain a first image feature of the first face image, and to fuse the second global feature and the second edge feature to obtain a second image feature of the second face image.

可选地，在本发明第二方面的第三种实现方式中，所述融合单元具体用于：通过预置双路特征提取网络对所述第一边缘图像和所述第二边缘图像所包含边缘图像信息进行特征提取，并对所述第一边缘图像和所述第二边缘图像所包含边缘图像信息进行特征提取；将所述特征提取结果进行加和，得到所述第一人脸图像和所述第二人脸图像中包括目标对象的区域图像特征。Optionally, in a third implementation of the second aspect of the present invention, the fusion unit is specifically used to: extract features from the edge image information contained in the first edge image and the second edge image through a preset dual-path feature extraction network, and extract features from the edge image information contained in the first edge image and the second edge image; sum the feature extraction results to obtain the region image features of the target object in the first face image and the second face image.

可选地，在本发明第二方面的第四种实现方式中，所述第一计算模块包括：第一计算单元，用于分别对所述第一图像特征和所述第二图像特征进行通道注意力的计算，得到所述图像特征的通道注意力图；第二计算单元，用于基于注意力机制对所述图像特征和所述通道注意力图合并得到的增强图像特征进行空间注意力计算，得到所述图像特征的空间注意力图；特征合并单元，用于将所述空间注意力图和所述增强图像特征合并，分别得到所述第一人脸图像的第一注意力图像特征和所述第二人脸图像的第二注意力图像特征。Optionally, in a fourth implementation of the second aspect of the present invention, the first calculation module includes: a first calculation unit, configured to calculate channel attention for the first image feature and the second image feature respectively, to obtain a channel attention map of the image feature; a second calculation unit, configured to calculate spatial attention for the enhanced image feature obtained by merging the image feature and the channel attention map based on an attention mechanism, to obtain a spatial attention map of the image feature; and a feature merging unit, configured to merge the spatial attention map and the enhanced image feature to obtain a first attention image feature of the first face image and a second attention image feature of the second face image respectively.

可选地，在本发明第二方面的第五种实现方式中，所述第二计算单元具体用于：分别对所述第一图像特征和所述第二图像特征进行平均池化运算和最大池化运算，得到平均池化特征和最大池化特征；利用预先构建的多层感知机处理所述平均池化特征，得到平均池化参数，并利用所述多层感知机处理所述最大池化特征，得到最大池化参数；将所述平均池化参数与所述最大池化参数的和输入激活模块，得到所述第一图像特征的第一通道注意力图和第二图像特征的第二通道注意力图。Optionally, in a fifth implementation of the second aspect of the present invention, the second computing unit is specifically configured to: perform average pooling and max pooling operations on the first image feature and the second image feature respectively to obtain average pooling features and max pooling features; process the average pooling features using a pre-constructed multilayer perceptron to obtain average pooling parameters, and process the max pooling features using the multilayer perceptron to obtain max pooling parameters; input the sum of the average pooling parameters and the max pooling parameters into an activation module to obtain a first channel attention map of the first image feature and a second channel attention map of the second image feature.

本发明第三方面提供了一种人脸图像相似度的计算设备，包括：存储器和至少一个处理器，所述存储器中存储有指令，所述存储器和所述至少一个处理器通过线路互连；A third aspect of the present invention provides a device for calculating facial image similarity, comprising: a memory and at least one processor, wherein the memory stores instructions, and the memory and the at least one processor are interconnected via a circuit;

所述至少一个处理器调用所述存储器中的所述指令，以使得所述人脸图像相似度的计算设备执行上述的人脸图像相似度的计算方法。The at least one processor invokes the instructions in the memory to cause the face image similarity calculation device to execute the face image similarity calculation method described above.

本发明的第四方面提供了一种计算机可读存储介质，所述计算机可读存储介质中存储有指令，当其在计算机上运行时，使得计算机执行上述的人脸图像相似度的计算方法。A fourth aspect of the present invention provides a computer-readable storage medium storing instructions that, when executed on a computer, cause the computer to perform the above-described method for calculating facial image similarity.

本发明提供的技术方案中，通过将两帧视频图像输入预置人脸识别模型进行识别，输出视频图像对应的第一人脸图像和第二人脸图像；将人脸图像输入预置注意力检测模型的特征层进行图像特征提取，分别得到人脸图像的图像特征；对图像特征执行卷积注意力计算，分别得到人脸图像的第一注意力图像特征和第二注意力图像特征；计算第一注意力图像特征和第二注意力图像特征之间的特征相似度，确定为第一人脸图像和第二人脸图像的图像相似度。本方案通过对人脸图像进行特征提取和融合，根据两图像对应特征之间的相关性确定图像的相关性，提高了图像识别效率。The technical solution provided by this invention involves inputting two video frames into a pre-set face recognition model for identification, outputting a first face image and a second face image corresponding to the video images; inputting the face images into the feature layer of a pre-set attention detection model for image feature extraction, obtaining image features of the face images respectively; performing convolutional attention calculation on the image features, obtaining first attention image features and second attention image features of the face images respectively; calculating the feature similarity between the first attention image features and the second attention image features, determining the image similarity between the first face image and the second face image. This solution improves image recognition efficiency by extracting and fusing features from face images and determining the correlation between corresponding features of the two images.

附图说明Attached Figure Description

图1为本发明人脸图像相似度的计算方法的第一个实施例示意图；Figure 1 is a schematic diagram of the first embodiment of the method for calculating facial image similarity of the present invention;

图2为本发明人脸图像相似度的计算方法的第二个实施例示意图；Figure 2 is a schematic diagram of the second embodiment of the method for calculating facial image similarity of the present invention;

图3为本发明人脸图像相似度的计算方法的第三个实施例示意图；Figure 3 is a schematic diagram of the third embodiment of the method for calculating facial image similarity of the present invention;

图4为本发明人脸图像相似度的计算方法的第四个实施例示意图；Figure 4 is a schematic diagram of the fourth embodiment of the method for calculating facial image similarity of the present invention;

图5为本发明人脸图像相似度的计算方法的第五个实施例示意图；Figure 5 is a schematic diagram of the fifth embodiment of the method for calculating facial image similarity of the present invention;

图6为本发明人脸图像相似度的计算装置的第一个实施例示意图；Figure 6 is a schematic diagram of the first embodiment of the face image similarity calculation device of the present invention;

图7为本发明人脸图像相似度的计算装置的第二个实施例示意图；Figure 7 is a schematic diagram of a second embodiment of the face image similarity calculation device of the present invention;

图8为本发明人脸图像相似度的计算设备的一个实施例示意图。Figure 8 is a schematic diagram of an embodiment of the face image similarity calculation device of the present invention.

具体实施方式Detailed Implementation

本发明实施例提供了一种人脸图像相似度的计算方法、装置、设备及存储介质，本发明的技术方案中，首先通过将两帧视频图像输入预置人脸识别模型进行识别，输出视频图像对应的第一人脸图像和第二人脸图像；将人脸图像输入预置注意力检测模型的特征层进行图像特征提取，分别得到人脸图像的图像特征；对图像特征执行卷积注意力计算，分别得到人脸图像的第一注意力图像特征和第二注意力图像特征；计算第一注意力图像特征和第二注意力图像特征之间的特征相似度，确定为第一人脸图像和第二人脸图像的图像相似度。本方案通过对人脸图像进行特征提取和融合，根据两图像对应特征之间的相关性确定图像的相关性，提高了图像识别效率。This invention provides a method, apparatus, device, and storage medium for calculating facial image similarity. The technical solution involves first inputting two video frames into a preset facial recognition model for identification, outputting a first and a second facial image corresponding to the video frames. The facial images are then input into the feature layer of a preset attention detection model for image feature extraction, yielding image features for each facial image. Convolutional attention calculation is performed on these image features to obtain first and second attention image features for each facial image. Finally, the feature similarity between the first and second attention image features is calculated and determined as the image similarity between the first and second facial images. This solution improves image recognition efficiency by extracting and fusing features from facial images and determining the correlation between corresponding features of the two images.

本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外，术语“包括”或“具有”及其任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms “first,” “second,” “third,” “fourth,” etc. (if present) in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms “comprising” or “having,” and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

为便于理解，下面对本发明实施例的具体流程进行描述，请参阅图1，本发明实施例中人脸图像相似度的计算方法的第一个实施例包括：For ease of understanding, the specific process of the embodiments of the present invention is described below. Please refer to Figure 1. The first embodiment of the method for calculating the similarity of face images in the embodiments of the present invention includes:

101、获取两帧包含人脸的视频图像，并将视频图像输入预置人脸识别模型进行识别，输出视频图像中人脸的区域范围；101. Acquire two video images containing human faces, input the video images into a preset face recognition model for recognition, and output the region range of the human face in the video images;

本实施例中，训练得到人脸识别模型后，从预置数据库中获取两帧包含人脸的视频图像，其中，视频图像中包含待识别的人脸信息。然后将视频图像输入人脸识别模型中。In this embodiment, after training the face recognition model, two video images containing faces are obtained from a pre-set database. These video images contain the facial information to be recognized. The video images are then input into the face recognition model.

人脸识别模型能够通过鼻子、眼睛、或其他五官的框将视频图像中的人脸标识出来，得到视频图像中各人脸的区域范围。Face recognition models can identify faces in video images by using bounding boxes for noses, eyes, or other facial features, thus obtaining the region range of each face in the video image.

102、根据区域范围，从两帧视频图像中提取出对应的第一人脸图像和第二人脸图像；102. Based on the region range, extract the corresponding first face image and second face image from the two video frames;

本实施例中，根据区域范围，然后将视频图像中各人脸的区域范围从第二图像中裁剪出来，从而提取各个视频图像对应的人脸图像，也就是第一人脸图像和第二人脸图像。In this embodiment, based on the region range, the region range of each face in the video image is then cropped from the second image, thereby extracting the face image corresponding to each video image, namely the first face image and the second face image.

103、将第一人脸图像和第二人脸图像输入预置注意力检测模型的特征层对第一人脸图像和第二人脸图像进行图像特征提取，分别得到第一人脸图像的第一图像特征和第二人脸图像的第二图像特征；103. Input the first face image and the second face image into the feature layer of the preset attention detection model to extract image features from the first face image and the second face image, respectively, to obtain the first image feature of the first face image and the second image feature of the second face image;

本实施例中，图像特征提取是指计算机不认识图像,只认识数字。为了使计算机能够“理解”图像，从而具有真正意义上的“视觉”，本章我们将研究如何从图像中提取有用的数据或信息，得到图像的“非图像”的表示或描述，如数值、向量和符号等。这一过程就是特征提取，而提取出来的这些“非图像”的表示或描述就是特征。In this embodiment, image feature extraction refers to the computer's ability to recognize numbers, not images. To enable computers to "understand" images and thus possess true "vision," this chapter will explore how to extract useful data or information from images, obtaining "non-image" representations or descriptions, such as numerical values, vectors, and symbols. This process is feature extraction, and the extracted "non-image" representations or descriptions are called features.

其中，特征是某一类对象区别于其他类对象的相应(本质)特点或特性，或是这些特点和特性的集合。特征是通过测量或处理能够抽取的数据。对于图像而言，每一幅图像都具有能够区别于其他类图像的自身特征，有些是可以直观地感受到的自然特征，如亮度、边缘、纹理和色彩等；有些则是需要通过变换或处理才能得到的，如矩、直方图以及主成份等。比如，我们常常将某一类对象的多个或多种特性组合在一起，形成一个特征向量来代表该类对象，如果只有单个数值特征，则特征向量为一个一维向量，如果是n个特性的组合，则为一个n维特征向量。该类特征向量常常作为识别系统的输入。实际上，一个n维特征就是位于n维空间中的点，而识别分类的任务就是找到对这个n维空间的一种划分。In this context, a feature is a corresponding (essential) characteristic or property that distinguishes one class of objects from other classes, or a set of such characteristics and properties. Features are data that can be extracted through measurement or processing. For images, each image possesses its own unique features that distinguish it from other images. Some are natural features that can be intuitively perceived, such as brightness, edges, texture, and color; others require transformation or processing, such as moments, histograms, and principal components. For example, we often combine multiple characteristics of a class of objects to form a feature vector representing that class. If there is only a single numerical feature, the feature vector is a one-dimensional vector; if it is a combination of n characteristics, it is an n-dimensional feature vector. This type of feature vector is often used as input to a recognition system. In fact, an n-dimensional feature is a point located in n-dimensional space, and the task of recognition and classification is to find a partition of this n-dimensional space.

104、分别对第一图像特征和第二图像特征进行卷积注意力的计算，得到第一注意力图像特征和第二注意力图像特征；104. Perform convolutional attention calculations on the first image features and the second image features respectively to obtain the first attention image features and the second attention image features;

本实施例中，其中，注意力检测模型包括依次连接的多个特征层；其中第一个特征层的输入为输入特征，除第一个特征层以外的每一个特征层的输入均为前一个特征层输出的图像特征；注意力图像特征中目标元素的数值大于对应的图像特征中目标元素的数值；目标元素指代根据待检测图像中目标物体的像素计算得到的元素。In this embodiment, the attention detection model includes multiple feature layers connected in sequence; the input of the first feature layer is the input feature, and the input of each feature layer other than the first feature layer is the image feature output by the previous feature layer; the value of the target element in the attention image feature is greater than the value of the target element in the corresponding image feature; the target element refers to the element calculated based on the pixels of the target object in the image to be detected.

105、计算第一注意力图像特征和第二注意力图像特征之间的特征相似度，并基于特征相似度确定第一人脸图像与第二人脸图像之间的图像相似度。105. Calculate the feature similarity between the first attention image features and the second attention image features, and determine the image similarity between the first face image and the second face image based on the feature similarity.

本实施例中，得到第一注意力图像特征之后，便可利用与或逻辑运算代替浮点运算，计算出第一注意力图像特征与第二注意力图像特征之间的特征相似度。该特征相似度即可视为第一人脸图像和第二人脸图像的图像相似度。In this embodiment, after obtaining the first attention image features, AND-OR logical operations can be used instead of floating-point operations to calculate the feature similarity between the first attention image features and the second attention image features. This feature similarity can be regarded as the image similarity between the first face image and the second face image.

具体的，第二人脸图像为待识别图像，第一人脸图像为具有识别标签的目标图像，在将特征相似度确定为第一人脸图像和第二人脸图像的图像相似度之后，在图像相似度大于预设阈值时，将识别标签作为第二人脸图像的识别结果。如此，便可提高图像识别的准确率和识别速度。其中，识别标签可具体为人员身份，或分类信息或其他识别标签。Specifically, the second face image is the image to be identified, and the first face image is the target image with an identification label. After determining the feature similarity as the image similarity between the first and second face images, if the image similarity is greater than a preset threshold, the identification label is used as the identification result of the second face image. This improves the accuracy and speed of image recognition. The identification label can specifically be a person's identity, classification information, or other identification tags.

本发明实施例中，通过将两帧视频图像输入预置人脸识别模型进行识别，输出视频图像对应的第一人脸图像和第二人脸图像；将人脸图像输入预置注意力检测模型的特征层进行图像特征提取，分别得到人脸图像的图像特征；对图像特征执行卷积注意力计算，分别得到人脸图像的第一注意力图像特征和第二注意力图像特征；计算第一注意力图像特征和第二注意力图像特征之间的特征相似度，确定为第一人脸图像和第二人脸图像的图像相似度。本方案通过对人脸图像进行特征提取和融合，根据两图像对应特征之间的相关性确定图像的相关性，提高了图像识别效率。In this embodiment of the invention, two video frames are input into a pre-set face recognition model for identification, outputting a first face image and a second face image corresponding to the video images. The face images are then input into the feature layer of a pre-set attention detection model for image feature extraction, yielding image features for each face image. Convolutional attention calculation is performed on these image features to obtain first attention image features and second attention image features for each face image. The feature similarity between the first and second attention image features is calculated and determined as the image similarity between the first and second face images. This scheme improves image recognition efficiency by extracting and fusing features from face images and determining the correlation between corresponding features of the two images.

请参阅图2，本发明实施例中人脸图像相似度的计算方法的第二个实施例包括：Please refer to Figure 2. A second embodiment of the method for calculating facial image similarity in this invention includes:

201、获取多张不同应用场景下的包含人脸的样本图像，并将样本图像作为训练样本图像集；201. Obtain multiple sample images containing human faces from different application scenarios, and use these sample images as a training sample image set;

本实施例中，一个训练数据集对应一个应用场景，例如：人证识别场景和自然场景。训练数据集可为不同维度下的人脸数据、开源数据和私有数据，例如：自然场景的人脸数据、亚洲人的人脸数据、考勤数据、人证数据和竞赛数据。服务器可从预置的数据库中提取多张不同应用场景下的包含人脸的样本图像，对包含人脸的样本图像进行预处理，得到预处理后的训练数据图像集。In this embodiment, one training dataset corresponds to one application scenario, such as a facial recognition scenario and a natural scene scenario. The training dataset can be facial data of different dimensions, open source data, and private data, such as facial data of natural scenes, facial data of Asians, attendance data, facial recognition data, and competition data. The server can extract multiple sample images containing faces from different application scenarios from a pre-set database, preprocess the sample images containing faces, and obtain a preprocessed training data image set.

202、将训练样本图像集输入预置的初始人脸识别模型的主干网络，对训练样本图像集中的样本图像分别进行人脸特征提取，得到特征集，其中，初始人脸识别模型包括主干网络和多个分类网络；202. Input the training sample image set into the backbone network of the preset initial face recognition model, and extract face features from the sample images in the training sample image set to obtain a feature set. The initial face recognition model includes a backbone network and multiple classification networks.

本实施例中，预置的初始人脸识别模型包括主干网络和多个分类网络，主干网络的输出为多个分类网络的输入，通过多个分类网络对主干网络处理后的数据进行分类，从而实现对训练数据集的人脸识别训练。主干网络可为单个卷积神经网络也可为多个卷积神经网络的综合框架，例如：主干网络可为深度残差学习框架ResNet或目标检测网络框架ET-YOLOv3，也可为深度残差学习框架ResNet结合目标检测网络框架ET-YOLOv3的综合框架。In this embodiment, the pre-set initial face recognition model includes a backbone network and multiple classification networks. The output of the backbone network serves as the input to the multiple classification networks. The data processed by the backbone network is classified by the multiple classification networks, thereby enabling face recognition training on the training dataset. The backbone network can be a single convolutional neural network or a combined framework of multiple convolutional neural networks. For example, the backbone network can be the deep residual learning framework ResNet or the object detection network framework ET-YOLOv3, or a combined framework of the deep residual learning framework ResNet and the object detection network framework ET-YOLOv3.

服务器可通过初始人脸识别模型的主干网络，对每个训练数据集进行人脸标框识别、标框区域划分、人脸关键点检测和人脸特征向量提取，得到每个训练数据集对应的特征集(即多个特征集)。主干网络中的卷积网络层采用小卷积核，通过小卷积核保留更多的特征，减少计算量，提高人脸特征提取的效率。The server can use the backbone network of the initial face recognition model to perform face bounding box recognition, bounding box region segmentation, face key point detection, and face feature vector extraction on each training dataset, obtaining a feature set (i.e., multiple feature sets) corresponding to each training dataset. The convolutional network layers in the backbone network use small convolutional kernels, which retain more features, reduce computation, and improve the efficiency of face feature extraction.

203、计算特征集的特征向量损失函数值，得到多个特征向量损失函数值；203. Calculate the feature vector loss function value of the feature set to obtain multiple feature vector loss function values;

本实施例中，计算第一中心向量和第二中心向量，计算每个第一中心向量和第二中心向量之间的距离值，将该距离值作为每个特征集对应的特征向量损失函数值，从而获得多个特征向量损失函数，其中，第一中心向量为每个特征集对应的中心向量，也可为每个特征集中每个训练数据对应的中心向量，第二中心向量可为所有特征集对应的第二中心向量，也可为每个特征集中所有训练数据对应的中心向量。In this embodiment, a first center vector and a second center vector are calculated, and the distance between each first center vector and the second center vector is calculated. This distance value is used as the feature vector loss function value corresponding to each feature set, thereby obtaining multiple feature vector loss functions. The first center vector is the center vector corresponding to each feature set, or it can be the center vector corresponding to each training data in each feature set. The second center vector can be the second center vector corresponding to all feature sets, or it can be the center vector corresponding to all training data in each feature set.

服务器可通过获取每个特征集对应的训练数据个数，以及计算所有训练数据对应的第一中心向量的和值，根据训练数据个数计算和值的均值，该均值为每个特征集对应的第二中心向量，服务器也可通过预置的中心向量公式计算第二中心向量。The server can obtain the number of training data points corresponding to each feature set, and calculate the sum of the first center vectors corresponding to all training data points. The mean of the sums is calculated based on the number of training data points. This mean is the second center vector corresponding to each feature set. The server can also calculate the second center vector using a preset center vector formula.

服务器通过预置的交叉熵损失函数计算每个分类数据集的分类损失函数值，从而得到多个分类损失函数值，该交叉熵损失函数可为多分类交叉熵损失函数，通过多分类交叉熵损失函数，求导更简单，能够使得收敛较快，对应的权重矩阵的更新更快。The server calculates the classification loss function value for each classification dataset using a pre-defined cross-entropy loss function, thus obtaining multiple classification loss function values. This cross-entropy loss function can be a multi-class cross-entropy loss function. Using a multi-class cross-entropy loss function simplifies the differentiation, resulting in faster convergence and quicker updates to the corresponding weight matrix.

204、根据多个特征向量损失函数值，计算初始人脸识别模型的目标损失函数值；204. Based on the loss function values of multiple feature vectors, calculate the target loss function value of the initial face recognition model;

本实施例中，获得多个特征向量损失函数值和多个分类损失函数值后，获取多个训练数据集的数据集个数，根据数据集个数，计算多个特征向量损失函数值的平均特征向量损失函数值，以及多个分类损失函数值的平均分类损失函数值，将平均特征向量损失函数值和平均分类损失函数值的和值，作为人脸识别模型的目标损失函数值，或者将平均特征向量损失函数值和平均分类损失函数值的加权和值，作为人脸识别模型的目标损失函数值。每个分类网络计算得到分类损失函数值时，可根据分类损失函数值对对应的分类网络进行反向更新。In this embodiment, after obtaining multiple feature vector loss function values and multiple classification loss function values, the number of training datasets is determined. Based on the number of datasets, the average feature vector loss function value and the average classification loss function value are calculated. The sum of the average feature vector loss function value and the average classification loss function value are used as the target loss function value of the face recognition model, or the weighted sum of the average feature vector loss function value and the average classification loss function value is used as the target loss function value of the face recognition model. When each classification network calculates its classification loss function value, the corresponding classification network can be updated in reverse based on the classification loss function value.

205、根据目标损失函数值对主干网络进行迭代更新，直至目标损失函数值收敛，得到目标人脸识别模型；205. Iteratively update the backbone network based on the target loss function value until the target loss function value converges to obtain the target face recognition model;

本实施例中，根据目标损失函数值和预置的迭代次数，对主干网络的网络结构和/或权重值进行迭代更新，直至目标损失函数值收敛(即人脸识别模型的训练精度符合预设条件)，得到更新后的人脸识别模型。其中，可通过对主干网络进行网络层的增加或删减来更新主干网络的网络结构，也可通过增设其他的网络框架来更新主干网络的网络结构，也可通过修改主干网络的卷积核大小和步长等来更新主干网络的网络结构。在对主干网络进行迭代更新时，服务器也可结合优化算法对人脸识别模型进行优化。In this embodiment, the network structure and/or weights of the backbone network are iteratively updated based on the target loss function value and a preset number of iterations until the target loss function value converges (i.e., the training accuracy of the face recognition model meets the preset conditions), resulting in an updated face recognition model. The backbone network structure can be updated by adding or removing network layers, by adding other network frameworks, or by modifying the convolutional kernel size and stride. During the iterative update of the backbone network, the server can also optimize the face recognition model using optimization algorithms.

206、获取两帧包含人脸的视频图像，并将视频图像输入预置人脸识别模型进行识别，输出视频图像中人脸的区域范围；206. Acquire two video images containing human faces, input the video images into a preset face recognition model for recognition, and output the region range of the human face in the video images;

207、根据区域范围，从两帧视频图像中提取出对应的第一人脸图像和第二人脸图像；207. Based on the region range, extract the corresponding first face image and second face image from two video frames;

208、将第一人脸图像和第二人脸图像输入预置注意力检测模型的特征层对第一人脸图像和第二人脸图像进行图像特征提取，分别得到第一人脸图像的第一图像特征和第二人脸图像的第二图像特征；208. Input the first face image and the second face image into the feature layer of the preset attention detection model to extract image features from the first face image and the second face image, respectively, to obtain the first image feature of the first face image and the second image feature of the second face image;

209、分别对第一图像特征和第二图像特征进行卷积注意力的计算，得到第一注意力图像特征和第二注意力图像特征；209. Perform convolutional attention calculations on the first image features and the second image features respectively to obtain the first attention image features and the second attention image features;

210、计算第一注意力图像特征和第二注意力图像特征之间的特征相似度，并基于特征相似度确定第一人脸图像与第二人脸图像之间的图像相似度。210. Calculate the feature similarity between the first attention image features and the second attention image features, and determine the image similarity between the first face image and the second face image based on the feature similarity.

本实施例中步骤206-210与第一实施例中的步骤101-105类似，此处不再赘述。Steps 206-210 in this embodiment are similar to steps 101-105 in the first embodiment, and will not be described again here.

请参阅图3，本发明实施例中人脸图像相似度的计算方法的第三个实施例包括：Please refer to Figure 3. A third embodiment of the method for calculating facial image similarity in this invention includes:

301、获取两帧包含人脸的视频图像，并将视频图像输入预置人脸识别模型进行识别，输出视频图像中人脸的区域范围；301. Acquire two video images containing human faces, input the video images into a preset face recognition model for recognition, and output the region range of the human face in the video images;

302、根据区域范围，从两帧视频图像中提取出对应的第一人脸图像和第二人脸图像；302. Based on the region range, extract the corresponding first face image and second face image from the two video frames;

303、对第一人脸图像和第二人脸图像进行边缘提取，得到第一边缘图像和第二边缘图像，其中，第一边缘图像和第二边缘图像所包含边缘图像信息；303. Perform edge extraction on the first face image and the second face image to obtain a first edge image and a second edge image, wherein the first edge image and the second edge image contain edge image information;

本实施例中，第一人脸图像和第二人脸图像为待特征提取的图像，第一人脸图像和第二人脸图像可以为RGB图像(即，由红黄蓝三原色组成的图像)，第一人脸图像和第二人脸图像的格式可以为jpg、jpeg、TIFF、PNG、BMP或PSD等，本公开实施例不作限定。第一人脸图像和第二人脸图像中包括目标对象，目标对象的数量可以为一个或多个(即，至少两个)。另外，边缘图像可以理解为用于突出表示第一人脸图像和第二人脸图像中目标对象与背景之间边界以及目标对象轮廓的图像。第一人脸图像和第二人脸图像与边缘图像中所包括的目标对象为相同目标对象，而目标对象在第一人脸图像和第二人脸图像和边缘图像中的表现形式不同。In this embodiment, the first face image and the second face image are images to be extracted. The first face image and the second face image can be RGB images (i.e., images composed of the three primary colors red, yellow, and blue), and their formats can be jpg, jpeg, TIFF, PNG, BMP, or PSD, etc., without limitation in this embodiment. The first face image and the second face image include target objects, and the number of target objects can be one or more (i.e., at least two). Additionally, the edge image can be understood as an image used to highlight the boundary between the target object and the background in the first face image and the second face image, as well as the outline of the target object. The target object included in the first face image, the second face image, and the edge image is the same target object, but the representation of the target object in the first face image, the second face image, and the edge image differs.

304、通过预置双路特征提取网络对第一边缘图像和第二边缘图像所包含边缘图像信息进行特征提取，并对第一边缘图像和第二边缘图像所包含边缘图像信息进行特征提取；304. The edge image information contained in the first edge image and the second edge image is extracted by a pre-set dual-path feature extraction network.

本实施例中，将第一人脸图像和第二人脸图像中一个小区域的像素进行加权平均后，可以成为边缘图像中的对应像素。第一人脸图像和第二人脸图像的维度可以为H×W×3；其中，H表示的是第一人脸图像和第二人脸图像的高度(如，600)，W表示的是第一人脸图像和第二人脸图像的宽度(如，600)，3表示的是第一人脸图像和第二人脸图像的三原色通道数。预设卷积核的尺寸可以为3*3，也可以为5*5，也可以为其他尺寸，本公开实施例不作限定。举例来说，若预设卷积核的尺寸为3*3，预设卷积核内每个单元的权重可以如下：In this embodiment, the corresponding pixels in the edge image are obtained by weighted averaging of pixels in a small region of the first and second face images. The dimensions of the first and second face images can be H×W×3; where H represents the height of the first and second face images (e.g., 600), W represents the width of the first and second face images (e.g., 600), and 3 represents the number of the three primary color channels of the first and second face images. The size of the preset convolution kernel can be 3*3, 5*5, or other sizes, which are not limited in this embodiment. For example, if the size of the preset convolution kernel is 3*3, the weight of each unit in the preset convolution kernel can be as follows:

-1 -2 -1 -2 12 -2 -1 -2 -1-1 -2 -1 -2 12 -2 -1 -2 -1

具体地，根据预设卷积核对第一人脸图像和第二人脸图像进行梯度计算，以提取第一人脸图像和第二人脸图像对应的边缘图像的方式可以为：Specifically, the method of extracting the edge images corresponding to the first and second face images by performing gradient calculations on the first and second face images according to the preset convolution kernel can be as follows:

将预设卷积核Sx与第一人脸图像和第二人脸图像进行卷积，得到其中，将预设卷积核Sx进行转置，得到转置后的卷积核并将与进行卷积，得到其中，通过对和的组合，得到第一人脸图像和第二人脸图像对应的梯度向量梯度方向θ以及梯度幅度其中，根据梯度向量确定出第一人脸图像和第二人脸图像对应的边缘图像，边缘图像中包括了用于表示灰度变化剧烈程度的图像频率。此外，需要说明的是，梯度幅度变化较快的区域可以为边缘区域，梯度方向θ用于表示梯度变化方向，结合梯度方向θ和梯度幅度能够确定出第一人脸图像和第二人脸图像中目标对象的边缘。A preset convolution kernel Sx is convolved with the first face image and the second face image to obtain [the kernel]. The preset convolution kernel Sx is then transposed to obtain the transposed convolution kernel, which is then convolved with [the kernel] to obtain [the kernel]. By combining [the kernel] and [the kernel], the gradient vectors corresponding to the first face image and the second face image are obtained, including the gradient direction θ and the gradient magnitude. The edge images corresponding to the first face image and the second face image are determined based on the gradient vectors. The edge images include image frequencies representing the degree of grayscale change. Furthermore, it should be noted that regions with rapid gradient magnitude changes can be considered edge regions. The gradient direction θ represents the direction of gradient change. Combining the gradient direction θ and the gradient magnitude allows the edge of the target object in the first face image and the second face image to be determined.

305、将特征提取结果进行加和，得到第一人脸图像和第二人脸图像中包括目标对象的区域；305. Sum the feature extraction results to obtain the regions containing the target object in the first face image and the second face image;

本实施例中，其中，全局图像信息用于从整体上表征第一人脸图像和第二人脸图像。边缘图像信息用于表征第一人脸图像和第二人脸图像中目标对象的边缘和细节。融合结果可以表示为矩阵，对应目标对象的边缘和细节强化后的第一人脸图像和第二人脸图像。In this embodiment, global image information is used to represent the first face image and the second face image as a whole. Edge image information is used to represent the edges and details of the target object in the first face image and the second face image. The fusion result can be represented as a matrix, corresponding to the first face image and the second face image after edge and detail enhancement of the target object.

可以将全局图像信息和边缘图像信息分别对应的参考图像特征进行加和，并对加和结果进行第二预设频次的卷积，实现全局图像信息和边缘图像信息的特征融合，得到第一人脸图像和第二人脸图像中包括目标对象的区域图像特征。The reference image features corresponding to the global image information and the edge image information can be summed, and the summed result can be convolved at a second preset frequency to achieve feature fusion of the global image information and the edge image information, so as to obtain the region image features including the target object in the first face image and the second face image.

306、对区域进行特征提取，得到第一人脸图像对应的第一全局特征、第一边缘特征和第二边缘图像对应的第二全局特征、第二边缘特征；306. Extract features from the region to obtain the first global feature and the first edge feature corresponding to the first face image, and the second global feature and the second edge feature corresponding to the second edge image;

本实施例中，全局特征用于在整体上表征目标对象，边缘特征用于突出在边缘和细节上表征目标对象。In this embodiment, global features are used to characterize the target object as a whole, while edge features are used to highlight the target object at its edges and in detail.

307、对第一全局特征和第一边缘特征进行特征融合，得到第一人脸图像的第一图像特征，以及对第二全局特征和第二边缘特征进行特征融合得到第二人脸图像的第二图像特征；307. Perform feature fusion on the first global feature and the first edge feature to obtain the first image feature of the first face image, and perform feature fusion on the second global feature and the second edge feature to obtain the second image feature of the second face image;

本实施例中，其中，图像特征的输出形式可以为矩阵。对上述的全局特征和边缘特征进行特征融合，包括：将全局特征和边缘特征进行连接，得到第一参考特征，第一参考特征的维度为全局特征和边缘特征的维度之和；举例来说，若全局特征的维度为2048且边缘特征的维度为2048维，那么，第一参考特征的维度为4096维；对第一参考特征进行降维特征转换，得到第二参考特征，作为目标对象对应的图像特征。In this embodiment, the output form of the image features can be a matrix. Feature fusion of the aforementioned global and edge features includes: concatenating the global and edge features to obtain a first reference feature, the dimension of which is the sum of the dimensions of the global and edge features; for example, if the dimension of the global feature is 2048 and the dimension of the edge feature is 2048, then the dimension of the first reference feature is 4096; performing dimensionality reduction feature transformation on the first reference feature to obtain a second reference feature, which serves as the image feature corresponding to the target object.

308、分别对第一图像特征和第二图像特征进行卷积注意力的计算，得到第一注意力图像特征和第二注意力图像特征；308. Perform convolutional attention calculations on the first image features and the second image features respectively to obtain the first attention image features and the second attention image features;

309、计算第一注意力图像特征和第二注意力图像特征之间的特征相似度，并基于特征相似度确定第一人脸图像与第二人脸图像的之间图像相似度。309. Calculate the feature similarity between the first attention image features and the second attention image features, and determine the image similarity between the first face image and the second face image based on the feature similarity.

本实施例中步骤301-302、308-309与第一实施例中的步骤101-102、104-105类似，此处不再赘述。Steps 301-302 and 308-309 in this embodiment are similar to steps 101-102 and 104-105 in the first embodiment, and will not be described again here.

请参阅图4，本发明实施例中人脸图像相似度的计算方法的第四个实施例包括：Please refer to Figure 4. The fourth embodiment of the method for calculating facial image similarity in this invention includes:

401、获取两帧包含人脸的视频图像，并将视频图像输入预置人脸识别模型进行识别，输出视频图像中人脸的区域范围；401. Acquire two video images containing human faces, input the video images into a preset face recognition model for recognition, and output the region range of the human face in the video images;

402、根据区域范围，从两帧视频图像中提取出对应的第一人脸图像和第二人脸图像；402. Based on the region range, extract the corresponding first face image and second face image from the two video frames;

403、将第一人脸图像和第二人脸图像输入预置注意力检测模型的特征层对第一人脸图像和第二人脸图像进行图像特征提取，分别得到第一人脸图像的第一图像特征和第二人脸图像的第二图像特征；403. Input the first face image and the second face image into the feature layer of the preset attention detection model to extract image features from the first face image and the second face image, respectively, to obtain the first image feature of the first face image and the second image feature of the second face image;

404、分别对第一图像特征和第二图像特征进行通道注意力的计算，得到图像特征的通道注意力图；404. Calculate channel attention for the first image feature and the second image feature respectively to obtain the channel attention map of the image features;

本实施例中，分别对图像特征进行平均池化运算和最大池化运算，得到平均池化特征和最大池化特征；利用预先构建的多层感知机处理平均池化特征，得到平均池化参数，并利用多层感知机处理最大池化特征，得到最大池化参数；In this embodiment, average pooling and max pooling operations are performed on the image features to obtain average pooling features and max pooling features respectively; the average pooling features are processed using a pre-built multilayer perceptron to obtain average pooling parameters, and the max pooling features are processed using a multilayer perceptron to obtain max pooling parameters.

将平均池化参数与最大池化参数的和输入激活模块，得到图像特征的通道注意力图。The sum of the average pooling parameters and the max pooling parameters is input into the activation module to obtain the channel attention map of the image features.

其中，对图像特征进行平均池化运算，是指，利用一个具有预先设定的尺寸的池化窗口(如可以是2×2的池化窗口)在图像特征包含的每一个特征矩阵上移动，每次移动后池化窗口覆盖的区域均紧挨着移动前池化窗口覆盖的区域(即移动前后的两个区域的某一条边重合，但是两个区域互不重叠)，每当池化窗口覆盖一个新的区域，计算池化窗口当前覆盖的元素(以上述2×2的池化窗口，一次可以覆盖4个元素，即两行两列)的算术平均值，将得到的计算结果作为最终的平均池化特征中的一个元素，当图像特征中每一个元素均进行过上述平均值计算后，对这个图像特征的平均池化运算就完成，计算得到的所有平均值按照计算时池化窗口的位置组合，就得到这个图像特征对应的平均池化特征。The average pooling operation on image features involves moving a pooling window of a pre-defined size (e.g., a 2×2 pooling window) across each feature matrix contained in the image feature. After each move, the area covered by the pooling window is adjacent to the area covered by the pooling window before the move (i.e., one edge of the two areas before and after the move coincides, but the two areas do not overlap). Whenever the pooling window covers a new area, the arithmetic mean of the elements currently covered by the pooling window (in the case of the 2×2 pooling window mentioned above, it can cover 4 elements at a time, i.e., two rows and two columns) is calculated. The calculated result is used as an element in the final average pooling feature. After each element in the image feature has undergone the above average calculation, the average pooling operation on this image feature is completed. All the calculated average values are combined according to the position of the pooling window during the calculation to obtain the average pooling feature corresponding to this image feature.

对图像特征进行最大池化运算的过程，和上述平均池化运算的过程基本一致，区别在于，每当池化窗口覆盖一个新区域时，从该区域内的所有元素中筛选出最大的元素，作为本次的计算结果(区别于平均池化运算中将平均值作为计算结果)，同样的，当图像特征中每一个元素均经过上述筛选后，对图像特征的最大池化运算过程完成，筛选得到的所有元素按照筛选时池化窗口的位置组合，就得到这个图像特征对应的最大池化特征。The process of performing max pooling on image features is basically the same as the average pooling process described above. The difference is that whenever the pooling window covers a new region, the largest element is selected from all elements in that region and used as the result of this calculation (unlike the average pooling operation which uses the average value as the result). Similarly, after each element in the image feature has been selected in the above way, the max pooling process of the image feature is completed. All the selected elements are combined according to the position of the pooling window during the selection to obtain the max pooling feature corresponding to this image feature.

405、基于注意力机制对图像特征和通道注意力图合并得到的增强图像特征进行空间注意力计算，得到图像特征的空间注意力图；405. Based on the attention mechanism, spatial attention is calculated on the enhanced image features obtained by merging image features and channel attention maps to obtain the spatial attention map of image features;

本实施例中，分别对增强图像特征进行平均池化运算和最大池化运算，得到平均池化增强特征和最大池化增强特征；将平均池化增强特征和最大池化增强特征合并得到合并池化特征；利用预设尺寸的卷积核对合并池化特征进行卷积运算，并将卷积运算得到的运算结果输入激活模块，得到图像特征的空间注意力图。In this embodiment, average pooling and max pooling operations are performed on the enhanced image features to obtain average pooling enhanced features and max pooling enhanced features, respectively. The average pooling enhanced features and max pooling enhanced features are then merged to obtain merged pooling features. The merged pooling features are then convolved using a convolution kernel of a preset size, and the result of the convolution operation is input into the activation module to obtain a spatial attention map of the image features.

可以理解的，针对任意一个特征矩阵，其内部的元素中只有那些根据待检测图像中目标物体的像素计算得到的元素(即目标元素)对于检测目标物体是有价值的，而其他的元素则是对检测目标物体这一目的干扰。例如，待检测图像中目标物体位于图像的左下角，相应的，特征矩阵中，根据图像左下角的像素计算得到的，同样位于特征矩阵左下角的元素对于检测目标物体是有价值的，而其他元素，例如位于特征矩阵上方的元素则会在检测目标物体时形成干扰。Understandably, for any given feature matrix, only those elements calculated based on the pixels of the target object in the image to be detected (i.e., target elements) are valuable for detecting the target object, while other elements interfere with this process. For example, if the target object in the image to be detected is located in the lower left corner, then the elements in the feature matrix calculated based on the pixels in the lower left corner, also located in the lower left corner, are valuable for detecting the target object. Other elements, such as those located above the target object in the feature matrix, will interfere with the detection process.

406、将空间注意力图和增强图像特征合并，分别得到第一人脸图像的第一注意力图像特征和第二人脸图像的第二注意力图像特征；406. Merge the spatial attention map and the enhanced image features to obtain the first attention image features of the first face image and the second attention image features of the second face image, respectively;

本实施例中，计算出图像特征的通道注意力图和空间注意力图，然后将通道注意力图和空间注意力图与图像特征合并，得到注意力图像特征。通过上述注意力计算，增加了卷积神经网络特征提取的有效性，使得目标检测的平均精度明显提升。In this embodiment, channel attention maps and spatial attention maps of image features are calculated, and then these maps are merged with the image features to obtain attention-based image features. This attention calculation increases the effectiveness of convolutional neural network feature extraction, significantly improving the average accuracy of object detection.

可选的，对于第一个特征层，可以设置归一化层，在这种情况下，第一个特征层输出图像特征之后，需要：利用归一化层对第一个特征层输出的图像特征进行批处理归一化运算，得到归一化图像特征；对应的，第一个特征层所连接的注意力层的具体作用是：利用特征层连接的注意力层对归一化图像特征执行卷积注意力计算，得到注意力图像特征。Optionally, a normalization layer can be set for the first feature layer. In this case, after the first feature layer outputs image features, it is necessary to: use the normalization layer to perform batch normalization operations on the image features output by the first feature layer to obtain normalized image features; correspondingly, the specific function of the attention layer connected to the first feature layer is: to use the attention layer connected to the feature layer to perform convolutional attention calculation on the normalized image features to obtain attention image features.

本实施例中，批处理归一化(Batch Norm)是为了解决训练过程中数据分布的改变，提高网络泛化性，加快网络训练的一种算法。在网络训练的过程中，参数不断地在更新，前一层网络参数的更新，就会导致下一层网络输入数据分布的变化，那么该层网络就要去适应新的数据分布，这样大大影响了网络训练的速度。另一方面，卷积神经网路的训练过程就是在学习数据分布，如果数据分布不断发生变化的话，那么会降低网络的泛化能力。批处理归一化的本质就是对数据进行预处理，把数据送入网络之前，先对它进行归一化，这样做可以减少数据分布的变化，使得网络的泛化性和训练速度大大提高。In this embodiment, batch normalization is an algorithm designed to address changes in data distribution during training, thereby improving network generalization and accelerating training. During network training, parameters are constantly updated. Updates to the parameters of previous layers lead to changes in the input data distribution of the next layer, requiring that layer to adapt to the new data distribution, significantly impacting training speed. Furthermore, the training process of convolutional neural networks involves learning the data distribution; if the data distribution constantly changes, it reduces the network's generalization ability. Batch normalization essentially preprocesses the data, normalizing it before feeding it into the network. This reduces changes in data distribution, greatly improving the network's generalization and training speed.

407、计算第一注意力图像特征和第二注意力图像特征之间的特征相似度，并基于特征相似度确定第一人脸图像与第二人脸图像之间的图像相似度。407. Calculate the feature similarity between the first attention image features and the second attention image features, and determine the image similarity between the first face image and the second face image based on the feature similarity.

本实施例中步骤401-403、407与第一实施例中的步骤101-103、105类似，此处不再赘述。Steps 401-403 and 407 in this embodiment are similar to steps 101-103 and 105 in the first embodiment, and will not be repeated here.

请参阅图5，本发明实施例中人脸图像相似度的计算方法的第五个实施例包括：Please refer to Figure 5. The fifth embodiment of the method for calculating facial image similarity in this invention includes:

501、获取两帧包含人脸的视频图像，并将视频图像输入预置人脸识别模型进行识别，输出视频图像中人脸的区域范围；501. Acquire two video images containing human faces, input the video images into a preset face recognition model for recognition, and output the region range of the human face in the video images;

502、根据区域范围，从两帧视频图像中提取出对应的第一人脸图像和第二人脸图像；502. Based on the region range, extract the corresponding first face image and second face image from the two video frames;

503、将第一人脸图像和第二人脸图像输入预置注意力检测模型的特征层对第一人脸图像和第二人脸图像进行图像特征提取，分别得到第一人脸图像的第一图像特征和第二人脸图像的第二图像特征；503. Input the first face image and the second face image into the feature layer of the preset attention detection model to extract image features from the first face image and the second face image, respectively, to obtain the first image feature of the first face image and the second image feature of the second face image;

504、分别对第一图像特征和第二图像特征进行平均池化运算和最大池化运算，得到平均池化特征和最大池化特征；504. Perform average pooling and max pooling operations on the first image features and the second image features respectively to obtain average pooling features and max pooling features;

本实施例中，对图像特征进行平均池化运算，是指，利用一个具有预先设定的尺寸的池化窗口(如可以是2×2的池化窗口)在图像特征包含的每一个特征矩阵上移动，每次移动后池化窗口覆盖的区域均紧挨着移动前池化窗口覆盖的区域(即移动前后的两个区域的某一条边重合，但是两个区域互不重叠)，每当池化窗口覆盖一个新的区域，计算池化窗口当前覆盖的元素(以上述2×2的池化窗口，一次可以覆盖4个元素，即两行两列)的算术平均值，将得到的计算结果作为最终的平均池化特征中的一个元素，当图像特征中每一个元素均进行过上述平均值计算后，对这个图像特征的平均池化运算就完成，计算得到的所有平均值按照计算时池化窗口的位置组合，就得到这个图像特征对应的平均池化特征。In this embodiment, the average pooling operation on the image features refers to using a pooling window with a pre-set size (such as a 2×2 pooling window) to move across each feature matrix contained in the image features. After each move, the area covered by the pooling window is adjacent to the area covered by the pooling window before the move (i.e., one edge of the two areas before and after the move coincides, but the two areas do not overlap). Whenever the pooling window covers a new area, the arithmetic mean of the elements currently covered by the pooling window (with the above-mentioned 2×2 pooling window, it can cover 4 elements at a time, i.e., two rows and two columns) is calculated. The calculated result is used as one element in the final average pooling feature. After each element in the image feature has undergone the above average calculation, the average pooling operation on this image feature is completed. All the calculated average values are combined according to the position of the pooling window during the calculation to obtain the average pooling feature corresponding to this image feature.

505、利用预先构建的多层感知机处理平均池化特征，得到平均池化参数，并利用多层感知机处理最大池化特征，得到最大池化参数；505. Use a pre-built multilayer perceptron to process the average pooling features to obtain the average pooling parameters, and use a multilayer perceptron to process the max pooling features to obtain the max pooling parameters.

本实施例中，多层感知机是一种前馈人工神经网络模型，其将输入的多个数据集映射到单一的输出的数据集上。在本方案中，多层感知机具体用于将最大池化特征和平均池化特征，分别映射为一个包含C个参数的一维向量，即映射为如下形式的向量：(A1，A2……AC-1，AC)。In this embodiment, the multilayer perceptron is a feedforward artificial neural network model that maps multiple input datasets to a single output dataset. Specifically, in this scheme, the multilayer perceptron is used to map max-pooling features and average-pooling features to a one-dimensional vector containing C parameters, i.e., a vector of the form (A1, A2, ..., AC-1, AC).

其中，C就是输入至这个注意力层的图像特征的通道数(一个图像特征包含的特征矩阵的数量，称为通道数)。Where C is the number of channels of the image features input to this attention layer (the number of feature matrices contained in an image feature is called the number of channels).

多层感知机输出的这两个一维向量，就是前述计算过程中提及的平均池化参数和最大池化参数。The two one-dimensional vectors output by the multilayer perceptron are the average pooling parameter and the max pooling parameter mentioned in the previous calculation process.

506、将平均池化参数与最大池化参数的和输入激活模块，得到第一图像特征的第一通道注意力图和第二图像特征的第二通道注意力图；506. Input the sum of the average pooling parameter and the max pooling parameter into the activation module to obtain the first channel attention map of the first image feature and the second channel attention map of the second image feature;

本实施例中，利用激活函数对多层感知机输出的两个一维向量进行激活运算(相当于将平均池化参数与最大池化参数的和输入激活模块)，就可以得到通道注意力图。其中，通道注意力图也是一个包含C个参数的一维向量。In this embodiment, the channel attention map is obtained by performing activation operations on the two one-dimensional vectors output by the multilayer perceptron using an activation function (equivalent to inputting the sum of the average pooling parameter and the max pooling parameter into the activation module). The channel attention map is also a one-dimensional vector containing C parameters.

本实施例中，通道注意力图的作用，在于突出图像特征中有意义的特征矩阵。一个图像中，每一种物体的特征在同一个图像特征的不同特征矩阵上的显著程度是不同的，例如，可能汽车的特征在第一特征矩阵上较为显著，而房屋的特征在第二特征矩阵上较为显著。显然，在物体检测这一应用场景中，那些目标物体的特征较为突出的特征矩阵是有意义的特征矩阵，而其他特征矩阵则是无意义的特征矩阵。In this embodiment, the function of the channel attention map is to highlight meaningful feature matrices within the image features. In an image, the salience of features of each object differs across different feature matrices of the same image feature set. For example, the features of a car might be more prominent on the first feature matrix, while the features of a house might be more prominent on the second. Clearly, in the object detection application scenario, feature matrices containing more prominent features of the target object are meaningful feature matrices, while other feature matrices are meaningless.

507、基于注意力机制对图像特征和通道注意力图合并得到的增强图像特征进行空间注意力计算，得到图像特征的空间注意力图；507. Based on the attention mechanism, spatial attention is calculated on the enhanced image features obtained by merging image features and channel attention maps to obtain the spatial attention map of image features;

508、将空间注意力图和增强图像特征合并，分别得到第一人脸图像的第一注意力图像特征和第二人脸图像的第二注意力图像特征；508. Merge the spatial attention map and the enhanced image features to obtain the first attention image features of the first face image and the second attention image features of the second face image, respectively;

509、计算第一注意力图像特征和第二注意力图像特征之间的特征相似度，并基于特征相似度确定第一人脸图像与第二人脸图像之间的图像相似度。509. Calculate the feature similarity between the first attention image features and the second attention image features, and determine the image similarity between the first face image and the second face image based on the feature similarity.

本实施例中步骤501-503、509与第一实施例中的101-103、105类似，此处不再赘述。Steps 501-503 and 509 in this embodiment are similar to steps 101-103 and 105 in the first embodiment, and will not be described again here.

在本发明实施例中，通过将两帧视频图像输入预置人脸识别模型进行识别，输出视频图像对应的第一人脸图像和第二人脸图像；将人脸图像输入预置注意力检测模型的特征层进行图像特征提取，分别得到人脸图像的图像特征；对图像特征执行卷积注意力计算，分别得到人脸图像的第一注意力图像特征和第二注意力图像特征；计算第一注意力图像特征和第二注意力图像特征之间的特征相似度，确定为第一人脸图像和第二人脸图像的图像相似度。本方案通过对人脸图像进行特征提取和融合，根据两图像对应特征之间的相关性确定图像的相关性，提高了图像识别效率。In this embodiment of the invention, two video frames are input into a pre-set face recognition model for identification, outputting a first face image and a second face image corresponding to the video images. The face images are then input into the feature layer of a pre-set attention detection model for image feature extraction, yielding image features for each face image. Convolutional attention calculation is performed on these image features to obtain first attention image features and second attention image features for each face image. The feature similarity between the first and second attention image features is calculated and determined as the image similarity between the first and second face images. This scheme improves image recognition efficiency by extracting and fusing features from face images and determining the correlation between corresponding features of the two images.

上面对本发明实施例中人脸图像相似度的计算方法进行了描述，下面对本发明实施例中人脸图像相似度的计算装置进行描述，请参阅图6，本发明实施例中人脸图像相似度的计算装置的第一个实施例包括：The method for calculating face image similarity in the embodiments of the present invention has been described above. The following describes the apparatus for calculating face image similarity in the embodiments of the present invention. Referring to Figure 6, the first embodiment of the apparatus for calculating face image similarity in the embodiments of the present invention includes:

识别模块601，用于获取两帧包含人脸的视频图像，并将所述视频图像输入预置人脸识别模型进行识别，输出所述视频图像中人脸的区域范围；The recognition module 601 is used to acquire two video images containing human faces, input the video images into a preset face recognition model for recognition, and output the region range of the human face in the video images;

提取模块602，用于根据所述区域范围，从所述两帧视频图像中提取出对应的第一人脸图像和第二人脸图像；Extraction module 602 is used to extract the corresponding first face image and second face image from the two video images according to the region range;

第一特征提取模块603，用于将所述第一人脸图像和所述第二人脸图像输入预置注意力检测模型的特征层对所述第一人脸图像和所述第二人脸图像进行图像特征提取，分别得到所述第一人脸图像的第一图像特征和所述第二人脸图像的第二图像特征；The first feature extraction module 603 is used to input the first face image and the second face image into the feature layer of a preset attention detection model to extract image features from the first face image and the second face image, respectively obtaining the first image feature of the first face image and the second image feature of the second face image;

第一计算模块604，用于分别对所述第一图像特征和所述第二图像特征进行卷积注意力的计算，得到第一注意力图像特征和第二注意力图像特征；The first calculation module 604 is used to calculate convolutional attention on the first image features and the second image features respectively, to obtain first attention image features and second attention image features;

确定模块605，用于计算所述第一注意力图像特征和第二注意力图像特征之间的特征相似度，并基于所述特征相似度确定所述第一人脸图像与所述第二人脸图像之间的图像相似度。The determination module 605 is used to calculate the feature similarity between the first attention image feature and the second attention image feature, and to determine the image similarity between the first face image and the second face image based on the feature similarity.

请参阅图7，本发明实施例中人脸图像相似度的计算装置的第二个实施例，该人脸图像相似度的计算装置具体包括：Please refer to Figure 7, which illustrates a second embodiment of the face image similarity calculation device according to the present invention. This face image similarity calculation device specifically includes:

第一计算模块604，用于分别对所述第一图像特征和所述第二图像特征进行卷积注意力的计算，得到第一注意力图像特征和第二注意力图像特征；The first calculation module 604 is used to calculate convolutional attention on the first image features and the second image features respectively to obtain first attention image features and second attention image features;

本实施例中，所述人脸图像相似度的计算装置包括：In this embodiment, the facial image similarity calculation device includes:

获取模块606，用于获取多张不同应用场景下的包含人脸的视频图像，并将所述视频图像作为训练样本图像集；The acquisition module 606 is used to acquire multiple video images containing human faces in different application scenarios, and use the video images as a training sample image set.

第二特征提取模块607，用于将所述训练样本图像集输入预置的初始人脸识别模型的主干网络，对所述训练样本图像集中的视频图像分别进行人脸特征提取，得到特征集，其中，所述初始人脸识别模型包括主干网络和多个分类网络；The second feature extraction module 607 is used to input the training sample image set into the backbone network of the preset initial face recognition model, and to extract face features from the video images in the training sample image set to obtain a feature set. The initial face recognition model includes a backbone network and multiple classification networks.

第二计算模块608，用于计算所述特征集的特征向量损失函数值，得到多个特征向量损失函数值；The second calculation module 608 is used to calculate the feature vector loss function value of the feature set and obtain multiple feature vector loss function values.

第三计算模块609，用于根据所述多个特征向量损失函数值，计算所述初始人脸识别模型的目标损失函数值；The third calculation module 609 is used to calculate the target loss function value of the initial face recognition model based on the multiple feature vector loss function values;

更新模块610，用于根据所述目标损失函数值对所述主干网络进行迭代更新，直至所述目标损失函数值收敛，得到目标人脸识别模型。The update module 610 is used to iteratively update the backbone network according to the target loss function value until the target loss function value converges, thereby obtaining the target face recognition model.

本实施例中，所述第一特征提取模块603包括：In this embodiment, the first feature extraction module 603 includes:

边缘提取单元6031，用于对所述第一人脸图像和所述第二人脸图像进行边缘提取，得到第一边缘图像和第二边缘图像；Edge extraction unit 6031 is used to extract edges from the first face image and the second face image to obtain a first edge image and a second edge image;

融合单元6032，用于将所述第一人脸图像和所述第二人脸图像所包含的全局图像信息和所述第一边缘图像和所述第二边缘图像所包含的边缘图像信息进行融合，得到所述第一人脸图像和所述第二人脸图像中包括目标对象的区域；The fusion unit 6032 is used to fuse the global image information contained in the first face image and the second face image and the edge image information contained in the first edge image and the second edge image to obtain the region containing the target object in the first face image and the second face image.

特征提取单元6033，用于对所述区域进行特征提取，得到所述第一人脸图像对应的第一全局特征、第一边缘特征和所述第二边缘图像对应的第二全局特征、第二边缘特征；Feature extraction unit 6033 is used to extract features from the region to obtain a first global feature and a first edge feature corresponding to the first face image and a second global feature and a second edge feature corresponding to the second edge image;

特征融合单元6034，用于分别对所述第一全局特征和所述第一边缘特征以及所述第二全局特征和所述第二边缘特征进行特征融合，分别得到所述第一人脸图像的第一图像特征和所述第二人脸图像的第二图像特征。The feature fusion unit 6034 is used to perform feature fusion on the first global feature and the first edge feature, as well as the second global feature and the second edge feature, respectively, to obtain the first image feature of the first face image and the second image feature of the second face image.

本实施例中，所述融合单元6032具体用于：In this embodiment, the fusion unit 6032 is specifically used for:

通过预置双路特征提取网络对所述第一边缘图像和所述第二边缘图像所包含边缘图像信息进行特征提取，并对所述第一边缘图像和所述第二边缘图像所包含边缘图像信息进行特征提取；The edge image information contained in the first edge image and the second edge image is extracted by a pre-set dual-path feature extraction network.

将所述特征提取结果进行加和，得到所述第一人脸图像和所述第二人脸图像中包括目标对象的区域图像特征。The feature extraction results are summed to obtain the region image features of the target object in the first face image and the second face image.

本实施例中，所述第一计算模块604包括：In this embodiment, the first computing module 604 includes:

第一计算单元6041，用于分别对所述特征层输出的第一图像特征和所述第二图像特征进行通道注意力的计算，得到所述图像特征的通道注意力图；The first calculation unit 6041 is used to calculate channel attention for the first image features and the second image features output by the feature layer, respectively, to obtain a channel attention map of the image features;

第二计算单元6042，用于基于注意力机制对所述图像特征和所述通道注意力图合并得到的增强图像特征进行空间注意力计算，得到所述图像特征的空间注意力图；The second computing unit 6042 is used to perform spatial attention calculation on the enhanced image features obtained by merging the image features and the channel attention map based on the attention mechanism, so as to obtain the spatial attention map of the image features.

特征合并单元6043，用于将所述空间注意力图和所述增强图像特征合并，分别得到所述第一人脸图像的第一注意力图像特征和所述第二人脸图像的第二注意力图像特征。The feature merging unit 6043 is used to merge the spatial attention map and the enhanced image features to obtain the first attention image features of the first face image and the second attention image features of the second face image, respectively.

本实施例中，所述第二计算单元6042具体用于：In this embodiment, the second computing unit 6042 is specifically used for:

分别对所述第一图像特征和所述第二图像特征进行平均池化运算和最大池化运算，得到平均池化特征和最大池化特征；Average pooling and max pooling operations are performed on the first image features and the second image features respectively to obtain average pooling features and max pooling features;

利用预先构建的多层感知机处理所述平均池化特征，得到平均池化参数，并利用所述多层感知机处理所述最大池化特征，得到最大池化参数；The average pooling feature is processed using a pre-built multilayer perceptron to obtain the average pooling parameter, and the max pooling feature is processed using the multilayer perceptron to obtain the max pooling parameter.

将所述平均池化参数与所述最大池化参数的和输入激活模块，得到所述第一图像特征的第一通道注意力图和第二图像特征的第二通道注意力图。The sum of the average pooling parameter and the max pooling parameter is input into the activation module to obtain the first channel attention map of the first image feature and the second channel attention map of the second image feature.

上面图6和图7从模块化功能实体的角度对本发明实施例中的人脸图像相似度的计算装置进行详细描述，下面从硬件处理的角度对本发明实施例中人脸图像相似度的计算设备进行详细描述。Figures 6 and 7 above describe in detail the face image similarity calculation device in the embodiments of the present invention from the perspective of modular functional entities. The following describes in detail the face image similarity calculation device in the embodiments of the present invention from the perspective of hardware processing.

图8是本发明实施例提供的一种人脸图像相似度的计算设备的结构示意图，该人脸图像相似度的计算设备800可因配置或性能不同而产生比较大的差异，可以包括一个或一个以上处理器(central processing units，CPU)810(例如，一个或一个以上处理器)和存储器820，一个或一个以上存储应用程序833或数据832的存储介质830(例如一个或一个以上海量存储设备)。其中，存储器820和存储介质830可以是短暂存储或持久存储。存储在存储介质830的程序可以包括一个或一个以上模块(图示没标出)，每个模块可以包括对人脸图像相似度的计算设备800中的一系列指令操作。更进一步地，处理器810可以设置为与存储介质830通信，在人脸图像相似度的计算设备800上执行存储介质830中的一系列指令操作，以实现上述各方法实施例提供的人脸图像相似度的计算方法的步骤。Figure 8 is a schematic diagram of a face image similarity calculation device provided in an embodiment of the present invention. This face image similarity calculation device 800 can vary significantly due to different configurations or performance. It may include one or more central processing units (CPUs) 810 (e.g., one or more processors) and a memory 820, and one or more storage media 830 (e.g., one or more mass storage devices) storing application programs 833 or data 832. The memory 820 and storage media 830 can be temporary or persistent storage. The program stored in the storage media 830 may include one or more modules (not shown in the figure), each module including a series of instruction operations on the face image similarity calculation device 800. Furthermore, the processor 810 may be configured to communicate with the storage media 830 and execute a series of instruction operations in the storage media 830 on the face image similarity calculation device 800 to implement the steps of the face image similarity calculation method provided in the above-described method embodiments.

人脸图像相似度的计算设备800还可以包括一个或一个以上电源840，一个或一个以上有线或无线网络接口850，一个或一个以上输入输出接口860，和/或，一个或一个以上操作系统831，例如Windows Serve，Mac OS X，Unix，Linux，FreeBSD等等。本领域技术人员可以理解，图8示出的人脸图像相似度的计算设备结构并不构成对本申请提供的人脸图像相似度的计算设备的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。The face image similarity computing device 800 may also include one or more power supplies 840, one or more wired or wireless network interfaces 850, one or more input/output interfaces 860, and/or one or more operating systems 831, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, etc. Those skilled in the art will understand that the face image similarity computing device structure shown in FIG8 does not constitute a limitation on the face image similarity computing device provided in this application, and may include more or fewer components than shown, or combine certain components, or have different component arrangements.

本发明还提供一种计算机可读存储介质，该计算机可读存储介质可以为非易失性计算机可读存储介质，该计算机可读存储介质也可以为易失性计算机可读存储介质，所述计算机可读存储介质中存储有指令，当所述指令在计算机上运行时，使得计算机执行上述人脸图像相似度的计算方法的步骤。The present invention also provides a computer-readable storage medium, which can be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium, wherein the computer-readable storage medium stores instructions that, when the instructions are executed on a computer, cause the computer to perform the steps of the above-described method for calculating the similarity of facial images.

所述领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的系统，装置和单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(read-only memory，ROM)、随机存取存储器(random access memory，RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

以上所述，以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。The above-described embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for calculating the similarity of facial images, characterized in that the method for calculating the similarity of facial images includes:

Two video images containing human faces are obtained from a pre-set database. Based on a pre-set face recognition model, the faces are identified by the facial features in the video images, and the output is the region range of the face in the video images.

Based on the area range, the corresponding first face image and second face image are extracted from the two video images, wherein the first face image and the second face image contain global image information;

The first face image and the second face image are input into the feature layer of a preset attention detection model to extract image features from the first face image and the second face image, respectively, to obtain the first image feature of the first face image and the second image feature of the second face image;

Convolutional attention is calculated on the first image features and the second image features respectively to obtain the first attention image features and the second attention image features;

Calculate the feature similarity between the first attention image features and the second attention image features, and determine the image similarity between the first face image and the second face image based on the feature similarity;

The step of inputting the first face image and the second face image into the feature layer of a preset attention detection model to extract image features from the first face image and the second face image to obtain the first image feature of the first face image and the second image feature of the second face image includes: performing edge extraction on the first face image and the second face image to obtain a first edge image and a second edge image, wherein the first edge image and the second edge image contain edge image information; performing feature extraction on the edge image information contained in the first edge image and the second edge image through a preset dual-path feature extraction network, and performing feature extraction on the edge image information contained in the first edge image and the second edge image; summing the feature extraction results to obtain the region containing the target object in the first face image and the second face image; performing feature extraction on the region to obtain the first global feature and the first edge feature corresponding to the first face image, and the second global feature and the second edge feature corresponding to the second edge image; performing feature fusion on the first global feature and the first edge feature to obtain the first image feature of the first face image, and performing feature fusion on the second global feature and the second edge feature to obtain the second image feature of the second face image;

The step of calculating convolutional attention on the first image feature and the second image feature respectively to obtain the first attention image feature and the second attention image feature includes: calculating channel attention on the first image feature and the second image feature respectively to obtain the channel attention map of the image feature; calculating spatial attention on the enhanced image feature obtained by merging the image feature and the channel attention map based on the attention mechanism to obtain the spatial attention map of the image feature; and merging the spatial attention map and the enhanced image feature to obtain the first attention image feature of the first face image and the second attention image feature of the second face image respectively.

2. The method for calculating facial image similarity according to claim 1, characterized in that, before obtaining two video images containing faces from a preset database, identifying faces based on facial features in the video images using a preset facial recognition model, and outputting the region range of the face in the video images, the method further includes:

Acquire multiple sample images containing human faces from different application scenarios, and use these sample images as a training sample image set;

The training sample image set is input into the backbone network of the preset initial face recognition model, and face features are extracted from the sample images in the training sample image set to obtain a feature set. The initial face recognition model includes a backbone network and multiple classification networks.

Calculate the feature vector loss function value of the feature set to obtain multiple feature vector loss function values;

Calculate the target loss function value of the initial face recognition model based on the multiple feature vector loss function values;

The backbone network is iteratively updated based on the target loss function value until the target loss function value converges, thus obtaining the target face recognition model.

3. The method for calculating facial image similarity according to claim 1, characterized in that, the step of calculating channel attention for the first image features and the second image features output by the feature layer respectively to obtain the channel attention map of the image features includes:

Average pooling and max pooling operations are performed on the first image features and the second image features respectively to obtain average pooling features and max pooling features;

The average pooling feature is processed using a pre-built multilayer perceptron to obtain the average pooling parameter, and the max pooling feature is processed using the multilayer perceptron to obtain the max pooling parameter.

The sum of the average pooling parameter and the max pooling parameter is input into the activation module to obtain the first channel attention map of the first image feature and the second channel attention map of the second image feature.

4. A device for calculating the similarity of facial images, characterized in that the device comprises:

The recognition module is used to obtain two video images containing human faces from a preset database, identify the human face based on the facial features in the video images using a preset face recognition model, and output the region range of the human face in the video images.

The extraction module is used to extract a corresponding first face image and a second face image from the two video images according to the region range, wherein the first face image and the second face image contain global image information;

The first feature extraction module is used to input the first face image and the second face image into the feature layer of a preset attention detection model to extract image features from the first face image and the second face image, respectively obtaining the first image feature of the first face image and the second image feature of the second face image;

The first calculation module is used to calculate convolutional attention on the first image features and the second image features respectively, to obtain first attention image features and second attention image features;

The determination module is used to calculate the feature similarity between the first attention image features and the second attention image features, and to determine the image similarity between the first face image and the second face image based on the feature similarity.

5. The facial image similarity calculation device according to claim 4, characterized in that the facial image similarity calculation device further comprises:

The acquisition module is used to acquire multiple sample images containing human faces in different application scenarios, and use the sample images as a training sample image set;

The second feature extraction module is used to input the training sample image set into the backbone network of the preset initial face recognition model, and to extract face features from the sample images in the training sample image set to obtain a feature set. The initial face recognition model includes a backbone network and multiple classification networks.

The second calculation module is used to calculate the feature vector loss function value of the feature set and obtain multiple feature vector loss function values.

The third calculation module is used to calculate the target loss function value of the initial face recognition model based on the multiple feature vector loss function values;

The update module is used to iteratively update the backbone network based on the target loss function value until the target loss function value converges, thereby obtaining the target face recognition model.

6. A facial image similarity calculation device, characterized in that the facial image similarity calculation device comprises: a memory and at least one processor, wherein the memory stores instructions, and the memory and the at least one processor are interconnected via a circuit;

The at least one processor invokes the instructions in the memory to cause the face image similarity calculation device to perform the steps of the face image similarity calculation method as described in any one of claims 1-3.

7. A computer-readable storage medium storing a computer program, characterized in that, when executed by a processor, the computer program implements the steps of the method for calculating the similarity of face images as described in any one of claims 1-3.