CN111428661A

CN111428661A - Method for processing face image based on intelligent human-computer interaction

Info

Publication number: CN111428661A
Application number: CN202010232764.9A
Authority: CN
Inventors: 涂山山; 张玙彤; 穆罕默德·瓦卡斯; 张振昊; 石岩; 张永继
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2020-03-28
Filing date: 2020-03-28
Publication date: 2020-07-17

Abstract

A face image processing method based on intelligent human-computer interaction belongs to the field of artificial intelligence. The facial features related to user information in the present invention are divided into overall facial features, expression features and visual tracking, and a three-dimensional face information can be obtained through combined analysis of the three aspects. The present invention binds specific actions in the user's facial features to behavior information, and realizes human-computer interaction through the facial feature information, thereby reducing the selection error of facial recognition points. The invention locates the basic facial features of the human face by analyzing the captured human face, performs visual tracking for the eye region, binds the facial expression to the behavior information to realize human-computer interaction, reduces the selection error of the facial recognition point, and improves the Facial recognition accuracy.

Description

A method of face image processing based on intelligent human-computer interaction

技术领域technical field

本发明属于人工智能领域，涉及人脸图像处理的方法实现，具体的说是一种通过人脸图像和视频信息的实时采集，进行人脸点检测、表情识别、视觉跟踪，实现人脸图像处理的方法。The invention belongs to the field of artificial intelligence, and relates to the realization of a method for processing face images, in particular to a method for performing face point detection, expression recognition, and visual tracking through real-time collection of face images and video information to realize face image processing. Methods.

背景技术Background technique

人机交互是人与系统之间信息交换的过程，该过程可存在于多种不同类型的系统中，最早的人机交互方法是通过手动操作输入机器语言指令来实现的，交互的媒介是计算机语言。随着图形处理领域的发展，人机交互媒介亦逐渐向图形界面转变，使得图形界面有更好的交互体验，用户的交互更方便，反馈给用户更精确的信息。随着普适计算，深度学习等相关技术的发展，人机交互的种类不断丰富，用于交互的媒介变得多样化，主要方法不断改进包括语音识别、手势识别、跟踪等等，交互形式变得多样化，可以传递的信息量大大增加。Human-computer interaction is the process of information exchange between people and systems. This process can exist in many different types of systems. The earliest human-computer interaction method is realized by manual operation and input of machine language instructions. The medium of interaction is computer. language. With the development of the field of graphics processing, the human-computer interaction medium is gradually changing to the graphical interface, which makes the graphical interface have a better interactive experience, makes the user's interaction more convenient, and feeds back more accurate information to the user. With the development of ubiquitous computing, deep learning and other related technologies, the types of human-computer interaction are constantly enriched, the media used for interaction have become diversified, and the main methods have been continuously improved, including speech recognition, gesture recognition, tracking, etc. Diversified, the amount of information that can be transmitted is greatly increased.

人脸识别作为模式识别的一个重要领域，一直是研究的热点之一，随着人脸分析技术的日趋成熟，人脸特征被广泛应用于各种场合，对于人脸图像的处理问题被推上时代的风口浪尖。借此本发明通过定位了人脸特征区域，对人脸进行人脸点检测、表情识别、视觉跟踪，将三种方法结合应用，克服了每种方法分别存在的不同局限性，实现智能化人机交互过程。As an important field of pattern recognition, face recognition has always been one of the research hotspots. With the maturity of face analysis technology, face features are widely used in various occasions, and the problem of face image processing has been pushed up. The cusp of the times. Thereby, the present invention locates the feature area of the face, performs face point detection, expression recognition, and visual tracking on the face, and applies the three methods in combination, overcomes the different limitations of each method, and realizes the intelligent human face. computer interaction process.

人脸点检测与识别是深度学习领域的重要应用之一，人脸点识别是指在检测到人脸后，在人脸上对关键点进行定位，定位人脸特征区域后对数据进行预处理，利用识别算法提取特征，完成人脸识别，过程如图1所示。随着深度学习技术的飞速发展，人脸点识别技术也日趋成熟。Face point detection and recognition is one of the important applications in the field of deep learning. Face point recognition refers to locating key points on the face after detecting the face, and preprocessing the data after locating the face feature area. , using the recognition algorithm to extract features to complete face recognition, the process is shown in Figure 1. With the rapid development of deep learning technology, face recognition technology is also becoming more and more mature.

表情识别是计算机理解人类情感的一个重要方向，也是人机交互的一个重要方面，通过分析表达式，您可以捕获用户信息并做出决策，可以直接判断使用者的情绪和心理。分析情绪常用卷积神经网络，通过分析细胞神经网络中的多个卷积层和采集层，可以提取整个人脸或局部区域的高级和多级特征，并得到良好的表情图像特征分析。经验表明，在图像识别中卷积神经网络的应用是优于其他类型神经网络，通过卷积神经网络可以达到目前最好的表情识别效果。Expression recognition is an important direction for computers to understand human emotions and an important aspect of human-computer interaction. By analyzing expressions, you can capture user information and make decisions, and you can directly judge the user's emotions and psychology. Convolutional neural networks are commonly used to analyze emotions. By analyzing multiple convolutional layers and acquisition layers in the cellular neural network, high-level and multi-level features of the entire face or local area can be extracted, and good facial expression image feature analysis can be obtained. Experience has shown that the application of convolutional neural networks in image recognition is superior to other types of neural networks, and the best expression recognition effect can be achieved through convolutional neural networks.

视觉跟踪是人机交互过程中的另一个重要方面，通过视觉跟踪可以方便地观察到用户的焦点，更有利于分析用户感兴趣的区域，分析用户的选择和偏好，我们用人眼作为电脑的输入源，通过跟踪用户的视线，我们确定了人眼的视线范围，并完成了相应的人机交互。Visual tracking is another important aspect in the process of human-computer interaction. Through visual tracking, the user's focus can be easily observed, which is more conducive to analyzing the user's area of interest and analyzing the user's choices and preferences. We use the human eye as the input of the computer. By tracking the user's line of sight, we determine the line of sight of the human eye and complete the corresponding human-computer interaction.

综合上述分析可以看出，以上三种方法分别针对人脸识别中的一个方向，因而单独应用都存在缺陷导致识别错误的发生。为此，本发明结合以上三种方法提出了一种新的人脸识别方法，通过对捕获人脸进行面部区域分析，定位出人脸的基本面部特征，针对眼部区域进行视觉跟踪，将面部表情绑定行为信息以实现人机交互，降低面部识别点选取误差，提高面部识别准确率。Based on the above analysis, it can be seen that the above three methods are respectively aimed at one direction in face recognition, so there are defects in individual applications, which lead to the occurrence of recognition errors. To this end, the present invention proposes a new face recognition method by combining the above three methods. By analyzing the face area of the captured face, the basic facial features of the face are located, and the eye area is visually tracked to identify the face. Expression binding behavior information to realize human-computer interaction, reduce facial recognition point selection errors, and improve facial recognition accuracy.

发明内容SUMMARY OF THE INVENTION

本发明的目的是解决人脸图像处理问题，提出通过人脸图像和视频信息的实时采集，进行人脸点检测、表情识别、视觉跟踪，实现人脸图像处理的方法。The purpose of the present invention is to solve the problem of face image processing, and proposes a method for realizing face image processing through real-time collection of face images and video information, performing face point detection, expression recognition, and visual tracking.

实现本发明的方法描述如下：The method for realizing the present invention is described as follows:

本发明基于卷积神经网络的人脸图像处理技术实现智能化人机交互，具体实现方法分以下三个过程：The present invention realizes intelligent human-computer interaction based on the face image processing technology of convolutional neural network, and the specific implementation method is divided into the following three processes:

(一)人脸点检测的方法(1) The method of face point detection

本方法采用三层卷积神经网络结构，采用绝对值校正及参数共享机制来建立第一层卷积神经网络，采用多层次回归的思想来获取二、三层卷积神经网络。由于人脸姿态变化大，导致检测不稳定，人脸点和检测点的相对位置可能在很大范围内发生变化，即产生较大的相对误差，因此第一级网络的输入区域选取较大，以覆盖尽可能多的预测位置。第一级网络的输出为后续检测区域的选取提供了选取条件，因此第二、三级检测区域相应缩小，检测区域的选取条件为：包含前一级网络获得的所有预测位置中的75％的预测位置，以前一级网络的预测位置密度最高的位置点为中心的圆形区域。This method adopts a three-layer convolutional neural network structure, adopts the absolute value correction and parameter sharing mechanism to establish the first-layer convolutional neural network, and adopts the idea of multi-level regression to obtain the second and third-layer convolutional neural networks. Due to the large change in the face posture, the detection is unstable, and the relative position of the face point and the detection point may change in a large range, that is, a large relative error occurs. Therefore, the input area of the first-level network is selected to be large. to cover as many predicted locations as possible. The output of the first-level network provides selection conditions for the selection of subsequent detection areas, so the second and third-level detection areas are correspondingly reduced. The predicted location is a circular area centered on the location point with the highest predicted location density of the previous level network.

在新的预测区域内重新获取预测位置，此过程重复多次至检测区域缩小至第一级检测区域的1％时，得到的预测位置即为每个点的预测位置，进而得到每个层次的不同输入区域的多个网络；Re-acquire the predicted position in the new prediction area. This process is repeated several times until the detection area is reduced to 1% of the first-level detection area. The obtained predicted position is the predicted position of each point, and then the Multiple networks for different input areas;

人脸点的最终预测位置x可以正式表示为一个n层级的级联，预测位置x的数学表示如下：The final predicted position x of a face point can be formally expressed as an n-level cascade, and the mathematical representation of the predicted position x is as follows:

式中x即为预测位置，l_i为一个n层级的级联在第层级上的预测位置个数，第i层级上的预测位置分别用

表示，即x₁ ⁽¹⁾为第1层级上的第一个预测位置，第i层级上的第l_i个预测位置

相比较于第i-1层级上相对应的第l_i个预测位置的改变用Δx₁ ⁽i⁾表示。In the formula, x is the prediction position, l _i is the number of prediction positions on the n-level cascade on the first level, and the prediction positions on the i-th level are respectively

Representation, that is, x ₁ ⁽¹⁾ is the first prediction position on the 1st level, and the l _ith prediction position on the i-th level

The change compared to the corresponding li-th predicted position on the _i -1-th level is denoted by Δx ₁ ⁽ i ⁾ .

本方法采用三层卷积神经网络的设计。第一层卷积神经网络包括三个检测区域不同的深度卷积网络，分别为F1网络检测区域覆盖了整个面部，EN1网络检测区域仅覆盖眼睛和鼻子区域及NM1网络检测区域仅覆盖鼻子和嘴巴区域。三个网络针对同一个面孔的不同区域同时采用上述预测方法进行预测，得到的三个网络的预测值进行平均，以减少方差即可得到整个面部特征点的第一层预测位置，避免因局部特征过于明显而导致的面部预测结果与实际有偏差。根据第一层预测位置采用回归的思想，针对三个网络F1、EN1、NM1分别得到对应的第二、三层预测位置即可。由于第二、三层次的输入区域范围受第一层次的预测结果的严格限制，因此第二、三层次的预测位置可达到极高的精度，但也因此将受到严格限制。This method adopts the design of three-layer convolutional neural network. The first layer of convolutional neural network includes three deep convolutional networks with different detection areas. The detection area of the F1 network covers the entire face, the detection area of the EN1 network only covers the eyes and nose area, and the detection area of the NM1 network only covers the nose and mouth. area. The three networks use the above prediction method to predict different areas of the same face at the same time, and the predicted values of the three networks obtained are averaged to reduce the variance to obtain the first-level prediction position of the entire facial feature point, to avoid local features. The face prediction results that are too obvious are deviated from the actual ones. According to the idea of using regression for the prediction position of the first layer, the corresponding prediction positions of the second and third layers can be obtained for the three networks F1, EN1 and NM1 respectively. Since the range of the input area of the second and third levels is strictly limited by the prediction results of the first level, the prediction positions of the second and third levels can achieve extremely high accuracy, but will therefore be strictly limited.

(二)面部表情识别的方法(2) Methods of facial expression recognition

在面部表情识别方法中，提出了一种端到端学习模型，该模型通过姿态和表情两个角度进行人脸图像合成，并在姿态不变时进行面部表情识别。该模型的结构由一个生成器，两个鉴别器和一个分类器组成。在将图像传入模型之前进行预处理，将人脸检测算法应用于包含68个标志点的基础库进行人脸检测。经过预处理后，将面部图像输入到生成器G中生成身份标识，具体来说，存在一个法则f(x)，使得每一个输入图像都有一个确定且唯一的身份标识，再将身份标识与表情编码e、姿态编码p级连，表示面部的变化。通过将极大极小算法应用于生成器G与鉴别器D之间，在解码器输入端添加相应的标签，可以获得不同姿态和表情人脸图像的新标签。在此本发明使用两个鉴别器结构分别用Datt和Di表示，其中鉴别器Datt用于识别并表示标识纠缠度，Di用于提高生成图像的质量。对人脸图像完成合成后，利用分类器Cexp完成人脸图像的面部表情识别任务，具体来说，本发明将深度学习算法应用于分类器中，确保在每个表示层中，分类关键因素逐渐趋于稳定的同时保留每个面部表情的特征信息。In the facial expression recognition method, an end-to-end learning model is proposed, which synthesizes face images from two angles of posture and expression, and performs facial expression recognition when the posture is unchanged. The structure of the model consists of a generator, two discriminators and a classifier. The image is preprocessed before feeding it into the model, and the face detection algorithm is applied to a base library containing 68 landmarks for face detection. After preprocessing, the facial image is input into the generator G to generate the identity. Specifically, there is a law f(x), so that each input image has a definite and unique identity, and then the identity is associated with Expression encoding e and pose encoding p are cascaded to represent changes in the face. By applying the minimax algorithm between the generator G and the discriminator D, and adding corresponding labels to the input of the decoder, new labels of face images with different poses and expressions can be obtained. Herein, the present invention uses two discriminator structures represented by Datt and Di respectively, wherein the discriminator Datt is used to identify and represent the entanglement degree of the identity, and Di is used to improve the quality of the generated image. After the face image is synthesized, the classifier Cexp is used to complete the facial expression recognition task of the face image. Specifically, the present invention applies the deep learning algorithm to the classifier to ensure that in each representation layer, the key factors of classification are gradually It tends to stabilize while preserving the characteristic information of each facial expression.

(三)视觉跟踪的方法(3) The method of visual tracking

本发明采用基于检测的跟踪算法，通过分析捕获人脸的图像梯度向量场，用数学方法描述了图像梯度的可能中心和方向之间的关系，设置一个可能的中心c，并在中心位置上提供一个梯度向量，通过归一化使位移与梯度方向相同。通过计算与中心位置相关的归一化位移

和梯度向量g_j之间的点积，得到人脸图像中眼部区域的最佳中心c^*，即瞳孔中心位置，由以下方法给出:The invention adopts the detection-based tracking algorithm, and by analyzing the image gradient vector field capturing the face, the relationship between the possible center and direction of the image gradient is mathematically described, a possible center c is set, and the center position is provided. A gradient vector, normalized so that the displacement is in the same direction as the gradient. By computing the normalized displacement relative to the center position

And the dot product between the gradient vector g _j , get the best center c ^* of the eye region in the face image, that is, the pupil center position, given by the following method:

针对一个可能的中心c，选取N个不同的梯度，分别对应不同的归一化位移

和梯度向量g₁…g_N，即选取的第j个梯度对应的归一化位移为

梯度向量为g_j，当目标函数取得最大值时所对应的位置变量即为最佳中心位置c^*。其中，选取的不同梯度对应位移d_j可由如下方法得出：For a possible center c, select N different gradients, corresponding to different normalized displacements

and gradient vectors g ₁ ...g _N , that is, the normalized displacement corresponding to the selected j-th gradient is

The gradient vector is g _j , and the position variable corresponding to the maximum value of the objective function is the optimal center position c ^* . Among them, the selected displacements d _j corresponding to different gradients can be obtained by the following method:

将位移d_j缩放到单位长度，得到人脸图像中不同位置的相同权值，为了提高本方法对光照和对比度的线性变化的鲁棒性，梯度矢量g_j也缩放到单位长度，可以得到目标函数在瞳孔中心位置产生最大值。另外可通过只考虑具有明显振幅的梯度向量，来降低算法的复杂度。Scale the displacement d _j to unit length to obtain the same weights at different positions in the face image. In order to improve the robustness of this method to linear changes in illumination and contrast, the gradient vector g _j is also scaled to unit length, and the target can be obtained. The function produces a maximum value at the pupil center position. In addition, the complexity of the algorithm can be reduced by considering only gradient vectors with significant amplitudes.

本发明提出的方法应用于人脸图像处理领域的优点如下：The advantages of the method proposed in the present invention applied to the field of face image processing are as follows:

(一)本发明通过识别面部特征来实现人机交互。(1) The present invention realizes human-computer interaction by recognizing facial features.

(二)本发明与用户信息相关的面部特征被分为整体面部特征、表情特征和视觉跟踪，通过三方面的结合分析可以得到一个三维的人脸信息。(2) The facial features related to the user information in the present invention are divided into overall facial features, expression features and visual tracking, and a three-dimensional face information can be obtained through the combined analysis of the three aspects.

(三)本发明将用户面部特征中的特定动作绑定到行为信息中，并通过面部特征信息实现人机交互，降低面部识别点选取误差。(3) The present invention binds specific actions in the user's facial features to behavior information, and realizes human-computer interaction through the facial feature information, thereby reducing the selection error of facial recognition points.

附图说明Description of drawings

图1基于人脸检测的人机交互处理过程示意图；Fig. 1 is a schematic diagram of the human-computer interaction processing process based on face detection;

图2模型框图；Figure 2 model block diagram;

具体实施方式Detailed ways

现结合附图2中的模型框图和实施示例对本发明的一种基于智能化人机交互进行人脸图像处理的方法做进一步详细描述。A method for processing face images based on intelligent human-computer interaction of the present invention will now be described in further detail with reference to the model block diagram and implementation examples in FIG. 2 .

本发明的一种基于智能化人机交互进行人脸图像处理的方法是一种通过人脸图像和视频信息的实时采集，进行人脸点检测、表情识别、视觉跟踪，实现人脸图像处理的方法。此发明应用于人脸识别领域总体工作框架叙述如下：首先对人脸进行捕获，通过三层卷积神经网络进行人脸关键点定位工作，采用端到端的深度学习模型对人脸表情进行识别，利用不同的手势和表情实现人脸图像合成和表情识别，通过使用图像梯度对眼睛中心进行定位实现视觉跟踪的工作。在获得这三个特征之后，我们将这三个特征结合到三层神经网络中进行训练，以使机器根据组合特征做出合理的响应。A method for processing face images based on intelligent human-computer interaction of the present invention is a method of performing face point detection, expression recognition, and visual tracking through real-time collection of face images and video information to realize face image processing. method. The overall working framework of the invention applied to the field of face recognition is described as follows: first, the face is captured, the key points of the face are located through a three-layer convolutional neural network, and the face expression is recognized by an end-to-end deep learning model. Use different gestures and expressions to achieve face image synthesis and expression recognition, and achieve visual tracking by using image gradients to locate the center of the eye. After obtaining these three features, we combine the three features into a three-layer neural network for training to make the machine respond reasonably based on the combined features.

本发明采用了如下的技术方案及实现步骤：The present invention adopts the following technical solutions and implementation steps:

(一)人脸点检测的方法(1) The method of face point detection

相比较于第i-1层级上相对应的第l_i个预测位置的改变用Δx₁ ⁽ⁱ⁾表示。In the formula, x is the prediction position, l _i is the number of prediction positions on the n-level cascade on the first level, and the prediction positions on the i-th level are respectively

The change compared to the corresponding li-th predicted position on the _i -1-th level is denoted by Δx ₁ ⁽ⁱ⁾ .

本方法采用三层卷积神经网络的设计。第一层卷积神经网络包括三个检测区域不同的深度卷积网络，分别为F1网络检测区域覆盖了整个面部，EN1网络检测区域仅覆盖眼睛和鼻子区域及NM1网络检测区域仅覆盖鼻子和嘴巴区域。三个网络针对同一个面孔的不同区域同时采用上述预测方法进行预测，得到的三个网络的预测值进行平均，以减少方差即可得到整个面部特征点的第一层预测位置，避免因局部特征过于明显而导致的面部预测结果与实际有偏差。根据第一层预测位置采用回归的思想，针对三个网络F1、EN1、NM1分别得到对应的第二、三层预测位置即可。由于第二、三层次的输入区域范围受第一层次的预测结果的严格限制，因此第二、三层次的预测位置可达到极高的精度，但也因此将受到严格限制。This method adopts the design of three-layer convolutional neural network. The first layer of convolutional neural network includes three deep convolutional networks with different detection areas. The F1 network detection area covers the entire face, the EN1 network detection area only covers the eyes and nose area, and the NM1 network detection area only covers the nose and mouth. area. The three networks use the above prediction method to predict different areas of the same face at the same time, and the predicted values of the three networks are averaged to reduce the variance to obtain the first-layer predicted position of the entire facial feature point, avoiding the local characteristics. The face prediction results that are too obvious are deviated from the actual ones. According to the idea of using regression for the prediction position of the first layer, the corresponding prediction positions of the second and third layers can be obtained for the three networks F1, EN1 and NM1 respectively. Since the range of the input area of the second and third levels is strictly limited by the prediction results of the first level, the prediction positions of the second and third levels can achieve extremely high accuracy, but will be strictly limited.

(二)面部表情识别的方法(2) Methods of facial expression recognition

(三)视觉跟踪的方法(3) The method of visual tracking

和梯度向量g₁…g_N，即选取的第个梯度对应的归一化位移为

and the gradient vector g ₁ ...g _N , that is, the normalized displacement corresponding to the selected gradient is

Claims

1. A method for processing face images based on intelligent human-computer interaction is characterized in that the specific implementation method comprises the following three processes:

method for detecting face points

Adopting a three-layer convolutional neural network structure, adopting an absolute value correction and parameter sharing mechanism to establish a first layer of convolutional neural network, and adopting a multi-layer regression idea to obtain two-layer and three-layer convolutional neural networks;

the input area of the first-level network is selected to be larger so as to cover the predicted positions as much as possible; correspondingly reducing the second and third detection areas, wherein the selection conditions of the detection areas are as follows: a circular area including 75% of predicted positions among all predicted positions obtained by the previous network, the circular area being centered at a position point where the density of predicted positions of the previous network is highest;

the prediction position is obtained again in the new prediction area, the process is repeated for many times until the detection area is reduced to 1% of the first-stage detection area, the obtained prediction position is the prediction position of each point, and then a plurality of networks of different input areas of each level are obtained;

the final predicted position x of the face point is represented as a cascade of n levels, and the mathematical representation of the predicted position x is as follows:

where x is the predicted position, l_iThe number of predicted positions on the ith level is n levels of cascade connection, and the predicted positions on the ith level are respectively used

Is represented by, i.e. x₁ ⁽¹⁾For the first predicted position at level 1, l at level i_iA predicted position

Compared with the corresponding l-th level on the i-1 level_iΔ x for the change of predicted position₁ ⁽ⁱ⁾Represents;

(II) facial expression recognition method

In the facial expression recognition method, an end-to-end learning model is provided, the model carries out facial image synthesis through two angles of gesture and expression, and carries out facial expression recognition when the gesture is unchanged; the structure of the model consists of a generator, two discriminators and a classifier;

preprocessing is carried out before the image is transmitted into the model, and a face detection algorithm is applied to a basic library containing 68 mark points for face detection; after preprocessing, inputting facial images into a generator G to generate identity marks, wherein each input image has a determined and unique identity mark, and then cascading the identity marks with an expression code e and a posture code p to express the change of the face; applying a maximum and minimum algorithm between the generator G and the discriminator D, and adding corresponding labels at the input end of a decoder to obtain new labels of face images with different postures and expressions;

two discriminator structures are respectively represented by Datt and Di, wherein the discriminator Datt is used for identifying and representing the entanglement degree of the mark, and the Di is used for improving the quality of the generated image; after the face image is synthesized, the classifier Cexp is used for completing the facial expression recognition task of the face image;

(III) visual tracking method

Adopting a tracking algorithm based on detection, analyzing an image gradient vector field of a captured face, describing the relation between the possible center and the direction of the image gradient by a mathematical method, setting a possible center c, providing a gradient vector at the center position, and enabling the displacement to be the same as the gradient direction through normalization;

by calculating a normalized displacement related to the position of the centre

And gradient vector g_jThe optimal center c of the eye region in the face image is obtained^*I.e., the pupil center position, is given by:

aiming at a possible center c, selecting N different gradients which respectively correspond to different normalized displacements

And gradient vector g₁…g_NI.e. the selected normalized displacement corresponding to the jth gradient is

Gradient vector is g_jWhen the objective function obtains the maximum value, the corresponding position variable is the optimal center position c^*(ii) a Wherein the selected different gradients correspond to a displacement d_jThe method comprises the following steps:

will displace d_jZooming to unit length to obtain the same weight and gradient vector g of different positions in the face image_jAnd the unit length is also scaled, so that the maximum value of the target function at the pupil center position is obtained.

2. The method of claim 1, wherein:

the first layer of convolutional neural network comprises three deep convolutional networks with different detection areas, wherein F1 network detection areas cover the whole face, EN1 network detection areas only cover the eyes and nose areas, and NM1 network detection areas only cover the nose and mouth areas; the three networks simultaneously predict different areas of the same face, the obtained predicted values of the three networks are averaged, and the corresponding second-layer predicted positions and the corresponding third-layer predicted positions are obtained respectively according to the three networks F1, FN1 and NM1 by adopting a regression idea according to the first-layer predicted positions.