CN112307815A

CN112307815A - Image processing method and device, electronic equipment and readable storage medium

Info

Publication number: CN112307815A
Application number: CN201910685032.2A
Authority: CN
Inventors: 郭天楚; 刘永超; 刘夏冰; 张辉; 韩在濬; 崔昌圭; 郭荣竣; 兪炳仁
Original assignee: Beijing Samsung Telecommunications Technology Research Co Ltd; Samsung Electronics Co Ltd
Current assignee: Beijing Samsung Telecommunications Technology Research Co Ltd; Samsung Electronics Co Ltd
Priority date: 2019-07-26
Filing date: 2019-07-26
Publication date: 2021-02-02
Also published as: KR20210012888A

Abstract

The embodiment of the application provides an image processing method, an image processing device, electronic equipment and a readable storage medium, wherein the method comprises the following steps: acquiring a face image of a user; and obtaining the sight focus position of the user based on the face image by using a neural network model. Based on the method provided by the embodiment of the application, the accuracy of the sight focus position estimation of the user can be effectively improved.

Description

Image processing method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an image processing method and apparatus, an electronic device, and a readable storage medium.

Background

At present, with the development of scientific technology, various electronic devices have become an indispensable part of people's lives. In many application scenarios, it is sometimes necessary to estimate a focus of a user's gaze when using an electronic device, i.e., a focus point of the user's gaze, for example, selecting an application program and starting the application program with the gaze (equivalent to using the gaze as a mouse), or pushing an advertisement according to the gaze position, etc. In the application scenarios, the sight position of the user on the screen of the electronic equipment needs to be estimated accurately in real time. However, the estimation accuracy of the existing estimation implementation schemes in practical application needs to be improved.

Disclosure of Invention

The application aims to provide an image processing method, an image processing device, electronic equipment and a readable storage medium, so as to improve the accuracy of the estimation of the key point position of the sight of a user. The scheme provided by the embodiment of the application is as follows:

in a first aspect, an embodiment of the present application provides an image processing method based on a neural network model, where the method includes:

acquiring a face image of a user;

and obtaining the sight focus position of the user based on the face image by using a neural network model.

In a second aspect, an embodiment of the present application provides a method for training a neural network model, where the method includes:

acquiring a training sample set, wherein the training sample set comprises sample images;

and training the initial target neural network model based on each sample image until the loss function is converged to obtain the trained target neural network model.

In a third aspect, an embodiment of the present application provides an image processing apparatus, including:

the image acquisition module is used for acquiring a face image of a user;

and the sight focus position determining module is used for obtaining the sight focus position of the user based on the face image by using the neural network model.

In a fourth aspect, an embodiment of the present application provides an apparatus for training a neural network model, where the apparatus includes:

the system comprises a sample acquisition module, a data acquisition module and a data processing module, wherein the sample acquisition module is used for acquiring a training sample set which comprises sample images;

and the model training module is used for training the initial target neural network model based on each sample image until the loss function is converged to obtain the trained target neural network model.

In a fifth aspect, an embodiment of the present application provides an electronic device, which includes a memory and a processor; wherein the memory has stored therein a computer program; the processor is configured to invoke the computer program to perform the method provided in the first aspect or the second aspect of the present application.

In a sixth aspect, the present application provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method provided in the first or second aspect of the present application.

The advantages of the technical solutions provided in the present application will be described in detail with reference to the following embodiments and accompanying drawings, which are not described herein.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic flowchart illustrating a method for training a neural network model according to an embodiment of the present disclosure;

FIG. 2 illustrates a flow diagram of a training method in an example of the present application;

FIG. 3 is a flow chart illustrating an image processing method according to an embodiment of the present disclosure;

FIG. 4 shows a schematic view of a screen marker in an example of the present application;

fig. 5 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

Generally, in a scene needing to estimate the positions of the sight line key points, an important performance of the sight line estimation scheme is precision, that is, a trained estimator needs to have high precision not only on a training set, but also on an actual test set, that is, a training algorithm has good generalization performance. Another important property is stability, when the user is fixed at the same point, or the user has a small movement near a certain point, the estimated sight position by the algorithm is required to be not only accurate, but also not have large jitter. But the generalization performance of the existing algorithm is poor. In the aspect of algorithm stability, most of the methods based on videos or multi-frames in the prior art are post-processing performed after prediction, and have the problem of delay. The sight line estimation needs a fast and accurate result in practical application.

For the problems existing in the prior art, the method for training the neural network model is provided on the one hand, the problem can be fundamentally solved in the aspect of training, a single picture is taken as input, the stability of sight line estimation is improved under the condition of no loss of real-time performance, and the method can be combined with a video-based method for use. The method can correspondingly process the initial sight line key point position obtained based on model prediction, and improves the sight line key point position with higher accuracy and stability.

The scheme provided by the present application is specifically described below.

An embodiment of the present application provides a training method of a neural network model, as shown in fig. 1, the method may include:

step S101: acquiring a training sample set, wherein the training sample set comprises sample images;

step S102: and training the initial target neural network model based on each sample image until the loss function is converged to obtain the trained target neural network model.

In an optional embodiment of the application, in the step S102, training the initial target neural network model based on each sample image specifically includes:

acquiring a first neural network model;

training the first neural network model at least twice based on each sample image to obtain the first neural network model after each training;

predicting each sample image through the neural network model after each training to obtain the prediction result of each sample image corresponding to the neural network model after each training;

and deleting the sample images in the training sample set based on the difference between the prediction result of each time corresponding to each sample image and the real result of the sample to obtain the processed sample images.

In an optional embodiment of the present application, when the first neural network model is trained at least twice based on the sample images, the sample images in the previous training are sample images obtained by deleting sample images of a set number or a set proportion, which have a smaller difference between a prediction result of the sample images and a real result of the sample images, from among the sample images used in the previous training.

That is to say, according to the scheme provided in the embodiment of the present application, before the model is trained based on the sample image, the optional manner may be first adopted to perform preprocessing on the sample image, that is, denoising processing on the training sample set, to filter out a part of poor sample images, and then train the target neural network model based on the filtered good sample images, so as to improve the accuracy of the trained model.

The preprocessing scheme for the sample image is further described below in connection with an example.

In this example, an iterative screening strategy is used, which aims to remove noise in a part of data through an algorithm, improve the quality of the data, and thus obtain a better training model. Let the entire dataset (i.e. the training sample set) contain N samples (i.e. x _ i, i ═ 1, …, N), and each sample knows its corresponding true value gt _ i, i.e. the true result. The number of training iterations in this example is M (M ≧ 2). Let Nd be N/M for each deleted sample number, i.e. the number of samples used in the next iterative training is the current sample number minus Nd. The method comprises the following specific steps:

1. the training sample set is initialized to a "data set" (containing N samples). Neural network parameters are initialized. A loss function is selected. A learning rate (e.g., 0.01) is initialized. Wherein, the neural network structure (i.e. the first neural network model) can be selected from the network structures of the prior art, such as AlexNet. Loss function for training neural networks we use rank plane loss function (ordering loss), but other loss functions can be chosen.

2. And training the neural network by using the training sample set to obtain the model and the model parameters after the training.

3. And (3) calculating a predicted value y _ i of each sample x _ i in the training sample set by using the neural network trained in the step (2), and calculating an error err _ i between the predicted value and a true value, namely distance (x _ i, y _ i), wherein the error metric distance can be selected as Euclidean distance. All errors are sorted in ascending order.

4. The first Nd samples (i.e., the Nd samples with the smallest error) are selected and deleted from the training sample set. I.e. the size of the current training sample set becomes N-Nd t (the current training in the t-th training in the t-order step 2). And saving the current neural network parameter model. And adjusts the learning rate (a general learning rate adjustment algorithm may be selected).

5. If N-Nd is not zero, go back to step 1 to continue execution, namely repeat the above steps until N-Nd is zero or M times of iterative training have been carried out.

6. And calculating the predicted values of the N samples of the whole data set by using all the stored M neural network parameter models (namely, the first neural network parameters corresponding to the M model parameters and the neural network parameters), and calculating the error between the predicted values and the actual values of the predicted values. Thus, M error results are obtained per sample. And (3) carrying out ascending arrangement on the N error results obtained by the same model to obtain M sequences, wherein N values of each sequence represent the errors of a certain model to all samples. If a certain sample x is ranked at the rear r% (the set ratio in this example is r%) of the sequence in the M sequences, x is considered as a noise sample, x is deleted from the data set, and finally a clean data set, that is, a training sample set with N × r% of N samples deleted, is obtained.

In an alternative embodiment of the present application, training the initial target neural network model based on each sample image includes:

inputting each sample image into a teacher network model to obtain an output result of each sample image;

taking each output result as a real result of each corresponding sample image, and training a target neural network model based on each sample image;

the teacher network model is any one randomly selected teacher network model in the teacher queue;

taking each output result as a real result of each corresponding sample image, and after training the target neural network model once based on each sample image, the method further comprises the following steps:

adding the target neural network model after each training to a teacher queue;

the model when the teacher queue is initialized is empty, and the real result of each sample image during initialization is the real result corresponding to the label of the sample image.

initializing one part of model parameters of the target neural network model after each training, taking the other part of model parameters and the initialized part of model parameters as new model parameters of the target neural network model, and carrying out the next training of the target neural network model.

In an optional embodiment of the present application, initializing a part of model parameters of the target neural network model after each training includes:

determining the importance degree of each filter in the target neural network model;

determining a target filter needing parameter initialization according to the importance degree of each filter;

model parameters of each target filter are initialized.

In an optional embodiment of the present application, initializing the model parameters of each target filter includes:

decomposing a filter parameter matrix of a neural network layer where a target filter is located to obtain an orthogonal matrix of the filter parameter matrix;

for the neural network layer where the target filter is located, determining a feature vector corresponding to each target filter in an orthogonal matrix corresponding to the neural network layer according to the position of each target filter in the neural network layer in the corresponding neural network layer;

determining two norms of the feature vectors of all target filters in the same neural network layer according to the feature vectors corresponding to all the target filters in the same neural network layer;

and for each target filter, determining initialized parameters of the target filter according to the feature vector corresponding to the target filter and the corresponding two-norm in the neural network layer to which the target filter belongs.

The above examples of the present application provide a training method that can effectively reduce overfitting. The method is based on a basic framework of knowledge distillation, and two modules (a pruning module based on cosine similarity and an alignment orthogonal initialization module) can be added to optimize the training process, so that the accuracy and the stability of the model are improved.

The content referred to in the above examples is further explained below with reference to an example.

In this example, let the neural network model be net, with its parameters being W. The iteration number is K, and the number of times of traversing the training data in each iteration is L. The pruning rate is p% (i.e. the proportion of the filters with the network parameters to be redetermined to the total number of filters of the model) and the maximum pruning rate of each layer is p _ max% (i.e. the proportion of the filters with the network parameters to be redetermined to the total number of filters of the layer does not exceed p _ max% for a layer of the network structure of the model). The algorithmic process of the training method can be represented as:

the initialization teacher queue is empty. Parameters of the net are initialized.

And finishing, and outputting the last network model in the current teacher queue as a training result. The neural network structure may use, among other things, a prior art structure such as alexnet. The loss function may use existing techniques such as edit loss.

The pruning algorithm and the reinitialization algorithm are described below separately.

Let the filter parameter of each layer of the neural network model be WF, whose shape is (Nout, C, Kw, Kh), where Nout is the number of filters in the layer, C is the number of input channels, and Kw and Kh are the width and height of the filter in the layer. Shape adjustment of WF to Nout one-dimensional vectors Wfi_iAnd i is 1, …, Nout, the dimension of each one-dimensional vector is C multiplied by Kw multiplied by Kh, namely a vector with the row number of 1 and the column number of C multiplied by Kw multiplied by Kh, and Nout represents the number of filters in one layer.

The pruning algorithm may specifically include:

1. normalized according to formula (1)

(the norm in the following formula, the European norm can be selected in the specific implementation, of course, other normalization methods can be adopted)

2. Simf (score of all filters representing all layers score) is calculated according to equation (2)

Simf＝{Simf_kK is 1, … Layer _ num }, i.e. it is the set for all layers

In the above formulas, Layer _ num is the number of network structure layers of the model, Simf_kSimf, i.e. Simf, for each filter represented as the k-th layer_kAlso corresponds to a set, where the elements are Simf of each filter in the layer, such as the above k-th layer with the number of filters Nout, then Simf_kFor the ith filter of one layer, corresponding to a number of Simf's with Nout, this filter may be based on

And of the jth filter of the layer to which the filter belongs

By calculating the dot product of the two

The correlation between the network parameters of the filter and the filter is obtained, based on the mode, the correlation between the network parameters of the ith filter and each filter of the layer to which the filter belongs can be calculated, and Nout correlations are obtained

The Simf of the ith filter can be obtained.

It should be noted that the Simf of each filter represents the importance of the filter, and the larger the Simf is, the lower the importance is.

3. The Simf of each filter is arranged in ascending order, and the filters arranged in the last p% are cut filters W'. However, the filter ratio at which each layer is clipped should not be greater than p _ max%.

The specific steps of the reinitialization algorithm may include:

1. QR decomposition is carried out on each layer (layer) WF of the W, and an orthogonal matrix Worth is obtained. Taking the value of the position corresponding to W 'to obtain a matrix Worth' with the same size as W '(namely, Worth' is obtained by independently calculating each layer);

2. calculating W 'according to a formula, and calculating Wpra' of parameter aggregated with Batch Normalization (BN) (independently calculating for each filter)

The BNscale and the BNvar are parameters of Batch Normalization, the BNscale is a network coefficient of the BN layer, and the BNvar is a variance of the network parameters of the BN layer.

It will be appreciated that in practice this step may be omitted if the BN layer is not connected after the filter, i.e. the convolutional layer.

3. Calculating the two-norm of each row of Wpar_k,i(i.e. the second norm of the ith filter of the kth layer), and recording the maximum value and the minimum value of all the second norms obtained by each layer, which are respectively recorded as max _ norm and min _ norm.

4. The reinitialized weight Wr is obtained according to the following formula_k,i(calculation of each filter for each layer)

Wherein, scalar_alignedMay be sampled from a uniform distribution of (min _ norm, max _ norm).

In an alternative embodiment of the present application, training the initial target neural network model based on each sample image may include:

and determining the prediction loss of the target neural network model to each sample image during each training, correcting each sample image according to the prediction loss, and performing the next training of the target neural network model based on each corrected sample image.

In an optional embodiment of the present application, determining a prediction loss of the target neural network model to each sample image during each training, and correcting each sample image according to the prediction loss may specifically include:

determining the prediction loss of the target neural network model to each sample image during each training;

and determining the disturbance of the prediction loss on each sample image, and correcting each sample image based on the disturbance.

In an optional embodiment of the present application, determining a disturbance of a prediction loss on each sample image, and correcting each sample image based on the disturbance includes:

for each sample image, determining the gradient change of the prediction loss of the sample image to each pixel point in the sample image;

determining the disturbance of the prediction loss to each pixel point according to the gradient change corresponding to each pixel point;

and superposing the disturbance corresponding to each pixel point with the original pixel value of the pixel point corresponding to the sample image to obtain the corrected sample.

In an alternative embodiment of the present application, before determining the prediction loss of the target neural network model for each sample image in each training, the method includes:

for each sample image, cutting the sample image to obtain a global image and a local image of the sample image;

determining the prediction loss of the target neural network model to each sample image during each training, correcting each sample image according to the prediction loss, and performing the next training of the target neural network model based on each corrected sample image, wherein the training comprises the following steps:

taking the global image and the local image corresponding to each sample image as new sample images, and determining the prediction loss of the target neural network model to each new sample image during each training;

correcting each new sample image corresponding to the prediction loss of each new sample image;

and training the target neural network model next time based on each modified new sample image.

The above-described alternative embodiments of the present application further provide a training method that can effectively increase the robustness of the model. It is to be understood that this mode may be used in combination with the above-described method of reducing overfitting, may be used alone, or may be used on the basis of the result obtained by the above-described method of reducing overfitting, that is, the result obtained by the above-described method of reducing overfitting is used as the initial value of the part.

The training method for enhancing the robustness of the model is described below with reference to an example.

In this example, the target neural network model is exemplified by a model for predicting the positions of the gaze keypoints of the user in the face image. The training sample set, i.e. the data set, is a face image. A flowchart of the training method is shown in fig. 2, and the specific process may include:

1.1 inputting a random picture data X in the data set, and cutting the picture into a face picture X according to the detection result of the key points of the face_f(Global image), left eye Picture X_l(partial image) and right eye picture X_rAnd (local image) adjusting the sizes of the three pictures to preset fixed sizes by using bilinear interpolation and outputting the three pictures. Assume that the preset fixed sizes corresponding to the three pictures are 64x64,48x64, and 48x64, respectively.

1.2 determine whether this training has already been generated against the picture (modified image in this example): if not, outputting the original three pictures; if yes, outputting the latest three confrontation pictures.

1.3 inputting the three pictures into a neural network model, and calculating to obtain a network output P_x(vector representing the positions of the sight-line key points) ═ f (Xf, Xl, Xr). Then through the sorting Loss function, the Loss is calculated and output. The calculation formula is as follows:

Loss_i＝-Y′_xi*Log(P_xi)-(1-Y′_xi)*Log(1-P_xi)

wherein i represents a vector Px or Y'_xThe ith component of (1), bin _ num represents the total number of components, Y'_xA representation vector representing the correct output value (i.e. ground route) corresponding to the input picture, i.e. the actual gaze keypoint location.

1.4 are respectively countedCalculating the gradient of the loss relative to the three pictures to obtain three groups of counterdisturbance

Taking a face picture as an example, the calculation formula is as follows:

where p _ bin is the number of the last component in the Px vector that is greater than a set value (e.g., 0.5). Adding the three groups of confrontation disturbances to the corresponding pictures to obtain three confrontation pictures

α is a hyperparameter, representing the step size, which is optionally 1.0.

Representing laplacian, sgn () is a sign function, and k is a set value, for example, it can be taken as 4, and it can be understood that the value of 2k +1 cannot be greater than the total number of neurons in the output layer of the model. By the formula

The gradient change of each pixel point in the image can be calculated.

1.5, judging whether the confrontation step number reaches a preset value step: if not, returning to the step 1.2, using the three confrontation pictures as input, and continuing to judge the step backwards; if yes, the step 1.6 is carried out. The preset value step can be configured according to actual requirements, and can be selected to be 3.

1.6, inputting the three confrontation pictures into a neural network model, and calculating the confrontation Loss Loss _ adv. Calculation method and Loss is the same, input picture is replaced by X_f ^adv，X_l ^adv，X_r ^adv。

And 1.7, inputting the antagonistic Loss Loss _ adv and the original Loss Loss, taking the weighted sum Loss _ total of the two losses as c Loss + (1-c) Loss _ adv according to a preset percentage c, solving the gradient of all parameters of the neural network model as the overall Loss, and carrying out gradient back transmission. Optionally, the preset percentage c may be 80%.

1.8 judging whether the training step number reaches a preset step number upper limit value s: if not, repeating the step 1.1 to the step 1.7, and judging the step; if yes, outputting parameters of the neural network model, and finishing the training process. The step number upper limit value s may be 200000 steps.

In an experiment, a neural network model for estimating the sight line of a user (i.e. the key point position of the sight line of the user) is trained based on the training method provided by the embodiment of the application, the model trained based on the scheme of the embodiment of the application and the model trained by the existing common training method are tested on the database of GAZE _ CN _ DB and GAZE _ start _ DB, and the experimental results are shown in the following table:

in the figure, the error 241 pixel indicates that the number of pixels of the deviation between the predicted coordinate (predicted view key position) and the actual coordinate (actual view key position) is 241, and the standard deviation 63.88 indicates that the standard deviation calculated based on the predicted deviation of each experimental sample is 63.88. As can be seen from the table, compared with the existing common training method, the model obtained by training based on the scheme of the embodiment of the application effectively improves the stability of the model prediction result in the aspect of performance.

An embodiment of the present application provides an image processing method based on a neural network model, as shown in fig. 3, the method may mainly include:

step S110: acquiring a face image of a user;

step S120: and obtaining the sight focus position of the user based on the face image by using a neural network model.

By adopting the image processing method provided by the embodiment of the application, the sight focus position of the user in the image, namely the position of the focus point concerned by the eyes of the user in the image can be determined according to the face image of the user.

In an alternative embodiment of the present application, obtaining the gaze focus position of the user based on the face image using the neural network model includes:

acquiring a position adjustment parameter;

obtaining a predicted sight focus position of the user based on the face image by using a neural network model;

and adjusting the predicted sight focus position based on the position adjustment parameter to obtain the adjusted sight focus position.

Alternatively, after the predicted gaze focal position of the user is obtained based on the neural network model, the predicted position may be adjusted based on the position adjustment parameter to obtain the gaze focal position of the user, so as to improve the accuracy of the gaze focal position.

In an alternative embodiment of the present application, the position adjustment parameter may be obtained by:

displaying the calibration object to a user, and acquiring a current face image of the user;

obtaining a predicted sight focus position of the user corresponding to the current face image based on the current face image by using a neural network model;

and determining position adjusting parameters according to the predicted sight focus position of the user corresponding to the current face image and the position of the calibration object.

In this embodiment, the user may be guided to pay attention to the calibration portion by providing the user with the calibration object, the face image of the user at that time may be acquired, and the position adjustment parameter may be determined based on the predicted gaze focal position of the face image acquired at that time and the position of the calibration object.

In practical application, the number of the calibration objects can be configured according to practical needs, and can be one or more. The style of the calibration object is not limited in the embodiments of the present application, and may be a calibration point.

As an example, a schematic diagram of a calibration object is shown in fig. 4, and as shown in the diagram, the calibration object in this example may be 3 specific calibration points shown in the diagram, and the step of determining the position adjustment parameter f (x) based on the 3 calibration points may include:

1. displaying a characteristic calibration point on a screen of an electronic device (in this example, a mobile phone is taken as an example) every time to guide a user to watch the specific calibration point on the screen, acquiring n (n is greater than or equal to 1) pictures at the calibration point through a visible light camera of the mobile phone, acquiring n pictures for 3 specific calibration points respectively and correspondingly, namely acquiring n pictures when displaying each characteristic calibration point, and recording the actual position of each calibration point on the screen, namely the coordinates of the calibration points per se, which are marked as g1, g2 and g 3; for the picture of each index point, the predicted sight line focus position of the user on the screen corresponding to each picture can be predicted through a neural network model, and the predicted sight line focus position corresponding to each index point, namely the predicted coordinate is obtained and is marked as p1, p2 and p3, and p1, p2 and p3 respectively correspond to g1, g2 and g 3. In practical applications, when n >1 (for example, n may be 3), the average value of the predicted coordinates of n pictures of each of the three specific calibration points may be adopted for p1, p2, and p 3.

2. After obtaining g1, g2, g3, p1, p2, and p3, the position adjustment parameters may be determined based on g1, g2, g3, p1, p2, and p 3.

In this example, an expression of an optional position adjustment parameter f (x) is given, which is specifically as follows:

wherein, x in the expression represents a coordinate, which is a predicted sight focus position that needs to be adjusted when the position is adjusted based on the function, and scr is a preconfigured maximum position.

It should be noted that, in practical applications, for the gaze focal position, in the calculation, coordinates of different dimensions need to be adjusted for the gaze focal position that needs to be adjusted, for example, if the focal position is in two directions (e.g., horizontal direction X and vertical direction Y), it is necessary to obtain a corresponding adjustment value according to the predicted gaze focal coordinate in the horizontal direction through the above function, obtain an adjusted gaze focal coordinate in the horizontal direction according to the adjustment value and the predicted gaze focal coordinate in the horizontal direction, and similarly, it is necessary to obtain a corresponding adjustment value according to the predicted gaze focal coordinate in the vertical direction through the above function, and obtain an adjusted gaze focal coordinate in the vertical direction according to the adjustment value and the predicted gaze focal coordinate in the vertical direction. Accordingly, the points p1, p2, p3, g1, g2, and g3 also need to be calculated based on coordinate values in each direction to perform corresponding calculations in each direction.

Based on the scheme provided by the embodiment of the application, when the attention point of the sight of the user on the electronic equipment needs to be determined, based on the collection of the face image of the user using the electronic equipment, the image is input into the neural network model, the predicted sight focus position x is output, the corresponding adjustment is obtained according to the adjustment function to F (x), the adjustment to F (x) is added to the predicted sight focus position to obtain the adjusted position F (x) + x, and the adjusted position is used as the sight focus position of the user on the electronic equipment.

In an optional embodiment of the application, the obtaining, in the step S120, the focal position of the line of sight of the user based on the face image by using the neural network model may include:

determining a predicted loss of the predicted gaze focal position;

determining a confidence level of the predicted gaze focal position based on the predicted loss;

if the confidence coefficient is greater than the set threshold value, determining the predicted sight line focus position as the sight line focus position of the user;

if the confidence coefficient is not greater than the set threshold, the predicted sight line focus position is adjusted to obtain the adjusted sight line focus position, or the sight line focus position corresponding to the previous frame of face image is determined as the sight line focus position of the user.

In an alternative embodiment of the present application, determining a confidence level of the predicted gaze focal position based on the predicted loss comprises:

determining at least two perturbations of the prediction loss to the facial image;

respectively correcting the face image based on each disturbance to obtain at least two corrected images;

obtaining a predicted sight focus position corresponding to each corrected image through a neural network model;

and obtaining the confidence coefficient according to the predicted sight focus position corresponding to each corrected image.

In an optional embodiment of the present application, obtaining a confidence level according to a predicted gaze focus position corresponding to each corrected image includes:

and determining a standard deviation according to the predicted sight focus position corresponding to each corrected image, and taking the reciprocal of the standard deviation as a confidence coefficient.

In an alternative embodiment of the present application, the determining of at least two perturbations of the facial image by the prediction loss comprises at least one of:

determining a predicted loss of an initial gaze focus position corresponding to the facial image relative to at least two directions; determining the disturbance of the initial sight focus position to the face image in each direction based on the corresponding prediction loss in each direction;

at least two perturbations of the initial gaze focal position to the face image are determined based on the at least two perturbation coefficients.

In an alternative embodiment of the present application, obtaining the predicted gaze focus position of the user based on the facial image using a neural network model includes:

cutting the face image to obtain a global image and a local image of the face image;

inputting the global image and the local image into a neural network model to obtain a predicted sight focus position of the user;

determining at least two perturbations of the prediction loss to the facial image, including:

determining at least two kinds of disturbance of prediction loss to each image in a global image and a local image;

correcting the face image based on each kind of disturbance to obtain at least two corrected images, including:

respectively correcting corresponding images based on at least two kinds of disturbance corresponding to each image in the global image and the local image to obtain at least two corrected images corresponding to each image;

obtaining the predicted sight focus position corresponding to each corrected image through a neural network model, wherein the predicted sight focus position comprises the following steps:

inputting each group of corrected images into a neural network model to obtain a predicted sight focus position corresponding to each group of corrected images, wherein each group of corrected images comprises a corrected image corresponding to each image obtained by cutting the face image;

obtaining a confidence level according to the predicted sight line focus position corresponding to each corrected image, comprising:

and obtaining the confidence degree based on the corresponding predicted sight line focus position of each group of modified images.

In an alternative embodiment of the present application, determining a perturbation of the facial image by the prediction loss comprises:

determining the gradient change of the prediction loss to each pixel point in the face image;

correcting the face image based on the disturbance to obtain a corrected image, comprising:

and superposing the disturbance corresponding to each pixel point with the original pixel value of the pixel point corresponding to the face image to obtain the corrected image.

The above alternatives provided by the embodiments of the present application are specifically described below with reference to a specific example. In this example, the input is image data captured by a visible light camera of an electronic device (e.g., a cell phone), such as a single frame picture containing a human face. The image processing method in this example mainly includes the following steps:

1. the input of the step is a face image to be processed, namely a single-frame picture X, and for the single-frame picture X, the picture can be cut into the face picture X according to the detection result of the key points of the face_fLeft eye picture X_lRight eye picture X_rAnd adjusting the sizes of the three pictures to preset fixed sizes by using bilinear interpolation and outputting the three pictures. For example, face image X_fAnd the left eye image X_lAnd right eye image X_rThe corresponding preset fixed sizes may be 64px64p, 48px64p, 48px64p, respectively, where p denotes a pixel.

It can be understood that the face image X_fI.e. the global image in this example, the left eye image X_lAnd right eye image X_rI.e. the partial image in this example.

2. Inputting the three pictures into a neural network model to obtain an output P_xThe output in this example is the feature vector, P, corresponding to the initial gaze focus position of the single frame picture X_xIs bin _ num. I.e. the total number of components of the vector, corresponding to the number of neurons of the model output layer, P_xThe value of the ith component may be denoted as P_xi。

3. Based on P_xiDetermining a plurality of kinds of disturbances, in this example, a face picture Xf is taken as an example to describe the disturbances, and the disturbances corresponding to the face picture Xf are expressed as

Where f represents the corresponding face picture, l corresponds to the direction, and in this example, there are two values of l, such as 1 and 2, a value of 1 corresponds to the left-hand disturbance, a value of 2 corresponds to the right-hand disturbance, and g and j correspond to the disturbance coefficients, respectively, and the corresponding explanation will be described later.

For face picture X_fBased on P_xiCan obtain a plurality ofDisturbance

Thus, a plurality of confrontation pictures (i.e. corrected face pictures) are obtained, which can be expressed as:

that is, each disturbance is associated with the face picture X_fSuperposing (pixel value superposing) to obtain a modified face picture

In this example, the specific calculation formula is:

Loss_test_l_i＝-Log(1-P_xi)

Loss_test_r_i＝-Log(P_xi)

wherein,

in each expression, Loss _ test _ l_iIndicating the predicted Loss of the ith component in the left direction, Loss _ test _ r_iIndicating the prediction Loss of the i-th component in the rightward direction, and Loss _ test _ l indicating the predicted gaze-focus position whole pair in this exampleThe Loss of prediction in the left direction should be, Loss _ test _ r represents the Loss of prediction in the right direction corresponding to the whole predicted gaze focus position in this example, k is a set value, and if it can be taken as 4, it can be understood that the value of 2k +1 cannot be greater than the total number of neurons in the output layer of the model. p _ bin denotes the number of the last component (neuron number) in the Px vector that is greater than a set value (e.g., 0.5),

which is indicative of a left-hand perturbation,

a right-hand perturbation is represented and,

representing the Laplace operator, alpha_jIndicating step size, j taking a different value, then α_jCorresponding to different step sizes, e.g. two options for step size, respectively a₁＝1，α₂2, i.e., j may take the value of 1 or 2, in this case,

a perturbation to the left corresponding to step 1 is indicated,

then a left-hand perturbation corresponding to step 2 is indicated; sgn () denotes the sign function, pos_gRepresenting probability, pos, for different values of g_gRepresenting different probabilities, in this example three choices of probability, e.g. pos₁＝1，pos₂＝0.8，pos₃When the value of g is 0.6, i.e., 1, 2 or 3, in this case,

a left-hand perturbation with a probability of 1 corresponding to step 1 is indicated,

then it represents the left with a probability of 0.8 corresponding to step size 1The disturbance of the direction and so on.

rdm(M，pos_k)_nmRepresenting pos in M matrix_kRandom reserved elements of probability. As can be appreciated, rdm (M, pos)_k)_nmFor purposes of illustrating the meaning of the rdm () function only, in random () ≦ pos_gWhen the function results in the corresponding element value, random () > pos_gThe element value is 0. In particular, for

Representing the position in pos of each pixel_gRandomly selecting whether to generate disturbance or not, and obtaining a disturbance matrix with disturbance at some pixel positions and no disturbance at some pixel positions, for example, random () > pos at (m, n) position_gThen the value of the (m, n) position in the resulting matrix is 0, i.e. the perturbation at that position is 0.

3. Calculating confidence (inverse of the standard deviation std _ adv)

The equation for the confrontational standard deviation in this example is as follows:

wherein N is the total number of the corresponding antagonistic pictures of the three pictures, namely j, g, l,

the predicted sight focus position corresponding to each group of confrontation pictures (namely each group of modified images) is represented, and for different values of j, g and l,

corresponding to the predicted sight focus position of each group of corresponding confrontation pictures, and mean is an average value; var is the variance. In application, each

Corresponding to a group of confrontation pictures, each group of confrontation pictures comprises one confrontation picture corresponding to each of three pictures (face picture, left eye picture and right eye picture), namely, every three confrontation pictures are taken as a group of inputs of the neural network model, each group of inputs corresponds to one corrected predicted sight focus position, in the example, N groups of inputs are obtained, and N groups of inputs are obtained

Based on the N

The standard deviation can be calculated, and the confidence can be obtained by taking the reciprocal of the standard deviation.

4. Predicted gaze focus position P for single frame picture X from antagonistic standard deviations_xDifferent treatments are performed. In this step, the predicted gaze focus position P is input as a picture_xAnd confidence 1/std _ adv, output as a pair of processed gaze focus positions.

Specifically, if 1/std _ adv is greater than threshold th1, it indicates that the confidence of the prediction result of the picture is high, and the predicted gaze focal position P is directly output_x(ii) a If 1/std _ adv is not greater than the threshold th1, it indicates that the confidence of the prediction result of the picture is small, and we perform temporal smoothing, such as kalman filtering, on the prediction coordinates. The threshold th1 may be configured according to actual requirements, for example, it may be 1/63.88.

Of course, in practical applications, the processing may be performed directly based on the standard deviation, specifically, if std _ adv is less than or equal to the threshold th2, the antagonistic standard deviation of the prediction result of the picture is small, and we consider that the confidence is high, and the predicted gaze focus position P is directly output_x(ii) a If std _ adv is greater than threshold th2, it indicates the prediction result of the pictureWith less confidence, we perform temporal smoothing on their predicted coordinates, such as kalman filtering. The threshold th2 may be configured according to actual requirements, for example, may be 63.88.

It is understood that the neural network model for image processing in the embodiment of the present application may be trained based on the training method provided in any embodiment of the present application. That is to say, the target neural network model to be trained may be a model for outputting a gaze focal position of the user (or a vector representation corresponding to the focal position), when the model is trained, a prediction result of the model is a predicted gaze focal position of the sample image, and a real result is a real gaze focal position of the user in the sample image, that is, a real coordinate point of the user on a screen position of the electronic device in the sample image.

The scheme provided by the embodiment of the application can be applied to various electronic devices, for example, the scheme can be applied to mobile electronic devices, such as mobile phones, tablets and the like, which only have a single visible light camera, and the mobile phone end can estimate the sight of the user based on videos/images. The scheme of the embodiment of the application can effectively improve the interaction performance between the sight line of a user and the equipment (mobile phone), for example, user data such as pictures can be acquired through a mobile phone camera, and then the position of the user on a mobile phone screen watched by the user can be estimated according to the user data.

It should be noted that, in the embodiment of the present application, for the example in the embodiment of the training method and the example in the image processing method, some corresponding parameter interpretations may be referred to each other.

In the embodiment of the training method, a method for denoising a data set and a training method for reducing overfitting are provided, so that the generalization performance of a trained model is improved. The embodiment of the application also provides a training method aiming at the ranking loss function based on the countertraining (namely the training method for enhancing the robustness of the model), so that the sight estimation result of the trained model is more stable, and the problem of jitter is solved in the training stage.

In the embodiment of the image processing method, a testing method for obtaining a stable prediction result is provided, the standard deviation and the prediction result of the confrontation sample of the test picture are output, and the prediction result is processed through the confrontation standard deviation. The embodiment of the application also provides a three-point calibration method for a specific person, and certainly, a single-point or multi-point calibration method can be adopted, so that the prediction result can be adjusted quickly and efficiently.

Based on the same principle, the embodiment of the application also provides an image processing device which comprises an image acquisition module and a sight line focus position determination module. Wherein:

the image acquisition module is used for acquiring a face image of a user;

Optionally, the gaze focus position determination module is specifically configured to:

acquiring a position adjustment parameter;

Optionally, the position adjustment parameter is obtained by:

determining a predicted loss of the predicted gaze focal position;

Optionally, when the gaze focus position determination module determines the confidence of the predicted gaze focus position based on the predicted loss, the gaze focus position determination module is specifically configured to:

respectively correcting the face images based on each kind of disturbance to obtain at least two corrected images;

Optionally, when obtaining the confidence level according to the predicted gaze focus position corresponding to each corrected image, the gaze focus position determining module is specifically configured to:

determining a standard deviation according to the predicted sight focus position corresponding to each corrected image;

the inverse of the standard deviation was taken as the confidence.

Optionally, the gaze focus position determination module, when determining at least two perturbations to the facial image that are predicted to be lost, performs at least one of:

Optionally, when the gaze focus position determination module obtains the predicted gaze focus position of the user based on the face image by using the neural network model, the gaze focus position determination module is specifically configured to:

the gaze focal position determination module, when determining at least two perturbations of the facial image by the prediction loss, is specifically configured to:

the gaze focus position determination module, when obtaining the predicted gaze focus position corresponding to each modified image through the neural network model, is specifically configured to:

the gaze focus position determination module is specifically configured to, when obtaining the confidence level according to the predicted gaze focus position corresponding to each corrected image:

Optionally, when determining the disturbance of the prediction loss on the face image, the gaze focus position determination module is specifically configured to:

Based on the same principle, the embodiment of the application also provides a training device of the neural network model, and the device comprises a sample obtaining module and a model training module. Wherein:

Optionally, the model training module is specifically configured to:

acquiring a first neural network model;

Optionally, when the model training module performs at least two times of training on the first neural network model based on the sample images, the sample images in the previous training are sample images obtained by deleting sample images of a set number or a set proportion, in each sample image used in the previous training, whose difference between the prediction result of the sample image and the real result of the sample image is small.

Optionally, the model training module is specifically configured to:

and taking each output result as a real result of each corresponding sample image, and training the target neural network model based on each sample image.

Optionally, the teacher network model is any one of the teacher network models randomly selected from the teacher queue;

optionally, the model training module is further configured to, after each training of the target neural network model based on each sample image is performed by taking each output result as a true result of each corresponding sample image:

adding the target neural network model after each training to a teacher queue;

Optionally, when the model training module trains the initial target neural network model based on each sample image, the model training module is specifically configured to:

Optionally, when initializing a part of model parameters of the target neural network model after each training, the model training module is specifically configured to:

model parameters of each target filter are initialized.

Optionally, when initializing the model parameters of each target filter, the model training module is specifically configured to:

It is understood that each module provided in the embodiments of the present application may have a function of implementing a corresponding step in the method provided in the embodiments of the present application. The functions may be implemented by hardware, or may be implemented by hardware executing corresponding software. The modules can be software and/or hardware, and can be implemented individually or by integrating a plurality of modules. For the functional description of each module of the speech translation apparatus, reference may be specifically made to the corresponding description in the methods in the foregoing embodiments, and details are not described here again.

In addition, in practical application, each functional module of the apparatus in the embodiment of the present application may be operated in the terminal device and/or the server according to a requirement of the practical application.

Based on the same principle, the embodiment of the application also provides an electronic device, which comprises a memory and a processor; the memory has a computer program stored therein; a processor for invoking a computer program to perform a method provided in any of the embodiments of the present application.

Based on the same principle, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the storage medium, and when the computer program is executed by a processor, the computer program implements the method provided in any embodiment of the present application.

Alternatively, fig. 5 shows a schematic structural diagram of an electronic device to which the embodiment of the present application is applied, and as shown in fig. 5, the electronic device 4000 may include a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

Processor 4001 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. Bus 4002 may be a PCI bus, EISA bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus.

Memory 4003 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, an optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory 4003 is used for storing computer programs for executing the present scheme, and is controlled by the processor 4001 for execution. Processor 4001 is configured to execute a computer program stored in memory 4003 to implement what is shown in any of the foregoing method embodiments.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. an image processing method based on neural network model, is characterized in that, comprises:

Get the user's face image;

Using a neural network model, based on the face image, the focus position of the user's line of sight is obtained.

2. The method according to claim 1, wherein the using a neural network model to obtain the focus position of the user's sight based on the face image, comprising:

Get the position adjustment parameters;

Using a neural network model, based on the face image, obtain the user's predicted line-of-sight focus position;

Based on the position adjustment parameter, the predicted line-of-sight focus position is adjusted to obtain the adjusted line-of-sight focus position.

3. The method according to claim 2, wherein the position adjustment parameter is obtained in the following manner:

displaying the calibration object to the user, and acquiring the current face image of the user;

Using a neural network model, based on the current face image, obtain the predicted line-of-sight focus position of the user corresponding to the current face image;

The position adjustment parameter is determined according to the user's predicted line-of-sight focus position corresponding to the current face image and the position of the calibration object.

4. The method according to claim 1, wherein the using a neural network model to obtain the focus position of the user's sight based on the face image, comprising:

Determining the predicted loss for predicting the focus position of the line of sight;

determining the confidence of the predicted line-of-sight focus position based on the predicted loss;

If the confidence level is greater than the set threshold, the predicted line-of-sight focus position is determined as the user's line-of-sight focus position;

If the confidence level is not greater than the set threshold, adjust the predicted line-of-sight focus position to obtain the adjusted line-of-sight focus position, or determine the line-of-sight focus position corresponding to the previous frame of face image as the user's line-of-sight focus position.

5 . The method according to claim 4 , wherein: determining the confidence of the predicted line-of-sight focus position based on the predicted loss, comprising: 6 .

determining the prediction loss for at least two perturbations of the face image;

Correcting the facial images based on each disturbance to obtain at least two corrected images;

Through the neural network model, the predicted line-of-sight focus position corresponding to each corrected image is obtained;

The confidence is obtained according to the predicted line-of-sight focal position corresponding to each modified image.

6. The method according to claim 4, characterized in that, obtaining the confidence according to the predicted line-of-sight focal position corresponding to each modified image, comprising:

Determine the standard deviation according to the predicted line-of-sight focal position corresponding to each modified image;

Take the inverse of the standard deviation as the confidence level.

7. The method according to claim 5, wherein the determining the prediction loss for at least two disturbances of the face image comprises at least one of the following:

Determine the initial line-of-sight focus position corresponding to the face image with respect to the prediction loss in at least two directions; determine the disturbance of the initial line-of-sight focus position to the face image in each direction based on the corresponding prediction losses in each direction;

Based on the at least two disturbance coefficients, at least two disturbances of the initial line-of-sight focus position to the face image are determined.

8. The method according to claim 6 or 7, wherein the using a neural network model to obtain the predicted line-of-sight focus position of the user based on the face image, comprising:

The facial image is cropped to obtain a global image and a partial image of the facial image;

Input the global image and local image into the neural network model to obtain the user's predicted focus position;

The determining prediction loss for at least two perturbations of the face image, including:

determining the prediction loss for at least two perturbations of each image in the global image and the local image;

The face image is corrected based on each disturbance, and at least two kinds of corrected images are obtained, including:

Based on at least two disturbances corresponding to each image in the global image and the local image, correct each corresponding image respectively, and obtain at least two corrected images corresponding to each image;

The predicted line-of-sight focal position corresponding to each modified image is obtained through the neural network model, including:

Each group of corrected images is input into the neural network model, and the predicted line-of-sight focus position corresponding to each group of corrected images is obtained. Image;

The confidence is obtained according to the predicted line-of-sight focal position corresponding to each modified image, including:

Confidence is obtained based on the predicted line-of-sight focal positions corresponding to each set of modified images.

9. The method according to any one of claims 5 to 8, wherein determining the perturbation of the prediction loss to the face image comprises:

Determine the gradient change of the prediction loss for each pixel in the face image;

According to the gradient change corresponding to each pixel, determine the perturbation of the prediction loss to each pixel;

Correct the face image based on the disturbance to obtain the corrected image, including:

The disturbance corresponding to each pixel is superimposed with the original pixel value of the pixel corresponding to the face image to obtain a corrected image.

10. A method for training a neural network model, comprising:

Obtain a training sample set, which includes each sample image;

The initial target neural network model is trained based on the sample images until the loss function converges, and the trained target neural network model is obtained.

11. The method according to claim 10, wherein the training of the initial target neural network model based on the respective sample images comprises:

Obtain the first neural network model;

The first neural network model is trained at least twice based on each sample image to obtain the first neural network model after each training;

Through the neural network model after each training, each training sample is predicted respectively, and the prediction result of each sample image corresponding to the neural network model after each training is obtained;

Based on the difference between the prediction result corresponding to each sample image and the real result of the sample, delete the sample images in the training sample set to obtain the processed sample images.

12 . The method according to claim 11 , wherein when the first neural network model is trained at least twice based on the sample images, the sample images in the current training are those used in the previous training. 13 . Among the sample images of , the sample images of which the difference between the predicted result of the sample image and the real result of the sample image is small is a set number or a set proportion of sample images after deletion.

13. The method according to claim 10, wherein the training of the initial target neural network model based on the respective sample images comprises:

Input each sample image into the teacher network model to obtain the output result of each sample image;

Take each output result as the real result of each corresponding sample image, and train the target neural network model based on each sample image;

Wherein, the teacher network model is any teacher network model randomly selected in the teacher queue;

Taking each output result as the real result of each corresponding sample image, after each training of the target neural network model based on each sample image, it also includes:

adding the target neural network model after each training to the teacher queue;

Wherein, the model when the teacher queue is initialized is empty, and the real result of each sample image during initialization is the real result corresponding to the label of the sample image.

14. The method according to claim 10, wherein the training of the initial target neural network model based on the respective sample images comprises:

Initialize a part of the model parameters of the target neural network model after each training, use another part of the model parameters and the initialized part of the model parameters as new model parameters of the target neural network model, and perform the next time of the target neural network model. train.

15. The method according to claim 14, wherein the initializing a part of model parameters of the target neural network model after each training comprises:

Determine the importance of each filter in the target neural network model;

Determine the target filter that needs parameter initialization according to the importance of each filter;

Initialize the model parameters of each target filter;

Wherein, the initialization of the model parameters of each target filter includes:

Decompose the filter parameter matrix of the neural network layer where the target filter is located to obtain the orthogonal matrix of the filter parameter matrix;

For the neural network layer where the target filter is located, according to the position of each target filter in the neural network layer in the corresponding neural network layer, determine the eigenvectors corresponding to each target filter in the orthogonal matrix corresponding to the neural network layer;

According to the eigenvectors corresponding to each target filter in the same neural network layer, determine the second norm of the eigenvectors of each target filter in the same neural network layer;

For each target filter, the initialized parameters of the target filter are determined according to the feature vector corresponding to the target filter and the corresponding two-norm in the neural network layer to which the target filter belongs.

16. The method according to claim 10, wherein the training of the initial target neural network model based on the respective sample images comprises:

The prediction loss of the target neural network model for each sample image in each training is determined, each sample image is modified according to the prediction loss, and the next training of the target neural network model is performed based on each modified sample image.

17. The method according to claim 16, wherein the determining the prediction loss of each sample image by the target neural network model during each training, and revising each sample image according to the prediction loss, comprises:

Determine the prediction loss of the target neural network model for each sample image in each training;

For each sample image, determine the gradient change of the prediction loss of the sample image for each pixel in the sample image;

The disturbance corresponding to each pixel point is superimposed with the original pixel value of the pixel point corresponding to the sample image to obtain the corrected sample image.

18. An image processing device, comprising:

an image acquisition module for acquiring the user's face image;

A line-of-sight focal position determination module, configured to use a neural network model to obtain the user's line-of-sight focal position based on the face image.

19. An electronic device, comprising a memory and a processor;

A computer program is stored in the memory;

The processor is used for invoking the computer program to execute the method of any one of claims 1 to 17.

20 . A computer-readable storage medium, wherein a computer program is stored in the storage medium, and when the computer program is executed by a processor, the method according to any one of claims 1 to 17 is implemented.