CN111178217A

CN111178217A - Method and device for detecting face image

Info

Publication number: CN111178217A
Application number: CN201911340156.3A
Authority: CN
Inventors: 周康明; 曹磊磊
Original assignee: Shanghai Eye Control Technology Co Ltd
Current assignee: Shanghai Eye Control Technology Co Ltd
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2020-05-19

Abstract

The present application provides a method and device for detecting a face image, which can build a face detection model based on depthwise separable convolution, and deploy the obtained face detection model to a mobile terminal, so that the mobile terminal can use the deployed human The face detection model performs face detection on the acquired image to be detected, and obtains the face image in the image to be detected. Compared with the traditional deep convolutional neural network, the number of model parameters and the amount of model calculation are reduced, and the overall model of the model is reduced. This makes it possible to deploy the face detection model on mobile terminals with limited storage space and computing resources, which meets the real-time requirements of face detection; in addition, compared with the existing lightweight face detection models, it also improves the the detection accuracy of the model.

Description

Method and equipment for detecting face image

Technical Field

The present application relates to the field of image recognition, and in particular, to a method and an apparatus for detecting a face image.

Background

Face detection is the basis of tasks such as face recognition, face key point detection, face tracking, face expression recognition and the like, and is always concerned by academia and industry. In recent years, Face detection technologies are designed mainly based on a universal target detection framework, and known Face detection algorithms include S3FD (single Shot-innovative Face Detector), PyramidBox, SRN (Selective reflection Network), DSFD (Dual Shot Face Detector), retina Face, and the like. The algorithms usually take a deep convolution neural network model such as VGG Net or ResNet as a backbone network to extract image features, then carry out face detection according to the image features, and the face detection model constructed according to the algorithms has high performance, but the detection time is long, and the real-time requirement is difficult to achieve even under the condition of GPU acceleration. The reason that the detection time of the deep convolutional neural network model is long is that the parameter quantity of the deep convolutional neural network model like VGG Net or ResNet is large, so that the calculation quantity is large, and the required storage space and the calculation resources are large. The storage space and the computing resources of the mobile terminal are very limited, so that the face detection model established by the deep convolutional neural network is difficult to be directly deployed on the mobile terminal for use.

Disclosure of Invention

An object of the present application is to provide a method and an apparatus for detecting a face image, which are used to solve the problem that an existing face detection model based on a deep convolutional neural network is difficult to be directly deployed on a mobile terminal.

In order to achieve the above object, the present application provides a method for detecting a face image, wherein the method includes:

constructing a face detection model, wherein a feature extraction network corresponding to the face detection model comprises a plurality of image feature extraction sections which are sequentially connected, each image feature extraction section comprises a first processing block and a plurality of second processing blocks which are sequentially connected, the first processing block is used for outputting an input feature image after downsampling by using depth separable convolution, the second processing block is used for carrying out channel separation on the input feature image, and then the feature extraction is carried out on the feature image after channel separation by using the depth separable convolution and then the feature image is output;

and deploying the face detection model to a mobile terminal so that the mobile terminal can acquire an image to be detected, and performing face detection on the image to be detected according to the face detection model to acquire a face image in the image to be detected.

Further, constructing a face detection model, comprising:

constructing a feature extraction network and a feature detection network corresponding to the face detection model, wherein the feature extraction network further comprises an image feature pyramid network, the image feature pyramid network performs up-sampling or down-sampling according to the feature images output by the image feature extraction section, and the feature images obtained through the up-sampling or down-sampling form a multi-level image feature pyramid; the feature detection network acquires an image feature pyramid output by the image feature pyramid network, and outputs a classification feature image and a regression feature image according to the image feature pyramid, wherein the classification feature image is used for explaining the probability that pixel points belong to a face, and the regression feature image is used for giving information of a face frame;

inputting a training image into the feature extraction network to extract the face features, and acquiring a corresponding feature image;

inputting the feature image into the feature detection network for feature detection to obtain a corresponding classification feature image and a corresponding regression feature image;

comparing the classification characteristic image with the face classification labeled in advance by the training image to determine a classification error;

comparing the regression feature image with a face image frame labeled in advance by the training image to determine a regression feature error;

adjusting model parameters of the face detection model based on the classification error and the regression feature error;

and when a preset model training stopping condition is met, determining the current model parameters of the face detection model as the trained model parameters of the face detection model.

Further, the image feature pyramid network performs up-sampling or down-sampling according to the feature image output by the image feature extraction section, and forms a multi-level image feature pyramid from the feature images obtained through the up-sampling or down-sampling, including:

acquiring a characteristic image output by the last image characteristic extraction section in the plurality of sequentially connected image characteristic extraction sections, performing channel adjustment on the characteristic image through a convolution core with the size of 1 multiplied by 1, and determining the characteristic image after the channel adjustment as a reference characteristic image;

performing down-sampling on the reference characteristic image through a depth separable convolution processing block to obtain a feature image after down-sampling;

the reference characteristic image is subjected to up-sampling, and the obtained characteristic image is subjected to characteristic extraction through a depth separable convolution processing block to obtain an up-sampled characteristic image;

and sorting the up-sampled characteristic image, the reference characteristic image and the down-sampled characteristic image according to the image size to obtain a multi-level image characteristic pyramid.

Further, the method for downsampling the reference feature image through the depth separable convolution processing block to obtain a downsampled feature image further includes:

the feature image after down sampling is subjected to down sampling through a depth separable convolution processing block to obtain a feature image after down sampling;

the method comprises the following steps of up-sampling the reference characteristic image, then extracting the characteristics of the obtained characteristic image through a depth separable convolution processing block, and obtaining the characteristic image after up-sampling, wherein the method comprises the following steps:

the reference characteristic image is subjected to up-sampling through a bilinear interpolation algorithm, and a first characteristic image is obtained;

acquiring a characteristic image output by the last image characteristic extraction section connected with the last image characteristic extraction section, performing channel adjustment on the characteristic image through a convolution kernel with the size of 1 multiplied by 1, and performing pixel-by-pixel addition on the characteristic image after the channel adjustment and the first characteristic image to acquire a second characteristic image;

and performing feature extraction on the second feature image through a depth separable convolution processing block to obtain an up-sampled feature image.

Further, the depth separable convolution processing block includes:

the device comprises a depth separable convolution layer, a batch normalization layer, a convolution layer with convolution kernel size of 1 x 1, a batch normalization layer and an activation function layer which are connected in sequence, wherein the convolution kernel size of the depth separable convolution layer is 3 x 3.

Further, the first processing block is configured to:

inputting the input characteristic image into a first depth separable convolution processing block for downsampling to obtain a first downsampled characteristic image;

inputting the characteristic image obtained by the convolution layer with the convolution kernel size of 1 multiplied by 1 of the input characteristic image into a second depth separable convolution processing block for down-sampling to obtain a second down-sampled characteristic image;

respectively connecting the first downsampling characteristic image and the second downsampling characteristic image through convolution layers with convolution kernel size of 1 x 1 to obtain a first connection characteristic image;

and performing random channel mixing on the first connection characteristic image, and outputting a first random channel mixed characteristic image.

Further, the second processing block is configured to:

carrying out channel separation on the input characteristic image to obtain a characteristic image after channel separation;

sequentially inputting the characteristic image after channel separation into a convolutional layer with a convolutional kernel size of 1 multiplied by 1, a depth separable convolutional processing block and a convolutional layer with a convolutional kernel size of 1 multiplied by 1 to obtain a characteristic image after characteristic extraction;

connecting the characteristic image after the channel separation with the characteristic image after the characteristic extraction to obtain a second connection characteristic image;

and performing random channel mixing on the second connection characteristic image, and outputting a second random channel mixed characteristic image.

Further, the feature detection network acquires an image feature pyramid output by the image feature pyramid network, and outputs a classification feature image and a regression feature image according to the image feature pyramid, including:

the feature detection network acquires a feature image of each level in the image feature pyramid output by the image feature pyramid network;

inputting the feature image of each layer into a convolution layer with the convolution kernel size of 3 multiplied by 3 to obtain a classification feature image with the channel number reduced to 2;

inputting the feature image of each layer into a convolution layer with the convolution kernel size of 3 multiplied by 3, and obtaining a regression feature image with the number of channels reduced to 4.

Based on another aspect of the present application, the present application further provides an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, cause the apparatus to perform the aforementioned method for detecting a face image.

The application also provides a computer readable medium, on which computer readable instructions are stored, and the computer readable instructions can be executed by a processor to realize the detection method of the human face image.

Compared with the prior art, the scheme provided by the application can construct a face detection model based on the depth separable convolution, and the obtained face detection model is deployed to the mobile terminal, so that the mobile terminal can use the deployed face detection model to perform face detection on the obtained image to be detected, and obtain the face image in the image to be detected, compared with the traditional depth convolution neural network, the number of model parameters and the model calculation amount are reduced, the overall detection time of the model is reduced, the face detection model can be deployed on the mobile terminal with limited storage space and calculation resources, and the real-time requirement of face detection is met; in addition, compared with the existing lightweight face detection model, the detection precision of the model is also improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

fig. 1 is a schematic flowchart of a method for detecting a face image according to some embodiments of the present application.

Fig. 2 is a schematic network structure diagram corresponding to a face image detection method according to some preferred embodiments of the present application.

Fig. 3 is a schematic structural diagram of a plurality of image feature extraction segments in a feature extraction network according to some preferred embodiments of the present application.

Fig. 4 is a schematic flowchart of constructing an image feature pyramid according to some preferred embodiments of the present application.

Fig. 5 is a schematic structural diagram of a first processing block according to some embodiments of the present disclosure.

Fig. 6 is a schematic structural diagram of a second processing block according to some embodiments of the present application.

Fig. 7 is a schematic structural diagram of a detection network according to some preferred embodiments of the present application.

Detailed Description

The present application is described in further detail below with reference to the attached figures.

In a typical configuration of the present application, the terminal and the network device each include one or more processors (CPUs), input/output interfaces, network interfaces, and memories.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

Fig. 1 illustrates a method for detecting a face image according to some embodiments of the present application, where the method specifically includes the following steps:

step S101, a face detection model is constructed, a feature extraction network corresponding to the face detection model comprises a plurality of image feature extraction sections which are sequentially connected, each image feature extraction section comprises a first processing block and a plurality of second processing blocks which are sequentially connected, the first processing block is used for outputting an input feature image after downsampling by using depth separable convolution, the second processing block is used for carrying out channel separation on the input feature image, and then the feature image after channel separation is output after feature extraction by using depth separable convolution;

step S102, deploying the face detection model to a mobile terminal so that the mobile terminal can obtain an image to be detected, and carrying out face detection on the image to be detected according to the face detection model to obtain a face image in the image to be detected.

The scheme is particularly suitable for scenes in which a face image is expected to be detected on a mobile terminal, a feature extraction network corresponding to a face detection model can be constructed through depth separable convolution, then the trained face detection model is deployed to the mobile terminal, and the mobile terminal performs face detection on the acquired image to be detected according to the deployed face detection model to obtain the corresponding face image.

In step S101, a face detection model is first constructed. The face detection model constructs a corresponding network structure based on depth separable convolution, the network structure corresponding to the face detection model may include a feature extraction network and a feature detection network, the feature extraction network is used for performing face-related feature extraction on the image, and the feature detection network is used for performing face detection on a feature image obtained by the feature extraction network to determine a face image therein.

The feature extraction network comprises a plurality of image feature extraction sections (stages), each image feature extraction section firstly receives an input feature image, then performs feature extraction on the input feature image to obtain a new feature image, and then outputs the new feature image. The image feature extraction sections are connected in sequence, and the output feature image of the previous image feature extraction section is used as the input feature image of the next image feature extraction section.

The image feature extraction section comprises a first processing block and a plurality of second processing blocks, the first processing block and the plurality of second processing blocks are sequentially connected, a feature image output by the first processing block is used as an input feature image of the first second processing block, and an output feature image of the first second processing block is used as an input feature image of the next second processing block. The first processing block downsamples an input feature image using depth separable convolution and outputs a downsampled feature image. The second processing block firstly carries out channel separation on the input characteristic image, then carries out characteristic extraction on the characteristic image after the channel separation by using depth separable convolution, and outputs the characteristic image after the characteristic extraction.

Downsampling, also known as reducing an image, is generally performed to generate a thumbnail of the corresponding image. One feature image is M × N in size, and is down-sampled by a factor of S to obtain an image of (M/S) × (N/S) size, where S is typically a common divisor of M and N. The feature image is subjected to down-sampling operation, so that the high-level semantic features of the feature image with low resolution can be obtained, and the down-sampled feature image has low resolution, namely small length and width, so that the number of pixels needing to be calculated is small, and the calculation complexity can be reduced. Here, the high-level semantic features refer to texture structures, semantic information, and the like of the image, and the convolutional layer initiated by the neural network learns the low-level shape features, color features, and the like of the image.

The depth separable convolution may separate the common convolution kernel into two separate convolution kernels, which perform depth convolution and point-by-point convolution, respectively. By using deep separable convolution, the number of multiplications in the convolution calculation process can be greatly reduced, thereby greatly reducing the calculation amount.

In some embodiments of the present application, constructing the face detection model may specifically include the following steps:

1) constructing a feature extraction network and a feature detection network corresponding to the face detection model;

the feature extraction network comprises a plurality of image feature extraction sections and also comprises an image feature pyramid network, wherein the image feature pyramid network performs up-sampling or down-sampling according to feature images output by the image feature extraction sections, and the feature images obtained through the up-sampling or down-sampling form a multi-level image feature pyramid.

The feature detection network acquires an image feature pyramid output by the image feature pyramid network, and outputs a classification feature image and a regression feature image according to the image feature pyramid, wherein the classification feature image is used for explaining the probability that pixel points belong to a face, and the regression feature image is used for giving information of a face frame.

2) Inputting a training image into the feature extraction network to extract the face features, and acquiring a corresponding feature image;

3) inputting the feature image into the feature detection network for feature detection to obtain a corresponding classification feature image and a corresponding regression feature image;

4) comparing the classification characteristic image with the face classification labeled in advance by the training image to determine a classification error;

5) comparing the regression feature image with a face image frame labeled in advance by the training image to determine a regression feature error;

6) adjusting model parameters of the face detection model based on the classification error and the regression feature error;

7) and when a preset model training stopping condition is met, determining the current model parameters of the face detection model as the trained model parameters of the face detection model.

Fig. 2 shows a schematic structural diagram of a preferred feature extraction network and a feature detection network, in which a backbone network and a feature pyramid are the feature extraction network, and a detection head is the feature detection network. The method comprises the steps that a backbone network receives an externally input image containing a human face, corresponding output characteristic images are obtained through 4 image characteristic extraction sections (stage 1-stage 4), the output characteristic images are input into a characteristic pyramid network to generate a characteristic pyramid consisting of 6 levels of characteristic images, and then the characteristic images corresponding to each level of the characteristic pyramid are input into a characteristic detection network.

In some embodiments of the present application, the feature extraction network further adds an initialization block before the plurality of image feature extraction segments, where the initialization block is configured to perform image initialization on an external input image, and then input an initialized feature image into the image feature extraction segment. Preferably, the initialization block may include a plurality of convolutional layers, a first processing block, and a plurality of second processing blocks, which are sequentially connected.

Fig. 3 shows a schematic structural diagram of a plurality of image feature extraction segments in a preferred feature extraction network, wherein an original image is processed by an initialization block and then input into a subsequent image feature extraction segment, the initialization block includes 3 convolutional layers, the convolutional kernel size of the convolutional layer is 3 × 3, a processing block B is a first processing block, a processing block a is a second processing block, and the number of the processing blocks a is 3. After the original image passes through 3 convolution layers of 3 multiplied by 3, the output channel of the obtained characteristic image is 24, the characteristic image is input into the processing block B for down sampling, the size of the output characteristic image is reduced but the number of channels is kept unchanged, the characteristic image output by the processing block B is sequentially input into the 3 processing blocks A, the size of the output characteristic image is unchanged, and the number of channels is unchanged. Preferably, the processing block B down-samples by a multiple of 2, and the size of the feature image output after the original image passes through the initialization block is 1/2 of the size of the original image. Each of the following 4 image feature extraction sections (stage 1-stage 4) is composed of a processing block B and 3 processing blocks a, from the first image feature extraction section to the last image feature extraction section, the number of channels of the output feature image is 2 times of the number of channels of the feature image output by the previous image feature extraction section, for example, the number of channels of stage1 is 58, the number of channels of stage2 is 116, the number of channels of stage3 is 232, and the number of channels of stage4 is 464; the size of the characteristic image is 1/2 of the characteristic image output by the previous image characteristic extraction segment, for example, the size of stage1 is 1/4 of the original image, the size of stage2 is 1/8 of the original image, the size of stage3 is 1/16 of the original image, and the size of stage4 is 1/32 of the original image. Here, the number of the processing blocks a in the image feature extraction segment may be set according to actual service requirements, for example, 2 or 4, and when a stage is composed of 1 processing block B and 2 processing blocks a connected, the network computation amount is reduced, but the accuracy effect is slightly worse; when a stage is composed of 1 processing block B and 4 processing blocks a, the amount of network computation increases, but the effect of accuracy is relatively high.

In addition, in order to pre-train the backbone network in fig. 3 on the Imagenet data set, the backbone network is continuously connected with a convolution layer with a convolution kernel size of 1 × 1, a global average pooling operation is adopted, and finally, a full-connection layer is input to output 1000 types of confidence coefficients.

In some embodiments of the present application, the image feature pyramid network performs up-sampling or down-sampling according to the feature image output by the image feature extraction section, and forms a multi-level image feature pyramid from the feature image obtained through the up-sampling or down-sampling, which may specifically include the following steps:

1) acquiring a characteristic image output by the last image characteristic extraction section in the plurality of sequentially connected image characteristic extraction sections, performing channel adjustment on the characteristic image through a convolution core with the size of 1 multiplied by 1, and determining the characteristic image after the channel adjustment as a reference characteristic image;

2) performing down-sampling on the reference characteristic image through a depth separable convolution processing block to obtain a feature image after down-sampling;

3) the reference characteristic image is subjected to up-sampling, and the obtained characteristic image is subjected to characteristic extraction through a depth separable convolution processing block to obtain an up-sampled characteristic image;

4) and sorting the up-sampled characteristic image, the reference characteristic image and the down-sampled characteristic image according to the image size to obtain a multi-level image characteristic pyramid.

Upsampling is also called image amplification, and the main purpose is to amplify an image, and an interpolation method is usually adopted, that is, a suitable interpolation algorithm is adopted to insert new pixels between pixel points on the basis of the original image pixels. The characteristic image is up-sampled through the up-sampling layer, so that the resolution of the characteristic image can be improved, and the capability of predicting the position of the face area is improved.

Fig. 4 shows a preferred procedure for constructing an image feature pyramid, where the image feature pyramid is generated according to a feature image output by an image feature extraction segment, the last image feature extraction segment in the figure is stage4, the number of channels of a feature image output by stage4 is 464, the feature image passes through a common convolution layer with a convolution kernel size of 1 × 1, the number of channels of the output feature image is adjusted to 116, the output feature image is a reference feature image, and the reference feature image is P5 in the figure. The P5 can be down-sampled and up-sampled respectively, and the down-sampling can be used to obtain a feature image with a smaller size, and the up-sampling can be used to obtain a feature image with a larger size. Downsampling the P5 through the depth separable convolution processing block can obtain a feature image P6.

In some embodiments of the present application, the obtained downsampled feature image may be further downsampled by the depth separable convolution processing block, and the obtained feature image is also the downsampled feature image, for example, the feature image P6 is further downsampled by the depth separable convolution processing block, so as to obtain the downsampled feature image P7.

In some embodiments of the present application, the depth separable convolution processing block may include: depth separable convolutional layers, Batch Normalization layers (BN layers), convolutional layers with a convolutional kernel size of 1 × 1, Batch Normalization layers (BN layers), and activation function layers (e.g., using the ReLU activation function), which are connected in sequence, with the output of the previous layer as input to the next layer. Here, the convolution kernel size of the depth separable convolution layer is 3 × 3. The number of input and output channels of the depth-separable convolution processing block can be determined as desired, as shown in fig. 4, the number of input channels of the depth-separable convolution processing block is 116, the number of output channels is 116, and the default downsampling multiple is 1, i.e., no downsampling is performed. When downsampling is needed, 2-fold downsampling can be performed by setting the downsampling multiple stride to 2, and for example, the feature image P6 with the size of P5 feature image 1/2 can be obtained after P5 is subjected to depth separable convolution processing blocks with the downsampling multiple set to 2.

In some embodiments of the present application, upsampling the reference feature image, and then performing feature extraction on the obtained feature image through the depth separable convolution processing block to obtain the upsampled feature image may include the following steps:

1) the reference characteristic image is subjected to up-sampling through a bilinear interpolation algorithm, and a first characteristic image is obtained;

2) acquiring a characteristic image output by the last image characteristic extraction section connected with the last image characteristic extraction section, performing channel adjustment on the characteristic image through a convolution kernel with the size of 1 multiplied by 1, and performing pixel-by-pixel addition on the characteristic image after the channel adjustment and the first characteristic image to acquire a second characteristic image;

3) and performing feature extraction on the second feature image through a depth separable convolution processing block to obtain an up-sampled feature image.

As shown in fig. 4, the reference feature image P5 is up-sampled by 2 times by a bilinear interpolation algorithm to obtain a first feature image, the previous image feature extraction stage of the last image feature extraction stage4 is stage3, the feature image output by stage3 is input into a convolution layer with a convolution kernel size of 1 × 1, and since the number of channels of the feature image output by stage3 is 232 and the number of channels of the feature image received by the depth separable convolution processing block is 116, the number of channels of the feature image output by stage3 needs to be adjusted to be smaller. The number of channels of the input characteristic image is adjusted from 232 to 116 by the convolutional layer, the characteristic image after channel adjustment is added element by element with the first characteristic image to obtain a second characteristic image, the second characteristic image is subjected to characteristic extraction by a depth separable convolution processing block DWConv, and the depth separable convolution processing block does not perform down sampling to obtain an up-sampled characteristic image P4.

In some embodiments of the present application, the feature image after upsampling may also be upsampled, and then the obtained feature image is subjected to feature extraction by a depth separable convolution processing block, and the obtained feature image is also the upsampled feature image. If the feature image P4 is up-sampled to obtain a first feature image, the first feature image and a feature image (not shown in the figure) obtained by performing convolution with a convolution kernel size of 1 × 1 on the feature image output by stage2 are added element by element to obtain a second feature image. Here, since the number of channels of the feature image output from stage2 is 116, there is no need to adjust the number of channels in the convolutional layer, and the second feature image is passed through the depth separable convolution processing block to obtain the up-sampled feature image P3. Similarly, in the case of up-sampling P3 to P2, since the number of channels of the feature image output from stage1 is 58, the number of channels needs to be adjusted to 116 by the convolutional layer.

The reference feature image P5 is down-sampled to obtain P6 and P7, up-sampled to obtain P4, P3 and P2, and the P2, P3, P4, P5, P6 and P7 are arranged in descending order of image size to obtain an image feature pyramid, as shown in fig. 4.

In some embodiments of the present application, the first processing block may be configured to implement the following steps, as shown in fig. 5:

1) inputting the input characteristic image into a first depth separable convolution processing block for downsampling to obtain a first downsampled characteristic image;

2) inputting the characteristic image obtained by the convolution layer with the convolution kernel size of 1 multiplied by 1 of the input characteristic image into a second depth separable convolution processing block for down-sampling to obtain a second down-sampled characteristic image;

3) respectively connecting the first downsampling characteristic image and the second downsampling characteristic image through convolution layers with convolution kernel size of 1 x 1 to obtain a first connection characteristic image;

4) and performing random channel mixing on the first connection characteristic image, and outputting a first random channel mixed characteristic image.

In some embodiments of the present application, the second processing block may be configured to implement the following steps, as shown in fig. 6:

1) carrying out channel separation on the input characteristic image to obtain a characteristic image after channel separation;

2) sequentially inputting the characteristic image after channel separation into a convolutional layer with a convolutional kernel size of 1 multiplied by 1, a depth separable convolutional processing block and a convolutional layer with a convolutional kernel size of 1 multiplied by 1 to obtain a characteristic image after characteristic extraction;

3) connecting the characteristic image after the channel separation with the characteristic image after the characteristic extraction to obtain a second connection characteristic image;

4) and performing random channel mixing on the second connection characteristic image, and outputting a second random channel mixed characteristic image.

In some embodiments of the present application, the feature detection network obtains an image feature pyramid output by the image feature pyramid network, and outputs a classification feature image and a regression feature image according to the image feature pyramid, which may specifically include the following steps:

1) the feature detection network acquires a feature image of each level in the image feature pyramid output by the image feature pyramid network;

2) inputting the feature image of each layer into a convolution layer with the convolution kernel size of 3 multiplied by 3 to obtain a classification feature image with the channel number reduced to 2;

3) inputting the feature image of each layer into a convolution layer with the convolution kernel size of 3 multiplied by 3, and obtaining a regression feature image with the number of channels reduced to 4.

Fig. 7 shows a schematic structural diagram of a preferred feature detection network, where Pi is a feature image of any level in an image feature pyramid, the feature image is respectively input into common convolutional layers with convolution kernel size of 3 × 3, one convolutional layer reduces the number of channels of the input feature image from 116 to 2, the output feature image with the number of channels of 2 is a classification feature image, and a numerical value corresponding to a pixel point at each position on the classification feature image is a probability that the pixel point belongs to a pixel point in a face or a pixel point in a background; the other convolution layer reduces the number of channels of the input feature image from 116 to 4, the output feature image with the number of channels of 4 is a regression feature image, and the regression feature image comprises information of a face frame.

In some embodiments of the present application, the information of the face frame includes a center point, a width, and a height of the face frame.

In some embodiments of the present application, the regression signature error may be determined by a Softmax loss function and the regression signature error may be determined by a SmoothL1 loss function.

In step S102, the face detection model is deployed to a mobile terminal, so that the mobile terminal obtains an image to be detected, and performs face detection on the image to be detected according to the face detection model, thereby obtaining a face image in the image to be detected. After the face detection model is trained, the model parameters are determined, and the model parameters are deployed to a mobile terminal, such as a mobile phone, an access control management system and the like, and are used for face detection. The image to be detected is not pre-labeled with pixel point classification and face frame information, the face detection model carries out face detection on the input image to be detected to obtain corresponding pixel point classification and face frame information, and a face image is obtained according to the information.

Some embodiments of the present application also provide an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, cause the apparatus to perform the aforementioned method of detecting a face image.

Some embodiments of the present application also provide a computer readable medium, on which computer readable instructions are stored, the computer readable instructions being executable by a processor to implement the aforementioned method for detecting a face image.

In summary, the scheme provided by the application can construct a face detection model based on depth separable convolution, and deploy the obtained face detection model to the mobile terminal, so that the mobile terminal can use the deployed face detection model to perform face detection on the obtained image to be detected, and obtain the face image in the image to be detected, compared with the traditional depth convolution neural network, the number of model parameters and the model calculation amount are reduced, the overall detection time of the model is reduced, the face detection model can be deployed on the mobile terminal with limited storage space and calculation resources, and the real-time requirement of face detection is met; in addition, compared with the existing lightweight face detection model, the detection precision of the model is also improved.

In addition, it is known from the production practice that the parameter quantity of the face detection model constructed by the scheme of the present application can be reduced to 0.85M, when the size of the input picture is 480x640, the calculated quantity of the networks, FLOPs, is only 5.12G, and the parameter quantity of the face detection model obtained according to the existing scheme exceeds 20M, and the calculated quantity exceeds 100G. The running speed of the face detection model constructed by the method is higher, and particularly under the condition of no GPU acceleration. The face detection model constructed by the method is trained and tested on the face data set, and the accuracy under single-scale test on the verification set is as follows: 91.1 percent of easy, 90.1 percent of medium and 82.3 percent of hard. The single-scale test is that the size of each picture is not modified, the picture is directly input into a network for detection, and the detection is only carried out once. Compared with the existing lightweight face detection model FaceBoxes, the result has very obvious precision advantage, and the precision of the FaceBoxes under single-scale test is as follows: easy: 79.1%, medium: 79.4%, hard: 71.5%, and its parameters are 1.01M, calculated as 2.84G. Compared with the faceBox parameters, the face detection model constructed by the method is smaller in quantity, higher in precision and only slightly increased in calculated amount.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the present application comprises a device comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the device to perform a method and/or a solution according to the aforementioned embodiments of the present application.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. A method for detecting a face image comprises the following steps:

2. The method of claim 1, wherein constructing a face detection model comprises:

3. The method of claim 2, wherein the image feature pyramid network performs up-sampling or down-sampling according to the feature images output by the image feature extraction section, and combines the feature images obtained through the up-sampling or down-sampling into a multi-level image feature pyramid, and the method includes:

4. The method of claim 3, wherein downsampling the reference feature image through a depth separable convolution processing block to obtain a downsampled feature image, further comprising:

5. The method of claim 4, wherein the depth separable convolution processing block comprises:

6. The method of claim 5, wherein the first processing block is to:

7. The method of claim 5, wherein the second processing block is to:

8. The method of claim 2, wherein the obtaining, by the feature detection network, the image feature pyramid output by the image feature pyramid network, and outputting a classification feature image and a regression feature image according to the image feature pyramid comprises:

9. An apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, cause the apparatus to perform the method of any of claims 1 to 8.

10. A computer readable medium having computer readable instructions stored thereon which are executable by a processor to implement the method of any one of claims 1 to 8.