CN115966008A

CN115966008A - Attack detection method, device and equipment in face recognition

Info

Publication number: CN115966008A
Application number: CN202211724495.3A
Authority: CN
Inventors: 曹佳炯
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-04-14

Abstract

The embodiment of the specification discloses an attack detection method, an attack detection device and attack detection equipment in face recognition. Reconstructing a partially occluded image by acquiring a training sample containing the partially occluded image to generate a reconstructed image; training and generating a first model according to the difference between the reconstructed image and the training sample; training to generate a second model according to the first model, wherein the second model is a lightweight model partially heterogeneous with the first model; and acquiring a picture to be recognized containing a human face, and using the second model to carry out attack detection, thereby realizing the training of the lightweight second model based on the level image reconstruction and the structure model distillation.

Description

Attack detection method, device and equipment in face recognition

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to a method, an apparatus, and a device for detecting an attack in face recognition.

Background

With the development of mobile internet, the application of face recognition is more and more extensive. But with more attack techniques. For example, the methods of the deepfaces attack tamper/generate a face video by using an algorithm, and the face of a in the video is changed into the face of B to perform a risk attack, which becomes a significant challenge in a face recognition system due to the diversity of the attack and the invisibility of vision. The currently common methods for detecting the depfakes are generally difficult to deploy on the client side for application due to high computational complexity.

Based on this, a scheme that can accurately detect an attack in face recognition at the client is required.

Disclosure of Invention

The embodiment of the specification provides attack detection, an apparatus, equipment and a storage medium in face recognition, which are used for solving the following technical problems: there is a need for a scheme that can accurately detect attacks in face recognition at a client.

To solve the above technical problem, one or more embodiments of the present specification are implemented as follows:

in a first aspect, an embodiment of the present specification provides an attack detection method in face recognition, including: acquiring a training sample containing a partial occlusion image, reconstructing the partial occlusion image, and generating a reconstructed image; training and generating a first model according to the difference of the reconstructed image and the training sample; training to generate a second model according to the first model, wherein the second model is a lightweight model partially heterogeneous with the first model; and acquiring a picture to be recognized containing a face, and using the second model to carry out attack detection.

In a second aspect, an embodiment of the present specification provides an attack detection apparatus in face recognition, including: the sample acquisition module is used for acquiring a training sample containing a partial occlusion image, reconstructing the partial occlusion image and generating a reconstructed image; the first model training module is used for generating a first model according to the difference training of the reconstructed image and the training sample; the second model training module is used for training and generating a second model according to the first model, wherein the second model is a lightweight model partially heterogeneous with the first model; and the attack detection module is used for acquiring the picture to be identified containing the face and carrying out attack detection by using the second model.

In a third aspect, one or more embodiments of the present specification provide an electronic device comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.

In a fourth aspect, embodiments of the present specification provide a non-volatile computer storage medium having stored thereon computer-executable instructions that, when read by a computer, cause the one or more processors to perform the method of the first aspect.

At least one technical scheme adopted by one or more embodiments of the specification can achieve the following beneficial effects: reconstructing a partially occluded image by acquiring a training sample containing the partially occluded image to generate a reconstructed image; training and generating a first model according to the difference between the reconstructed image and the training sample; training to generate a second model according to the first model, wherein the second model is a lightweight model which is partially heterogeneous with the first model; and acquiring a picture to be recognized containing a face, and performing attack detection by using the second model, thereby realizing training of the lightweight second model based on hierarchical image reconstruction and model construction distillation, and performing accurate attack detection in face recognition at a client side.

Drawings

In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present specification, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.

Fig. 1 is a schematic flowchart of an attack detection method in face recognition according to one or more embodiments of the present disclosure;

FIG. 2 is a schematic diagram of a training architecture of a first model provided in an embodiment of the present disclosure;

FIG. 3 is a block diagram illustrating training of a second model according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an attack detection apparatus in face recognition according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device provided in an embodiment of the present specification.

Detailed Description

The embodiment of the specification provides an attack detection method, an attack detection device, attack detection equipment and a storage medium in face recognition.

In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any inventive step based on the embodiments of the present disclosure, shall fall within the scope of protection of the present application.

The current methods for deteffakes detection can be divided into two categories. The first type is a methods of depfakes detection based on a single frame image. In the method, a single-frame image is used as input, a deep learning model is trained for classification, and whether an input sample is a defakes attack or not is judged. Such models have limited performance for the defakes detection due to limited input information; the second type of method is the methods for depfakes detection based on video data. The method receives multi-frame video data as input, and performs the decapfakes detection by combining the time sequence information such as optical flow, key point stability and the like. Although the accuracy of the method is improved, the overall calculation complexity is too high, and the method is difficult to be applied on the end side. Based on this, the embodiments of the present specification provide an attack detection scheme that can accurately perform face recognition on the client side.

In a first aspect, as shown in fig. 1, fig. 1 is a schematic flowchart of an attack detection method in face recognition according to one or more embodiments of the present specification, where the method includes:

s101, obtaining a training sample containing a partial occlusion image, reconstructing the partial occlusion image, and generating a reconstructed image.

The training samples should contain occlusion of parts of the image. The occlusion may be performed by using other similar images, or may be performed by using a blank image.

For example, for an eye region in the original image, occlusion is performed using a similar eye image (which may be generated based on an algorithm); or, for the eye region in the original image, a blank image is used for blocking.

In the training sample, the shielded area can be marked for shielding, and the shielded area is used for indicating the pixel points which are shielded.

The model to be trained may include a partial module for image reconstruction, such as a content reconstruction module, and the content reconstruction module may perform image reconstruction based on the extracted feature vector, so as to obtain a reconstructed image.

During the reconstruction process, a plurality of reconstruction iterations can be adopted. For example, for a pixel indicated by the occlusion flag, an image corresponding to a pixel close to an already occluded region in the face portion is partially reconstructed (i.e., a pixel at the edge of the occluded region), after the partial reconstruction is completed, the partially reconstructed image is used as an input of the model, the corresponding content reconstruction module feature vector is extracted again, the image reconstruction is performed again, and the whole reconstruction process is completed through multiple iterations.

S103, training and generating a first model according to the difference between the reconstructed image and the training sample.

Specifically, if the training sample is a normal training sample, when feature extraction is performed after multiple reconstructions, the corresponding facial features (i.e., the reconstruction features) should be as consistent as possible with the facial features corresponding to the original image; if the training sample is a countercheck training sample, on the contrary, the facial features corresponding to the reconstructed image and the facial features of the original image show inconsistency, the inconsistency can be characterized by the similarity between the features, and the higher the similarity is, the higher the consistency is.

Further, the loss of consistency of the reconstructed features and the facial features may be determined based on the extracted reconstructed features corresponding to the reconstructed image. Training of the first model is performed based on the reconstructed feature consistency loss.

In particular embodiments, the first model may include three parts: the first part is a feature encoding module, the second part is a defakes classifier module, and the third part is a content reconstruction module. As shown in fig. 2, fig. 2 is a schematic diagram of a training architecture of a first model provided in the embodiment of the present disclosure.

In the training process, the input of the feature coding module is a face image contained in a training sample, and the output is the extraction of the facial features of the training sample; the deepfakes classifier module inputs the extracted facial features and outputs the sample classification result; the content reconstruction module has as input a partially reconstructed image and as output a corresponding reconstructed image. The content reconstruction module can complete the whole reconstruction process through repeated iterative reconstruction, and the hierarchical reconstruction is more traceable and stable.

Correspondingly, the loss function can comprise three parts, namely a first part is classified according to the facial features, and classification loss is determined; the second part is reconstruction loss determined according to the difference between the reconstructed image and an original image (an image of an unobstructed training sample), and the third part is to extract reconstruction characteristics corresponding to the reconstructed image and determine the reconstruction characteristic consistency loss of the reconstruction characteristics and the facial characteristics of the original image. When multiple iterative reconstructions exist in the process, the final image of the multiple iterative reconstructions can be input into a feature encoder, compared with the features of the original image, and the consistency loss of the reconstructed features is determined.

And finally, fusing the classification loss, the reconstruction loss and the reconstruction feature consistency loss, reversely propagating the fused loss value, training the feature coding module, the decapfakes classifier module and the content reconstruction module, and generating an available first model when the loss value is converged.

The trained first model can be used for performing depfakes attack detection on the picture to be recognized in two ways. The first method is to extract features of an input picture to be identified and classify the picture based on a defakes classifier; and the second mode is that the input picture to be recognized is subjected to picture reconstruction of multiple iterations based on the trained content reconstruction module in the first model, the facial features in the reconstructed image and the facial features in the original image (namely the input picture to be recognized) are extracted and compared in consistency, the difference between the facial features and the original image is determined, and when the difference exceeds a threshold value T, the picture to be recognized is judged to be a decapfakes attack.

It should be noted that the first model is generally a high-precision and high-complexity model, which is suitable for being deployed on the side of a server with stronger computing power, and is generally not suitable for being deployed on the side of a client with weaker computing power.

And S105, training and generating a second model according to the first model, wherein the second model is a lightweight model partially heterogeneous with the first model.

As mentioned before, the first model is not suitable for deployment on the client side. Based on this, model distillation based on the first model is needed to obtain a lightweight model for deployment on the client side.

In the model distillation, there may be isomorphic model distillation or partially isomeric model distillation. In the context of the embodiments of the present description. An encoder of the facial features in the first model is a Transformer neural network; correspondingly, the encoder of the facial features in the second surface model is a convolutional neural network CNN model. This is because the Transformer can provide a good detection capability, and the CNN can provide a friendly end-side deployment capability, so that by changing the encoder to an heterogeneous CNN model, a model performance close to that of the Transformer can be obtained by heterogeneous distillation, and the method is suitable for deployment and client side, and can be more suitable for actual situations.

Meanwhile, in the second model, in order to realize the model distillation based on the first model, the other part of the structure is still kept as a lightweight model which has the same function as the other structure of the first model.

Specifically, the second model comprises a feature encoder of the CNN type, and further comprises a defakes classification module and a hierarchical content reconstruction module which have the same functions as the first model. As shown in fig. 3, fig. 3 is an architecture diagram of training of a second model provided in an embodiment of the present specification, in which an eepfakes classification module and a content reconstruction module included in the first model are omitted and not shown.

However, compared to the first model, the features of the deepfactors classification module and the hierarchical content reconstruction module included in the second model are the same, that is, the deepfactors classification module can also realize classification, and the hierarchical content reconstruction module can also realize image reconstruction as in the first model, but the structure in the second model is lighter.

Lightweight refers to the lightweight of the model structure at a number of different angles: first, the same function module contains fewer network layers and fewer channels, for example, in the first model, the default classifier module may contain 10 hidden layers, and in the second model, there may be only 5 hidden layers. Second, the same functional block requires less computation. For example, in the first model, when convolving the input image, one convolution kernel 9*9 may be used, while in the second model, the convolution kernel may be used as 3*3. Thirdly, the same functional module contains fewer parameters to be trained. For example, in the first model, there may be 200 parameters to be trained contained in the fully-connected layer in the decapakes classifier, and in the second model, there may be 50 parameters to be trained contained in the fully-connected layer in the decapakes classifier.

In the process, since the first model and the second model are encoded on the basis of different modes on the feature encoder, in order to make the encoder trained by the second model close to the first model, a heterogeneous distillation module can be added in the second model.

The input of the heterogeneous distillation module is the feature map generated by the CNN feature extraction module in the second model, and the output is the self-attention matrix generated by the transform neural network in the first model.

In other words, the first model may generate a first self-attention matrix, including Q, K and a V matrix, for the training sample based on a Transformer neural network. Q, K and V are obtained by linear transformation of input feature vectors, wherein the weight matrix for linear transformation can be obtained by learning, the transformation improves the fitting ability of the model, and the obtained Q, K, V can be regarded as: q represents the information to be queried, K represents the vector to be queried, and V represents the value obtained by the query.

Since the first model is a trained model, the second model obtained by training is also approximated to the first model, so that the facial features of the training sample can be obtained based on the neural network CNN, the heterogeneous distillation module is adopted to generate a corresponding second self-attention matrix according to the facial features obtained by the CNN, and the heterogeneous distillation loss L can be determined based on the difference between the first self-attention matrix and the second self-attention matrix _KD And training to generate a second model according to the heterogeneous distillation loss.

Specifically, when the second model is trained, since the hierarchy content reconstruction module and the hierarchy classes module which have the same functions as the first model but are lightweight are also included, the hierarchy content reconstruction module can also generate the classification loss L for the training sample _cls Reconstruction loss L _rec And reconstruction feature consistency loss L _feat At this time, the classification loss, reconstruction feature consistency loss and the heterogeneous structure of the second model for the training samples may be fusedDistillation loss training generates a second model. That is, for the second model, the computation of the loss function during its training may be L2= L _cls +L _rec +L _feat +L _KD And training the network based on the model structure and the loss function until the model converges to generate a usable second model.

And S107, acquiring the picture to be recognized containing the face, and performing attack detection by using the second model.

After the second model is obtained through training, the second model can be deployed to the client, and attack detection is performed on the obtained picture to be recognized by using the second model on the client side.

In particular, the second model can also perform the depfakes attack detection in two ways, similar to the first model.

The first method is to acquire a picture to be recognized containing a human face, extract the facial features of the picture to be recognized by adopting the second model, and classify the attack categories of the facial features of the picture to be recognized.

And the second method is to reconstruct an image according to the facial features in the picture to be identified and detect the attack according to the difference between the facial features corresponding to the reconstructed image and the facial features of the picture to be identified. If a multi-iteration reconstruction mode is adopted in the reconstruction process, attack detection can be performed based on the differences between the facial features corresponding to the multi-iteration reconstructed image and the facial features of the original image, for example, if the differences are greater than a threshold value T, then the deelfakes attack is determined.

Reconstructing the partially-occluded image by acquiring a training sample containing the partially-occluded image to generate a reconstructed image; training and generating a first model according to the difference between the reconstructed image and the training sample; training to generate a second model according to the first model, wherein the second model is a lightweight model partially heterogeneous with the first model; and acquiring a picture to be recognized containing a human face, and performing attack detection by using the second model, thereby realizing the training of the lightweight second model based on the hierarchical image reconstruction and the structure model distillation, and performing accurate attack detection in the human face recognition at the client side.

In one embodiment, the second model may also be quantized before attack detection using the second model. The quantization mode may be performed synchronously during the process of training the second model based on the first model, or may be performed after the second model is trained.

The quantization of the second model refers to modifying part of parameters in the second model from floating point type to integer type (for example, converting 32-bit floating point number to 8-bit integer number int 8), and the accuracy of the quantized model is expected to be similar to that before quantization.

Based on this, in the embodiment of the present specification, a hybrid quantization manner is provided, that is, a partial layer (layer) in the second model is selected for quantization. Specifically, in the training process of the second model, a quantitative cost performance evaluation module is added for evaluating the quantitative cost performance of each layer.

In the training process, for each layer in the second model, the parameters of the layer are changed in the training process, at this time, the parameters in the layer can be mapped into a vector, the L1 norm of the vector is further calculated, and the L1 norm of the vector is used as the local quantization sparse loss corresponding to the layer for evaluating the quantization cost performance of the layer. Generally speaking, the smaller the local quantization sparse loss is, the more parameters which can be quantized to 0 are contained in the layer, the higher the quantization cost performance of the layer is, and by this way, it can be ensured that the quantifiable parameters in any layer which needs to be quantized are sufficiently sparse, so as to maintain the accuracy of the model.

Meanwhile, parameters contained in all layers in the second model can be mapped into another vector, the L1 norm of the other vector is further calculated, the L1 norm of the other vector is determined as the global quantization sparse loss corresponding to the model, the smaller the global quantization sparse loss is, the more parameters which can be quantized to 0 are contained in the global environment, and the number of layers which can be quantized in the second model can be increased as much as possible through the method, so that the lightweight of the model is improved as much as possible, and the calculation efficiency of the quantized second model is improved.

In the foregoing quantization manner, the local quantization sparse loss and the global quantization sparse loss may be fused to train and generate the quantized second model. In this case, the loss value of the second model may be set to L2= L as described above _cls +L _rec +L _feat +L _KD Then, local quantization sparse loss Lr-sparse and global quantization sparse loss Lw-sparse are added again. I.e. L2= L _cls +L _rec +L _feat +L _KD + Lr-spars + Lw-sparse. And carrying out network training based on the model structure and the loss function until the model converges, wherein the generated second model is a quantized second model, and by the method, the quantization of the second model can be synchronously realized in the training process of the second model, the performance of the second model is kept to be close to that of the first model, and the training mode is simpler and more convenient.

Based on the same idea, one or more embodiments of the present specification further provide apparatuses and devices corresponding to the above-described method, as shown in fig. 4 and 5.

In a second aspect, as shown in fig. 4, fig. 4 is a schematic structural diagram of an attack detection apparatus in face recognition according to an embodiment of the present disclosure, where the apparatus includes:

the sample acquisition module 401 acquires a training sample containing a partially occluded image, reconstructs the partially occluded image, and generates a reconstructed image;

a first model training module 403, which trains and generates a first model according to the difference between the reconstructed image and the training sample;

a second model training module 405, which trains and generates a second model according to the first model, wherein the second model is a lightweight model partially heterogeneous to the first model;

and the attack detection module 407 is used for acquiring the picture to be recognized containing the face and performing attack detection by using the second model.

Optionally, the first model training module 403 acquires a training sample containing a human face, and extracts facial features of the training sample; classifying according to the facial features and determining classification loss; determining reconstruction loss according to the difference between the reconstructed image and the training sample; extracting reconstruction characteristics corresponding to the reconstructed image, and determining consistency loss of the reconstruction characteristics and the reconstruction characteristics of the facial characteristics; and fusing the classification loss, the reconstruction loss and the reconstruction characteristic consistency loss to train and generate a first model.

Optionally, in the apparatus, the encoder of the facial features in the first model is a transform neural network; accordingly, the encoder of the facial features in the second model is a convolutional neural network CNN, and the second model is a lightweight model that functions the same as the other structures of the first model.

Optionally, a second model training module 405, determining a first self-attention matrix of the first model for the training sample; acquiring facial features of a training sample by adopting the convolutional neural network CNN, and generating a corresponding second self-attention matrix according to the facial features; and determining heterogeneous distillation loss according to the difference between the first self-attention matrix and the second self-attention matrix, and training to generate a second model according to the heterogeneous distillation loss.

Optionally, a second model training module 405, determining a classification loss, a reconstruction loss and a reconstruction feature consistency loss of the second model for the training samples; and fusing the classification loss, reconstruction characteristic consistency loss and the heterogeneous distillation loss of the second model to the training sample to generate a second model.

Optionally, the apparatus further includes a quantization module 409, which performs quantization evaluation on parameters of any layer included in the second model, and determines a quantization cost performance of the layer; and training and generating a quantized second model according to the quantization cost performance.

Optionally, the quantization module 409 determines a local quantization sparse loss corresponding to an arbitrary layer in the second model according to a quantization cost performance corresponding to the layer; determining global quantization sparse loss corresponding to the model according to the quantization cost performance ratio corresponding to all layers in the second model; and fusing the local quantization sparse loss and the global quantization sparse loss training to generate a quantized second model.

Optionally, the attack detection module 407 obtains a to-be-identified picture including a human face, and extracts the facial features of the to-be-identified picture by using the second model; carrying out attack category classification on the facial features of the picture to be identified; or, reconstructing an image according to the facial features in the picture to be identified, and performing attack detection according to the difference between the facial features corresponding to the reconstructed image and the facial features of the picture to be identified.

In a third aspect, as shown in fig. 5, fig. 5 is a schematic structural diagram of an electronic device provided in an embodiment of the present specification, where the electronic device includes:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

In a fourth aspect, based on the same idea, the present specification further provides a non-volatile computer storage medium corresponding to the method described above, and storing computer-executable instructions, which, when read by a computer, cause one or more processors to execute the method according to the first aspect.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain a corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical blocks. For example, a Programmable Logic Device (PLD) (e.g., a Field Programmable Gate Array (FPGA)) is an integrated circuit whose Logic functions are determined by a user programming the Device. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually manufacturing an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as ABEL (Advanced Boolean Expression Language), AHDL (alternate Hardware Description Language), traffic, CUPL (core universal Programming Language), HDCal, jhddl (Java Hardware Description Language), lava, lola, HDL, PALASM, rhyd (Hardware Description Language), and vhigh-Language (Hardware Description Language), which is currently used in most popular applications. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be regarded as a hardware component and the means for performing the various functions included therein may also be regarded as structures within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, respectively. Of course, the functions of the various elements may be implemented in the same one or more software and/or hardware implementations of the present description.

As will be appreciated by one skilled in the art, the present specification embodiments may be provided as a method, system, or computer program product. Accordingly, embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The description has been presented with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the description. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

All the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the apparatus, device, and non-volatile computer storage medium embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to the partial description of the method embodiments for relevant points.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The above description is merely one or more embodiments of the present disclosure and is not intended to limit the present disclosure. Various modifications and alterations to one or more embodiments of the present description will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of one or more embodiments of the present specification should be included in the scope of the claims of the present specification.

Claims

1. An attack detection method in face recognition comprises the following steps:

acquiring a training sample containing a partially-occluded image, reconstructing the partially-occluded image, and generating a reconstructed image;

training and generating a first model according to the difference between the reconstructed image and the training sample;

training to generate a second model according to the first model, wherein the second model is a lightweight model which is partially heterogeneous with the first model;

and acquiring a picture to be recognized containing a human face, and performing attack detection by using the second model.

2. The method of claim 1, wherein training to generate a first model from differences of the reconstructed image and the training samples comprises:

acquiring a training sample containing a human face, and extracting facial features of the training sample;

classifying according to the facial features and determining classification loss;

determining reconstruction loss according to the difference between the reconstructed image and the training sample;

extracting reconstruction characteristics corresponding to the reconstructed image, and determining consistency loss of the reconstruction characteristics and the reconstruction characteristics of the facial characteristics;

and fusing the classification loss, the reconstruction loss and the reconstruction characteristic consistency loss to train and generate a first model.

3. The method of claim 1, wherein the second model is a lightweight model that is partially heterogeneous from the first model, comprising:

an encoder of the facial features in the first model is a Transformer neural network;

accordingly, the encoder of the facial features in the second model is a convolutional neural network CNN, and the second model is a lightweight model that functions the same as the other structures of the first model.

4. The method of claim 3, wherein training to generate a second model from the first model comprises:

determining a first self-attention matrix of the first model for the training sample;

acquiring facial features of a training sample by adopting the convolutional neural network CNN, and generating a corresponding second self-attention matrix according to the facial features;

and determining heterogeneous distillation loss according to the difference between the first self-attention matrix and the second self-attention matrix, and training to generate a second model according to the heterogeneous distillation loss.

5. The method of claim 4, wherein generating a second model based on the heterogeneous distillation loss training comprises:

determining a classification loss, a reconstruction loss and a reconstruction feature consistency loss of the second model for the training samples;

and fusing the classification loss, the reconstruction characteristic consistency loss and the heterogeneous distillation loss of the second model to the training sample to generate a second model.

6. The method of claim 1, wherein prior to attack detection using the second model, the method further comprises:

carrying out quantitative evaluation on parameters of any layer contained in the second model, and determining the quantitative cost performance of the layer;

and training and generating a quantized second model according to the quantization cost performance.

7. The method of claim 6, wherein training to generate a quantized second model based on the quantized cost-to-performance ratio comprises:

determining local quantization sparse loss corresponding to any layer in the second model according to the corresponding quantization cost performance ratio of the layer;

determining global quantization sparse loss corresponding to the model according to the quantization cost performance ratio corresponding to all layers in the second model;

and fusing the local quantization sparse loss and the global quantization sparse loss training to generate a quantized second model.

8. The method of claim 1, wherein obtaining a picture to be recognized including a human face and using the second model for attack detection comprises:

acquiring a picture to be recognized containing a human face, and extracting the facial features of the picture to be recognized by adopting the second model;

carrying out attack category classification on the facial features of the picture to be identified;

or, reconstructing an image according to the facial features in the picture to be recognized, and performing attack detection according to the difference between the facial features corresponding to the reconstructed image and the facial features of the picture to be recognized.

9. An attack detection apparatus in face recognition, comprising:

the sample acquisition module is used for acquiring a training sample containing a partial occlusion image, reconstructing the partial occlusion image and generating a reconstructed image;

the first model training module is used for generating a first model according to the difference training of the reconstructed image and the training sample;

the second model training module is used for training and generating a second model according to the first model, wherein the second model is a lightweight model partially heterogeneous with the first model;

and the attack detection module is used for acquiring the picture to be identified containing the face and using the second model to carry out attack detection.

10. An electronic device, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 8.