CN113627328B

CN113627328B - Electronic device, image recognition method thereof, system on chip and medium

Info

Publication number: CN113627328B
Application number: CN202110913181.7A
Authority: CN
Inventors: 阮小飞; 杨磊; 尚峰; 黄敦博; 刘宇轩
Original assignee: ARM Technology China Co Ltd
Current assignee: ARM Technology China Co Ltd
Priority date: 2021-08-10
Filing date: 2021-08-10
Publication date: 2024-09-13
Anticipated expiration: 2041-08-10
Also published as: CN113627328A

Abstract

The application relates to the field of image processing, and discloses electronic equipment, an image recognition method and a medium thereof, wherein the method comprises the following steps: the processing mode of the electronic equipment after the M frame image is acquired is as follows: the first processor performs face detection and face alignment on the Mth frame image to obtain a first frame selection parameter and a first alignment parameter, and the second processor performs face region of interest extraction and face alignment on the Mth frame image by adopting a second frame selection parameter and a second alignment parameter; the second frame selection parameter and the second alignment parameter are obtained by the first processor performing face detection and face alignment on the M-1 frame image after the electronic equipment acquires the M-1 frame image; the first processor performs face feature extraction and feature matching on the Mth frame image based on the face region of interest extraction result and the face alignment result of the Mth frame image by the second processor, and obtains the face recognition result of the Mth frame image. The processing time from the image end to the end is shortened to a certain extent, and the processing efficiency of the algorithm is improved.

Description

Electronic device, image recognition method thereof, system on chip and medium

Technical Field

The present application relates to the field of image processing, and in particular, to an electronic device, an image recognition method thereof, a system on a chip, and a medium.

Background

With the rapid development of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI), neural network processors (neural networkprocessing unit, NPU) are becoming increasingly popular in intelligent electronic devices.

In a system or chip that includes an NPU and an image signal processor (IMAGE SIGNAL processor, ISP), the NPU performs image recognition processing, such as face recognition, on an input image using a neural network model.

Generally, in the face recognition process, the NPU performs a series of algorithm processes on an image acquired from the ISP, for example, the following 4-stage algorithm processes: face detection (including extraction of a region of interest, ROI), face alignment (conversion of an improperly angled image into a properly angled image), face feature extraction, and face recognition. However, the 4-stage algorithm works in series, processes in sequence, and has long time consumption and large delay.

Disclosure of Invention

The embodiment of the application provides electronic equipment, an image recognition method thereof, a system-on-chip and a medium.

In a first aspect, an embodiment of the present application provides an image recognition method, where the method is applied to an electronic device, and the electronic device includes a first processor and a second processor, and the method includes:

The electronic equipment carries out face recognition on the frame images in the video to be recognized through the first processor and the second processor, wherein the processing mode of the electronic equipment after the M-th frame image is acquired is as follows:

The first processor performs face detection and face alignment on the Mth frame image to obtain a first frame selection parameter and a first alignment parameter, and the second processor performs face region of interest extraction and face alignment on the Mth frame image by adopting a second frame selection parameter and a second alignment parameter; the second frame selection parameter and the second alignment parameter are obtained by the first processor performing face detection and face alignment on the M-1 frame image after the electronic equipment acquires the M-1 frame image;

The first processor performs face feature extraction and feature matching on the Mth frame image based on the face interesting region extraction result and the face alignment result of the Mth frame image by the second processor, and obtains the face recognition result of the Mth frame image.

It may be appreciated that the face region of interest extraction result may include a target face frame selection region in the following embodiment, and the face alignment result may include a target face image adjusted to a preset angle in the following embodiment.

It will be appreciated that if M is the N-1 th frame in the embodiments below, then M-1 frame is the N-2 th frame in the embodiments below. If M is the N-th frame in the following embodiments, then M-1 frame is the N-1 th frame in the following embodiments.

In the embodiment of the present application, because the positions and attitudes of faces of two adjacent frames of face images are not different in a high frame rate (e.g., 30 frames) video stream, the frame selection parameter and the face alignment parameter of the last frame (N-1 frame) calculated by the NPU104 may be used to act on the next frame (N frame) in the ISP103, and the face region of interest extraction and face alignment are performed by the ISP103 on the face image of the next frame (N frame), which takes time as follows: because the frame selection parameter and the face alignment parameter are calculated by the NPU104 before being transmitted to the ISP103, the time for the ISP103 to calculate the frame selection parameter and the face alignment parameter is omitted, and thus the first time period T1'+ the second time period T2' is smaller than the first time period T1+ the second time period T2, and the four phases of the embodiment of the present application are generally time-consuming and time-consuming: the first time period T1'+ the second time period T2' + the third time period t3+ the fourth time period T4 is less than the time consumption in the prior art: the first period t1+the second period t2+the third period t3+the fourth period T4. Therefore, the end-to-end delay caused by the 4-stage algorithm processing serial is reduced, the end-to-end image processing time is shortened to a certain extent, and the algorithm processing efficiency is improved.

In a possible implementation of the first aspect, the first processor is a neural network processor, and the second processor is an image signal processor.

In a possible implementation manner of the first aspect, the face region of interest extraction result includes a target face frame selection region, and the face alignment result includes a target face image adjusted to a preset angle.

It will be appreciated that the target face image with the preset angle may refer to a front face image obtained by converting an image with a side face or an improper angle.

In a possible implementation of the first aspect, the frame selection parameter includes coordinate data of an edge of a frame selection area of the target face; the alignment parameters comprise affine change matrix data for adjusting the target face to a preset angle.

In a possible implementation manner of the first aspect, the target face frame selection area is a rectangle, and the frame selection parameter is two vertex coordinate data on a diagonal line of the target face frame selection area or four vertex coordinate data of the target face frame selection area.

In a possible implementation manner of the first aspect, the video to be identified is a video greater than a preset frame rate.

It will be appreciated that in some embodiments, ISP 103 may obtain frame images in a video stream from image sensor 102. The image sensor 102 may also send the frame images in the video stream to the memory 107 for storage, and the ISP 103 obtains two adjacent frame images in the video stream from the memory 107. Further, it is understood that the video stream may be a high frame rate video stream, also referred to as a high frame rate format (HIGH FRAME RATE/HFR), meaning video is captured at a picture frequency of 30 frames per second or more. For example, 30FPS, 48FPS, 60FPS. FPS is a definition in the field of images, and refers to the number of frames per second transmitted for a picture, i.e., the number of pictures for an animation or video.

In a high frame rate face recognition system, the positions and postures of two adjacent frames of target face images are not greatly different. The target face ROI of the previous frame and the face affine transformation may be used to act on the target face ROI of the next frame. How the target face ROI of the previous frame and the face affine transformation are applied to the target face ROI of the next frame will be described in detail below.

In a possible implementation of the first aspect, the preset frame rate is 30 frames per second.

In a second aspect, an embodiment of the present application provides a readable medium, where instructions are stored, where the instructions, when executed by an image signal processor, may implement the image recognition method according to the first aspect.

In a third aspect, an embodiment of the present application provides an electronic device, including: a memory for storing instructions for execution by one or more processors of the electronic device; and

A first processor and a second processor for supporting the first processor or the second processor to perform the image recognition method according to any one of the first aspect.

In a third aspect, embodiments of the present application provide a system on a chip, a memory for storing instructions for execution by one or more processors of the system on a chip; and

A first processor and a second processor for supporting the first processor or the second processor to execute the image recognition method according to any one of the first aspect.

Drawings

Fig. 1 shows a schematic diagram of an NPU for face recognition of an image;

FIG. 2A is a schematic diagram of the NPU104 and ISP103 performing face recognition on the N-1 frame image and the N frame image in parallel;

FIG. 2B is a schematic diagram corresponding to FIGS. 1 and 2A, showing the principle of the parallel algorithm processing provided by the embodiment of the application, wherein the algorithm processing efficiency is improved compared with the serial algorithm processing in the prior art;

FIG. 2C is a schematic diagram of the NPU104 and ISP103 performing face recognition on the N-2 frame image to the N+1 frame image in parallel;

FIG. 3 illustrates a schematic diagram of an electronic device 100, according to some embodiments of the application;

FIG. 4 illustrates an interaction process of face feature extraction and face recognition using the NPU104 after image data is processed by face region of interest extraction and face alignment of the ISP 103, in accordance with an embodiment of the present application;

FIG. 5 illustrates a schematic diagram of the structure of an ISP 103, according to some embodiments of the application;

FIG. 6 is a schematic diagram showing a process of processing image data by a general function module;

fig. 7 illustrates a schematic diagram of a system-on-chip architecture, according to some embodiments of the application.

Detailed Description

Illustrative embodiments of the application include, but are not limited to, electronic devices and image recognition methods and media therefor.

As described above, the following technical problems are mentioned in the background art:

The four stages of face detection, face alignment, face feature extraction and face recognition are long in time consumption and large in delay when the NPU works in series. For example, fig. 1 shows a schematic diagram of the NPU performing face recognition on an image, and as shown in fig. 1, the NPU104 needs to perform the following stages when performing face recognition on an image: face detection in a first time period, face alignment in a second time period, face feature extraction in a third time period and face recognition in a fourth time period. The 4-stage algorithm works serially and takes time: the first period t1+the second period t2+the third period t3+the fourth period T4.

In the embodiment of the application, partial algorithm processing in the NPU is submitted to ISP asynchronous processing, for example, the ISP mainly performs face region of interest extraction (directly extracts the face region of interest ROI according to frame selection parameters) and face alignment (directly converts images with incorrect angles into images with correct angles according to affine transformation matrix), the NPU mainly performs face feature extraction and face recognition, the ISP and the NPU work in parallel, end-to-end delay caused by 4-stage algorithm processing serial is reduced, the image processing time can be shortened to a certain extent, and the algorithm processing efficiency is improved. For example, FIG. 2A shows a schematic diagram of the NPU104 and ISP103 in parallel performing face recognition on the N-1 frame image and the N frame image. As shown in fig. 2A, since the positions and attitudes of the face images of two adjacent frames are not greatly different, the frame selection parameter and the face alignment parameter of the previous frame (N-1 frame) calculated by the NPU104 may be used to act on the next frame (N frame) in the ISP103, and the ISP103 performs the face region of interest extraction and the face alignment on the face image of the next frame (N frame), which takes time as follows: because the frame selection parameter and the face alignment parameter are calculated by the NPU104 before being transmitted to the ISP103, the time for the ISP103 to calculate the frame selection parameter and the face alignment parameter is omitted, and thus the first time period T1'+ the second time period T2' is smaller than the first time period T1+ the second time period T2, and the four phases of the embodiment of the present application are generally time-consuming and time-consuming: the first time period T1'+ the second time period T2' + the third time period t3+ the fourth time period T4 is less than the time consumption in the prior art: the first period t1+the second period t2+the third period t3+the fourth period T4. Therefore, the end-to-end delay caused by the 4-stage algorithm processing serial is reduced, the image processing time is shortened to a certain extent, and the algorithm processing efficiency is improved.

Fig. 2B is a schematic diagram corresponding to fig. 1 and fig. 2A, showing the principle of the parallel algorithm processing provided by the embodiment of the application, and the algorithm processing efficiency is improved compared with the serial algorithm processing in the prior art. As shown in FIG. 2B, A _n-1 represents the NPU104 to perform face framing on the N-1 frame image, B _n-1 represents the NPU104 to perform face alignment on the N-1 frame image, C _n-1 represents the NPU104 to perform face feature extraction on the N-1 frame image, and D _n-1 represents the NPU104 to perform face recognition on the N-1 frame image.

Similarly, a _n represents face framing of the NPU104 on the nth frame image, B _n represents face alignment of the NPU104 on the nth frame image, C _n represents face feature extraction of the NPU104 on the nth frame image, and D _n represents face recognition of the NPU104 on the nth frame image.

A _n-1' represents frame selection parameters of the previous frame image obtained by the ISP103 from the NPU104, and face frame selection is carried out on the N-1 frame image. A _n' represents frame selection parameters of the previous frame image obtained by the ISP103 from the NPU104, and face frame selection is carried out on the Nth frame image. B _n-1 'represents the face alignment parameter of the previous frame image obtained by ISP103 from NPU104, and the N-1 frame image is face aligned, B _n-1' represents the face alignment parameter of the previous frame image obtained by ISP103 from NPU104, and the N-1 frame image is face aligned.

T1 'represents the frame selection parameter of the previous frame image obtained by the ISP103 from the NPU104, the time spent on extracting the region of interest of the human face (human face frame selection) for the current frame (such as the N-1 frame and the N frame) image, T2' represents the time spent on obtaining the human face alignment parameter of the previous frame image from the NPU104 by the ISP103, the time spent on carrying out human face alignment for the current frame (such as the N-1 frame and the N frame) image, T3 represents the time spent on extracting the human face feature of the current frame image by the NPU104, and T4 represents the time spent on carrying out human face recognition for the N frame image by the NPU 104.

In the prior art, the NPU104 performs face detection, face alignment, face feature extraction and face recognition on the N-1 frame image, and the time consumption is as follows: in the embodiment of the application, the ISP103 and the NPU104 perform face region of interest extraction, face alignment, face feature extraction and face recognition on the N-1 frame image in parallel as follows: t1'+T2' +t3+t4. Because the frame selection parameter and the face alignment parameter are calculated by the NPU104 before and transmitted to the ISP103, the time for calculating the frame selection parameter and the face alignment parameter by the ISP103 is saved, therefore, T1 'is smaller than T1, T2' is smaller than T2, T1'+T2' +T3+T4 is smaller than T1+T2+T3+T4, the end-to-end delay caused by the NPU104 sequentially carrying out face detection, face alignment, face feature extraction and face recognition on the N-1 frame image is reduced, the image processing time is shortened to a certain extent, and the algorithm processing efficiency is improved.

Similarly, in the embodiment of the present application, the ISP103 and the NPU104 perform the face region of interest extraction, face alignment, face feature extraction and face recognition on the N-1 frame image in parallel, which takes the following time: t1'+T2' +T3+T4 is smaller than T1+T2+T3+T4, in the prior art, the NPU104 performs face detection, face alignment, face feature extraction and face recognition on the N-1 frame image, and the time consumption is as follows: T1+T2+T3+T4, T1'+T2' +T3+T4 is smaller than T1+T2+T3+T4, end-to-end delay caused by the fact that NPU104 sequentially carries out face detection, face alignment, face feature extraction and face recognition on the N-th frame image is reduced, end-to-end image processing time is shortened to a certain extent, and algorithm processing efficiency is improved.

Fig. 2C shows a schematic diagram of parallel asynchronous processing of the NPU104 and the ISP103 from the N-2 frame image to the n+1 frame image, as shown in fig. 2C, in which the entire system (NPU 104 and ISP 103) performs face detection and face alignment on the N-2 frame image to the n+1 frame image, for each frame image, the NPU104 and the ISP103 perform face region of interest extraction, face alignment, face feature extraction and face recognition on the N-1 frame image to the N frame image, the N frame image to the n+1 frame image, and the NPU104 and the ISP103 perform face detection and face alignment on the N-2 frame image to the n+1 frame image, and specifically, the NPU104 always acquires the face selection parameter and the face alignment parameter of the previous frame image from the NPU104 as the face selection parameter and the face alignment parameter of the current frame, performs face alignment on the current frame image and the NPU104, and then performs face detection and face alignment on the current frame image by using the face selection parameter and the face alignment parameter of the previous frame image to the NPU104, and the NPU104 performs face detection and face alignment on the current frame image. In this way, the end-to-end delay caused by face detection, face alignment, face feature extraction and face recognition of each frame of face image in the N-2 frame image to the N+1 frame image by the NPU104 is reduced, the end-to-end image processing time is shortened to a certain extent, and the algorithm processing efficiency is improved.

The image recognition method is suitable for various image processing scenes, such as face recognition of video streams shot at a train station gate, user identity authentication of a banking system through face images in the video, user identity authentication of a mobile phone terminal through face images in the video, user identity authentication of a vehicle-mounted face system through face images in the video in quick payment, user identity authentication of the vehicle-mounted face system through face images in the video, and the like.

The terms involved in the embodiments of the present application are explained below to facilitate understanding.

(1) A region of interest (region ofinterest, ROI), in which a region to be processed, called a region of interest, is selected from the processed image in a frame, a circle, an ellipse, an irregular polygon, or the like. In the embodiment of the application, the target face ROI selects the region containing the face from the processed image.

(2) Face Detection (Face Detection), also called Face frame selection, is to detect the position of a Face in an image and extract a region of interest of the Face. The input of the face detection algorithm is an image, and the output is a target face ROI, or face frame selection region, such as 1 target face ROI or multiple target face ROIs. In general, the output target face ROI is a square right upwards, but some face detection techniques output a rectangle right upwards or a rectangle with a rotation direction, for example, in the embodiment of the present application, as shown in fig. 2, the NPU104 detects the rectangular target face ROI from the N-1 frame image.

(3) Face alignment: and converting the images with the side faces or the angles not being correct into the front face images. The face alignment technology can improve the face recognition accuracy. In the embodiment of the application, the face image can be obtained by converting the face aligned side face or the image with an improper angle.

(4) Face extraction feature extraction (Face Feature Extraction) is a process of converting a face image into a string of vector values of fixed length. This string of values is called a "Face Feature" and has the ability to characterize this Face Feature. For example, eye characteristics, mouth characteristics, eyebrow characteristics, nose characteristics, ear characteristics.

(5) Face Recognition (Face Recognition) is an algorithm that recognizes the corresponding identity of an input Face image. The method inputs a face feature, and finds out the feature with highest similarity with the input feature by comparing the face feature with the features corresponding to N identities registered in a library one by one. Comparing the highest similarity value with a preset threshold value, if the highest similarity value is larger than the threshold value, returning the identity corresponding to the feature, otherwise, returning to 'not in the library'.

The technical scheme and beneficial effects of the embodiment of the application are further described below with reference to the accompanying drawings.

Fig. 3 illustrates a schematic diagram of an electronic device 100, according to some embodiments of the application. The electronic device 100 may perform image acquisition, face detection (including extracting a target face ROI), face alignment (converting an image with an improper angle into an image with a correct angle), and then face feature extraction and face recognition. As shown in fig. 1, the electronic device 100 includes: camera 101, image sensor 102, ISP 103, NPU 104, central processing unit (central processing unit, CPU) 105, display 106, memory 107, and interface module 108. The camera 101 is connected to the image sensor 102, the image sensor 702 is connected to the ISP 103, and the ISP 103, the NPU 104, the CPU 105, the display 106, the memory 107, and the interface module 108 are coupled via the bus 109. ISP 103, NPU 104, CPU 105, and memory 107 may be coupled via bus 109 to form a System On Chip (SOC) 1000, and in other embodiments ISP 103, NPU 104, CPU 105, and memory 107 may be separate devices.

The camera 101 is used for collecting light signals reflected by a scene and presenting the light signals on the image sensor 102, and the camera 101 can be a fixed focus lens, a zoom lens, a fisheye lens, a panoramic lens, and the like.

The image sensor 102 is configured to convert an optical signal reflected by a subject collected by the camera 101 into an electrical signal, and generate RAW image (RAW) data, for example, data in Bayer format may be generated. The image sensor may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor.

ISP 103 is an application-specific integrated circuit (ASIC) for image data processing for further processing of image data formed by image sensor 102 for better image quality. In some embodiments, ISP 103 is also used to perform face region of interest extraction and face alignment on the image data.

The NPU 104 is an ASIC designed for deep learning, and in some embodiments, may rapidly process input information by referencing biological neural network structures, such as referencing transfer patterns between human brain neurons. Deep learning model reasoning can be performed by the NPU, for example: and extracting facial features, recognizing human faces and the like by utilizing a neural network model.

The CPU 105 may include one or more processing units, for example, processing modules or processing circuits that may include a central Processor CPU (Central Processing Unit), an image Processor GPU (Graphics Processing Unit), a digital signal Processor DSP (DIGITAL SIGNAL Processor), a microprocessor MCU (Micro-programmed Control Unit), an AI (ARTIFICIAL INTELLIGENCE ) Processor, a programmable logic device FPGA (Field Programmable GateArray), and the like. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The display panel of the display 106 may employ a Liquid crystal display (Liquid CRYSTAL DISPLAY, LCD), an Organic Light-emitting Diode (OLED), an Active-matrix Organic Light-emitting Diode (AMOLED), a flexible Light-emitting Diode (FLED), a Mini LED, a Micro OLED, a quantum dot Light-emitting Diode (QLED), or the like. The display 106 may be used to display image data processed by the ISP 103, and may also be used to display the result of the operation of the NPU 104, such as the result of face images and face recognition.

The Memory 107 may be used to store data, software programs, and modules, and may be a Volatile Memory (Volatile Memory), such as a Random-Access Memory (RAM), a double data rate synchronous dynamic Random Access Memory (Double Data Rate Synchronous Dynamic RandomAccess Memory, DDR SDRAM).

The interface module 108 includes an external memory interface, a universal serial bus (universal serial bus, USB) interface, and the like. The external Memory interface may be used to connect to an external nonvolatile Memory (Non-Volatile Memory), such as Read-Only Memory (ROM), flash Memory (Flash Memory), hard disk (HARD DISK DRIVE, HDD) or Solid state disk (Solid-STATE DRIVE, SSD); or a combination of the above types of memories, or may be a removable storage medium, such as a Secure Digital (SD) memory card, for extending the storage capabilities of the electronic device 100.

Bus 109 is used to couple ISP 103, NPU104, CPU 105, display 106, memory 107, and interface module 108. Bus 109 may be a ADVANCED HIGH-performance bus (AHB) or may be another type of data bus.

It should be understood that the structure of the electronic device 100 shown in fig. 3 is merely an example, and any electronic device 100 including an ISP and an NPU is not limited to the electronic device 100, and in other embodiments, the electronic device 100 may include more or fewer modules, and may further include a combination or splitting of some modules, which is not limited by the embodiments of the present application.

It is understood that electronic device 100 may include, but is not limited to: laptop computers, desktop computers, tablets, cell phones, servers, wearable devices, portable gaming devices, televisions, etc.

In one embodiment, the electronic device 100 may capture images through the camera 101, the image sensor 102, and the ISP103, and a specific scene/person forms an optical signal on the image sensor 102 through the camera 101, where the image sensor 102 converts the optical signal into an electrical signal, and sends the electrical signal to the ISP103, where the ISP103 performs a series of processes on the original format image data, including, for example, face region of interest extraction, face alignment, and the like.

Corresponding to fig. 2A, fig. 4 illustrates an interactive process of face feature extraction and face recognition using the NPU104 after the image data is processed by the face region of interest extraction and face alignment of the ISP 103 according to an embodiment of the present application, including the following steps.

Step 401: ISP 103 obtains the N-1 st frame image in the video stream.

It will be appreciated that in some embodiments, ISP 103 may obtain frame images in a video stream from image sensor 102. The image sensor 102 may also send the frame images in the video stream to the memory 107 for storage, and the ISP 103 retrieves the frame images in the video stream from the memory 107. Further, it is understood that the video stream may be a high frame rate video stream, also referred to as a high frame rate format (HIGH FRAME RATE/HFR), meaning video is captured at a picture frequency of 30 frames per second or more. For example, 30FPS, 48FPS, 60FPS. FPS is a definition in the field of images, and refers to the number of frames per second transmitted for a picture, i.e., the number of pictures for an animation or video.

In a high frame rate face recognition system, the positions and postures of two adjacent frames of target face images are not greatly different. The target face ROI of the previous frame and the face affine transformation may be used to act on the target face of the next frame. How the target face ROI and the face affine transformation of the previous frame are applied to the target face of the next frame will be described in detail below.

Step 402: the NPU104 reads the N-1 st frame image from the ISP 103.

It is understood that N is a natural number greater than 2. When N takes 2, then NPU104 reads the 1 st frame image from ISP 103. When N takes 3, then NPU104 reads the 2 nd frame image from ISP 103. And so on.

Step 403: the NPU104 performs face detection on the N-1 th frame image to obtain a frame selection parameter of the target face ROI of the N-1 th frame image (hereinafter abbreviated as a face frame selection parameter of the N-1 th frame).

For example, as shown in FIG. 2, the NPU104 performs face detection on the N-1 frame image to obtain frame selection parameters: coordinates (x 1, y 1) of the upper left corner vertex a11 and coordinates (x 2, y 2) of the lower right corner vertex a 12. Although only one face is shown in fig. 2, the technical solution of the present application may be adapted to a case of processing a plurality of faces in one image at the same time.

It can be understood that the common face detection network is basically a CNN convolutional neural network, and a plurality of convolutional operators perform convolutional operations on the whole image to form feature layers with different feature representations. Common typical detection networks such as SSD (single-shot detection), YOLO (You only look once), MTCCN (Multi-task convolutional neural network), etc., can be modified for use in the face detection field.

In the embodiment of the present application, the NPU104 outputs the frame selection parameters of the target face ROI after running the CNN face detection network, for example, the upper left corner vertex coordinates (x 1, y 1) and the lower right corner vertex coordinates (x 2, y 2) of the target face ROI under a given xy coordinate system.

In this way, the ISP103 can frame a target face ROI of a predetermined size defined by two straight lines passing through the upper left corner vertex coordinates (x 1, y 1) and parallel to the x-direction and the y-direction and two straight lines passing through the lower right corner vertex coordinates (x 2, y 2) and parallel to the x-direction and the y-direction according to the upper left corner vertex coordinates (x 1, y 1) and the lower right corner vertex coordinates (x 2, y 2) of a given xy coordinate system.

Furthermore, in other embodiments, the NPU104 directly determines the ROI 4 vertex coordinates (x, y) during target face detection.

Step 404: ISP 103 obtains the nth frame image in the video stream.

It will be appreciated that in other embodiments, step 404 may occur after step 405 and any period of time before step 406.

Step 405: the NPU104 sends the face framing parameters for the N-1 th frame to the ISP 103.

It can be understood that, in the embodiment of the present application, the NPU104 performs face detection on the N-1 th frame image, and then sends the frame selection parameter of the target face ROI of the N-1 th frame image to the ISP 103.

Step 406: ISP103 extracts the region of interest of the face from the N-th frame image by using the face frame selection parameters of the N-1-th frame image to obtain the target face ROI of the N-th frame image.

For example, as shown in FIG. 2, ISP selects parameters according to the face frames of frame N-1: coordinates (x 1, y 1) of the upper left corner vertex a11 and coordinates (x 2, y 2) of the lower right corner vertex a12 are determined, and coordinates of an upper left corner vertex b21 and coordinates of a lower right corner vertex b22 of the target face ROIB1 of the nth frame image are determined, thereby determining the target face ROIB1 of the nth frame image. Specifically, ISP103 may select parameters according to the face frames of the N-1 th frame given the xy coordinate system: coordinates (x 1, y 1) of an upper left corner vertex a11 and coordinates (x 2, y 2) of a lower right corner vertex a12, an upper left corner vertex b21 and a lower right corner vertex b22 of a target face ROIB1 of an nth frame image are determined, two straight lines parallel to the x direction and the y direction of one upper left corner vertex b21 and two straight lines parallel to the x direction and the y direction of the right corner vertex b22 are selected in a frame, and a target face ROI of a preset size is enclosed.

Furthermore, in other embodiments, the ISP selects parameters according to the face frames of frame N-1: and 4 vertex coordinates (x, y) of the ROI, 4 vertex coordinates (x, y) of the target face ROI of the N frame of image are determined, and the target face ROI with the preset size is enclosed according to the 4 vertex coordinates (x, y).

Step 407: the NPU104 performs face alignment on the face image of the target face ROI of the N-1 th frame image to obtain a face alignment parameter of the target face ROI of the N-1 th frame image (hereinafter referred to as N-1 th frame face alignment parameter).

It can be appreciated that the NPU104 operates the facial feature keypoint model to obtain facial keypoint coordinates. And carrying out affine transformation (WARPAFFINE) on the face key points output by the face five-sense organ key point model and the provided face template to solve the matrix to obtain an affine transformation matrix, wherein the affine transformation matrix is the face alignment parameter. Thus, the human face is adjusted to a preset size and form to realize the alignment of the human face, and the abnormal visual angle is changed to the normal visual angle to realize the alignment of the human face. The affine transformation mainly comprises translation transformation, rotation transformation, scaling transformation (also called scale transformation), inclination transformation (also called miscut transformation, shearing transformation and offset transformation) and turnover transformation, and the size of the affine transformation matrix is generally a matrix of 3*3, including translation, rotation, scaling coefficients and the like.

The input of the facial feature key point model is a target facial ROI, and the coordinate sequence of the facial feature key points is output. The number of key points of the five sense organs is a preset fixed value, and can be defined according to different semantics (5 points, 68 points, 90 points and the like are common).

Step 408: the NPU104 sends the face alignment parameters for the N-1 frame to the ISP 103.

Step 409: ISP 103 uses the face alignment parameters of the N-1 frame image to perform face alignment on the face image in the target face ROI of the N frame image to obtain the N frame face alignment image.

For example, as shown in fig. 2A, the face image in the target face ROI of the N-1 th frame image is subjected to face alignment calculation to obtain a face alignment parameter of the face alignment image A2, and the ISP103 performs face alignment on the face image in the target face ROI of the N-1 th frame image according to the face alignment parameter of the N-1 th frame image to obtain an N-th frame face alignment image B2.

It will be appreciated that, similar to the solution of step 406, the ISP103 may adjust the face ROI image to a predetermined size and shape by affine variation operation, so as to achieve face alignment. In some embodiments, ISP103 multiplies the coordinates of the face ROI image by the face alignment parameters (affine transformation matrix) of the N-1 frame image to obtain a face aligned image.

Step 410: the ISP 103 transmits the nth frame face alignment image to the NPU 104.

Step 411: the NPU104 performs face feature extraction on the nth frame face alignment image to obtain nth frame face feature data.

It will be appreciated that in some embodiments, the ISP 103 may perform face feature extraction on the face alignment image in the nth frame image by a convolutional neural network model or the like.

Step 412: the NPU104 performs face recognition according to the nth frame face feature data.

The ISP103 can recognize the face features of the face alignment image of the N-th frame image extracted by a convolutional neural network model or the like.

To sum up, the face detection obtains the ROI and the face alignment to obtain affine matrix parameters, the affine matrix parameters are transmitted to the ISP103, the ISP103 directly outputs the image with the ROI and the face alignment to the NPU104 for AI processing, including face feature extraction and face recognition, which is equivalent to the NPU104 mainly performing face feature extraction and face recognition in the end-to-end algorithm processing process of the next frame, so that the system delay is reduced.

If the NPU104 supports multi-model simultaneous processing, face detection and face alignment can be distributed to 2 or more NPU104 reasoning tasks for the current frame processing, face feature extraction and face recognition, thus maximizing resource utilization and reducing system delay.

Further, fig. 5 illustrates a schematic diagram of the configuration of an ISP 103, according to some embodiments of the application. As shown in fig. 5, ISP 103 includes a processor 1031, an image transmission interface 1032, a general purpose peripheral device 1033, an image recognition module 1034, and a general purpose function module 1035.

The processor 1031 is used for logic control and scheduling in the ISP 103.

The image transmission interface 1032 is used for transmission of image data.

Common peripheral devices 1033 include, but are not limited to: buses for coupling various modules of ISP 103 and controllers thereof, buses for coupling with other devices, such as an advanced high performance bus (AHB), may enable the ISP to communicate with other devices (e.g., DSP, CPU, etc.) at high performance; a WATCHDOG unit (WATCHDOG) for monitoring the ISP operating status.

An ROI module and affine transformation module 1034 for lifting the ROI area according to the extracted ROI framing parameters and obtaining an aligned image according to the provided affine transformation matrix.

The general function module 1035 is used to process images input to the ISP 103, including but not limited to: black level compensation (black level Correction, BLC), dead point Correction (badpixel Correction, BPC), lens Correction (LENS SHADING Correction, LSC), demosaicing (Demosaic), noise reduction (Denoise), automatic white balance (automatic white balance, AWB), color Correction (Color Correction), gamma Correction (Gamma Correction), color gamut conversion, and the like. When the image sensor delivers image data in RAW format to the image signal processor 1030, it is first processed by the pass-through function module. The general function module may include a RAW domain processing module, a YUV domain processing module, and an RGB domain processing module, and fig. 6 shows a schematic process of processing image data by the general function module, including the following steps.

The RAW domain processing module performs dead point correction, black level correction and automatic white balance on the image data.

The image data processed by the RAW domain is subjected to RGB interpolation to obtain the image data of the RGB domain, and then the RGB domain processing module performs gamma correction and color correction on the image data of the RGB domain.

The image data processed by the RGB domain is subjected to color gamut conversion to obtain YUV domain image data, and then the YUV domain processing module is used for carrying out noise reduction, edge increase and brightness/contrast/chromaticity adjustment on the YUV domain image data.

It can be appreciated that the image data can be output to the ROI module and affine transformation module 1034 for image data filling after being processed by the general function module. The color gamut of the image data output to the ROI module and affine transformation module 1034 may be RGB, YUV, or gray scale image, which is not limited by the embodiment of the present application.

It will be appreciated that the configuration of ISP 103 shown in fig. 7 is merely an example, and those skilled in the art will appreciate that it may include more or fewer modules, and may combine or split portions of modules, and embodiments of the present application are not limited.

Embodiments of the present application also provide a system-on-chip, and fig. 7 shows a schematic diagram of a system-on-chip structure, as shown in fig. 7, where a system-on-chip 1000 includes an ISP 103, an NPU 104, a CPU 105, and a memory 107, according to some embodiments of the present application. ISP 103, NPU 104, CPU 105 and memory 107 are coupled via bus 109. ISP 103 can extract the region of interest of the face and align the face, output the ROI of the face and the corrected face image, NPU 104 reads the region of interest of the face from ISP 103 and extracts the face feature and matches the face of the image after the face is aligned, thus shortening the processing time of the image, reducing the end-to-end delay caused by 4-stage algorithm processing serial and improving the real-time.

It will be appreciated that in the embodiment of fig. 7, some components may be added or subtracted, e.g., adding a bus control unit, an interrupt management unit, a coprocessor, etc., and some components may be split or combined, e.g., integrating ISP103 and NPU 104, as embodiments of the present application are not limited.

Embodiments of the disclosed mechanisms may be implemented in hardware, software, firmware, or a combination of these implementations. Embodiments of the application may be implemented as a computer program or program code that is executed on a programmable system comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.

Program code may be applied to input instructions to perform the functions described herein and generate output information. The output information may be applied to one or more output devices in a known manner. For purposes of the present application, a processing system includes any system having a Processor such as, for example, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), microcontroller, application SPECIFIC INTEGRATED Circuit (ASIC), or microprocessor.

The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. Program code may also be implemented in assembly or machine language, if desired. Indeed, the mechanisms described in the present application are not limited in scope by any particular programming language. In either case, the language may be a compiled or interpreted language.

In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. For example, the instructions may be distributed over a network or through other computer readable media. Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including but not limited to floppy diskettes, optical disks, read-Only memories (CD-ROMs), magneto-optical disks, read Only Memories (ROMs), random access memories (Random Access Memory, RAMs), erasable programmable Read-Only memories (Erasable Programmable Read Only Memory, EPROMs), electrically erasable programmable Read-Only memories (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only memories, EEPROMs), magnetic or optical cards, flash Memory, or tangible machine-readable Memory for transmitting information (e.g., carrier waves, infrared signal digital signals, etc.) using the internet in an electrical, optical, acoustical or other form of propagated signal. Thus, a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g., a computer).

In the drawings, some structural or methodological features may be shown in a particular arrangement and/or order. However, it should be understood that such a particular arrangement and/or ordering may not be required. Rather, in some embodiments, these features may be arranged in a different manner and/or order than shown in the illustrative figures. Additionally, the inclusion of structural or methodological features in a particular figure is not meant to imply that such features are required in all embodiments, and in some embodiments, may not be included or may be combined with other features.

It should be noted that, in the embodiments of the present application, each unit/module mentioned in each device is a logic unit/module, and in physical terms, one logic unit/module may be one physical unit/module, or may be a part of one physical unit/module, or may be implemented by a combination of multiple physical units/modules, where the physical implementation manner of the logic unit/module itself is not the most important, and the combination of functions implemented by the logic unit/module is only a key for solving the technical problem posed by the present application. Furthermore, in order to highlight the innovative part of the present application, the above-described device embodiments of the present application do not introduce units/modules that are less closely related to solving the technical problems posed by the present application, which does not indicate that the above-described device embodiments do not have other units/modules.

It should be noted that, in the examples and descriptions of this patent, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

While the application has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the application.

Claims

1. An image recognition method, wherein the method is applied to an electronic device, the electronic device comprises a first processor and a second processor, the first processor is a neural network processor, and the second processor is an image signal processor, the method comprises:

2. The method of claim 1, wherein the face region of interest extraction result comprises a target face frame selection region and the face alignment result comprises a target face image adjusted to a preset angle.

3. The method of claim 1, wherein the framing parameters include coordinate data of edges of a framing region of the target face; the alignment parameters comprise affine change matrix data for adjusting the target face to a preset angle.

4. The method of claim 2, wherein the target face frame selection area is a rectangle, and the frame selection parameter is two vertex coordinate data on a diagonal of the target face frame selection area or four vertex coordinate data of the target face frame selection area.

5. The method of claim 1, wherein the video to be identified is a video greater than a preset frame rate.

6. The method of claim 5, wherein the predetermined frame rate is 30 frames per second.

7. A readable medium having stored thereon instructions that, when executed on an electronic device, cause the electronic device to perform the image recognition method of any one of claims 1 to 6.

8. An electronic device, comprising:

A memory for storing instructions for execution by one or more processors of the electronic device; and

A first processor and a second processor for supporting the first processor or the second processor to perform the image recognition method of any one of claims 1 to 6.

9. A system on a chip, comprising:

a memory for storing instructions for execution by one or more processors of the system-on-chip; and