CN111597926A

CN111597926A - Image processing method and device, electronic device and storage medium

Info

Publication number: CN111597926A
Application number: CN202010356731.5A
Authority: CN
Inventors: 李佳桦
Original assignee: Shenzhen Shangtang Intelligent Sensor Technology Co ltd
Current assignee: Shenzhen Shangtang Intelligent Sensor Technology Co ltd
Priority date: 2020-04-29
Filing date: 2020-04-29
Publication date: 2020-08-28

Abstract

The application discloses an image processing method and device, electronic equipment and a storage medium. The method comprises the following steps: acquiring a first face model and reference expression data, wherein the first face model is obtained based on a face; and rendering the expression of the first face model according to the reference expression data to obtain a second face model.

Description

Image processing method and device, electronic device and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image processing method and apparatus, an electronic device, and a storage medium.

Background

With the development of image processing technology, more and more applications are implemented based on image processing, wherein expression migration is one of the applications. The expression of the face model can be changed by transferring the expression to be transferred to the face model, so that the face model after the expression is transferred is obtained, and the method has very important significance on how to improve the fidelity of the face model after the expression is transferred.

Disclosure of Invention

The application provides an image processing method and device, an electronic device and a storage medium.

In a first aspect, an image processing method is provided, the method comprising:

acquiring a first face model and reference expression data, wherein the first face model is obtained based on a face;

and rendering the expression of the first face model according to the reference expression data to obtain a second face model.

In this aspect, the second face model is obtained according to the reference expression data and the first face model, the expression indicated by the reference expression data is migrated to the first face model, and further, the first face model in an arbitrary expression can be obtained by changing the expression indicated by the reference expression data. Because the first face model is more vivid than a face model obtained based on no face, a more vivid second face model can be obtained by transferring the expression indicated by the reference expression data to the first face model, so that the expression transferring effect is more natural.

In combination with any embodiment of the present application, the acquiring reference expression data includes:

acquiring a first face image;

extracting face key points from the first face image to obtain face key point information in the first face image;

and obtaining the reference expression data according to the face key point information.

In the embodiment, the face key point information can be obtained by extracting the face key points from the first face image, and the expression data in the first face image can be determined according to the face key point information and used as the reference expression data.

With reference to any embodiment of the present application, obtaining a second face model according to the reference expression data and the first face model includes:

performing feature extraction processing on the first face model to obtain a first feature image;

fusing the first characteristic image and the reference expression data to obtain a second characteristic image;

and performing upsampling processing on the second characteristic image to obtain the second face model.

In this embodiment, the reference expression data and the first face model are fused, so that the expression indicated by the reference expression data is migrated to the first face model, and the second face model is obtained.

In combination with any embodiment of the present application, the acquiring a first face image includes:

acquiring a video stream;

and carrying out face detection processing on the images in the video stream to obtain an image containing a face as the first face image.

In this embodiment, the first face image is obtained by performing face detection processing on the image in the video stream, and further, the expression in the video stream can be migrated to the first face model.

acquiring first audio data;

and obtaining the reference expression data according to a mapping relation and the information carried in the first audio data, wherein the mapping relation is used for representing the mapping between the information carried in the audio data and the expression data.

In this embodiment, the expression of the first face model may be changed by the first audio data to obtain the second face model.

In combination with any embodiment of the present application, the method further comprises:

acquiring the character attribute of the first face model;

obtaining second audio data according to the character attributes, wherein the information carried in the second audio data is the same as the information carried in the first audio data;

and outputting the second audio data in the process of controlling the second face model to execute the speaking operation.

With reference to any embodiment of the present application, before obtaining the reference expression data according to the mapping relationship and the information carried in the first audio data, the method further includes:

carrying out sound feature extraction processing on the first audio data to obtain feature data;

obtaining the reference expression data according to the mapping relationship and the information carried in the first audio data, including:

obtaining intermediate expression data according to the mapping relation and the information carried in the first audio data;

and adjusting the intermediate expression data according to the feature data to obtain the reference expression data.

With reference to any embodiment of the present application, the acquiring first audio data includes:

voice data are collected through a voice collection assembly;

performing semantic analysis processing on the voice data to obtain semantic data;

and obtaining the first audio data according to the information carried in the semantic data.

In combination with any embodiment of the present application, the first face model is obtained based on a face, including:

acquiring a second face image and a depth image of the second face image;

and obtaining the first face model according to the second face image and the depth image.

In this embodiment, the first face model is derived from the second face image and the depth image.

With reference to any embodiment of the present application, the obtaining the first face model according to the second face image and the depth image includes:

obtaining a third face model according to the second face image and the depth image;

removing a pixel region belonging to a reference region in the third face model to obtain a fourth face model, wherein the reference region comprises at least one of the following regions: eye area, oral area;

filling reference data into a reference area in the fourth face model to obtain the first face model, wherein the reference data includes at least one of the following data: data of eye area, data of oral cavity area.

In this embodiment, the first face model is obtained by removing the pixel region belonging to the reference region in the third face model to obtain a fourth face model, and filling the reference region in the fourth face model with reference data. Therefore, in the process of adjusting the expression of the first face model, the related data associated with the reference data can be utilized, so that the probability of occurrence of the condition of missing information in the obtained second face model is reduced, and the fidelity of the second face model is improved.

In combination with any embodiment of the present application, the first face model is a three-dimensional face model.

In a second aspect, there is provided an image processing apparatus, the apparatus comprising:

a first obtaining unit, configured to obtain a first face model and reference expression data, where the first face model is obtained based on a face;

and the first processing unit is used for rendering the expression of the first face model according to the reference expression data to obtain a second face model.

With reference to any embodiment of the present application, the first obtaining unit is configured to:

acquiring a first face image;

With reference to any one of the embodiments of the present application, the first processing unit is configured to:

acquiring a video stream;

acquiring first audio data;

In combination with any embodiment of the present application, the apparatus further includes:

the second acquisition unit is used for acquiring the character attribute of the first face model;

the second processing unit is used for obtaining second audio data according to the character attributes, wherein the information carried in the second audio data is the same as the information carried in the first audio data;

and the control unit outputs the second audio data in the process of controlling the second face model to execute the speaking operation.

the third processing unit is used for performing sound feature extraction processing on the first audio data to obtain feature data before the reference expression data is obtained according to the mapping relation and the information carried in the first audio data;

the first obtaining unit is used for:

voice data are collected through a voice collection assembly;

acquiring a second face image and a depth image of the second face image;

In a third aspect, a processor is provided, which is configured to perform the method according to the first aspect and any one of the possible implementations thereof.

In a fourth aspect, an electronic device is provided, comprising: a processor, transmitting means, input means, output means, and a memory for storing computer program code comprising computer instructions, which, when executed by the processor, cause the electronic device to perform the method of the first aspect and any one of its possible implementations.

In a fifth aspect, there is provided a computer-readable storage medium having stored therein a computer program comprising program instructions which, if executed by a processor, cause the processor to perform the method of the first aspect and any one of its possible implementations.

A sixth aspect provides a computer program product comprising a computer program or instructions which, when run on a computer, causes the computer to perform the method of the first aspect and any of its possible implementations.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or the background art of the present application, the drawings required to be used in the embodiments or the background art of the present application will be described below.

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and, together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic diagram of a pixel coordinate system according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of another pixel coordinate system provided in an embodiment of the present application;

fig. 3 is a schematic flowchart of an image processing method according to an embodiment of the present application;

fig. 4 is a schematic diagram of a face key point provided in an embodiment of the present application;

fig. 5 is a schematic flowchart of another image processing method according to an embodiment of the present application;

fig. 6 is a schematic flowchart of another image processing method according to an embodiment of the present application;

fig. 7 is a schematic flowchart of another image processing method according to an embodiment of the present application;

fig. 8 is a schematic view of a white mold according to an embodiment of the present disclosure;

fig. 9 is a schematic flowchart of another image processing method according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 11 is a schematic diagram of a hardware structure of an image processing apparatus according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The execution subject of the embodiment of the present application is an image processing apparatus, and optionally, the image processing apparatus may be one of the following: cell-phone, computer, server, panel computer.

Before proceeding with the following explanation, the pixel coordinate system in the embodiment of the present application is first defined. In the embodiment of the application, the pixel coordinate system comprises a pixel coordinate system in a two-dimensional image and a pixel coordinate system in a three-dimensional human face model.

Referring to fig. 1, a pixel coordinate system in the two-dimensional image is shown in fig. 1, and a pixel coordinate system xoy is constructed by taking the lower right corner of the facial image a as the origin o of the pixel coordinate system, the direction parallel to the rows of the facial image a as the direction of the x axis, and the direction parallel to the columns of the facial image a as the direction of the y axis. In the pixel coordinate system, the abscissa is used to represent the number of columns in the face image a of the pixels in the face image a, the ordinate is used to represent the number of rows in the face image a of the pixels in the face image a, and the units of the abscissa and the ordinate may be pixels. For example, suppose that the coordinates of the pixel a in fig. 1 are (10, 30), that is, the abscissa of the pixel a is 10 pixels, the ordinate of the pixel a is 30 pixels, and the pixel a is the pixel of the 10 th column and 30 th row in the face image a.

The pixel coordinate system oxyz in the three-dimensional face model is as follows: and constructing an obtained three-dimensional coordinate system by taking the midpoint between two eyes in the three-dimensional face model as an origin o, wherein the ox axis is perpendicular to the positive sagittal plane of the three-dimensional face model and points to the left face area of the three-dimensional face model. The median sagittal plane is a median line passing through the three-dimensional face model and divides the three-dimensional face model into two parts of sagittal planes which are symmetrical left and right. The division of the left face region and the right face region in the three-dimensional face model can be seen in fig. 2, and in the three-dimensional face model shown in fig. 2, the median sagittal plane divides the three-dimensional face model into the left face region and the right face region. The oy axis is parallel to the midsagittal plane of the three-dimensional face model. The oz axis is perpendicular to the xoy plane, and the direction of the oz axis (hereinafter, the direction of the oz axis is referred to as a depth direction) is the same as the face orientation of the three-dimensional face model. Hereinafter, the coordinate on the ox axis is referred to as the abscissa, the coordinate on the OY axis is referred to as the ordinate, and the coordinate on the oz axis is referred to as the ordinate. In a pixel coordinate system, an abscissa is used for representing the column number of pixels in the three-dimensional face model, an ordinate is used for representing the row number of pixels in the three-dimensional face model, a vertical coordinate is used for representing the depth number of pixels in the three-dimensional face model, and the units of the abscissa, the vertical coordinate and the vertical coordinate can be pixels. For example, suppose that the coordinates of the pixel a of the three-dimensional face model are (10, 30, 20), i.e., the abscissa of the pixel a is 10 pixels, and the ordinate of the pixel a is 30 pixels.

The embodiments of the present application will be described below with reference to the drawings. Referring to fig. 3, fig. 3 is a schematic flowchart of an image processing method according to an embodiment of the present disclosure.

301. And acquiring a first face model and reference expression data.

In the embodiment of the application, the first face model may be a two-dimensional image containing a face, or may be a three-dimensional face model. The three-dimensional face model may be a Computer Aided Design (CAD) three-dimensional face model, the three-dimensional face model may also be a three-dimensional face convex hull, and the three-dimensional face model may also be a three-dimensional face point cloud. No matter the first face model is a face image or a three-dimensional face model, the first face model is obtained based on the face.

In the case where the first face model is a two-dimensional image including a face, as a possible implementation manner of acquiring the first face model, the image processing apparatus acquires a face image through the imaging device to acquire the first face model. For example, assume that the image processing apparatus is a mobile phone having a camera. The mobile phone shoots the Zhang III face through the camera to obtain an image containing the Zhang III face as a first face model.

In a case where the first face model is a two-dimensional image including a face, as another possible implementation manner of acquiring the first face model, the image processing apparatus receives, as the first face model, a face image input by a user through an input component, where the input component includes: keyboard, mouse, touch screen, touch pad, audio input device, etc.

Under the condition that the first face model is a two-dimensional image containing a face, as another possible implementation manner for acquiring the first face model, the image processing device receives a face image sent by a first terminal, wherein the first terminal comprises a mobile phone, a computer, a server, a tablet computer and the like.

In a case where the first face model is a two-dimensional image including a face, as another possible implementation manner of acquiring the first face model, the image processing apparatus acquires a face image through the imaging device to acquire the first face model. For example, assume that the image processing apparatus is a mobile phone having a camera. The mobile phone shoots the Zhang III face through the camera to obtain an image containing the Zhang III face as a first face model.

In a case where the first face model is a three-dimensional face model, as another possible implementation manner of obtaining the first face model, the image processing apparatus receives, as the first face model, a three-dimensional face model input by a user through an input component, where the input component includes: keyboard, mouse, touch screen, touch pad, audio input device, etc.

Under the condition that the first face model is a three-dimensional face model, as another possible implementation manner for acquiring the first face model, the image processing device receives the three-dimensional face model sent by a second terminal, wherein the second terminal comprises a mobile phone, a computer, a server, a tablet computer and the like.

In the case where the first face model is a three-dimensional face model, as another possible implementation manner of acquiring the first face model, the three-dimensional face model may be acquired as the first face model by scanning the face with an imaging device of the image processing apparatus. For example, the image processing apparatus is a tablet computer, the RGB camera of the tablet computer is used to capture the face of the fourth lie, so as to obtain a first image, and the time of flight (TOF) camera is used to capture the face of the fourth lie while the first image is captured, so as to obtain a second image. Based on the first image and the second image, a three-dimensional model of the face of lie four can be obtained as the first face model.

Because the first face model in the embodiment of the application is obtained based on the face, the first face model is more vivid than a face model obtained based on a non-face, wherein the vividness refers to the similarity of the face, and the face model obtained based on the non-face comprises a cartoon face image and a cartoon three-dimensional face model. The more realistic the face model is, the higher the similarity of the face model to a real person in the visual perception of the user.

In the embodiment of the application, the reference expression data may be expression data obtained by extracting and processing face key points of a face image. For example, a face key point extraction process is performed on an image containing a small and bright face to obtain a small and bright face mask, wherein the face mask carries small and bright expression data.

The reference expression data may also be expression data indicated by the expression instruction, for example, the image processing apparatus stores 3 kinds of expression data, which are smile, giggle, and anger, respectively, where the indicated expression data is the expression instruction of smile a, the indicated expression data is the expression instruction of giggle b, and the indicated expression data is the expression instruction of anger c. In the case where the expression instruction received by the image processing apparatus is a, the reference expression data is smile. When the expression instruction received by the image processing device is b, the reference expression data is rape smile. In the case where the expression instruction received by the image processing apparatus is c, the reference expression data is anger.

The reference expression data may also be a sentence carrying expression data, e.g. the sentence may be a smile, and the sentence may be a frowny face, for example.

In one implementation of obtaining the reference expression data, the image processing apparatus receives the reference expression data input by the user through the input component. The above-mentioned input assembly includes: keyboard, mouse, touch screen, touch pad, audio input device, etc.

In another implementation manner of acquiring the reference expression data, the image processing apparatus receives the reference expression data sent by the third terminal. The third terminal comprises a mobile phone, a computer, a tablet computer, a server and wearable equipment.

In another implementation manner of obtaining the reference expression data, the image processing apparatus receives an expression instruction input by a user, and determines the reference expression data according to the expression instruction.

302. And rendering the first face model according to the reference expression data to obtain a second face model.

In the embodiment of the application, the five sense organs of the second face model are the same as the five sense organs of the first face model, the hairstyle of the second face model is the same as the hairstyle of the first face model, the face contour of the second face model is the same as the face contour of the first face model, the face texture data of the second face model is the same as the face model data of the first face model, and the expression of the second face model is the same as the expression indicated by the reference expression data.

The face texture data comprises skin color information of face skin, glossiness information of the face skin, wrinkle information of the face skin and texture information of the face skin.

The same five sense organs include the same position information of key points of the five sense organs. The face contour is the same, and the position information of key points of the face contour is the same. The position information of the face contour key points comprises coordinates of the face contour key points under a face model (comprising a first face model and a second face model) coordinate system, and the position information of the five sense organs comprises coordinates of the five sense organs key points under pixel coordinates.

For example, as shown in fig. 4, the key points of the five sense organs include key points of the eyebrow region, the eye region, the nose region, the mouth region, and the ear region. The face contour keypoints comprise keypoints on a face contour line. It should be understood that the number and the positions of the key points (including the key points of the five sense organs and the key points of the face contour) shown in fig. 4 are only an example provided by the embodiment of the present application, and should not be construed as limiting the present application.

For the same person, the face texture data is fixed, that is, it is better that the fingerprint information and the iris information can be used as the identity information of the person, and the face texture data can also be used as the identity information of the person. Because the face texture data of the first face model is the same as the face texture data of the second face model, and the expression of the second face model is the same as the expression indicated by the reference expression data, the second face model is obtained according to the reference expression data and the first face model, and the expression indicated by the reference expression data can be migrated to the first face model.

For example, the first face model is a reddish three-dimensional face model, and the expression of the first face model is smile, and the expression indicated by the reference expression data is anger. The second face model obtained according to the reference expression data and the first face model is a reddish three-dimensional face model, but the expression of the second face model is angry.

As an implementation manner for obtaining the second face model, under the condition that the reference expression data is obtained by performing face key point extraction processing on the face image, the reference expression data and the first face model are subjected to fusion processing, so that the second face model can be obtained. For example, assume that the expression in the face image a is smile. And extracting the key points of the face image a to obtain reference expression data carrying information of the smiling expression in the face image a. And performing fusion processing on the reference expression data a and the first face model to obtain a second face model with a smile expression in the face image a.

As another implementation manner for obtaining the second face model, under the condition that the reference expression data is expression data indicated by the expression instruction, the first face model is subjected to deformation processing according to the reference expression data, so that the second face model can be obtained. The deformation treatment comprises at least one of the following: and adjusting the shape and the size of the first face model. For example, assume that the reference expression data is rape smile. And performing deformation processing on the first human face model to obtain a second human face model with the expression of the rape smile.

According to the embodiment of the application, the second face model is obtained according to the reference expression data and the first face model, the expression indicated by the reference expression data is transferred to the first face model, and further, the first face model under any expression can be obtained by changing the expression indicated by the reference expression data. Because the first face model is more vivid than a face model obtained based on no face, a more vivid second face model can be obtained by transferring the expression indicated by the reference expression data to the first face model, so that the expression transferring effect is more natural.

Referring to fig. 5, fig. 5 is a flowchart illustrating a method for one possible implementation of step 302 according to an embodiment of the present disclosure.

501. A first face image is acquired.

In the embodiment of the application, the first face image is an image containing a human face.

The manner of acquiring the first face image may be: the image processing device receives a first face image input by a user through an input component, wherein the input component comprises: keyboard, mouse, touch screen, touch pad, audio input device, etc.

The manner of acquiring the first face image may also be: the image processing device receives a first face image sent by a fourth terminal, wherein the fourth terminal comprises a mobile phone, a computer, a tablet computer, a server and the like.

Acquiring the first face image may further be: the image processing device is obtained by camera capture, for example, the image processing device is a mobile phone, and a face image obtained by camera capture of the mobile phone is taken as a first face image.

The manner of acquiring the first face image may also be: a frame cut out from a video stream by an image processing apparatus contains an image of a human face as a first human face image.

The method for obtaining the first face image is not limited in the application.

502. And extracting and processing the key points of the face of the first face image to obtain key point information of the face of the first face image.

In this embodiment, the coordinates of the key points of the face in the first face image in the pixel coordinate system can be obtained by performing face key point extraction processing on the first face image. And determining the face key point information in the first face image according to the coordinates of the face key points in the pixel coordinate system. In the embodiment of the application, the key points of the human face comprise key points of five sense organs and key points of a face contour.

In an implementation manner of determining face keypoint information in a first face image, face keypoint extraction processing on the first face image can be implemented through a convolutional neural network. The image with the labeling information is used as training data to train the convolutional neural network, so that the trained convolutional neural network can finish extraction processing of the key points of the face of the first face image. And the labeling information of the image in the training data is the coordinates of the key points of the human face in a pixel coordinate system. In the process of training the convolutional neural network by using the training data, the convolutional neural network extracts the feature data of the image from the image and determines the coordinates of the key points of the face in the image under a pixel coordinate system according to the feature data. And monitoring the result obtained by the convolutional neural network in the training process by taking the marking information as the monitoring information, updating the parameters of the convolutional neural network, and finishing the training of the convolutional neural network. In this way, the trained convolutional neural network can be used for extracting the face key points of the first face image so as to obtain the coordinates of the face key points in the first image to be processed in the pixel coordinate system.

In another implementation manner of determining face key point information in a first face image, the face key point extraction process may be implemented by a face key point extraction algorithm, where the face key point extraction process may be implemented by any one of the following algorithms: OpenFace, multitask cascaded convolutional neural network (MTCNN), Tuned Convolutional Neural Network (TCNN), or task-constrained deep convolutional neural network (TCDCN), which does not limit a face keypoint extraction algorithm for implementing face keypoint extraction processing.

503. And obtaining the reference expression data according to the face key point information.

After the face key point information in the first face image is obtained, the reference expression data can be obtained according to the face key point information.

In an implementation manner of obtaining reference expression data according to face key point information, an image with labeling information (the image contains face key point information) is used as training data to train a convolutional neural network, so that the trained convolutional neural network can obtain expression data according to the face key point information. The annotation information of the images in the training data is expression data (e.g., smile, and, for example, anger). In the process of training the convolutional neural network by using the training data, the convolutional neural network extracts the feature data of the image from the image and determines the expression data in the image according to the feature data. And monitoring the result obtained by the convolutional neural network in the training process by taking the marking information as the monitoring information, updating the parameters of the convolutional neural network, completing the training of the convolutional neural network, and establishing a mapping relation between the face key point information and the expression data by the convolutional neural network through the training. Therefore, the trained convolutional neural network can be used for processing the face key point information in the first face image, and expression data in the first face image can be obtained and used as reference expression data.

In another implementation manner of obtaining the reference expression data according to the face key point information, a face mask is obtained based on the face key point information in the first face image, wherein the face mask carries the face key point information in the first face image. The face mask is used as reference expression data. After obtaining the reference expression data based on this implementation, step 302 specifically includes the following steps:

carrying out feature extraction processing on the first face model to obtain a first feature image;

performing fusion processing on the first characteristic image and the reference expression data to obtain a second characteristic image;

Because the face mask carries the expression information of the first face image, the expression of the first face image can be transferred to the first face model by fusing the face mask and the first face model, and the second face model is obtained.

According to the implementation, the reference expression data can be obtained by extracting the face key points of the first face image, and the expression in the first face image can be migrated to the first face model based on the technical scheme provided by the implementation.

Referring to fig. 6, fig. 6 is a flowchart illustrating a method for implementing step 302 according to an embodiment of the present disclosure.

601. First audio data is acquired.

In the embodiment of the application, the first audio data may carry voice information, and the first audio data may not carry voice information. For example, the first audio data may be voice data, and the first audio data may also be pure music audio data.

The manner of acquiring the first audio data may be: the image processing device receives first audio data input by a user through an input component, wherein the input component comprises: keyboard, mouse, touch screen, touch pad, audio input device, etc.

The manner of acquiring the first audio data may also be: the image processing device receives first audio data sent by a fifth terminal, wherein the fifth terminal comprises a mobile phone, a computer, a tablet computer, a server and the like.

The acquiring of the first audio data may further be: the image processing apparatus is acquired by microphone acquisition, for example, the image processing apparatus is a mobile phone, and audio data acquired by a microphone of the mobile phone is taken as first audio data.

The manner of acquiring the first audio data may also be: the image processing apparatus cuts out audio data from a video stream as first audio data.

The manner of obtaining the first audio data is not limited in the present application.

602. And obtaining the reference expression data according to the mapping relation and the information carried in the first audio data.

In the embodiment of the application, the information carried by the first audio data is the content carried by the first audio data. For example, assume that the content carried by the first audio data is: if you are good, the information carried by the first audio data is: you are good. For another example, assume that the content carried by the first audio data is: it is happy today that the information carried by the first audio data is: it is good today. For another example, in the case that the first audio data is pure music audio data, the content carried by the first audio data is a melody type of the first audio data. The melody type may include at least one of: joy, worry, heavy.

When the speaking contents of the people are different, the moods are different, and further the expressions are different. For example, in small red: in the case of "happy mood", the mood of small red is pleasant, and the expression is happy smile. For another example, in the small description: when 100 yuan of money is lost today, a small red mood is difficult, and the expression is a frowny face. The expressions corresponding to different types of melodies are different. For example, a happy melody corresponds to a happy expression, a sad melody corresponds to a sad expression, and a heavy melody corresponds to a heavy expression. Accordingly, based on the information carried in the first audio data, an expression matching the first audio data can be determined.

In the embodiment of the application, based on the mapping relation and the information carried in the first audio data, the expression matched with the first audio data can be obtained, and then the reference expression data is obtained. The mapping relation is used for representing mapping between information carried in the audio data and the expression data. For example, the mapping between information carried in audio data and expression data can be seen in the following table:

information carried by audio data	Expression data
		Hello or hello	Smile (smile)
Stock is increased	Joyful eyebrow
		Joyful melody	Happy

TABLE 1

Based on the technical scheme provided by the implementation, the expression of the first face model can be changed through the first audio data, and the second face model is obtained. As an alternative embodiment, the image processing apparatus may drive the facial expression and the lip pose of the second face model according to the information carried in the first audio data, so as to achieve the effect of outputting the information carried in the first audio data through the second face model. For example, assume that the information carried in the first audio data is: your good, can be based on the expression of first audio data change first face model, obtain the expression for the smiling second face model simultaneously, change the lip gesture of second face model, make the lip language that the lip gesture of second face model expressed be: hello and outputs "hello" audio data through the image processing apparatus (e.g., output through a speaker of the image processor: hello).

Because the sound attributes matched by the face models with different character attributes are different. For example, young women have lively voices and older women have relaxed voices. For example, young women have a lively voice and young men have a deep voice. As an alternative embodiment, the sound attribute of the audio data output by the image processing apparatus may be determined according to the character attribute of the first face model.

In an embodiment of the present application, the character attributes may include: sex, age. Optionally, the gender includes: male and female. The ages included the following 7 age groups: 1-10 years old, 11-15 years old, 15-20 years old, 21-30 years old, 31-43 years old, 44-60 years old, 60 years old or more. The sound attributes include: timbre, tone, loudness.

Optionally, the image processing apparatus may perform feature extraction processing on the first face model to obtain feature data of the first face model, where the feature data includes semantic information of the first face model, and the semantic information may be used to describe content of the first face image. According to the feature data, the character attribute of the first face model can be obtained. And obtaining the sound attribute of the second audio data according to the character attribute, wherein the information carried in the second audio data is the same as the information carried in the first audio data.

For example, assume that: the character attributes of the first face model include: the female, 20 years old to 30 years old, and the information carried in the first audio data is: you are good. The tone, the tone and the loudness of the second audio data can be determined according to the character attributes, and the information carried in the second audio data can be determined according to the information carried in the first audio data as follows: you are good.

As an alternative embodiment, before the image processing apparatus executes step 602, the following steps are further executed:

and performing sound feature extraction processing on the first audio data to obtain feature data.

In an embodiment of the present application, the feature data carries sound feature information of the first audio data, where the sound feature information includes: volume of sound.

In the case of obtaining feature data, the image processing apparatus realizes step 602 by performing the steps of:

61. obtaining intermediate expression data according to the mapping relation and the information carried in the first audio data;

in this step, according to the mapping relationship and the information carried in the first audio data, the reference expression data is not obtained, but the intermediate expression data is obtained. The information carried in the intermediate emotion data is the same as the information carried in the first audio data.

62. And adjusting the intermediate expression data according to the characteristic data to obtain the reference expression data.

Because the expressions corresponding to different sound characteristics are different, the intermediate expression data can be adjusted according to the sound characteristic information of the first audio data to obtain the reference expression data. For example, the greater the volume, the greater the expression magnitude. And adjusting the amplitude of the intermediate expression data according to the characteristic data to obtain reference expression data.

Optionally, in a case that the image processing apparatus is loaded with a voice capturing component (e.g. a microphone), the image processing apparatus implements step 601 by performing the following steps:

63. voice data are collected through a voice collection assembly;

the voice data may be a sound made by a person during speaking. The voice data may also be voice data output by a voice terminal, where the voice terminal includes: cell-phone, computer, panel computer, server, wearable equipment. For example, the image processing apparatus picks up voice data by a microphone from a mobile phone and outputs the voice data through a speaker.

64. And carrying out semantic analysis processing on the voice data to obtain semantic data.

And extracting the semantics of the voice data to obtain the semantic data by performing voice analysis processing on the voice data.

65. And obtaining the first audio data according to the information carried in the semantic data.

The information carried in the semantic data includes the semantics of the voice data. And obtaining first audio data according to the semantic meaning of the voice data so as to respond to the voice data.

Based on steps 63 to 65, it is possible to control the expression of the virtual character while controlling the virtual character to perform a conversation.

Referring to fig. 7, fig. 7 is a flowchart illustrating a method for obtaining a first face model according to an embodiment of the present application.

701. And acquiring a second face image and a depth image of the second face image.

In this embodiment, the depth image of the second face image includes depth information of pixels in the second face image.

The manner of acquiring the second face image may be: the image processing device receives a second face image input by a user through an input component, wherein the input component comprises: keyboard, mouse, touch screen, touch pad, audio input device, etc.

The manner of acquiring the second face image may also be: the image processing device receives a second face image sent by a sixth terminal, wherein the sixth terminal comprises a mobile phone, a computer, a tablet computer, a server and the like.

The acquiring of the second face image may further be: the image processing device is obtained by camera acquisition, for example, the image processing device is a mobile phone, and a face image obtained by camera shooting of the mobile phone is taken as a second face image.

The manner of acquiring the second face image may also be: a frame cut out from the video stream by the image processing apparatus contains an image of a human face as a second human face image.

The depth image may be obtained by shooting with a depth camera, where the depth camera may be any one of: structured light (structured light) cameras, TOF cameras, binocular stereo vision cameras. The depth image may also be acquired by receiving a depth image input by a user through the input component. The depth image may also be obtained by receiving a depth image sent by a seventh terminal, where the seventh terminal includes a mobile phone, a computer, a tablet computer, a server, and the like. In this embodiment, the sixth terminal and the seventh terminal may be the same or different.

In one possible implementation, the image processing apparatus is a mobile phone. The mobile phone is loaded with an RGB camera and a TOF camera. The mobile phone shoots the face by using the RGB camera to obtain a second face image. And when the RGB camera shoots to obtain a second face image, the mobile phone shoots the face by using the TOF camera to obtain a depth image of the second face image.

702. And obtaining the first human face model according to the second human face image and the depth image.

In one possible implementation, the following information may be determined according to the face key points (see step 301 for the manner of obtaining the face key points): and obtaining the contour of the three-dimensional face model, the contour of the five sense organs and the positions of the five sense organs in the three-dimensional face model. According to the depth information in the depth image, the depth information of the key points of the face in the face contour model can be determined, and then a third face model is obtained. Since the third face model does not include face texture data, the third face model is a face white model (as shown in fig. 8, fig. 8 is a face white model). And fusing the face texture data in the second face image with the third face model to obtain the face model with the face texture data, namely the first face model.

The change of the expression is realized by performing deformation processing on the face of the human face model, wherein the deformation processing of the face comprises deformation processing of five sense organs, such as rotation of eyeballs, opening and closing of mouths, expansion and contraction of nasal wings, change of sizes of eye sockets and change of positions of eyebrows. Since the information of the five sense organ regions in the second image is limited and the expression is of a wide variety, the missing information exists in the five sense organs in the obtained second face model when the expression of the first face model is adjusted to be not the expression in the second face image. The above-mentioned "missing information" refers to information generated due to a difference between the expression in the second image and the expression indicated by the reference expression data.

For example, the expression in the second face image is closed eyes, the expression in the first face model obtained based on the second face image is also closed eyes, and the expression indicated by the reference expression data is open eyes. Thus, the second face model obtained according to the first face model and the reference expression data does not contain information of the eye region. That is, the information of the eye region in the second face model is missing information.

For another example, the expression in the second face image is closed mouth, the expression in the first face model obtained based on the second face image is also closed mouth, and the expression indicated by the reference expression data is open mouth smile. In this way, the second face model obtained from the first face model and the reference expression data does not contain information of the cavity region. That is, the information of the oral cavity region in the second face model is missing information.

Obviously, in case of missing information in the second face model, the fidelity of the second face model will decrease. Therefore, the embodiment of the application provides a technical scheme to reduce the probability of occurrence of the situation of missing information in the second face model, so as to improve the fidelity of the second face model.

As an optional implementation manner, step 702 specifically includes the following steps:

71. and obtaining a third face model according to the second face image and the depth image.

The implementation of this step can refer to step 702, which will not be described herein.

72. And removing the pixel area belonging to the reference area in the third face model to obtain a fourth face model.

In embodiments of the present application, the reference area comprises at least one of: eye area, oral area. And after the third face model is obtained, removing the pixel area in the reference area in the third face model to obtain a fourth face model. For example, the reference area includes: the eye region and the oral region. And after the third face model is obtained, removing the pixel area in the eye area and the pixel area in the oral cavity area in the third face model to obtain a fourth face model.

73. And filling reference data into the reference area in the fourth face model to obtain the first face model.

In an embodiment of the application, the reference data includes at least one of: data of eye area, data of oral cavity area. The reference data is a pixel region matched with the fourth face model. For example, the data of the oral cavity region to be filled in the fourth face model may cover the same area as the oral cavity region in the fourth face model. For another example, the data of the eye region filled in the fourth face model covers the same area as the data of the eye region removed from the third face model.

Unlike the pixel regions removed from the third face model, the pixel regions filled into the fourth face model have at least one associated pixel region (hereinafter, the associated pixel regions are referred to as correlation data) that is respectively matched with various expressions. For example, the reference data filled into the fourth face model includes a pixel region within the eye region, and the filled-in line-of-sight deflection angle is 0 degrees and the filled-in expression is smile. The pixel region associated with the filled pixel region includes: a pixel region in the eye region with the deflection angle of the sight line of 30 degrees, wherein the expression matched with the pixel region is light bamboo; a pixel area in the eye area where the deflection angle of the line of sight is 60 degrees, and the expression matched with the pixel area is rape smile. In this embodiment of the application, an included angle between the shooting direction of the imaging device that acquires the second face image and the line of sight of the person being shot is referred to as a line of sight deflection angle, and from the top of the head of the person being shot, when the shooting direction of the imaging device is clockwise compared with the offset direction of the line of sight of the person being shot, the line of sight deflection angle is positive, and conversely, when the shooting direction of the imaging device is counterclockwise compared with the offset direction of the line of sight of the person being shot, from the top of the head of the person being shot, the line of sight deflection angle is negative.

For another example, the reference data filled into the fourth face model includes a pixel region in the oral cavity region, the area covered by the filled pixel region is 15 square inches, and the expression after filling is smile. The pixel region associated with the filled pixel region includes: a pixel area in the oral cavity area with the area of 40 square inches is covered, and the expression matched with the pixel area is laugh; a pixel area within the oral cavity area covered by an area of 5 square inches, the expression matching the pixel area is a biting incisor.

And filling the reference area in the fourth face model with the face model obtained by the reference data to serve as the first face model. The reference data and the related data are preset data, namely the matching degree of the reference data and the related data is high, so that in the process of adjusting the expression of the first face model, the data of the five sense organ region in the first face model can be determined according to the related data related to the reference data, the probability of the occurrence of the condition of missing information in the second face model is reduced, and the fidelity of the second face model is improved.

According to the embodiment of the application, the fourth face model is obtained by removing the pixel region belonging to the reference region in the third face model, and the reference data is filled in the reference region in the fourth face model, so that the first face model is obtained. Therefore, in the process of adjusting the expression of the first face model, the related data associated with the reference data can be utilized, so that the probability of occurrence of the condition of missing information in the obtained second face model is reduced, and the fidelity of the second face model is improved.

Based on the technical scheme provided by the embodiment of the application, the embodiment of the application also provides several possible application scenarios.

Scene A: with the popularization of mobile terminals and the rapid development of internet technologies, more and more people use mobile terminals to carry out video calls. Based on the technical scheme provided by the embodiment of the application, the use of the human face model in the video call process can be realized, so that the interestingness of the video call is improved.

Referring to fig. 9, fig. 9 is a schematic flowchart illustrating another image processing method according to an embodiment of the present disclosure.

901. A video stream and a first face model are obtained.

In the embodiment of the application, the image processing device is loaded with the camera, and the video stream can be collected through the camera.

The implementation manner of obtaining the first face model may refer to step 301, or step 701 to step 702, which will not be described herein. Alternatively to this, the first and second parts may,

902. and carrying out face detection processing on the images in the video stream to obtain an image containing a face as a first face image.

In the embodiment of the application, the image containing the face in the video stream can be determined by performing face detection processing on each frame of image in the video stream, and the image containing the face is taken as the first face image. For example (example 1), the video stream includes an image a, an image B, and an image C, and the face detection processing is performed on the video stream to determine that both the image a and the image B include a face, and that the image C does not include a face. Based on the result of the face detection processing, the image a and the image B are taken as the first face image.

903. And obtaining reference expression data based on the first face image.

The implementation manner of this step can be referred to as step 502, and will not be described herein.

It should be understood that, in the case that the number of the first face images is greater than or equal to 2, one reference expression data may be obtained according to each first face image. Taking example 1 as an example (example 2), reference expression data D can be obtained according to the image a, and reference expression data E can be obtained according to the image B, wherein the expression indicated by the reference expression data D is the same as the expression in the image a, and the expression indicated by the reference expression data E is the same as the expression in the image B.

904. And obtaining a second face model according to the reference expression data and the first face model.

The implementation manner of this step can be referred to as step 302, and will not be described herein.

It should be understood that, in the case that the number of the reference expression data is greater than or equal to 2, a second face model can be obtained according to each reference expression data. Taking example 2 as an example, the second face model F can be obtained according to the reference expression data D and the first face model, and the second face model G can be obtained according to the reference expression data E and the first face model, wherein the expression indicated by the reference expression data F is the same as the expression in the image a, and the expression indicated by the reference expression data G is the same as the expression in the image B.

Based on the technical scheme provided by the embodiment of the application, in the process of video call of a user, the expression of the user can be transferred to the first face model to obtain the second face model, and then the video call is completed through the second face model. For example, a user wants to use a face model to make a video call with a mouse, so as to achieve the effect of increasing interest. Before a video call is made, the minired can scan the face of the user by using a mobile phone to obtain a face model of the user (hereinafter, referred to as a face model a). In the process of carrying out video call, the mobile phone carries out face detection processing on the collected video stream to obtain an image containing a small red face part as a first face image. The mobile phone takes the face model a as a first face model, and based on the technical scheme provided by the embodiment of the application, the mobile phone can transfer the expression in the first face image to the face model a to obtain a face model b (namely a second face model), and uses the face model b to carry out video call with Xiaoming.

Scene B: with the rapid development of the three-dimensional printing (3D printing) technology, the three-dimensional printing technology is widely applied to the fields such as mold manufacturing, industrial design, and the like. The physical model corresponding to the three-dimensional model can be obtained through the three-dimensional printing technology, so that the method has very important significance on how to efficiently obtain the three-dimensional model with high accuracy.

For example, a person may clearly feel laughing in a photo of the person (hereinafter, referred to as photo c), and want a physical model of the person under the expression in photo c. The xiaoming can scan the face of the user by using a mobile phone to obtain a face model of the user (hereinafter, referred to as a face model d). The mobile phone can take the photo c as a first face image and the face model d as a first face model, and migrate the expression in the photo c to the face model d based on the technical scheme provided by the embodiment of the application to obtain the face model e (namely, a second face model). And processing the face model e by using a three-dimensional printer to obtain an entity model f, wherein the expression in the entity model f is the same as that in the photo c.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

The method of the embodiments of the present application is set forth above in detail and the apparatus of the embodiments of the present application is provided below.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application, where the apparatus 1 includes: a first acquisition unit 11, a first processing unit 12, a second acquisition unit 13, a second processing unit 14, a control unit 15, and a third processing unit 16, wherein:

a first obtaining unit 11, configured to obtain a first face model and reference expression data, where the first face model is obtained based on a face;

the first processing unit 12 is configured to render the expression of the first face model according to the reference expression data, so as to obtain a second face model.

acquiring a first face image;

acquiring a video stream;

acquiring first audio data;

In combination with any of the embodiments of the present application, the apparatus 1 further includes:

a second obtaining unit 13, configured to obtain a person attribute of the first face model;

a second processing unit 14, configured to obtain second audio data according to the character attribute, where information carried in the second audio data is the same as information carried in the first audio data;

and the control unit 15 outputs the second audio data in the process of controlling the second face model to execute the speaking operation.

a third processing unit 16, configured to perform sound feature extraction processing on the first audio data to obtain feature data before obtaining the reference expression data according to the mapping relationship and the information carried in the first audio data;

the first obtaining unit is used for:

voice data are collected through a voice collection assembly;

acquiring a second face image and a depth image of the second face image;

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present application may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

Fig. 11 is a schematic diagram of a hardware structure of an image processing apparatus according to an embodiment of the present disclosure. The image processing apparatus 2 includes a processor 21, a memory 22, an input device 23, and an output device 24. The processor 21, the memory 22, the input device 23 and the output device 24 are coupled by a connector, which includes various interfaces, transmission lines or buses, etc., and the embodiment of the present application is not limited thereto. It should be appreciated that in various embodiments of the present application, coupled refers to being interconnected in a particular manner, including being directly connected or indirectly connected through other devices, such as through various interfaces, transmission lines, buses, and the like.

The processor 21 may be one or more Graphics Processing Units (GPUs), and in the case that the processor 21 is one GPU, the GPU may be a single-core GPU or a multi-core GPU. Alternatively, the processor 21 may be a processor group composed of a plurality of GPUs, and the plurality of processors are coupled to each other through one or more buses. Alternatively, the processor may be other types of processors, and the like, and the embodiments of the present application are not limited.

Memory 22 may be used to store computer program instructions, as well as various types of computer program code for executing the program code of aspects of the present application. Alternatively, the memory includes, but is not limited to, Random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or compact disc read-only memory (CD-ROM), which is used for related instructions and data.

The input means 23 are for inputting data and/or signals and the output means 24 are for outputting data and/or signals. The input device 23 and the output device 24 may be separate devices or may be an integral device.

It is understood that, in the embodiment of the present application, the memory 22 may be used to store not only the relevant instructions, but also relevant data, for example, the memory 22 may be used to store the reference expression data acquired through the input device 23, or the memory 22 may also be used to store a second face model obtained through the processor 21, and the like, and the embodiment of the present application is not limited to the data specifically stored in the memory.

It will be appreciated that fig. 10 only shows a simplified design of an image processing apparatus. In practical applications, the three-dimensional model processing device may further include necessary other components, including but not limited to any number of input/output devices, processors, memories, etc., and all image processing devices that can implement the embodiments of the present application are within the scope of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. It is also clear to those skilled in the art that the descriptions of the various embodiments of the present application have different emphasis, and for convenience and brevity of description, the same or similar parts may not be repeated in different embodiments, so that the parts that are not described or not described in detail in a certain embodiment may refer to the descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one first processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in or transmitted over a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)), or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Versatile Disk (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

One of ordinary skill in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by hardware related to instructions of a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the above method embodiments. And the aforementioned storage medium includes: various media that can store program codes, such as a read-only memory (ROM) or a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Claims

1. An image processing method, characterized in that the method comprises:

2. The method of claim 1, wherein the obtaining the reference expression data comprises:

acquiring a first face image;

3. The method of claim 2, wherein the rendering the expression of the first face model according to the reference expression data to obtain a second face model comprises:

4. The method according to claim 2 or 3, wherein the acquiring a first face image comprises:

acquiring a video stream;

and carrying out face detection processing on the images in the video stream to obtain a first face image containing a face.

5. The method of claim 1, wherein the obtaining the reference expression data comprises:

acquiring first audio data;

6. The method of claim 5, further comprising:

acquiring the character attribute of the first face model;

7. The method according to claim 5 or 6, wherein before the obtaining the reference expression data according to the mapping relationship and the information carried in the first audio data, the method further comprises:

8. The method of any of claims 5 to 7, wherein the obtaining the first audio data comprises:

voice data are collected through a voice collection assembly;

9. The method of any of claims 1 to 8, wherein the first face model is derived based on a human face, comprising:

acquiring a second face image and a depth image of the second face image;

10. The method of claim 9, wherein the deriving the first face model from the second face image and the depth image comprises:

removing pixel regions belonging to the reference region in the third face model to obtain a fourth face model;

and filling a reference area in the fourth face model according to the reference data to obtain the first face model.

11. An image processing apparatus, characterized in that the apparatus comprises:

12. An electronic device, comprising: processor, transmission means, input means, output means and a memory for storing computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the method of any of claims 1 to 10.

13. A computer-readable storage medium, in which a computer program is stored, which computer program comprises program instructions which, if executed by a processor, cause the processor to carry out the method of any one of claims 1 to 10.