CN116168107A

CN116168107A - Face sketch coloring method based on reference picture

Info

Publication number: CN116168107A
Application number: CN202310180484.1A
Authority: CN
Inventors: 刘恒; 徐尧; 刘涛
Original assignee: Anhui University of Technology AHUT
Current assignee: Anhui University of Technology AHUT
Priority date: 2023-02-24
Filing date: 2023-02-24
Publication date: 2023-05-26

Abstract

The invention discloses a face sketch coloring method based on a reference picture, and belongs to the technical field of image processing. The invention comprises the following steps: 1. making a training set of paired sketches and real pictures; 2. constructing a human face sketch encoder to obtain structural features of a human face sketch and a reference picture encoder to obtain features of a reference picture, performing feature fusion by using a cross attention mechanism, and establishing a long-distance dependency relationship on the fused features by using a transducer in a decoding stage; 3. training a network model based on the constructed encoder-decoder network and the fabricated training set; 4. and inputting a human face sketch and a real human face reference picture according to the model parameters obtained through learning to obtain a real picture with the color style of the reference picture. The invention establishes long-distance dependency relationship among different parts of the synthesized real human face, so that the generated human face picture is fine and vivid in local part and has the characteristics of consistency and continuity in global.

Description

Face sketch coloring method based on reference picture

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a face sketch coloring method based on a reference picture.

Background

The coloring of the human face sketch based on the reference picture plays an important role in the fields of artistic creation, film and television media and the like. At present, in the field of deep learning, good progress has been made about image synthesis, compared with the traditional method, the method based on deep learning is superior to the traditional method in terms of both quality and speed of generated images, but in the field of image generation, the coloring of the face sketch based on the reference picture is a research direction with use value, and the method based on the reference picture has application value in terms of the coloring performance of the face sketch based on the reference picture due to the development of the deep learning technology.

Traditional image synthesis techniques are those that render a synthesized real image by means of three-dimensional reconstruction and deformation techniques. However, synthesizing a real image by the three-dimensional technology consumes more time and financial resources, and the person using the technology is required to grasp the knowledge in terms of corresponding three-dimensional modeling and design, so that the use threshold is high, and the practicability is limited. At present, a Sketch-based image generation algorithm in the traditional field at home and abroad generally adopts a search fusion method, such as a Sketch2Photo and a Photosketcher. The methods mainly divide the sketch, match the divided image blocks in the image database, and fuse the image blocks through a fusion algorithm. In order to more accurately match image blocks, sketches are often required to be marked manually to reduce the matching range of the image blocks, and in addition, a traditional fusion algorithm cannot obtain a satisfactory result in global consistency in a fusion stage, so that the difficulty of a synthesis task is further aggravated by introducing the characteristics of reference pictures, and therefore, the traditional method has great limitation in the face sketch coloring problem based on the reference pictures.

The student proposed Style2paint, which is a deep learning method based on reference picture coloring, and different from the traditional method, he uses a fully convolved U-Net shaped encoder-decoder network structure, and feature fusion is simply feature stitching, but the Style2paint is greatly improved in effect as compared with the traditional method. A learner puts forward a novel image-to-image translation model ContextualGAN, the algorithm splices the paired sketches and the color sketches, then uses the model to calculate the position of the test sketches on the joint distribution so as to find the corresponding color pictures, and the method can ensure that the obtained pictures have a sense of reality, but can lead to the inconsistent situation of the color pictures obtained by the test and the input sketches under the condition that the counter propagation solving path is too long. A learner proposed a sketch coloring network based on reference pictures using two discriminators, one for discriminating whether the structure of a picture generated by a generator is consistent with an input sketch and one for discriminating whether the color of a picture generated by a generator is consistent with an input reference picture, although the synthesized picture structure and color can maintain good consistency with the input sketch and structure, the network model is only applicable to a simple icon sketch and is not applicable to a complex human face sketch due to the structure of the full convolution of the generator network. Although these methods have been well developed in reference picture-based image coloring, they are not applicable in some cases, and usually do not consider the feature correlation between the input sketch and the reference picture, but simply splice features, and the local features of the convolution network also cause artifacts and global inconsistencies in the coloring effect of the finally generated face.

The application number is: CN202210111517.2, filing date: 2022, 01, 26, the name of the invention is: a sketch coloring method based on an attention mechanism. The application is different from the existing method for coloring the sketch by utilizing the color block, and can quickly perform high-quality coloring of similar styles on the sketch by only inputting one style reference picture by a user. Although the method effectively solves the problem of feature fusion between the features of the reference picture and the input sketch structural features, rather than simply splicing the features, the local receptive field of convolution is insufficient to solve the field gap because of the large difference of two image fields between the sketch and the color real picture, so that the problem of inconsistent, incoherence and distortion of the overall human face synthesized finally can occur.

The application number is as follows: CN201910909387.5, filing date: 24 days of 2019, 09 months, the invention is named: an image generation method based on sketch. The application builds a generated countermeasure model comprising a generator and a discriminator, wherein the generator comprises a downsampling mask residual module, a residual module, an upsampling mask residual module and a conditional self-attention module, and the discriminator comprises more than one sub-discrimination network with different depths; the method ensures that the generated face image has a real local texture and a complete face structure. Although this approach proposes using self-attention to establish long-range dependencies on features of the image in the encoder, since conventional convolutional layers are still used in the encoder, the obtained features are local features of the image, resulting in loss of global features of the image, which is detrimental to the decoder to generate a complete real picture.

Disclosure of Invention

1. Technical problem to be solved by the invention

In order to overcome the problems that the semantic relation between the structural features of the sketch and the style and color features of the reference picture cannot be established and the long-distance dependency relation cannot be established between the features of the face in the decoding stage, so that the global inconsistency of the face is caused; the invention provides a face sketch coloring method based on a reference picture; the invention effectively establishes the semantic relation between the structural features of the sketch and the color features of the reference picture by using the cross attention, and establishes the long-distance dependency relation of the image features in the decoding stage by using the Transformer, thereby being capable of adapting to the coloring scene of the human face sketch based on the reference picture and meeting the complex requirements in reality.

2. Technical proposal

In order to achieve the above purpose, the technical scheme provided by the invention is as follows:

the invention relates to a face sketch coloring method based on a reference picture, which comprises the following steps:

step 1, utilizing a real face picture to manufacture a face sketch and a real face picture training set;

step 2, constructing a face sketch coloring generation countermeasure network model based on a reference picture;

step 3, training the model constructed in the step 2 according to the training data set manufactured in the step 1, and storing model parameters obtained by training;

and 4, taking a human face sketch and a reference picture as network inputs, and reconstructing a human face image with the color style of the reference picture by using the parameters obtained by the learning in the step 3 as output.

Further, in step 1, using the CelebA-HQ dataset, a sketch corresponding to a real picture is obtained by using a sknchkeras model, so as to make a pair of a face sketch and a real face picture dataset.

Still further, the generating countermeasure network constructed in step 2 includes a face sketch encoder, a reference picture encoder, a cross-attention-based feature fusion module, a Swin transform block-based decoder, a guidance encoder, and a discriminator.

Furthermore, the human face sketch encoder, the reference picture encoder and the encoder based on the Swin transform block form a U-Net generator network; wherein:

the human face sketch encoder acquires the context structure information of the sketch;

a reference picture encoder obtaining context information of a color style of a reference picture;

the characteristic fusion module is based on the cross attention, and fusion is carried out on sketch characteristics and reference picture characteristics;

the decoder based on the Swin transform block decodes the obtained fusion characteristics, establishes long-distance dependency relationship on the fusion characteristics, and finally decodes the fusion characteristics to generate a real face image with a reference picture color style;

the guiding decoder is used for guiding and generating an image based on the fusion characteristics to improve the quality of the picture and avoid the gradient disappearance problem of the network intermediate layer;

and the discriminator discriminates the picture generated by the generator, so that the picture generated by the generator is more real and the detail is finer.

Further, the face sketch encoder is composed of 8 encoding units, each unit comprises a convolution layer with a convolution kernel of 4 and a step length of 2, a batch normalization layer and an activation function LeakyReLU with a tilt coefficient of 0.2.

Further, the reference picture encoder is composed of 8 coding units, each unit including a convolution layer with a convolution kernel of 4 and a step size of 2, a batch normalization layer, and an activation function LeakyReLU with a tilt coefficient of 0.2.

Still further, the Swin Transformer block-based decoder consists of 7 decoding units, each unit containing a transposed convolutional layer with a convolutional kernel of 4 and a step size of 2, each coding unit being followed by two consecutive Swin Transformer blocks, each Swin Transformer block consisting of W-WSA, SW-WSA and MLP.

Still further, the boot decoder uses 8 consecutive cores of size 3, step size 2, transposed wrap size 1 and a LeakyReLU gradient of 0.2.

Further, the arbiter consists of 5 groups of convolution layers with a convolution kernel size of 4, a step size of 2, a packing of 2, a batch normalization layer module, and a LeakyReLU with an inclination of 0.2.

Further, step 4 optimizes the network model by Adam optimization algorithm, and the loss function of the network comprises three parts: the perceptual loss of the picture, the counterattack loss and the reconstruction loss of the guiding decoder are generated, and the losses are put together for optimization during training.

3. Advantageous effects

Compared with the prior art, the technical scheme provided by the invention has the following remarkable effects:

(1) According to the human face sketch coloring method based on the reference picture, the network effectively fuses the color style characteristics of the reference picture and the structural characteristics of the human face sketch through the cross attention characteristic fusion module.

(2) According to the face sketch coloring method based on the reference picture, a decoder capable of maintaining global information is introduced to achieve a better result in consideration of a larger image domain gap between a real sketch and a real picture, and compared with other methods based on a convolution network, the face finally generated by the method is more consistent and consistent in structure and color globally, and is finer and more vivid in local area.

(3) The human face sketch coloring method based on the reference picture promotes the quality of the synthesized picture and the discriminator to discriminate the generated picture by combining the guide decoder, promotes the generated picture to be finer and more vivid, and has wide application prospect in the fields of artistic creation, image processing and the like.

Drawings

FIG. 1 is a flow chart of a reference picture-based face sketch coloring method of the present invention;

FIG. 2 is a flow chart of the sketch dataset production in the present invention;

FIG. 3 is a schematic diagram of a reference picture-based face sketch coloring generation countermeasure network constructed in accordance with the present invention;

FIG. 4 is a detailed view of a feature fusion module of the present invention;

FIG. 5 is a detailed diagram of a Swin transducer block based decoder network of the present invention;

fig. 6 is a detailed view of a arbiter constructed in the present invention.

Detailed Description

For a further understanding of the present invention, the present invention will be described in detail with reference to the drawings and examples.

Example 1

Referring to fig. 1, a face sketch coloring method based on a reference picture in this embodiment specifically includes the following steps:

step 1, a common data set, namely a CelebA-HQ data set is utilized to manufacture a pair of sketch and real picture data sets. The specific steps are shown in fig. 2, namely:

and acquiring a sketch corresponding to the real picture by using a SketchKeras model for each image in the CelebA-HQ dataset.

Step 2, constructing a face sketch coloring generation countermeasure network based on a reference picture;

2-1, generating an countermeasure network model based on sketch global context information and synthesized from sketch to realistic pictures, wherein the specific structure is shown in fig. 3, and the constructed network consists of a face sketch encoder, a reference picture encoder, a feature fusion module based on cross attention, a decoder based on a Swin transform block, a guide encoder and a discriminator.

The operation flow of the module is as follows: firstly, constructing a U-Net generator network composed of a human face sketch encoder, a reference picture encoder and a Swin Transformer block-based decoder, wherein the human face sketch encoder acquires context structure information of sketch, the reference picture encoder acquires context information of color style of reference picture, a feature fusion module based on cross attention fuses sketch features and reference picture features, the Swin Transformer block-based decoder decodes the obtained fusion features, a long-distance dependency relationship is built on the fusion features, finally, the decoding generates a real human face image with color style of the reference picture, and a guide decoder is additionally used for guiding the fusion features to generate an image to improve the quality of the picture and avoid the gradient disappearance problem of a network interlayer. The discriminator discriminates the picture generated by the generator, so that the picture generated by the generator is more real, and the detail is finer and finer. The specific flow is represented by formula (1):

F _s ＝Encoder _s (I _sketch )

F _r ＝Encoder _r (I _ref )

F _fusion ＝Fusion(F _s ，F _r )

I _img ＝STB-based decoder(F _fusion )

I _guide ＝Guide-decoder(F _fusion ) (1)

wherein, the Encoder _s () Encoder module for representing human face sketch, I _sketch Sketch representing input, F _s Representing the sketch structural characteristics obtained by the face sketch after being encoded by a face sketch encoder module, F _r Indicating that the reference picture is coded by a reference picture coder to obtain the reference picture characteristics, and an Encoder _r () Representing reference picturesEncoder modules, I _ref Representing an input reference picture, fusion () represents a feature Fusion module using cross-attention, F _fusion Representing the fused features obtained by the fusion module, STB-based decoder () represents a Swin transform block-based decoder, I _img Representing the real face picture decoded by the decoder, guide-decoder () representing the Guide decoder, I _guide Represents F _fusion And leading the pictures obtained through a leading decoder.

The 2-2 constructed face sketch encoder module is shown in fig. 3, the face sketch encoder is composed of 8 encoding units, each unit comprises a convolution layer with a convolution kernel of 4 and a step length of 2, a batch normalization layer and an activation function LeakyReLU with a tilt coefficient of 0.2, and the face sketch is subjected to the face sketch encoder to obtain the structural characteristics of the face sketch.

2-3 the reference picture encoder module constructed as shown in fig. 3, the human face sketch encoder is composed of 8 encoding units, each unit comprises a convolution layer with a convolution kernel of 4 and a step length of 2, a batch normalization layer, and an activation function LeakyReLU with a tilt coefficient of 0.2, and the reference picture obtains the color style characteristics of the reference picture through the reference picture.

2-4, wherein the cross attention feature fusion module network constructed by the method is specifically shown in fig. 4, a structural feature diagram of the face sketch is obtained by the face sketch encoder, a reference picture is obtained by the reference picture encoder, a color style feature diagram of the reference picture is obtained by the face sketch structural feature diagram through a linear layer, a Query is obtained by the color style feature diagram of the reference picture through two linear layers, keys and values are obtained respectively, matrix multiplication is carried out on the Query and keys, attention force diagrams are obtained, namely, related contents in the color style feature diagram of the reference picture are searched by using the features of the face structure, and the weights in the attention force diagrams represent the correlation of different positions. And multiplying the attention map and Value to obtain a fused feature, and finally adding the fused feature with the input human face sketch structural feature to obtain the fused feature.

2-5 the Swin transform block based decoder network is shown in particular in FIG. 3, and consists of 7 decoding units, each unit containing a transposed convolutional layer with a convolutional kernel of 4 and a step size of 2. Each decoding unit is followed by two consecutive Swin transform blocks, each composed of W-WSA (window-based multi-head self-section), SW-WSA (shiftable window-based multi-head self-section), LN (layer normalization) and MLP (multi-layer supercptron), as shown in fig. 5. Can be expressed by the formula (2):

where X represents the input signature, MSA represents the W-MSA or SW-MSA unit,

representing intermediate features through LN and MSA, and l-1 represent the current and last iteration results in turn. The feature encoded by the encoder is decoded using a pilot decoder. The boot decoder consists of 8 consecutive convolution kernels of size 3, step size 2, transposed convolution with padding 1 and a leakey relu with gradient 0.2. The formula (3) is

fout＝Leakey ReLU(BN(Conv(f _in ))) (3)

Wherein f _in Representing the characteristics of the input, conv represents the convolutional layer, BN represents the batch normalization layer, and leakey relu is the activation function.

2-6 the arbiter constructed as shown in fig. 6, the arbiter network is used to discriminate the pictures generated by the generator, the arbiter consists of 5 sets of convolution kernels of size 4, step size 2, 2-filled convolution layers, a batch normalization layer module and a leakey relu with an inclination of 0.2.

Step 3, performing network training according to the training set obtained in the step 1 and the generated countermeasure network constructed in the step 2;

3-1, training the network by using a pyrach deep learning platform, initializing the network by adopting an Xavier mode to generate an countermeasure network, and initializing all bias bi to 0. The specific process is as follows:

1) After initializing the weight W in the countermeasure network in an Xavier manner, W satisfies the following gaussian distribution:

where n represents the number of network input units of the layer, i.e. the number of convolutional layer input feature maps.

2) The bias is all initialized to 0 throughout the network, i.e., bi=0.

3-2, optimizing a network model by using a general Adam optimization algorithm, wherein a loss function trained by the model is shown in a formula (5):

wherein x is _j Sketch representing input, y _j Representing the generated real picture, D representing the arbiter. G _g Representing a guide generator network, G representing a generator network, phi _l (. Cndot.) represents a pre-trained VGG network, the present embodiment uses the first layer intermediate features of the VGG network to calculate the perceived distance between features. The above losses are put together for optimization during training, and the overall loss of the network is shown in formula (6).

In the training process, the network parameters are updated by specifying the iteration times.

And 4, taking a face sketch and a reference picture as network inputs, and reconstructing a face picture with the color style of the reference picture by using the parameters obtained by the learning in the step 3 as output.

The invention and its embodiments have been described above by way of illustration and not limitation, and the invention is illustrated in the accompanying drawings and described in the drawings in which the actual structure is not limited thereto. Therefore, if one of ordinary skill in the art is informed by this disclosure, the structural mode and the embodiments similar to the technical scheme are not creatively designed without departing from the gist of the present invention.

Claims

1. The method for coloring the human face sketch based on the reference picture is characterized by comprising the following steps of:

2. The reference picture-based face sketch coloring method according to claim 1, wherein: in the step 1, a CelebA-HQ data set is used, and a sketch corresponding to a real picture is obtained by using a SketchKeras model to manufacture a pair of human face sketch and a real human face picture data set.

3. A method for coloring a sketch of a face based on a reference picture according to claim 1 or 2, characterized in that: the generated countermeasure network constructed in the step 2 comprises a face sketch encoder, a reference picture encoder, a cross-attention-based feature fusion module, a Swin transform block-based decoder, a guidance encoder and a discriminator.

4. A reference picture based face sketch coloring method according to claim 3, characterized in that: the human face sketch encoder, the reference picture encoder, the cross attention-based feature fusion module and the Swin transform block-based encoder form a U-Net generator network; wherein:

5. The reference picture-based face sketch coloring method according to claim 4, wherein: the human face sketch encoder consists of 8 encoding units, wherein each unit comprises a convolution layer with a convolution kernel of 4 and a step length of 2, a batch normalization layer and an activation function LeakyReLU with a tilt coefficient of 0.2.

6. The reference picture-based face sketch coloring method according to claim 5, wherein: the reference picture encoder consists of 8 coding units, each unit comprises a convolution layer with a convolution kernel of 4 and a step length of 2, a batch normalization layer and an activation function LeakyReLU with a tilt coefficient of 0.2.

7. The reference picture-based face sketch coloring method according to claim 6, wherein: the Swin Transformer block based decoder consists of 7 decoding units, each unit containing a transposed convolutional layer with a convolutional kernel of 4 and a step size of 2, each coding unit being followed by two consecutive Swin Transformer blocks, each Swin Transformer block consisting of W-WSA, SW-WSA, LN and MLP.

8. The reference picture-based face sketch coloring method according to claim 7, wherein: the boot decoder uses 8 consecutive convolution kernels of size 3, step size 2, a transpose convolution of size 1 and a LeakyReLU gradient of 0.2.

9. The reference picture-based face sketch coloring method according to claim 8, wherein: the discriminator consists of 5 groups of convolution layers with the convolution kernel size of 4, the step length of 2 and the filling of 2, a batch of normalization layer modules and a LeakyReLU with the gradient of 0.2.

10. The reference picture-based face sketch coloring method according to claim 8, wherein: step 4, optimizing a network model by adopting an Adam optimization algorithm, wherein a loss function of the network comprises three parts: the perceptual loss of the picture, the counterattack loss and the reconstruction loss of the guiding decoder are generated, and the losses are put together for optimization during training.