CN119151810A

CN119151810A - Image processing method and electronic device

Info

Publication number: CN119151810A
Application number: CN202411657514.4A
Authority: CN
Inventors: 肖润宇; 王宇
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2024-11-20
Filing date: 2024-11-20
Publication date: 2024-12-17
Anticipated expiration: 2044-11-20
Also published as: CN119151810B

Abstract

The disclosure provides an image processing method and electronic equipment, relates to the technical field of terminals, and can generate corresponding image quality enhancement images according to image quality enhancement requirements arbitrarily input by users. The method comprises the steps of responding to image quality enhancement triggering operation, obtaining a first round of fusion text image bimodal characteristic according to image quality enhancement demand text characteristics, a first image characteristic, a second image characteristic, a text guide enhancement model and a diffusion model, carrying out N-1 iteration according to the first round of fusion text image bimodal characteristic, the image quality enhancement demand text characteristics, the first image characteristic, the text guide enhancement model and the diffusion model to obtain a first enhancement image coding characteristic, wherein the text guide enhancement model has the capability of fusing the image quality enhancement demand text characteristics, the first image characteristic and a j-th round of diffusion image characteristic set in each iteration process to generate the first enhancement image coding characteristic, and decoding the first enhancement image coding characteristic to obtain the second image.

Description

Image processing method and electronic device

Technical Field

The disclosure relates to the technical field of terminals, and in particular relates to an image processing method and electronic equipment.

Background

In this highly digitized age today, users take pictures or acquire images through a network using electronic devices (e.g., mobile phones) have become part of daily life. However, in the actual acquisition process, there may be a case where the image quality is poor due to the influence of various factors. For example, blurring at the time of photographing may cause blurring of an image, raindrops may occur in an image at the time of photographing in a rainy day, an image acquired from a network is often too low in pixels, and the like.

In order to solve the problem of poor image quality, the related art adopts a specific function to improve a specific image quality effect. For example, improving blurred images using deblurring functions, lifting pixels of images using lifting pixel functions, and so forth. In the process of improving a specific image quality effect by using a specific function, a user is generally required to perform a large number of repetitions. This is not only cumbersome, but may not fully meet the user's needs. Therefore, how to provide a more convenient and flexible image processing scheme for users has become a current urgent problem to be solved.

Disclosure of Invention

The embodiment of the disclosure provides an image processing method and electronic equipment, which can generate corresponding image quality enhancement images according to image quality enhancement requirements arbitrarily input by a user, and improve the use experience of the user.

In order to achieve the above object, the embodiments of the present disclosure adopt the following technical solutions:

According to a first aspect, the disclosure provides an image processing method, which comprises the steps of receiving an image quality enhancement triggering operation on a first image, wherein the image quality enhancement triggering operation comprises an image quality enhancement demand text, responding to the image quality enhancement triggering operation, obtaining a first round of fusion text image bimodal feature according to the image quality enhancement demand text feature, the first image feature, a second image feature, a text guide enhancement model and a diffusion model, wherein the image quality enhancement demand text feature is obtained by encoding the image quality enhancement demand text, the first image feature is obtained by encoding the first image, the second image feature is obtained by adding noise to the first image, performing feature extraction on a noise adding result, performing N-1 times of iteration according to the first round of fusion text image bimodal feature, the image quality enhancement demand text feature, the first image feature, the text guide enhancement model and the diffusion model, obtaining a first enhancement image encoding feature, wherein the diffusion model is provided with a feature extraction function of performing denoising processing on the second image feature and performing feature extraction on the denoising result to generate a first round of diffusion image feature set, and generating a first round of diffusion image feature set according to the first round of the first image feature, j is an integer force which is equal to or smaller than or equal to the first round of the first image enhancement feature, j is generated in an iterative process of the first round of the first image enhancement text image with a positive integer force equal to or smaller than or equal to the first round of the first image enhancement feature, j is generated by performing the first iterative feature, and j is generated by performing an iterative process, and obtaining a second image, wherein the second image is an image with stronger image quality of the first image according to the image quality enhancement requirement text.

Based on the image processing method of the first aspect, after the electronic device receives the image quality enhancement triggering operation of the user, the diffusion model and the text guide enhancement model can be utilized to perform multiple iterations to generate a second image with enhanced image quality. In each iteration process, after the diffusion model generates a diffusion image feature set (i.e., a jth round of diffusion image feature set), the text-guided enhancement model may fuse the diffusion image feature set into the image-quality enhancement required text feature and the first image feature, so that in each iteration process, the second image can be generated by referring to the image-quality enhancement required text. Therefore, the finally obtained second image meets the image quality enhancement requirement of the user, the implementation process is convenient and simple, and the user only needs to execute the image quality enhancement triggering operation.

In combination with the first aspect, in another possible implementation manner, according to the image quality enhancement demand text feature, the first image feature, the second image feature, the text guide enhancement model and the diffusion model, a first round of fusion text image bimodal feature is obtained, which comprises the steps of processing the image quality enhancement demand text feature and the first image feature by using a first processing module of the text guide enhancement model to obtain a text image bimodal feature corresponding to the first processing module in the first round, denoising the second image feature by using the diffusion model, extracting features from a denoising result to generate a first round of diffusion image feature set, wherein the first round of diffusion image feature set comprises M features, each processing module in the diffusion model corresponds to one feature, M is a positive integer, fusion text image bimodal feature corresponding to the first processing module in the first round and the first diffusion image feature are obtained, the first diffusion image feature is a first processing module output feature of the diffusion model, and the diffusion model is obtained according to the first processing module output feature of the first round, and the diffusion model corresponds to the first text image bimodal feature of the first processing module in the first round is obtained, and the diffusion model is obtained.

Based on the scheme, the dual-mode characteristics of the text image and the characteristics output by the diffusion model are fused, so that the characteristics of the first image are considered in the image quality enhancement process, and the text characteristics required by the image quality enhancement of a user are combined. The fusion mode can lead the image quality enhancement to be more targeted, and avoids blindness which can occur when the image is processed by simply relying on the image characteristics, thereby obviously improving the image quality enhancement effect.

In combination with the first aspect, in another possible implementation manner, according to the first round of fusion text image bimodal feature, the image quality enhancement demand text feature, the first image feature, the text guide enhancement model and the diffusion model, performing N-1 iterations to obtain a first enhancement image coding feature, wherein in the iteration process of the j round, a j round diffusion image feature set is determined according to the j-1 round of fusion text image bimodal feature and the diffusion model, j is a positive integer, and j is 2 less than or equal to N, and according to the j round diffusion image feature set, the image quality enhancement demand text feature, the first image feature and the text guide enhancement model, the j round of fusion text image bimodal feature is obtained, and when j=n, the j round of fusion text image bimodal feature is the first enhancement image coding feature.

Based on the scheme, in the iterative process, each round is further processed based on the fused text image bimodal feature of the previous round. With the increase of the iteration times, the analysis and adjustment of the image features are finer, flaws in the image can be removed gradually, and the quality of the image is enhanced.

In combination with the first aspect, in another possible implementation manner, according to the jth round of diffusion image feature set, the image quality enhancement requirement text feature, the first image feature and the text guide enhancement model, the jth round of fusion text image bimodal feature is obtained, and the method comprises the steps of processing the image quality enhancement requirement text feature and the first image feature by using a first processing module of the text guide enhancement model to obtain a text image bimodal feature corresponding to the first processing module in the jth round, fusing the feature output by the first processing module of the diffusion model in the jth round of diffusion image feature set and the text image bimodal feature corresponding to the first processing module in the jth round to obtain a fusion text image bimodal feature corresponding to the first processing module in the jth round, and obtaining the jth round of fusion text image bimodal feature according to the fusion text image bimodal feature corresponding to the first processing module in the jth round and the text guide enhancement model except the feature output by the first processing module in the diffusion model in the jth round of diffusion image feature set.

Based on the scheme, the first processing module of the text guiding enhancement model is utilized to process the text features and the first image features of the image quality enhancement requirement, and the information of two different modes of the text and the image can be fused at an early stage. The image quality enhancement requirement of the user and the characteristics of the image can be fully referred by the subsequent iteration in a plurality of rounds, so that the image quality can be enhanced more pertinently.

In combination with the first aspect, in another possible implementation manner, according to the dual-mode feature of the fused text image corresponding to the first processing module in the jth round, the feature of the diffusion image feature set of the jth round except the feature output by the first processing module of the diffusion model and the text guiding enhancement model, the dual-mode feature of the fused text image of the jth round is obtained, which comprises the steps of inputting the dual-mode feature of the fused text image corresponding to the i-1 th processing module in the jth round into the ith processing module of the text guiding enhancement model, and obtaining the dual-mode feature of the text image corresponding to the ith processing module in the jth round; the text guiding enhancement model comprises M processing modules, i is a positive integer and is more than or equal to 2 and less than or equal to M, features output by the ith processing module of the diffusion model in the jth round of diffusion image feature set are fused with text image bimodal features corresponding to the ith processing module in the jth round of diffusion image feature set to obtain fused text image bimodal features corresponding to the ith processing module in the jth round of diffusion image feature set, and when i=M, the fused text image bimodal features corresponding to the Mth processing module in the jth round of diffusion image feature set are input into an output module of the text guiding enhancement model to obtain the jth round of fused text image bimodal features. Based on this scheme, an exemplary iterative process is provided.

With reference to the first aspect, in another possible implementation manner, the image quality enhancement triggering operation includes image quality enhancement required audio, and before responding to the image quality enhancement triggering operation, the method further includes identifying the image quality enhancement required audio to obtain the image quality enhancement required text when the image quality enhancement triggering operation includes the image quality enhancement required audio.

Based on this scheme, the variety of the image quality enhancement triggering operation can be increased. The user can express the image quality enhancement requirement not only by text input, but also by voice input. Thus, the convenience and the efficiency of the operation of the user are improved.

With reference to the first aspect, in another possible implementation manner, the diffusion model is a pre-training implicit diffusion model, and the text-guided enhancement model is a u-net model. Examples of a diffusion model and a text-guided enhancement model are provided.

With reference to the first aspect, in another possible implementation manner, the text with higher image quality is required to reduce noise, reduce blurring degree and improve resolution. An example of a more image quality demanding text is provided.

In a second aspect, an embodiment of the present disclosure provides an image processing apparatus, which may be applied to an electronic device, for implementing the method in the first aspect. The functions of the image processing apparatus may be realized by hardware, or may be realized by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above functions, for example, a receiving module, a processing module, and the like.

A receiving module configured to receive an image quality enhancement trigger operation on a first image; the image quality enhancement triggering operation comprises an image quality enhancement requirement text;

the processing module is configured to respond to image quality enhancement triggering operation, obtain a first round of fusion text image bimodal feature according to image quality enhancement demand text feature, first image feature, second image feature, text guide enhancement model and diffusion model, wherein the image quality enhancement demand text feature is obtained by encoding image quality enhancement demand text, the first image feature is obtained by encoding a first image, the second image feature is obtained by adding noise to the first image and then extracting features aiming at a noise adding result, the first image feature is obtained by carrying out N-1 times of iteration according to the first round of fusion text image bimodal feature, the image quality enhancement demand text feature, the first image feature, the text guide enhancement model and the diffusion model, and obtain a first enhancement image encoding feature, the diffusion model is provided with the capability of carrying out denoising processing on the second image feature and carrying out feature extraction on the denoising result to generate a first round of diffusion image feature set, the text guide enhancement has the capability of generating a j-1 time bimodal feature set according to the first round of fusion text image feature, the first round of iteration is provided with j-1 time, the first round of iteration is provided with j-j, the first image feature is more than or equal to or less than the first image enhancement image integral-image feature, and the first image enhancement feature is generated by carrying out N-1 time of iteration, and the first enhancement image enhancement feature is obtained according to the first image enhancement image-image feature.

In combination with the second aspect, in one possible implementation manner, the processing module is further configured to process the text feature and the first image feature of the image enhancement requirement by using a first processing module of the text guide enhancement model to obtain a text image bimodal feature corresponding to the first processing module in the first round, denoise the second image feature by using the diffusion model, and perform feature extraction on the denoised result to generate a first round diffusion image feature set, where the first round diffusion image feature set includes M features, the diffusion model includes M processing modules, each processing module in the diffusion model corresponds to one feature, M is a positive integer, the text image bimodal feature corresponding to the first processing module in the first round and the first diffusion image feature are fused to obtain a fused text image bimodal feature corresponding to the first processing module in the first round, the first diffusion image feature is a feature output by the first processing module in the diffusion model, and the other fused text image bimodal features except the first text image enhancement feature in the first round diffusion image feature set and the first diffusion image guide bimodal feature in the first round are obtained.

With reference to the second aspect, in a possible implementation manner, the processing module is further configured to determine a j-th round of diffusion image feature set according to the j-1-th round of fusion text image bimodal feature and the diffusion model in the iteration process of the j-th round, where j is a positive integer, 2 is less than or equal to j is less than or equal to N, obtain the j-th round of fusion text image bimodal feature according to the j-th round of diffusion image feature set, the image quality enhancement requirement text feature, the first image feature and the text guidance enhancement model, and when j=n, the j-th round of fusion text image bimodal feature is the first enhancement image coding feature.

With reference to the second aspect, in one possible implementation manner, the processing module is further configured to process the image quality enhancement required text feature and the first image feature by using a first processing module of the text guide enhancement model to obtain a text image bimodal feature corresponding to the first processing module in the jth round, fuse the feature output by the first processing module of the diffusion model in the diffusion image feature set of the jth round with the text image bimodal feature corresponding to the first processing module in the jth round to obtain a fused text image bimodal feature corresponding to the first processing module in the jth round, and obtain a jth round fused text image bimodal feature according to the fused text image bimodal feature corresponding to the first processing module in the jth round, the feature other than the feature output by the first processing module of the diffusion model in the diffusion image feature set of the jth round and the text guide enhancement model.

With reference to the second aspect, in one possible implementation manner, the processing module is further configured to input the dual-mode feature of the fused text image corresponding to the ith-1 th processing module in the jth round into the ith processing module of the text guiding enhancement model to obtain the dual-mode feature of the text image corresponding to the ith processing module in the jth round, where the text guiding enhancement model includes M processing modules, i is a positive integer, i is 2-i and is less than or equal to M, fuse the feature output by the ith processing module of the diffusion model in the diffusion image feature set of the jth round with the dual-mode feature of the text image corresponding to the ith processing module in the jth round to obtain the dual-mode feature of the fused text image corresponding to the ith processing module in the jth round, and when i=m, the dual-mode feature of the fused text image corresponding to the mth processing module in the jth round is input into the dual-mode feature of the text guiding enhancement model to obtain the dual-mode feature of the fused text image of the jth round.

With reference to the second aspect, in one possible implementation manner, the image quality enhancement triggering operation includes image quality enhancement required audio, and the processing module is further configured to identify the image quality enhancement required audio to obtain the image quality enhancement required text when the image quality enhancement triggering operation includes the image quality enhancement required audio.

With reference to the second aspect, in one possible implementation manner, the diffusion model is a pre-trained implicit diffusion model, and the text-guided enhancement model is a u-net model.

With reference to the second aspect, in one possible implementation manner, the text with higher image quality is required to reduce noise, reduce blurring degree and improve resolution.

In a third aspect, the present disclosure provides an electronic device comprising a memory, a display screen, and one or more processors, the memory, the display screen being coupled to the processors. Wherein the memory is for storing computer program code comprising computer instructions, the processor being for executing one or more computer instructions stored by the memory to cause the electronic device to perform the image processing method as in any one of the first aspects above when the electronic device is operating.

In a fourth aspect, the present disclosure provides a computer storage medium comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the image processing method of any one of the first aspects.

In a fifth aspect, the present disclosure provides a computer program product for, when run on an electronic device, causing the electronic device to perform the image processing method as in any one of the first aspects.

In a sixth aspect, there is provided an apparatus (e.g. the apparatus may be a system-on-a-chip) comprising a processor for supporting a first device to implement the functionality referred to in the first aspect above. In one possible design, the apparatus further includes a memory for holding program instructions and data necessary for the first device. When the device is a chip system, the device can be formed by a chip, and can also comprise the chip and other discrete devices.

It should be appreciated that the advantages of the second to sixth aspects may be referred to in the description of the first aspect, and are not described herein.

Drawings

Fig. 1 is a schematic structural diagram of an image generating network according to an embodiment of the present disclosure.

Fig. 2 is a schematic hardware structure of an electronic device according to an embodiment of the disclosure.

Fig. 3 is a schematic software structure of an electronic device according to an embodiment of the disclosure.

Fig. 4 is a schematic flow chart of an image processing method according to an embodiment of the disclosure.

Fig. 5 is a schematic diagram of a display interface according to an embodiment of the disclosure.

Fig. 6 is a second schematic diagram of a display interface according to an embodiment of the disclosure.

Fig. 7 is a second flowchart of an image processing method according to an embodiment of the disclosure.

Fig. 8 is a third flowchart of an image processing method according to an embodiment of the disclosure.

Fig. 9 is a block diagram of a chip system according to an embodiment of the disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described below with reference to the drawings in the embodiments of the present disclosure. In the description of the present disclosure, unless otherwise indicated, "/" means that the related objects are in a "or" relationship, for example, a/B may mean a or B, and "and/or" in the present disclosure is merely an association relationship describing the related objects, for example, a and/or B may mean that there may be three relationships, for example, a and/or B, three cases where a exists alone, a and B exist together, and B exists alone, where a and B may be singular or plural. Also, in the description of the present disclosure, unless otherwise indicated, "a plurality" means two or more than two. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (a, b, or c) of a, b, c, a-b, a-c, b-c, or a-b-c may be represented, wherein a, b, c may be single or plural. In addition, in order to clearly describe the technical solutions of the embodiments of the present disclosure, in the embodiments of the present disclosure, the words "first", "second", and the like are used to distinguish the same item or similar items having substantially the same function and effect. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ. Meanwhile, in the embodiments of the present disclosure, words such as "exemplary" or "such as" are used to mean serving as examples, illustrations, or descriptions. Any embodiment or design described herein as "exemplary" or "e.g." in the examples of this disclosure should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion that may be readily understood.

In addition, the network architecture and the service scenario described in the embodiments of the present disclosure are for more clearly describing the technical solution of the embodiments of the present disclosure, and do not constitute a limitation on the technical solution provided by the embodiments of the present disclosure, and as a person of ordinary skill in the art can know, with evolution of the network architecture and appearance of a new service scenario, the technical solution provided by the embodiments of the present disclosure is equally applicable to similar technical problems.

With the widespread popularity of electronic devices (e.g., mobile phones) and the rapid development of the internet, users often have poor image quality of the acquired images due to various situations in the process of photographing with the mobile phones or acquiring images through the internet or other media. Therefore, the user needs to improve the image quality.

For example, when photographing with a mobile phone, there are some cases where the image quality is poor due to the influence of various factors. For example, when a user holds a mobile phone to take a picture, if the hand is not stable enough, a blurred picture can be easily taken. In this case, the object in the photograph may have a blurred outline, and the details are lost, which seriously affects the quality and the ornamental value of the photograph.

For another example, weather conditions may also have an impact on the photographic effect. For example, when taking a photograph in rainy days, raindrops may appear on the photograph, which may obstruct part of the scene, and make the photograph appear disordered, and also affect the color and contrast of the photograph. Therefore, in this case, the user needs to remove the raindrop and restore the original clear picture of the photograph.

As another example, the internet is an important platform for information dissemination, from which people often download various images. However, many images are compressed to some extent for ease of storage and transmission. The compressed images often have problems such as low definition, color distortion, and blurring of details. When the user copies these images locally, the need to use them, for example for printing, display or further editing purposes, creates the need to enhance the sharpness and restore the original quality of the image. That is, among a plurality of possible image application scenes (i.e., the above-described scenes), there is a ubiquitous demand for image quality enhancement by users.

In the related art, an image generation network may be used to process the image according to the image quality enhancement requirement of the image. Referring to fig. 1, an embodiment of the present disclosure provides a block diagram of an image generation network. The image generation network includes a stable diffusion model (stable diffusion Models, SD model), a control network (ControlNet), a text encoder (Text Encoder), and a temporal encoder (Time Encoder).

The SD model is used for generating a preset image according to Input data (Input z _t). The control network is used for increasing training parameters in the diffusion stage of the SD model to assist in adjusting weight parameters in the SD model. A text encoder (Text Encoder) is used to encode the demand text (promt c _t) to obtain a text-embedded vector. A Time encoder (Time Encoder) is used to convert the Time step (Time t) into a Time-embedded vector.

The input data of the SD model comprises a required text c _t, a time step t and input data z _t, and the Output data is Output _θ(z_t,t,c_i,c_f ) The output data is used for representing that under the conditions of time t, required text c _t and Condition input (Condition c _f), the output data is generated according to the input data z _t _θ . The condition input c _f is input data of the control network. For example, z _t may be a noisy image, outputting data _θ May be a denoised image.

The SD model includes a plurality of encoders and a plurality of decoders, the plurality of encoders and the plurality of decoders being in one-to-one correspondence. A plurality of encoders are used to downsample the demand text c _t, the time step t, and the input data z _t step by step to extract features of different scales. Multiple decoders are used to upsample features step by step to reconstruct resulting output data _θ。

For example, as shown in fig. 1, the SD model may be a U-shaped network. The plurality of encoders in the SD model includes SD encoder A (SD Encoder Block A), SD encoder B (SD Encoder Block B), SD encoder C (SD Encoder Block C), and SD encoder D (SD Encoder Block D). Wherein SD encoder A, SD encoder B, SD encoder C, SD encoder D, the 4 encoder blocks each comprise three identical sub-blocks. The output data of the SD encoder A is 6464, The output data of the SD encoder B is 3232, The output data of the SD encoder C is 1616, The output data of the SD encoder D is 88, The output data of the SD model intermediate module is 88.

The plurality of decoders includes SD decoder D (SD Decoder Block D), SD decoder C (SD Decoder Block C), SD decoder B (SD Decoder Block B), SD decoder A (SD Decoder Block A). Wherein SD decoder A, SD decoder B, SD decoder C, SD decoder D, the 4 decoder blocks each comprise three identical sub-blocks. The output data of the SD decoder D is 88, The output data of the SD encoder C is 1616, The output data of the SD decoder B is 3232, The output data of the SD decoder A is 64A matrix of 64. That is, the SD encoder a and SD decoder A, SD encoder B and SD decoder B, SD encoder C and SD decoder C, SD encoder D and SD decoder D are in one-to-one correspondence.

The SD encoder D and the SD decoder D are connected through an SD intermediate Block (SD Middle Block). The scale of the output data of the SD middle module is the same as the scale of the output data of the SD encoder D and the scale of the output data of the SD decoder D. The output data of the SDSD intermediate module is 88.

The input data to the control network is the conditional input c _f. For example, the conditional input c _f may be an edge map or a semantic segmentation map, or the like.

The control network includes a plurality of zero convolution layers (zero convolution), a plurality of encoders, and a plurality of zero convolution layers. After the condition input c _f is input to the zero convolution layer (zero convolution) of the control network, the zero convolution layer performs convolution processing on the condition input c _f to obtain a convolution result. And then, merging the input data z _t of the SD model and the convolution result through a merging module to obtain a merged feature vector.

Then, the control network further processes the fused feature vector, the required text c _t and the time step t through a plurality of encoders to output feature vectors with different scales. Feature vectors of different scales are then input into a plurality of decoders of the SD model through zero convolution layers corresponding to each encoder to assist in adjusting weight parameters in the SD model. Wherein feature vectors of different scales may also be referred to as training copies.

For example, referring to fig. 1, the plurality of encoders in the control network includes SD encoder A (SD Encoder Block A) SD encoder B (SD Encoder Block B), SD encoder C (SD Encoder Block C), SD encoder D (SD Encoder Block D), and SD intermediate Block (SD intermediate Block). The SD encoder a in the control network and the SD encoder A, SD encoder B in the U-network and the SD encoder B, SD encoder C in the U-network and the SD encoder C, SD encoder D in the U-network and the SD encoder D, SD model intermediate module in the U-network are in one-to-one correspondence with the SD model intermediate module in the U-network. The scale of the output data of the SD encoders A-D in the control network is consistent with the scale of the output data of the SD encoders A-D in the U-shaped network.

After each encoder outputs data (i.e. feature vectors of different scales), the output data of each encoder can be processed by a zero convolution layer corresponding to the output data, and the processed result is input as training parameters into a decoder corresponding to the output data in the SD model, so that the SD model obtains final output data _θ。

The SD model generates a preset image by encoding and decoding the latent variable z _t step by step and integrating the required text c _t and the time step t in the encoding and decoding process of each step as known by combining the structure of the SD model and the control network. The control network inputs control conditions in the process of generating the preset image by the SD model through zero convolution, so that the preset image with more pertinence is generated. That is, the above-described image generation network generally takes image generation as a main processing core under control of a required text and conditions input by a user, not image quality enhancement for images. Therefore, the above-described image generation network does not achieve a preferable processing effect in response to the demand for image quality enhancement.

In other examples, when a user has a need for image quality enhancement, the relevant image processing method often prompts the user to perform a large number of repeated operations (e.g., adjusting noise parameters by sliding, adjusting deblurring parameters by sliding, increasing resolution by sliding, etc.) under a specific function (e.g., noise reduction function, blur removal function, resolution increase function, etc.) to achieve image quality enhancement of the image. This approach is not only cumbersome, but may not fully meet the image quality enhancement needs of the user. Therefore, how to construct a more convenient and flexible image processing scheme, so as to actually meet the image quality enhancement requirement of the user, has become an important problem to be solved currently.

Therefore, the embodiment of the disclosure provides an image processing method, which can utilize a diffusion model and a text-guided enhancement model to perform multiple iterations to generate a second image with enhanced image quality after receiving an image quality enhancement triggering operation of a user. In each iteration process, after the diffusion model generates a diffusion image feature set (i.e., an nth round of diffusion image feature set), the text-guided enhancement model can fuse the diffusion image feature set into the image-quality enhancement required text feature and the first image feature, so that in each iteration process, a second image can be generated by referring to the image-quality enhancement required text. Therefore, the finally obtained second image meets the image quality enhancement requirement of the user, the implementation process is convenient and simple, and the user only needs to execute the image quality enhancement triggering operation.

The technical scheme provided by the embodiment of the application is described in detail below with reference to the accompanying drawings.

The technical scheme provided by the application can be applied to the electronic equipment with the image display function. In some embodiments, the electronic device may be a mobile phone, a tablet computer, a handheld computer, a personal computer (personal computer, PC), an ultra-mobile personal computer (UMPC), a netbook, a cellular phone, a Personal Digital Assistant (PDA), an augmented reality (augmented reality, AR) device, a Virtual Reality (VR) device, an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) device, a wearable device, a vehicle-mounted device, a smart home device, and/or a smart city device, and the specific type of the electronic device is not particularly limited by the embodiments of the present application.

For example, taking an electronic device as a mobile phone as an example, fig. 2 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.

Referring to fig. 2, the electronic device may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a key 190, a motor 191, an indicator 192, a display 193, a subscriber identity module (subscriber identification module, SIM) card interface 194, a camera 195, and the like. The sensor module 180 may include, among other things, a pressure sensor, a gyroscope sensor, a barometric sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, etc.

The processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (IMAGE SIGNAL processor, ISP), a controller, a memory, a video codec, a digital signal processor (DIGITAL SIGNAL processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The controller may be a neural hub and command center of the electronic device. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it may be called directly from memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.

In some embodiments, the processor 110 may include one or more interfaces. The interfaces may include an integrated circuit (inter-INTEGRATED CIRCUIT, I2C) interface, an integrated circuit built-in audio (inter-INTEGRATED CIRCUIT SOUND, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (subscriber identity module, SIM) interface, and/or a universal serial bus (universal serial bus, USB) interface, among others.

The charge management module 140 is configured to receive a charge input from a power supply device (e.g., a charger, notebook power, etc.). The charger can be a wireless charger or a wired charger. In some wired charging embodiments, the charge management module 140 may receive a charging input of a wired charger through the USB interface 130. In some wireless charging embodiments, the charge management module 140 may receive wireless charging input through a wireless charging coil of the electronic device.

The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142. The battery 142 may specifically be a plurality of batteries connected in series. The power management module 141 is used for connecting the battery 142, the charge management module 140 and the processor 110.

The power management module 141 is used for connecting the battery 142, and the charge management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 and provides power to the processor 110, the internal memory 121, the display 193, the camera 195, the wireless communication module 160, and the like. The power management module 141 may also be configured to monitor parameters such as battery voltage, current, battery cycle number, battery state of health (leakage, impedance), etc. In other embodiments, the power management module 141 may also be provided in the processor 110.

The external memory interface 120 may be used to connect external non-volatile memory to enable expansion of the memory capabilities of the electronic device. The external nonvolatile memory communicates with the processor 110 through the external memory interface 120 to implement a data storage function. For example, files such as music and video are stored in an external nonvolatile memory.

The internal memory 121 may include one or more random access memories (random access memory, RAM) and one or more non-volatile memories (NVM). The random access memory may be read directly from and written to by the processor 110, may be used to store executable programs (e.g., machine instructions) for an operating system or other on-the-fly programs, may also be used to store data for users and applications, and the like. The nonvolatile memory may store executable programs, store data of users and applications, and the like, and may be loaded into the random access memory in advance for the processor 110 to directly read and write. In an embodiment of the present application, the internal memory 121 may have a diffusion model stored therein. The internal memory 121 may also store a correlation model capable of converting an image into a noise image and a text identifier, or may also store noise images and text identifiers corresponding to a plurality of images.

Touch sensors, also known as "touch devices". The touch sensor may be disposed on the display screen 193, and the touch sensor and the display screen 193 form a touch screen, which is also called a "touch screen". The touch sensor is used to monitor touch operations acting on or near it. The touch sensor may communicate the monitored touch operation to the application processor to determine the touch event type. Visual output related to the touch operation may be provided through the display 193. In other embodiments, the touch sensor may also be disposed on a surface of the electronic device other than where the display 193 is located.

The pressure sensor is used for sensing a pressure signal and can convert the pressure signal into an electric signal. In some embodiments, the pressure sensor may be provided on the display 193. Pressure sensors are of many kinds, such as resistive pressure sensors, inductive pressure sensors, capacitive pressure sensors, etc. When a touch operation is applied to the display screen 193, the electronic apparatus monitors the touch operation intensity according to the pressure sensor. The electronic device may also calculate the location of the touch based on the monitoring signal of the pressure sensor. In some embodiments, touch operations that act on the same touch location, but at different touch operation strengths, may correspond to different operation instructions. For example, when a touch operation with a touch operation intensity smaller than a first pressure threshold is applied to the short message application icon, an instruction to view the short message is executed. And executing an instruction for newly creating the short message when the touch operation with the touch operation intensity being greater than or equal to the first pressure threshold acts on the short message application icon.

In some embodiments, the electronic device may include 1 or N cameras 195, N being a positive integer greater than 1. In an embodiment of the present application, the type of camera 195 may be differentiated according to hardware configuration and physical location. For example, a camera disposed on the side of the display 193 of the electronic device may be referred to as a front camera, a camera disposed on the side of the rear cover of the electronic device may be referred to as a rear camera, and for example, a camera having a short focal length and a large viewing angle may be referred to as a wide-angle camera, and a camera having a long focal length and a small viewing angle may be referred to as a general camera. The focal length and the visual angle are relative concepts, and are not limited by specific parameters, so that the wide-angle camera and the common camera are also relative concepts, and can be distinguished according to physical parameters such as the focal length, the visual angle and the like.

The electronic device implements display functions through a GPU, a display screen 193, an application processor, and the like. The GPU is a microprocessor for image editing, and is connected to the display 193 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The electronic device may implement photographing functions through an ISP, a camera 195, a video codec, a GPU, a display screen 193, an application processor, and the like. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information. In the embodiment of the application, the GPU function is used in the frame drawing process of each image frame, so that the finally displayed picture obtains better display effect and performance.

The ISP is used to process the data fed back by the camera 195. For example, when photographing, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing, so that the electrical signal is converted into an image visible to naked eyes. ISP can also perform algorithm optimization on noise and brightness of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be located in the camera 195. The camera 195 is used to capture still images or video.

The digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals. For example, when the electronic device selects a frequency bin, the digital signal processor is used to fourier transform the frequency bin energy, and so on.

The display 193 is used to display images, videos, and the like. The display 193 includes a display panel. The display panel may employ a Liquid Crystal Display (LCD) CRYSTAL DISPLAY, an organic light-emitting diode (OLED), an active-matrix organic LIGHT EMITTING diode (AMOLED), a flexible light-emitting diode (FLED), miniled, microLed, micro-oLed, a quantum dot LIGHT EMITTING diode (QLED), or the like. In some embodiments, the electronic device may include 1 or N display screens 193, N being a positive integer greater than 1.

In embodiments of the application, the display 193 may be used to display an interface (e.g., desktop, lock screen interface, etc.) of the electronic device and display images in the interface from images stored in the electronic device (e.g., wallpaper, photographs, etc.), or images captured by any one or more of the cameras 195.

The wireless communication function of the electronic device may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem, a baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device may be used to cover a single or multiple communication bands. Different antennas may also be multiplexed to improve the utilization of the antennas.

The mobile communication module 150 may provide a solution for wireless communication including 2G/3G/4G/5G, etc. applied on an electronic device. The mobile communication module 150 may receive electromagnetic waves from the antenna 1, perform processes such as filtering, amplifying, and the like on the received electromagnetic waves, and transmit the processed electromagnetic waves to the modem processor for demodulation. The mobile communication module 150 can amplify the signal modulated by the modem processor, and convert the signal into electromagnetic waves through the antenna 1 to radiate. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be provided in the same device as at least some of the modules of the processor 110.

The modem processor may include a modulator and a demodulator. The modulator is used for modulating the low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then transmits the demodulated low frequency baseband signal to the baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor outputs sound signals through an audio device (not limited to the speaker 170A, the receiver 170B, etc.), or displays images or videos through the display screen 193. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be provided in the same device as the mobile communication module 150 or other functional module, independent of the processor 110.

The wireless communication module 160 may provide solutions for wireless communication including wireless local area network (wireless local area networks, WLAN) (e.g., wireless fidelity (WIRELESS FIDELITY, wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation SATELLITE SYSTEM, GNSS), frequency modulation (frequency modulation, FM), near field communication (NEAR FIELD communication, NFC), infrared (IR), etc., as applied to electronic devices. The wireless communication module 160 may be one or more devices that integrate at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, modulates the electromagnetic wave signals, filters the electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, frequency modulate it, amplify it, and convert it to electromagnetic waves for radiation via the antenna 2.

The SIM card interface 194 is used to connect to a SIM card. The SIM card may be inserted into the SIM card interface 194, or removed from the SIM card interface 194 to effect contact and separation with the electronic device. The electronic device may support one or more SIM card interfaces. The SIM card interface 194 may support a Nano SIM card, micro SIM card, etc. The same SIM card interface 194 may be used to insert multiple cards simultaneously. The SIM card interface 194 may also be compatible with external memory cards. The electronic equipment interacts with the network through the SIM card, so that the functions of communication, data communication and the like are realized. One SIM card corresponds to one subscriber number.

It should be understood that the connection relationship between the modules illustrated in the embodiments of the present application is only illustrative, and does not limit the structure of the electronic device. In other embodiments of the present application, the electronic device may also use different interfacing manners, or a combination of multiple interfacing manners in the foregoing embodiments.

It will be understood, of course, that the above illustration of fig. 2 is merely exemplary of the case where the electronic device is in the form of a cellular phone. If the electronic device is a tablet computer, a handheld computer, a PC, a PDA, a wearable device (e.g., a smart watch, a smart bracelet), etc., the electronic device may include fewer structures than those shown in fig. 2, or may include more structures than those shown in fig. 2, which is not limited herein.

It will be appreciated that in general, implementation of electronic device functions requires software in addition to hardware support. The software system of the electronic device may employ a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. In the embodiment of the application, an Android system with a layered architecture is taken as an example, and the software structure of the electronic equipment is illustrated by an example.

Fig. 3 is a schematic diagram of a layered architecture of a software system of an electronic device according to an embodiment of the present application. The layered architecture divides the software into several layers, each with distinct roles and branches. The layers communicate with each other through a software interface (e.g., API).

In some examples, referring to fig. 3, in an embodiment of the present application, the software of the electronic device is divided into five layers, from top to bottom, an application layer, a framework layer (or referred to as an application framework layer), a system library and android runtime (android runtime), a HAL layer (hardware abstraction layer, a hardware abstraction layer), and a driver layer (or referred to as a kernel layer), respectively. The system library and android runtime may also be referred to as a local framework layer or a native layer.

The application layer may include a series of applications, among others. As shown in fig. 3, the application layer may include Applications (APP) such as camera, gallery, calendar, phone call, map, navigation, WLAN, bluetooth, music, instant messaging, and short message.

In the embodiment of the application, the gallery application can enhance the picture quality of an image (such as a first image) based on the picture quality enhancement instruction of the user on the image, so that the picture quality of the image is improved.

The framework layer provides an application programming interface (application programming interface, API) and programming framework for the application programs of the application layer. The application framework layer includes a number of predefined functions or services. For example, the application framework layer may include an activity manager, a window manager, a content provider, an audio service, a view system, a telephony manager, a resource manager, a notification manager, a package manager, etc., to which embodiments of the application are not limited in any way.

The window manager is used for managing window programs. The window manager can acquire the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like.

The content provider is used to store and retrieve data and make such data accessible to applications. Such data may include video, images, audio, calls made and received, browsing history and bookmarks, phonebooks, etc.

The view system includes visual controls, such as controls to display text, controls to display images, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, a display interface including a text messaging icon may include a view displaying text and a view displaying an image. In some embodiments, a rendering thread may also be included or started in the view system to complete drawing frame buffering, etc.

The telephony manager is for providing communication functions of the electronic device. For example, the telephony manager may manage the call state (including initiate, connect, hang-up, etc.) of the call application.

The resource manager provides various resources to the application program, such as localization strings, icons, images, layout files, video files, and the like.

The notification manager allows the application to display notification information in a status bar, can be used to communicate notification type messages, can automatically disappear after a short dwell, and does not require user interaction. Such as notification manager is used to inform that the download is complete, message alerts, etc. The notification manager may also be a notification in the form of a chart or scroll bar text that appears on the system top status bar, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. For example, a text message is prompted in a status bar, a prompt tone is emitted, the electronic device vibrates, and an indicator light blinks, etc.

The package manager is used in the android ^® system to manage application packages. It allows applications to obtain detailed information about installed applications and their services, rights, etc. The package manager is also used for managing events such as installation, uninstallation, upgrade and the like of the application program.

In the embodiment of the application, the framework layer can also comprise image quality enhancement services with the functions of the gallery application. Under the condition that the gallery application does not exist in the application program layer or the gallery application cannot be used or the display effect of the image which is required to be changed by a user cannot be operated by the gallery application, the image quality enhancement service performs the same action as the gallery application on the corresponding image so as to enable the corresponding image to display the display effect consistent with the image quality enhancement direction.

The system library may include a plurality of functional modules. Such as surface manager (surface manager), media library (Media Libraries), openGL ES, SGL, etc. The surface manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications. Media libraries support a variety of commonly used audio, video format playback and recording, still image files, and the like. The media library may support a variety of audio and video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, etc. OpenGL ES is used to implement three-dimensional graphics drawing, image rendering, compositing, and layer processing, among others. SGL is the drawing engine for 2D drawing. The android runtime (android runtime) includes a core library and an ART virtual machine. android runtime is responsible for scheduling and management of the android system. The core library comprises two parts, wherein one part is a function required to be called by java language, and the other part is an android core library. The application layer and the application framework layer run in an ART virtual machine. The ART virtual machine executes java files of the application program layer and the application program framework layer into binary files. The ART virtual machine is used for executing the functions of object life cycle management, stack management, thread management, security and exception management, garbage collection and the like.

The HAL layer is an interface layer between the operating system kernel and the hardware circuitry that aims at abstracting the hardware. The hardware interface details of a specific platform are hidden, a virtual hardware platform is provided for an operating system, so that the operating system has hardware independence, and can be transplanted on various platforms. The HAL layer provides a standard interface to display device hardware functionality to a higher level Java API framework (i.e., framework layer). The HAL layer contains a plurality of library modules, each of which implements an interface for a particular type of hardware component, such as an audio HAL audio module, a blue HAL Bluetooth module, CAMERA HAL camera modules (also referred to as camera HAL or camera hardware abstraction modules), sensors HAL sensor modules (or as Isensor services).

The kernel layer is a layer between hardware and software. The inner core layer at least comprises display drive, camera drive, audio drive, sensor drive, battery drive and the like, and the application is not limited. The sensor driver may specifically include a driver of each sensor included in the electronic device, for example, an ambient light sensor driver, or the like. For example, the ambient light sensor driver may send the ambient light sensor detection data to the sensing module in a timely manner in response to an indication or instruction by the sensing module to obtain the detection data.

The technical scheme provided by the embodiment of the application can be realized in the electronic equipment with the hardware architecture or the software architecture.

An image processing method according to an embodiment of the present application is described below with reference to fig. 4. Fig. 4 is a flowchart of an image processing method according to an embodiment of the present application. Referring to fig. 4, taking an electronic device as an example of a mobile phone, the image processing method may include S401 to S406:

s401, a first application of the mobile phone receives a first input operation aiming at a first image.

The first application may be any application with an image quality enhancement function, for example, the first application may be a gallery application, a camera application, a beauty application, an intelligent assistant application, and the like. The first application may also have an image viewing function.

The first image is an image for which image quality enhancement is required. The first image may be an image that the user views locally on the cell phone (e.g., an image in a gallery application) or an image that the user views on the network (e.g., an image in a browser).

The first input operation includes first input data for triggering a first application, and image quality enhancement is performed on the first image based on the first input data. The first input operation may also be referred to as an image quality enhancement triggering operation. Exemplary image quality enhancements performed on the first image include, but are not limited to, reducing noise of the first image, brightening the first image, increasing resolution of the first image, removing blur in the first image, removing artifacts in the first image.

The first input data is used to characterize the image quality enhancement requirements of the user. The first input data is first text data (which may also be referred to as image quality enhancement required text), or first audio data (which may also be referred to as image quality enhancement required audio). The first text data is used to express the image quality enhancement requirement of the user in the form of text. The first audio data is used to express the user's image quality enhancement requirement in the form of audio.

The first input operation is for triggering a first application to perform image quality enhancement on a first image based on first text data, or first audio data. The first text data may be first universal language text data. The first universal language text data refers to text data covering a wide language category such as a plurality of languages, language variants, language expression forms, language related information and the like.

For example, the first-language text data may be "remove raindrops in a picture", "i want to make the definition of the picture higher", or the like. When the first input data is first universal language text data, the user can express the image quality enhancement requirement for the first image at will. In this way, the user does not need to repeatedly pay attention to how to accurately express, so that the image quality enhancement function can be used more conveniently.

When the user needs to enhance the picture quality of the first image, the user can trigger the first input operation, so that the mobile phone responds to the first input operation and starts the picture quality enhancing function for the first image. In the case where the mobile phone turns on the image quality enhancing function for the first image, the mobile phone may perform the image quality enhancing operation conforming to the first input operation on the first image. Wherein the first input operation may be any interactive instruction capable of turning on an image quality enhancing function for the first image, such as a text instruction, an audio instruction, etc. When the image enhancement function for the first image is turned on, the mobile phone performs an image enhancement operation conforming to the first input operation on the first image, so that the first image has an image enhancement effect conforming to the first input operation. I.e. the subsequent S402-S406 are performed.

In some examples, when the first input operation includes first text data, the first application of the cell phone may receive the first input operation including the first text data.

For example, in a case where the user needs to turn on the image quality enhancing function, the user may first turn on the display interface of the first image using any feasible triggering operation. In response to the trigger operation, the mobile phone may display a display interface 501 of a first image as shown in fig. 5 (a). The display interface 501 of the first image includes an image quality enhancement function control 502. The image quality enhancement function control 502 is used to implement an image quality enhancement function for the first image. Of course, the display interface 501 of the first image may also include other controls (not shown in the figure), for example, an edit control, a delete control, a forward control, and so on, which will not be described in detail herein.

The user may perform an opening operation (e.g., a clicking operation) for the image quality enhancement function control 502. The mobile phone may turn on the image quality enhancement function for the first image in response to the user's turn-on operation of the image quality enhancement function control 502. At this time, the mobile phone may display an image quality enhancement operation interface 503 as shown in fig. 5 (b). The image quality enhancement operation interface 503 includes an instruction input area 504, a cancel control 507, and a finish control 508.

Wherein the instruction input area 504 is configured to receive a first input data input by a user in the instruction input area 504. Then, the image quality enhancement effect expectation of the user for the first image can be determined according to the first input data.

The cancel control 507 is used to cancel performing image quality enhancement and restore to the original state of the first image. The cancel control 507 may be triggered at any time, for example, after the user enters the first input data.

The finish control 508 is used to trigger the first application, and to enhance the image quality of the first image according to the first input data input by the user in the instruction input area 504. Or the completion control 508 is used for returning to the display interface 501 of the first image after the first application performs image quality enhancement on the first image according to the first input data. The present disclosure does not limit the function of completing the control, and particularly, the actual use is subject to.

In some examples, instruction input area 504 may include at least one input control to enable a user to input first input data through the input control. For example, the at least one input control may include a text input control. For another example, the at least one input control may include a voice input control. As another example, as shown in fig. 5 (b), the instruction input area 504 includes a text input control 505 and a voice input control 506. Wherein the text input control 505 is shown in the form of a text input box. The present disclosure is not limited to the specific form of input controls in instruction input area 504, as long as they are practical.

Wherein the text input control 505 is for receiving first text data entered by a user. For example, the first text data input by the user is "remove raindrops in a picture", "i want to make the definition of this picture higher", or the like. The voice input control 506 is used to receive first audio data entered by a user. For example, the first audio data input by the user is audio including "remove raindrops in a picture", audio including "I want to let the picture be more clear", and so on.

In other embodiments, in the display interface of the first image, the first input operation may also be a voice command input by the user to the mobile phone. The voice command is used for instructing the mobile phone to enhance the image quality of the first image so as to enable the display effect of the first image to be better.

Specifically, in the display interface of the first image, the mobile phone may first receive, with its own microphone, a wake-up instruction of the smart assistant spoken by the user, for example, "hello, YOYO". The handset may then display a smart assistant interface 601 as shown in fig. 6 (a) in response to the smart assistant wakeup instruction.

Then, referring to fig. 6 (b), the smart assistant of the mobile phone invokes the microphone of the mobile phone to acquire and recognize the next voice command of the user, for example, "remove raindrops in the figure", and displays the smart assistant interface 602 as shown in fig. 6 (b).

Finally, referring to the intelligent assistant interface 603 shown in fig. 6 (c), the mobile phone may remove the raindrops in the first image and display the second image in response to the voice command of the user "remove the raindrops in the figure". The second image is an image after removing raindrops in the first image.

Of course, in practice, the wake-up instruction of the voice assistant may be determined according to the design of different mobile phones, and the voice instruction input/uttered by the user may be any voice instruction that may indicate that the image quality enhancement function is turned on. Of course, there are any other possible first input operation for enhancing the image quality of the first image in practice, and the above is merely an example and is not a specific limitation of the embodiment of the present application.

S402, responding to the first input operation, identifying the first input data, and obtaining a first text feature E (also called as image quality enhancement requirement text feature).

The first text feature E is used for characterizing an image quality enhancement text encoding vector corresponding to the first input data.

In some examples, the first text feature E is obtained by encoding first text data. Illustratively, in the case where the first input data is first text data, the first text data is encoded with a text encoder in response to the first input operation, resulting in the first text feature E.

The text encoder (which may also be referred to as a pre-trained text encoder) may be an open-source neural network model. For example, the text encoder may be based on any of the bi-directional representation encoding algorithms (bidirectional encoder representation from transformers, BERT), GPT-3, and the like.

In some examples, in the embodiments of the present application, the text encoder used may be trained in any feasible training manner, so long as the finally trained text encoder may generate, based on the first text data, the first text feature E that matches the first text data.

In some examples, as shown in FIG. 7, in response to a first input operation, encoding the first text data with a text encoder to obtain a first text encoding feature may include first, word segmentation of the first text data to obtain at least one word, then converting the at least one word into a word vector corresponding to the at least one word, and then encoding the word vector corresponding to the at least one word to obtain the first text feature E.

Wherein the first text feature E is a high-dimensional representation of the first text data. The first text feature E may be stored in Embedding, and may represent a fixed-length vector, for example, 768-dimensional.

By way of example, the process of converting the at least one Word into a Word Vector corresponding to the at least one Word may include inputting the at least one Word into a Word embedding layer (e.g., a Word to Vector conversion model (Word 2 Vec), a global Word Vector (Global Vectors for Word Representation, gloVe), or a built-in embedding layer of the BERT) of the text encoder, the Word embedding layer outputting the Word Vector corresponding to the at least one Word.

The process of encoding the word vector corresponding to the at least one word to obtain the first text feature E may include inputting the word vector corresponding to the at least one word into a neural network (e.g., a multi-layered self-attention mechanism sequence-to-sequence model (transducer) structure) in the text encoder for encoding, and outputting the first text feature E by the neural network.

In some examples, where the first input data is first audio data, the first audio data is identified using a speech recognition model in response to the first input operation, resulting in first text data. And then, encoding the first text data by using a text encoder to obtain a first text feature E.

Illustratively, the speech recognition model may be any of deep speech (DEEPSPEECH) or Wav2Vec 2.0, etc. In the embodiment of the application, the voice recognition model can be obtained by training in any feasible training mode, so long as the finally trained voice recognition model can generate first text data matched with the first audio data based on the first audio data.

The process of obtaining the first text data by using the voice recognition model to recognize the first audio data may include firstly dividing the first audio data into at least one frame segment by the voice recognition model, then extracting features of the at least one frame segment by the voice recognition model to obtain acoustic features corresponding to the at least one frame segment, and then converting the acoustic features corresponding to the at least one frame segment into the first text data by the voice recognition model.

Illustratively, the acoustic features corresponding to the at least one frame segment may include mel-frequency cepstral coefficients (Mel Frequency Cepstral Coefficients, MFCC), a spectrogram, and the like.

In some examples, the process of the speech recognition model converting the acoustic features corresponding to the at least one frame segment to text may include the speech recognition model converting the acoustic features corresponding to the at least one frame segment to first text data via a neural network (e.g., recurrent neural network (Recurrent Neural Network, RNN), convolutional neural network (Convolutional Neural Network, CNN), transducer, etc.).

After the first audio data is processed to obtain first text data, the text encoder is used for encoding the first text data, and the process of obtaining the first text feature E is similar to the process of obtaining the first text feature E.

S403, inputting the first image into a first image encoder to obtain a first image characteristic q ₁.

In some examples, the first image feature is obtained by encoding the first image. For example, after determining that the first input operation is directed to the first image, the first image may be input to the first image encoder to initially obtain the first image feature q ₁ of the first image.

The first image encoder may also be referred to as a pre-trained first image encoder. The first image encoder is configured to generate a first image feature q ₁ of the first image.

The first image encoder may be any one of CNN, transformer, residual Network (ResNet), visual geometry group (Visual Geometry Group, VGG), initial Network (Inception Network, inception), etc. In the embodiment of the application, the first image encoder used can be trained by adopting any feasible training mode, so long as the first image encoder finally trained can generate the first image characteristic q ₁ matched with the first image based on the first image.

In some examples, as shown in FIG. 7, inputting the first image into the first image encoder to obtain the first image feature q ₁ may include inputting the first image into the first image encoder, extracting features of the first image through a series of convolution layers, pooling layers, and full-link layers, and finally outputting a feature vector of a fixed length (i.e., the first image feature q ₁).

The first image feature q ₁ is illustratively a fixed-length vector, e.g., 2048 or 768 dimensions. The first image feature q is used for a subsequent image quality enhancement task.

S404, inputting the first image and the first time sequence into a second image encoder to obtain a second image characteristic z ₁.

The second image feature z ₁ is obtained by adding noise to the first image and then extracting features of the noise adding result.

In some examples, the first time sequence T includes T ₀、…、t_i、…、t_s. Inputting the first image and the first time sequence into a second image encoder, and obtaining the second image feature may include performing a noise adding operation on the first image step by step in the process of t=t ₀ to t=t _s until a noise image is obtained, and then performing feature extraction on the noise image to obtain a second image feature z ₁.

In some examples, the second image encoder may be a diffusion model. The noise image may then be the result of the second image encoder adding any possible type of noise (e.g. gaussian noise) to any image a number of times. The noise type can be determined by the second image encoder in the training process, and the noise type in the related images is consistent when the second image encoder is used at any time later, so that the second image encoder can smoothly use the information loss effect caused by the learned noise when in use, and the images are processed.

The diffusion model is a generation model that generates data samples (for example, the data samples may be gaussian noise images) by gradually adding noise to the data through a forward diffusion process. Then, the noise is gradually removed by using a back diffusion process, and the original data is restored from the Gaussian noise image.

It will be appreciated that S404 is similar to the forward diffusion process of the diffusion model, i.e. the noise is gradually increased in the image to be processed (i.e. the first image) over a series of fixed time steps, resulting in a noisy image.

In the embodiment of the present application, the second image encoder may be trained by any feasible training method, so long as the second image encoder obtained by final training may generate the matched second image feature z ₁ based on the first image and the first time sequence.

S405, obtaining a first enhanced image coding feature through iteration for preset times based on the first text feature E, the first image feature q ₁, the text guiding enhanced model B, the second image feature z ₁ and the diffusion model P.

The text guiding enhancement model B is used for fusing the first text characteristic E and the first image characteristic q ₁.

The preset times can be N times, N is a positive integer, and the value of N is not limited in the disclosure.

The diffusion model has the capabilities of performing denoising processing on the second image features, performing feature extraction on the denoising result to generate a first round of diffusion image feature set, and performing N-1 times of iteration according to the bimodal features of the first round of fusion text image to generate an Nth round of diffusion image feature set.

The text guiding enhancement model has the capability of fusing the image quality enhancement requirement text features, the first image features, the first round of diffusion image feature sets and the Nth round of diffusion image feature sets to generate first enhancement image coding features.

The first enhanced image coding feature is used for representing the coding feature corresponding to the second image after the first image is subjected to image quality enhancement according to the first input data.

The text-guided enhanced model B may be a convolutional network (Convolutional Networks for Biomedical Image Segmentation, u-net) model for biomedical image segmentation, for example. In the embodiment of the application, the text guiding enhancement model B used can be trained by adopting any feasible training mode, so long as the finally trained text guiding enhancement model B can generate the dual-mode characteristic q _n' of the fused text image based on the first text characteristic E and the first image characteristic q ₁.

As shown in fig. 7, after the second image feature z ₁ is obtained, the second image feature z ₁ may be used as input data of the diffusion model at the time of the first iteration, and the diffusion model P may be passed through a plurality of processing modules to output a plurality of diffusion image features X _n.

As can be seen in conjunction with S404, the diffusion model P includes a forward diffusion process and a reverse diffusion process. During forward diffusion, the first image is gradually converted into image features corresponding to the noise image, namely the second image features z ₁. Subsequently, the back-diffusion process is started, and the second image feature z ₁ is used as input data, and the denoising operation is gradually performed in the plurality of processing modules of the diffusion model P, so as to generate image features (i.e., a plurality of diffusion image features X _n) corresponding to the denoised image.

Illustratively, the diffusion model P may be a pre-trained implicit diffusion model, and the weights of the pre-trained implicit diffusion model are locked. This means that the parameters of the pre-trained implicit diffusion model are not updated during a specific task or further training process. The weight parameters of the pre-trained implicit diffusion model have been determined in the pre-training phase, and the fixed weights are directly utilized in later use to perform various operations.

For example, the pre-trained implicit diffusion model P may be pre-trained on large scale unsupervised data, learning generic features and patterns in the training data. In the embodiment of the present application, the adopted diffusion model P can obtain training results through any feasible training means, provided that the diffusion model P after training is completed can successfully generate a plurality of diffusion image features X _n according to the second image features z ₁.

In view of the foregoing, the diffusion model P is a generation model, so the present application can integrate the image enhancement requirement of the user into the image generation process, thereby obtaining an image that meets the image enhancement requirement of the user.

In some examples, in the process of generating an image by the diffusion model P, the text guide enhancement model B may be introduced to obtain an image meeting the image quality enhancement requirement of the user. The text guiding enhancement model B is used for further processing the first text feature E and the first image feature q ₁ and outputting a text image bimodal feature. Since the first text feature E is a code for the image quality enhancement requirement expressed by the user through the first input data, it corresponds to inputting the image quality enhancement requirement of the user into the diffusion model P through the text-guided enhancement model B. In this way, the diffusion model P can be made to generate an image that meets the image quality enhancement requirements of the user.

Illustratively, the diffusion model P and the text-guided enhancement model B may include a plurality of processing modules. Based on the above, after the text image bimodal feature is output by the text guidance enhancement model B, the output result of each processing module of the diffusion model P (for example, the diffusion image feature after removing noise) and the output result of each processing module of the text image bimodal feature of the text guidance enhancement model B may be fused by the hierarchical fusion module f. By the fusion mode, the image quality enhancement requirement of a user can be gradually and effectively fused into the image generation process.

It will be appreciated that the above procedure is an iterative procedure, and that multiple iterative (i.e., N) operations may be performed. In this way, the first enhanced image coding feature finally output by the diffusion model P is more and more rich, not only includes the information of the first image, but also fully reflects the image quality enhancement requirement of the user.

In some examples, the above-mentioned hierarchical fusion module f may be a separate module in the electronic device, or may be a module in the text-guiding enhancement model B, and the setting position of the hierarchical fusion module f of the present disclosure is not limited.

Based on this, as shown in fig. 8, in the embodiment of the disclosure, based on the first text feature E, the first image feature q ₁, the text-guiding enhancement model B, the second image feature z ₁, and the diffusion model, after a preset number of iterations, the obtaining the first enhanced image coding feature may specifically include S1-S3.

S1, inputting the second image feature z ₁ into a diffusion model P to obtain a first round of diffusion image feature set a ₁.

Wherein the diffusion model includes M processing modules. The first diffusion image feature set a ₁ is that M processing modules of the diffusion model output M diffusion image features during the first iteration. Each processing module may include at least one network layer, and the output data of each network layer in the same processing module is uniform in scale.

Illustratively, the first round diffuses image feature set a ₁={ X₁₁,…, X_1i,…, X_1M. Wherein X ₁₁ is the characteristic output by the first processing module of the diffusion model in the first round of iteration, X _1i is the characteristic output by the ith processing module of the diffusion model in the first round of iteration, and X _1M is the characteristic output by the Mth processing module of the diffusion model in the first round of iteration. Wherein i is a positive integer.

For ease of understanding, the description will be given of the number of processing modules in relation to all of them, with m=3.

For example, the diffusion model includes 3 processing modules, a first set of diffusion image features a ₁={ X₁₁, X₁₂, X₁₃}.X₁₁ is the features output by a first processing module of the diffusion model during a first round of iterations, X ₁₂ is the features output by a second processing module of the diffusion model during the first round of iterations, and X ₁₃ is the features output by a third processing module of the diffusion model during the first round of iterations.

The process of inputting the second image feature z ₁ into the diffusion model P to obtain the first set of diffuse image features may include inputting the second image feature z ₁ into a first processing module of the diffusion model P during a first iteration to obtain X ₁₁ in the first set of diffuse image features a ₁. And inputting X ₁₁ into a second processing module of the diffusion model P to obtain X ₁₂ in the first round of diffusion image feature set a ₁. Inputting X ₁₂ into a third processing module of the diffusion model P to obtain X ₁₃ in the first round of diffusion image feature set a ₁. The first processing module, the second processing module and the third processing module in the diffusion model P are used for denoising the second image feature and extracting the feature of the denoising result, so as to generate a first round of diffusion image feature set.

S2, obtaining a first round of fusion text image bimodal feature q ₁' according to the first round of diffusion image feature set a ₁, the first text feature E, the first image feature q ₁ and the text guide enhancement model B.

The text guiding enhancement model B comprises M processing modules. The output data of each processing module in the diffusion model P is consistent with the scale of the output data of each processing module in the text-guided enhancement model B. Each processing module in the text-guided enhancement model B may include at least one network layer, and the output data of each network layer in the same processing module is uniform in scale.

In some examples, the process of obtaining the first round of fused text image bimodal features q ₁ 'from the first round of diffuse image feature set a ₁, the first text feature E, the first image feature q ₁, and the text guide enhancement model B may include first inputting the first text feature E and the first image feature q ₁ into a first processing module of the text guide enhancement model B to obtain a text image bimodal feature C ₁₁ corresponding to the first processing module in the first round, and then obtaining the first round of fused text image bimodal features q ₁' from C ₁₁, the first round of diffuse image feature set a ₁, and the text guide enhancement model B.

The process of obtaining the first round of dual-mode feature q ₁ 'of the fused text image according to C ₁₁, the first round of diffusion image feature set a ₁ and the text guiding enhancement model B may include that first, C _1i-1 and X _1i-1 in the first round of diffusion image feature set a ₁ are fused to obtain dual-mode feature C _1i' of the fused text image corresponding to the i-1 th processing module in the first round, then, the i-1 th processing module corresponding to the fused text image dual-mode feature C _1i 'is input to the i-th processing module of the text guiding enhancement model B to obtain dual-mode feature C _1i of the text image corresponding to the i-th processing module in the first round, and finally, C _1i and X _1i in the first round of diffusion image feature set a ₁ are fused to obtain dual-mode feature C _1i' of the fused text image corresponding to the i-1 processing module in the first round, wherein i=2, 3, m. When i=m, the fused text image bimodal feature C _1i 'corresponding to the mth processing module in the first round is input to the output module of the text guiding enhancement model B, so as to obtain a first round fused text image bimodal feature q ₁'.

For example, when m=3, the first text feature E and the first image feature q ₁ are input to the first processing module of the text-guiding enhancement model B, and the text image bimodal feature C ₁₁ corresponding to the first processing module in the first round is output. then fusing C ₁₁ with X ₁₁ in the first diffuse image feature set a ₁ to obtain a fused text image bimodal feature C ₁₁ ' corresponding to a first processing module in the first round, inputting the fused text image bimodal feature C ₁₁ ' corresponding to the first processing module in the first round into a second processing module of the text guiding enhancement model B, outputting a text image bimodal feature C ₁₂ corresponding to the second processing module in the first round, and fusing C ₁₂ with X ₁₂ in the first diffuse image feature set a ₁ to obtain a fused text image bimodal feature C ₁₂ ' corresponding to the second processing module in the first round. The method comprises the steps of inputting a bimodal feature C ₁₂ 'of a fused text image corresponding to a first processing module in a first round into a third processing module of a text guiding enhancement model B, outputting a bimodal feature C ₁₃ of the text image corresponding to the third processing module in the first round, fusing C ₁₃ and X ₁₃ in a first round diffusion image feature set a ₁ to obtain a bimodal feature C ₁₃' of the fused text image corresponding to the third processing module in the first round, and inputting a bimodal feature C _1i 'of the fused text image corresponding to the third processing module in the first round into an output module of the text guiding enhancement model B to obtain a bimodal feature q ₁' of the fused text image in the first round.

Illustratively, the text image bimodal feature C satisfies the following expression:

c=B(q,E)

For example, C may be C ₁₁, B is a text-guided enhancement model B, q may be the first image feature q ₁, and E is the first text feature E.

By fusing the bimodal characteristics of the text image and the characteristics output by the diffusion model, the characteristics of the first image are considered in the image quality enhancement process, and the text characteristics required by the image quality enhancement of a user are combined. The fusion mode can lead the image quality enhancement to be more targeted, and avoids blindness which can occur when the image is processed by simply relying on the image characteristics, thereby obviously improving the image quality enhancement effect.

S3, obtaining a first enhanced image coding feature according to the bimodal feature q ₁' of the first round of fused text image, the first text feature E, the first image feature q ₁, the text guiding enhancement model B and the diffusion model P.

And performing N-1 iterations according to the bimodal characteristic q ₁' of the first round of fused text image, the first text characteristic E, the first image characteristic q ₁, the text guiding enhancement model B and the diffusion model P to obtain a first enhanced image coding characteristic.

In some examples, the process of obtaining the first enhanced image encoding feature may include determining a jth round of diffusion image feature set a _j according to the jth round of text image bimodal feature q _j-1 'and the diffusion model during the jth round of iteration, obtaining a jth round of fused text image bimodal feature q _j' according to the jth round of diffusion image feature set, the first text feature E, the first image feature q and the text-guided enhancement model B, wherein j is a positive integer, 2.ltoreq.j.ltoreq.n, and the jth round of fused text image bimodal feature q _j 'is the first enhanced image encoding feature when j=n, according to the first round of fused text image bimodal feature q ₁', the first text feature E, the first image feature q and the text-guided enhancement model P.

In the iterative process, each round is further processed based on the dual-mode characteristics of the fused text image of the previous round. With the increase of the iteration times, the analysis and adjustment of the image features are finer, flaws in the image can be removed gradually, and the quality of the image is enhanced.

In some examples, determining the jth round of diffusion image feature set a _j from the jth-1 round of text image bimodal features q _j-1 'and the diffusion model during the jth round of iterations may include inputting the jth-1 fused round of text image bimodal features q _j-1' into the diffusion model during the jth round of iterations, the diffusion model outputting the jth round of diffusion image feature set a _j. Illustratively, the j-th round diffuses the image feature set a _j={ X_j1,…, X_ji,…, X_jM. Wherein X _j1 is the characteristic output by the first processing module of the diffusion model in the process of the jth round of iteration, X _ji is the characteristic output by the ith processing module of the diffusion model in the process of the jth round of iteration, and X _jM is the characteristic output by the Mth processing module of the diffusion model in the process of the jth round of iteration. Wherein i is a positive integer.

For example, taking the number of iterations n=2 and the number of processing modules m=3 as an example, after the bimodal feature q ₁' of the first fused text image is obtained, the first iteration is considered to be completed, and then the second iteration is continuously performed, i.e. j=2. In the second round of iterative process, the bimodal feature q ₁' of the first round of fused text image is input into a diffusion model, and the diffusion model outputs a second round of diffusion image feature set a ₂.a₂={X₂₁, X₂₂, X₂₃. Wherein X ₂₁ is the characteristic output by the first processing module of the diffusion model in the second round of iteration, X ₂₂ is the characteristic output by the second processing module of the diffusion model in the second round of iteration, and X ₂₃ is the characteristic output by the third processing module of the diffusion model in the second round of iteration.

In some examples, in conjunction with S404, in the iterative process of the jth round, the process in which the diffusion model outputs the jth round diffusion image feature set a _j is a back-diffusion process. In the process of performing back diffusion, the input data of the diffusion model is divided by the j-1 fusion round text image bimodal characteristic q _j-1' (namely input data z _j), and further comprises a required text and a second time sequence, so that the diffusion model gradually removes noise in the noise image based on the second time sequence in the process of performing back diffusion, and image characteristics corresponding to the second image are generated.

For example, in the embodiment of the present application, the demand text may be set to be empty during the back diffusion process. For example, in the jth iteration, the feature X _ji output by the ith processing module of the diffusion model satisfies the following expression:

x_ji= P(z_j,t,null)

Wherein P is a diffusion model P, z _j is input data of the diffusion model in the j-th round of iteration process, t is a second time sequence, null is that the content corresponding to the required text is null.

The process of obtaining the j-th round of fused text image bimodal feature q _j ' according to the j-th round of diffusion image feature set, the first text feature E, the first image feature q and the text guiding enhancement model B may include inputting the first text feature E and the first image feature q into a first processing module of the text guiding enhancement model B to obtain a text image bimodal feature C _j1 corresponding to the first processing module in the j-th round, then fusing the X _j1 in the j-th round of diffusion image feature set a _j and the text image bimodal feature C _j1 corresponding to the first processing module in the j-th round to obtain a fused text image bimodal feature C _j1 ' corresponding to the first processing module in the j-th round, and finally obtaining the j-th round fused text image bimodal feature q _j ' according to the X _ji in the first processing module in the j-th round of diffusion image feature set a _j and the text guiding enhancement model B.

For example, taking the iteration number n=2 and the number m=3 of the processing modules as an example, inputting the first text feature E and the first image feature q into the first processing module of the text guiding enhancement model B to obtain a text image bimodal feature C ₂₁ corresponding to the first processing module in the second round, then fusing the X ₂₁ in the second round diffusion image feature set a ₂ and the first processing module text image bimodal feature C ₂₁ to obtain a fused text image bimodal feature C ₂₁ 'corresponding to the first processing module in the second round, and finally obtaining a second round fused text image bimodal feature q ₂' (i.e., the first enhanced image coding feature) according to the diffusion image feature X _2i and the text guiding enhancement model B in the second round diffusion image feature set corresponding to the first processing module in the second round.

The process of obtaining the j-th round of fusion text image bimodal feature q _j 'according to the fusion text image bimodal feature C _j1' corresponding to the first processing module in the j-th round, the diffusion image feature X _ji in the j-th round diffusion image feature set and the text guiding enhancement model B can comprise inputting the fusion text image bimodal feature C _j1 'corresponding to the i-1 processing module in the j-th round into the i-th processing module of the text guiding enhancement model B to obtain the text image bimodal feature C _ji corresponding to the i processing module in the j-th round, and then fusing the diffusion image feature X _ji in the j-th round diffusion image feature set a _j and the text image bimodal feature C _ji corresponding to the i processing module in the j-th round to obtain the fusion text image bimodal feature C _ji' corresponding to the i processing module in the j-th round. When i=m, the fused text image bimodal feature C _jm 'corresponding to the mth processing module is input into the output module of the text guiding enhancement model B, so as to obtain the j-th round of fused text image bimodal feature q _j'.

For example, taking the number of iterations n=2 and the number of processing modules m=3 as an example, inputting the fused text image bimodal feature C ₂₁ 'corresponding to the first processing module in the second round into the second processing module of the text guiding enhancement model B to obtain the text image bimodal feature C ₂₂ corresponding to the second processing module in the second round, then fusing the diffusion image feature X ₂₂ in the second round diffusion image feature set a ₂ and the text image bimodal feature C ₂₂ corresponding to the second processing module in the second round to obtain the fused text image bimodal feature C ₂₂' corresponding to the second processing module in the second round, inputting the fused text image bimodal feature C ₂₂ 'corresponding to the second processing module in the second round into the third processing module of the text guiding enhancement model B to obtain the text image bimodal feature C ₂₃ corresponding to the third processing module in the second round, and outputting the second text image bimodal feature C ₂₃' corresponding to the third processing module in the second round diffusion image feature set a ₂ and the third processing module in the second round to obtain the fused text image bimodal feature C ₂₂ ', namely outputting the text image bimodal feature C ₂₃' corresponding to the second processing module in the second round to obtain the text image bimodal feature C guide enhancement model B.

Illustratively, during the jth round of iteration, the output data z _j of the diffusion model satisfies the following expression:

z_j= B(q,E, f(x_j-1m,C_j-1m))

Wherein B is a text-guided enhancement model B, q may be a first image feature q ₁, and E is a first text feature E. f is a fusion function, x _j-1m is the j-1 th round, the diffusion image feature x _j-1m;C_j-1m output by the m processing module of the diffusion model is the j-1 th round, and the text image bimodal feature C _j-1m corresponding to the m processing module of the text guiding enhancement model B.

In some examples, the above-described S1-S3 procedure is exemplified by the number of iterations n=2 of the diffusion model, the number of processing modules m=3 in the diffusion model, and the number of processing modules m=3 in the text-guided enhancement model B. The scale of the output data of each processing module in the diffusion model P is consistent with the scale of the output data of each processing module in the text-guiding enhancement model B.

The first iterative process of the diffusion model is to input the second image feature Z ₁ into the first processing module of the diffusion model, which outputs X ₁₁. X ₁₁ is input to the second processing module of the diffusion model, which outputs X ₁₂. X ₁₂ is input to the third processing module of the diffusion model, which outputs X ₁₃.

And inputting the first text feature E and the first image feature q ₁ into a first processing module of the text guiding enhancement model B to obtain a text image bimodal feature C ₁₁ corresponding to the first processing module in the first round.

And fusing the X ₁₁ with the text image bimodal feature C ₁₁ corresponding to the first processing module in the first round to obtain a fused text image bimodal feature C ₁₁' corresponding to the first processing module in the first round.

And inputting the bimodal feature C ₁₁' of the fused text image corresponding to the first processing module in the first round into the second processing module of the text guiding enhancement model B to obtain the bimodal feature C ₁₂ of the text image corresponding to the second processing module in the first round.

And fusing the text image bimodal feature C ₁₂ corresponding to the X ₁₂ and the second processing module to obtain a fused text image bimodal feature C ₁₂' corresponding to the second processing module in the first round.

And inputting the bimodal feature C ₁₂' of the fused text image corresponding to the second processing module in the first round into a third processing module of the text guiding enhancement model B to obtain the bimodal feature C ₁₃ of the text image corresponding to the third processing module in the first round.

And fusing the X ₁₃ with the text image bimodal feature C ₁₃ corresponding to the third processing module in the first round to obtain a fused text image bimodal feature C ₁₃ ' corresponding to the third processing module in the first round, and inputting the fused text image bimodal feature C ₁₃ ' corresponding to the third processing module in the first round into an output module of the text guiding enhancement model B to obtain a first round fused text image bimodal feature q ₁ '.

The second iteration process of the diffusion model is that the bimodal feature q1' of the first round of fused text image is input into a first processing module of the diffusion model, and the first processing module of the diffusion model outputs X ₂₁. X ₂₁ is input to the second processing module of the diffusion model, which outputs X ₂₂. X ₂₂ is input to the third processing module of the diffusion model, which outputs X ₂₃.

And inputting the first text feature E and the first image feature q ₁ into a first processing module of the text guiding enhancement model B to obtain a text image bimodal feature C ₂₁ corresponding to the first processing module in the second round.

And fusing the X ₂₁ with the text image bimodal feature C ₂₁ corresponding to the first processing module in the second round to obtain a fused text image bimodal feature C ₂₁' corresponding to the first processing module in the second round.

And inputting the bimodal feature C ₂₁' of the fused text image corresponding to the first processing module in the second round into a second processing module of the text guiding enhancement model B to obtain the bimodal feature C ₂₂ of the text image corresponding to the second processing module in the second round.

And fusing the X ₂₂ with the text image bimodal feature C ₂₂ corresponding to the second processing module in the second round to obtain a fused text image bimodal feature C ₂₂' corresponding to the second processing module in the second round.

And inputting the bimodal feature C ₂₂' of the fused text image corresponding to the second processing module in the second round into a third processing module of the text guiding enhancement model B to obtain the bimodal feature C ₂₃ of the text image corresponding to the third processing module in the second round.

And fusing the X ₂₃ with the text image bimodal feature C ₂₃ corresponding to the third processing module in the second round to obtain a fused text image bimodal feature C ₂₃ ' corresponding to the third processing module, inputting the fused text image bimodal feature C ₂₃ ' corresponding to the third processing module in the second round into an output module of the text guiding enhancement model B to obtain a second round fused text image bimodal feature q ₂ ' (first enhanced image coding feature).

S406, inputting the first enhanced image coding characteristic into an image decoder to obtain a second image.

The second image is an image obtained by enhancing the image quality of the first image according to the first input data.

After the first enhanced image encoding features are obtained through a plurality of iterations, the first enhanced image encoding features may be input to an image decoder (which may also be referred to as a pre-training image decoder) that outputs the first enhanced image. The image encoder may be any of a generation countermeasure network (GENERATIVE ADVERSARIAL Networks, GANs), a variation self-encoder (Variational Autoencoders, VAEs), U-Net and ResNets, etc.

In some examples, in the embodiments of the present application, the image decoder used may be trained in any feasible training manner, so long as the image decoder finally trained may generate, based on the first enhanced image coding feature, a second image matching the first enhanced image coding feature.

In some examples, inputting the first enhanced image encoding features into an image decoder (which may also be referred to as a pre-training image decoder) and outputting the first enhanced image by the image decoder may include first inputting the first enhanced image encoding features into the image decoder, and then, the image decoder performing a series of operations (e.g., deconvolution operations, upsampling operations, transpose convolution operations, etc.) on the input first enhanced image encoding features to progressively restore the dimensions of the first enhanced image encoding features from a high-dimensional compressed state to a second image of the same size as the first image.

For example, referring to fig. 6, when the first input data is "raindrop in the removal map" with respect to the first image shown in (a) of fig. 6, a second image shown in (c) of fig. 6 may be obtained through S401 to S406 described above.

Based on the above scheme, after the electronic device receives the image quality enhancement triggering operation of the user, the diffusion model and the text guide enhancement model can be utilized to carry out multiple iterations to generate a second image with enhanced image quality. In each iteration process, after the diffusion model generates a diffusion image feature set (i.e., an nth round of diffusion image feature set), the text-guided enhancement model can fuse the diffusion image feature set into the image-quality enhancement required text feature and the first image feature, so that in each iteration process, a second image can be generated by referring to the image-quality enhancement required text. Therefore, the finally obtained second image meets the image quality enhancement requirement of the user, the implementation process is convenient and simple, and the user only needs to execute the image quality enhancement triggering operation.

It should be understood that the steps in the above-described method embodiments provided by the present disclosure may be accomplished by instructions in the form of integrated logic circuits of hardware in a processor or software. The steps of a method disclosed in connection with the embodiments of the present disclosure may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in the processor for execution.

In one example, the units in the above apparatus may be one or more integrated circuits configured to implement the above methods, e.g., one or more ASICs, or one or more DSPs, or one or more FPGAs, or a combination of at least two of these integrated circuit forms.

For another example, when the units in the apparatus may be implemented in the form of a scheduler of processing elements, the processing elements may be general-purpose processors, such as CPUs or other processors that may invoke programs. For another example, the units may be integrated together and implemented in the form of a system on chip SOC.

In one implementation, the above means for implementing each corresponding step in the above method may be implemented in the form of a processing element scheduler. For example, the apparatus may comprise a processing element and a storage element, the processing element invoking a program stored in the storage element to perform the method of the above method embodiments. The memory element may be a memory element on the same chip as the processing element, i.e. an on-chip memory element.

In another implementation, the program for performing the above method may be on a memory element on a different chip than the processing element, i.e. an off-chip memory element. At this point, the processing element invokes or loads a program from the off-chip storage element onto the on-chip storage element to invoke and execute the method of the above method embodiment.

For example, an apparatus, such as an electronic device, may include a processor, a memory for storing instructions executable by the processor. The processor is configured to, when executing the above-described instructions, cause the electronic device to implement the data processing method of the previous embodiment. The memory may be located within the electronic device or may be located external to the electronic device. And the processor includes one or more.

In yet another implementation, the unit of the apparatus implementing each step in the above method may be configured as one or more processing elements, which may be disposed on the electronic device corresponding to the above, where the processing elements may be integrated circuits, for example, one or more ASICs, or one or more DSPs, or one or more FPGAs, or a combination of these types of integrated circuits. These integrated circuits may be integrated together to form a chip.

For example, embodiments of the present disclosure also provide a chip, as shown in fig. 9, the chip system including at least one processor 901 and at least one interface circuit 902. The processor 901 and the interface circuit 902 may be interconnected by wires. For example, the interface circuit 902 may be used to receive signals from other devices. For another example, interface circuitry 902 may be used to send signals to other devices (e.g., processor 901).

For example, the interface circuit 902 may read instructions stored in a memory in the device and send the instructions to the processor 901. The instructions, when executed by the processor 901, may cause an electronic device (such as the electronic device shown in fig. 2) to perform the steps of the above-described embodiments. Of course, the system-on-chip may also include other discrete devices, as embodiments of the disclosure are not specifically limited in this regard.

Embodiments of the present disclosure also provide a computer readable storage medium having computer program instructions stored thereon. The computer program instructions, when executed by an electronic device, enable the electronic device to implement the data processing method as described above.

Embodiments of the present disclosure also provide a computer program product comprising computer instructions for execution by an electronic device as described above, which when executed in the electronic device, cause the electronic device to implement a data processing method as described above. From the foregoing description of the embodiments, it will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of functional modules is illustrated, and in practical application, the above-described functional allocation may be implemented by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to implement all or part of the functions described above.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another apparatus, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and the parts shown as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present disclosure may be embodied in the form of a software product, such as a program, in essence or as a part of the technical solutions contributing to the prior art or all or part of the technical solutions. The software product is stored in a program product, such as a computer readable storage medium, comprising instructions for causing a terminal device (which may be a single-chip microcomputer, chip or the like) or processor (processor) to perform all or part of the steps of the methods of the various embodiments of the disclosure. The storage medium includes various media capable of storing program codes such as a U disk, a mobile hard disk, a ROM, a RAM, a magnetic disk or an optical disk.

For example, embodiments of the present disclosure may also provide a computer-readable storage medium having computer program instructions stored thereon. The computer program instructions, when executed by an electronic device, cause the electronic device to implement a data processing method as in the method embodiments described above.

The foregoing is merely a specific embodiment of the disclosure, but the protection scope of the disclosure is not limited thereto, and any changes or substitutions within the technical scope of the disclosure should be covered in the protection scope of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. An image processing method, applied to an electronic device, comprising:

Receiving an image quality enhancement triggering operation on a first image, wherein the image quality enhancement triggering operation comprises an image quality enhancement requirement text;

Responding to the image quality enhancement triggering operation, and obtaining a first round of fusion text image bimodal characteristic according to the image quality enhancement demand text characteristic, a first image characteristic, a second image characteristic, a text guide enhancement model and a diffusion model, wherein the image quality enhancement demand text characteristic is obtained by encoding the image quality enhancement demand text, the first image characteristic is obtained by encoding a first image, and the second image characteristic is obtained by adding noise to the first image and extracting characteristics aiming at a noise adding result;

Performing N-1 iterations according to the first round of fusion text image bimodal features, the image quality enhancement demand text features, the first image features, the text guide enhancement model and the diffusion model to obtain first enhancement image coding features, wherein the diffusion model is provided with the capability of performing denoising processing on the second image features and performing feature extraction on denoising results to generate a first round of diffusion image feature set, and performing j-1 iterations according to the first round of fusion text image bimodal features to generate a j-th round of diffusion image feature set;

and decoding the first enhanced image coding feature to obtain a second image, wherein the second image is an image with stronger image quality of the first image according to the image quality enhancement requirement text.

2. The method of claim 1, wherein the obtaining the first round of blended text image bimodal features based on the image quality enhancement required text feature, the first image feature, the second image feature, the text guide enhancement model, and the diffusion model comprises:

processing the image quality enhancement demand text feature and the first image feature by using a first processing module of the text guide enhancement model to obtain a text image bimodal feature corresponding to the first processing module in a first round;

Denoising the second image features by using the diffusion model, extracting features of a denoising result, and generating a first round of diffusion image feature set, wherein the first round of diffusion image feature set comprises M features, the diffusion model comprises M processing modules, each processing module in the diffusion model corresponds to one feature, and M is a positive integer;

Fusing the text image bimodal feature corresponding to the first processing module in the first round with the first diffusion image feature to obtain the fused text image bimodal feature corresponding to the first processing module in the first round;

And obtaining the bimodal features of the first round of fused text images according to the bimodal features of the fused text images corresponding to the first processing module in the first round, other features except the first diffuse image features in the first round of diffuse image feature set and the text guiding enhancement model.

3. The method of claim 1, wherein the performing N-1 iterations based on the first round of blended text image bimodal features, the image quality enhancement required text features, the first image features, the text-guided enhancement model, and the diffusion model to obtain a first enhanced image encoding feature comprises:

In the iteration process of the jth round, determining a jth round diffusion image feature set according to the bimodal feature of the jth-1 round fusion text image and the diffusion model, wherein j is a positive integer, and j is more than or equal to 2 and less than or equal to N;

And obtaining a j-th round of fusion text image bimodal feature according to the j-th round of diffusion image feature set, the image quality enhancement requirement text feature, the first image feature and the text guide enhancement model, wherein when j=N, the j-th round of fusion text image bimodal feature is the first enhancement image coding feature.

4. The method of claim 3, wherein the obtaining the j-th round of blended text image bimodal features from the j-th round of diffuse image feature sets, the image quality enhancement required text features, the first image features, and the text guidance enhancement model comprises:

processing the image quality enhancement demand text feature and the first image feature by using a first processing module of the text guide enhancement model to obtain a text image bimodal feature corresponding to the first processing module in the j-th round;

Fusing the feature output by the first processing module of the diffusion model in the jth round of diffusion image feature set and the text image bimodal feature corresponding to the first processing module in the jth round of diffusion image feature set to obtain the fused text image bimodal feature corresponding to the first processing module in the jth round of diffusion image feature set;

and obtaining the dual-mode characteristics of the j-th round of fused text images according to the dual-mode characteristics of the fused text images corresponding to the first processing module in the j-th round, the characteristics of the diffusion image characteristic set of the j-th round except the characteristics output by the first processing module of the diffusion model and the text guiding enhancement model.

5. The method according to claim 4, wherein the obtaining the j-th round of the blended text image bimodal feature according to the blended text image bimodal feature corresponding to the first processing module in the j-th round, the feature of the j-th round of the diffusion image feature set other than the feature output by the first processing module of the diffusion model, and the text guidance enhancement model includes:

Inputting the bimodal characteristics of the fused text image corresponding to the ith-1 processing module in the jth round into the ith processing module of the text guiding enhancement model to obtain the bimodal characteristics of the text image corresponding to the ith processing module in the jth round, wherein the text guiding enhancement model comprises M processing modules, i is a positive integer, and i is more than or equal to 2 and less than or equal to M;

Fusing the feature output by the ith processing module of the diffusion model in the jth round of diffusion image feature set with the text image bimodal feature corresponding to the ith processing module in the jth round of diffusion image feature set to obtain the fused text image bimodal feature corresponding to the ith processing module in the jth round of diffusion image feature set; and when i=M, inputting the bimodal feature of the fused text image corresponding to the M processing module in the jth round into an output module of the text guiding enhancement model to obtain the bimodal feature of the fused text image in the jth round.

6. The method of any one of claims 1-5, wherein the image quality enhancement triggering operation includes image quality enhancement required audio;

Before the responding to the image quality enhancement triggering operation, the method further comprises:

and under the condition that the image quality enhancement triggering operation comprises the image quality enhancement requirement audio, identifying the image quality enhancement requirement audio to obtain the image quality enhancement requirement text.

7. The method of any one of claims 1-5, wherein the diffusion model is a pre-trained implicit diffusion model and the text-guided enhancement model is a u-net model.

8. The method of any one of claims 1-5, wherein the more quality-demanding text is reduced noise, reduced blurriness, and improved resolution.

9. An electronic device, comprising:

a touch screen including a touch sensor and a display screen;

One or more processors;

A memory;

Wherein the memory has stored therein one or more computer programs, the one or more computer programs comprising instructions, which when executed by the electronic device, cause the electronic device to perform the image processing method of any of claims 1-8.

10. A computer-readable storage medium having stored thereon computer program instructions, characterized in that,

The computer program instructions, when executed by an electronic device, cause the electronic device to implement the image processing method of any one of claims 1 to 8.