HK40092635A

HK40092635A - Image processing method, apparatus, device, and computer-readable storage medium

Info

Publication number: HK40092635A
Application number: HK42023081819.7A
Authority: HK
Inventors: 朱城
Original assignee: 腾讯科技（深圳）有限公司
Filing date: 2023-10-31
Publication date: 2023-12-29

Description

Image processing methods, apparatus, devices and computer-readable storage media

技术领域Technical Field

本申请涉及计算机技术领域，具体涉及一种图像处理方法、装置、设备和计算机可读存储介质。This application relates to the field of computer technology, and more specifically to an image processing method, apparatus, device, and computer-readable storage medium.

背景技术Background Technology

人工智能（Artificial Intelligence，AI）已应用于广泛领域，其所涉及的技术可包含计算机视觉、语音处理、自然语言处理等，其中，计算机视觉技术在图像处理的应用方向上具有深远的意义。例如，可利用计算机视觉技术来完成不同类型的图像处理任务。Artificial intelligence (AI) has been applied to a wide range of fields, and the technologies involved include computer vision, speech processing, and natural language processing. Among these, computer vision technology has profound significance in the application of image processing. For example, computer vision technology can be used to complete different types of image processing tasks.

为了完成图像调整任务，相关技术一般基于现有的扩散模型，结合图像调整的描述信息对目标图像进行调整，以获取调整后的图像。To accomplish image adjustment tasks, related techniques are generally based on existing diffusion models, combined with descriptive information about image adjustment, to adjust the target image and obtain the adjusted image.

在对相关技术的研究和实践过程中，本申请的发明人发现相关技术在进行图像调整时，在保持现有扩散模型的网络参数不变的前提下，适应性调节描述信息，以基于调节的描述信息来调整目标图像，这容易忽略较多的图像细节，降低了图像调整的准确性，使得调整后的图像效果与实际需求不符合，不利于后续业务的开展。In the process of researching and practicing related technologies, the inventors of this application discovered that when performing image adjustment, the related technologies adaptively adjust the descriptive information while keeping the network parameters of the existing diffusion model unchanged. This adjustment of the target image based on the adjusted descriptive information easily overlooks many image details, reduces the accuracy of image adjustment, and makes the adjusted image effect inconsistent with the actual needs, which is not conducive to the development of subsequent business.

发明内容Summary of the Invention

本申请实施例提供一种图像处理方法、装置、设备和计算机可读存储介质，可解决调整后的图像效果与实际需求不符的问题，提高图像调整的准确性。This application provides an image processing method, apparatus, device, and computer-readable storage medium, which can solve the problem that the adjusted image effect does not match the actual needs and improve the accuracy of image adjustment.

本申请实施例提供一种图像处理方法，包括：This application provides an image processing method, including:

获取参考图像，并获取与所述参考图像相似的相似图像；Obtain a reference image, and then obtain similar images that are similar to the reference image;

确定所述参考图像与所述相似图像之间的差异信息；Determine the difference information between the reference image and the similar image;

确定所述参考图像中针对所述差异信息的目标掩码图；Determine the target mask image in the reference image for the difference information;

基于所述差异信息进行扩充，得到差异描述文本；Based on the aforementioned difference information, the difference description text is obtained by expanding upon it.

根据所述目标掩码图、所述差异描述文本和所述参考图像，对所述相似图像进行局部调整，得到调整后的目标图像。Based on the target mask image, the difference description text, and the reference image, the similar image is locally adjusted to obtain the adjusted target image.

相应的，本申请实施例提供一种图像处理装置，包括：Accordingly, embodiments of this application provide an image processing apparatus, including:

获取单元，用于获取参考图像，并获取与所述参考图像相似的相似图像；An acquisition unit is used to acquire a reference image and acquire similar images that are similar to the reference image;

第一确定单元，用于确定所述参考图像与所述相似图像之间的差异信息；The first determining unit is used to determine the difference information between the reference image and the similar image;

第二确定单元，用于确定所述参考图像中针对所述差异信息的目标掩码图；The second determining unit is used to determine the target mask image in the reference image for the difference information;

扩充单元，用于基于所述差异信息进行扩充，得到差异描述文本；An expansion unit is used to expand the information based on the differences to obtain a difference description text.

调整单元，用于根据所述目标掩码图、所述差异描述文本和所述参考图像，对所述相似图像进行局部调整，得到调整后的目标图像。The adjustment unit is used to perform local adjustments on the similar image based on the target mask image, the difference description text, and the reference image to obtain the adjusted target image.

在一些实施方式中，所述调整单元，还用于：In some embodiments, the adjustment unit is further configured to:

对所述相似图像进行加噪处理，并获取所述加噪处理中相邻时间步的第一相似噪声图和第二相似噪声图；The similar images are subjected to noise processing, and a first similar noise map and a second similar noise map at adjacent time steps are obtained in the noise processing.

根据所述目标掩码图和所述差异描述文本，对所述第一相似噪声图进行解噪处理，得到第一图像；Based on the target mask image and the difference description text, the first similarity noise image is denoised to obtain the first image;

根据所述目标掩码图和所述参考图像，对所述第二相似噪声图进行解噪处理，得到第二图像；Based on the target mask image and the reference image, the second similarity noise image is denoised to obtain the second image;

将所述第一图像与所述第二图像进行融合，得到调整后的目标图像。The first image and the second image are fused together to obtain the adjusted target image.

根据所述目标掩码图对所述第一相似噪声图进行掩码处理，得到第一掩码噪声图；The first similar noise map is masked according to the target mask map to obtain the first mask noise map;

获取所述差异描述文本对应的差异文本向量，并根据所述差异文本向量对所述第一相似噪声图进行减噪处理，得到第一特征图；Obtain the difference text vector corresponding to the difference description text, and perform noise reduction processing on the first similarity noise map based on the difference text vector to obtain the first feature map;

对所述第一特征图进行解码处理，得到第一图像。The first feature map is decoded to obtain the first image.

对所述目标掩码图进行取反，得到所述目标掩码图对应的目标反掩码图；Invert the target mask image to obtain the target inverse mask image corresponding to the target mask image;

根据所述目标反掩码图对所述第二相似噪声图进行掩码处理，得到第二掩码噪声图；The second similar noise map is masked according to the target inverse mask map to obtain the second masked noise map;

根据所述参考图像对应的特征图对所述第二掩码噪声图进行掩码处理，得到第二特征图；The second mask noise map is masked according to the feature map corresponding to the reference image to obtain the second feature map;

对所述第二特征图进行解码处理，得到第二图像。The second feature map is decoded to obtain the second image.

根据所述目标掩码图和所述差异描述文本，对所述相似图像进行局部微调，得到第一图像；Based on the target mask image and the difference description text, the similar images are locally fine-tuned to obtain the first image;

根据所述目标掩码图和所述参考图像，对所述相似图像进行局部微调，得到第二图像；Based on the target mask image and the reference image, the similar image is locally fine-tuned to obtain the second image;

根据所述目标掩码图对所述相似图像进行掩码处理，得到第一相似掩码图像；The similar image is masked according to the target mask image to obtain a first similar mask image;

对所述第一相似掩码图像进行加噪处理，得到第一相似噪声图；The first similarity mask image is subjected to noise processing to obtain a first similarity noise image;

获取所述差异描述文本对应的差异文本向量；Obtain the difference text vector corresponding to the difference description text;

根据所述差异文本向量对所述第一相似噪声图进行减噪处理，得到第一特征图；The first similarity noise map is denoised based on the difference text vector to obtain the first feature map;

对所述第一相似噪声图进行连续多次降噪处理，并通过注意力机制在每次降噪处理过程中融入所述差异文本向量，得到第一特征图。The first similar noise map is subjected to multiple consecutive noise reduction processes, and the difference text vector is incorporated into each noise reduction process through an attention mechanism to obtain the first feature map.

对所述第一相似掩码图像进行编码处理，得到编码特征图；The first similar mask image is encoded to obtain an encoded feature map;

对所述编码特征图进行噪声处理，得到第一相似噪声图。The encoded feature map is subjected to noise processing to obtain a first similar noise map.

通过第一神经网络模型基于所述目标掩码图和所述差异描述文本，对所述相似图像进行局部微调，生成所述第一图像；The first image is generated by locally fine-tuning the similar image based on the target mask image and the difference description text using a first neural network model.

则所述图像处理装置还包括训练单元，用于：The image processing device further includes a training unit, used for:

获取样本参考图像和样本相似图像，以及第一样本目标图像；Obtain the sample reference image and sample similar images, as well as the first sample target image;

基于所述样本参考图像与所述样本相似图像之间的差异信息，生成所述样本参考图像的样本目标掩码图和样本差异描述文本；Based on the difference information between the sample reference image and the sample similar image, a sample target mask image and sample difference description text are generated for the sample reference image;

通过预设模型基于所述样本目标掩码图和所述样本差异描述文本，对所述样本相似图像进行局部微调，生成第一预测图像；Based on the sample target mask image and the sample difference description text, the sample similar image is locally fine-tuned using a preset model to generate a first predicted image.

根据所述第一样本目标图像与所述第一预测图像，确定预测损失；The prediction loss is determined based on the first sample target image and the first prediction image;

基于所述预测损失对所述预设模型进行迭代训练，直至达到预设条件，得到所述第一神经网络模型。The preset model is iteratively trained based on the prediction loss until the preset conditions are met, thus obtaining the first neural network model.

根据所述目标反掩码图对所述相似图像进行掩码处理，得到第二相似掩码图像；The similar image is masked according to the target inverse mask image to obtain a second similar mask image;

对所述第二相似掩码图像进行加噪处理，得到第二相似噪声图，其中，所述第二相似噪声图的时间步与所述第一相似噪声图的时间步相邻；The second similar mask image is subjected to noise processing to obtain a second similar noise map, wherein the time step of the second similar noise map is adjacent to the time step of the first similar noise map.

根据所述参考图像对应的特征图对所述第二相似噪声图进行减噪处理，得到第二特征图；The second similarity noise map is denoised based on the feature map corresponding to the reference image to obtain the second feature map;

在一些实施方式中，所述扩充单元，还用于：In some embodiments, the expansion unit is further configured to:

根据所述差异信息，确定所述参考图像相对于所述相似图像的差异对象；Based on the difference information, determine the difference objects between the reference image and the similar image;

确定所述差异对象与所述参考图像中其他对象之间的对象关系信息；Determine the object relationship information between the differing object and other objects in the reference image;

获取所述参考图像的全局描述文本和所述差异对象的目标对象描述文本；Obtain the global description text of the reference image and the target object description text of the difference object;

基于所述全局描述文本、所述目标对象描述文本和所述对象关系信息进行文本扩充，得到差异描述文本。Based on the global description text, the target object description text, and the object relationship information, the text is expanded to obtain the difference description text.

在一些实施方式中，所述获取单元，还用于：In some embodiments, the acquiring unit is further configured to:

确定所述参考图像所属的参考聚类中心；Determine the reference cluster center to which the reference image belongs;

确定预设数据库中预构建的每个图像聚类中心与所述参考聚类中心之间的特征类别距离；Determine the feature category distance between each pre-constructed image cluster center in the preset database and the reference cluster center;

基于所述特征类别距离，为所述参考图像选取相似的相似图像。Based on the feature category distance, similar images are selected for the reference image.

在一些实施方式中，所述第一确定单元，还用于：In some implementations, the first determining unit is further configured to:

获取所述参考图像对应的第一描述文本；Obtain the first descriptive text corresponding to the reference image;

获取所述相似图像对应的第二描述文本；Obtain the second descriptive text corresponding to the similar image;

基于所述第一描述文本与所述第二描述文本之间的差异，生成差异信息。Based on the differences between the first description text and the second description text, difference information is generated.

通过第一预设模型对所述参考图像进行全局描述，生成所述参考图像的全局描述文本；The reference image is globally described using a first preset model, generating a global description text for the reference image;

通过第二预设模型对所述参考图像中的每个对象所在的像素区域进行处理，得到所述参考图像中每个对象对应的对象描述文本；The pixel region where each object is located in the reference image is processed by the second preset model to obtain the object description text corresponding to each object in the reference image;

根据所述参考图像的全局描述文本和所述每个对象对应的对象描述文本，确定所述参考图像对应的第一描述文本。The first description text corresponding to the reference image is determined based on the global description text of the reference image and the object description text corresponding to each object.

在一些实施方式中，所述第二确定单元，还用于：In some embodiments, the second determining unit is further configured to:

根据所述差异信息，确定所述参考图像中相对于所述相似图像的差异对象；Based on the difference information, determine the difference objects in the reference image relative to the similar image;

基于所述参考图像中的差异对象，生成目标掩码图。A target mask image is generated based on the differences in the reference image.

此外，本申请实施例还提供一种计算机设备，包括处理器和存储器，所述存储器存储有计算机程序，所述处理器用于运行所述存储器内的计算机程序实现本申请实施例提供的任一种图像处理方法中的步骤。Furthermore, embodiments of this application also provide a computer device, including a processor and a memory, wherein the memory stores a computer program, and the processor is used to run the computer program in the memory to implement the steps in any of the image processing methods provided in embodiments of this application.

此外，本申请实施例还提供一种计算机可读存储介质，所述计算机可读存储介质存储有多条指令，所述指令适于处理器进行加载，以执行本申请实施例所提供的任一种图像处理方法中的步骤。Furthermore, embodiments of this application also provide a computer-readable storage medium storing a plurality of instructions adapted for loading by a processor to execute steps in any of the image processing methods provided in embodiments of this application.

此外，本申请实施例还提供一种计算机程序产品，包括计算机指令，所述计算机指被执行时实现本申请实施例所提供的任一种图像处理方法中的步骤。Furthermore, embodiments of this application also provide a computer program product, including computer instructions, which, when executed, implement the steps in any of the image processing methods provided in embodiments of this application.

本申请实施例可先从现有的数据中获取与参考图像相似的相似图像，然后，基于参考图像与相似图像之间的差异信息，生成目标掩码图，以及，针对该差异信息进行扩充，以丰富表示该差异信息的差异描述文本，最后，联合目标掩码、差异描述文本和参考图像对现有的相似图像进行局部微调，以获取微调后的目标图像；以此，可对图像之间的差异进行扩充描述，并针对差异的扩充描述文本和参考图像作为约束来局部调整相似图像，提高图像调整的准确性，使得调整后的图像效果与实际需求相符合，以利于后续其他业务的开展。This application embodiment first obtains similar images to a reference image from existing data. Then, based on the difference information between the reference image and the similar image, a target mask image is generated. The difference information is then expanded to enrich the difference description text representing it. Finally, the existing similar image is locally fine-tuned using the target mask, the difference description text, and the reference image to obtain the fine-tuned target image. In this way, the differences between images can be expanded in description, and the expanded description text and the reference image can be used as constraints to locally adjust similar images, improving the accuracy of image adjustment and ensuring that the adjusted image effect meets actual requirements, thus facilitating the development of other subsequent services.

附图说明Attached Figure Description

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

图1是本申请实施例提供的图像处理系统的场景示意图；Figure 1 is a scene diagram of the image processing system provided in an embodiment of this application;

图2是本申请实施例提供的图像处理方法的步骤流程示意图；Figure 2 is a flowchart illustrating the steps of the image processing method provided in an embodiment of this application.

图3是本申请实施例提供的全局描述文本的生成场景示意图；Figure 3 is a schematic diagram of the global description text generation scenario provided in the embodiments of this application;

图4是本申请实施例提供的图像中对象描述文本的生成场景示意图；Figure 4 is a schematic diagram of the scene of generating object description text in an image provided in an embodiment of this application;

图5是本申请实施例提供的掩码分割模型的掩码分割场景示意图；Figure 5 is a schematic diagram of a mask segmentation scenario of the mask segmentation model provided in the embodiments of this application;

图6是本申请实施例提供的隐空间扩散模型的结构示意图；Figure 6 is a schematic diagram of the hidden space diffusion model provided in an embodiment of this application;

图7是本申请实施例提供的反向扩散中减噪网络层的结构示意图；Figure 7 is a schematic diagram of the structure of the noise reduction network layer in the back diffusion provided in an embodiment of this application;

图8是本申请实施例提供的图像处理方法的另一步骤流程示意图；Figure 8 is a schematic flowchart of another step of the image processing method provided in the embodiment of this application;

图9为申请实施例提供的图像处理系统的的框架结构示意图；Figure 9 is a schematic diagram of the framework structure of the image processing system provided in the application embodiment;

图10是本申请实施例提供的残差网络层的结构示意图;Figure 10 is a schematic diagram of the residual network layer provided in an embodiment of this application;

图11是本申请实施例提供的差异信息汇总生成差异描述文本的场景示意图;Figure 11 is a schematic diagram of a scenario where difference information is summarized to generate difference description text, as provided in an embodiment of this application.

图12是本申请实施例提供的图像微调过程的场景示意图;Figure 12 is a schematic diagram of the image fine-tuning process provided in an embodiment of this application;

图13是本申请实施例提供的图像处理装置的结构示意图；Figure 13 is a schematic diagram of the image processing apparatus provided in an embodiment of this application;

图14是本申请实施例提供的计算机设备的结构示意图。Figure 14 is a schematic diagram of the structure of the computer device provided in an embodiment of this application.

具体实施方式Detailed Implementation

下面详细描述本申请的实施方式，实施方式的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施方式是示例性地，仅用于解释本申请，而不能理解为对本申请的限制。The embodiments of this application are described in detail below. Examples of the embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain this application, and should not be construed as limiting this application.

在说明书、权利要求书和上述附图所描述的一些流程中，包含了按照特定顺序出现的多个步骤，但应该清楚了解，这些步骤可以不按照其在本文中出现的顺序来执行或并行执行，步骤序号仅仅是用于区分开各个不同的步骤，序号本身不代表任何的执行顺序。此外，本文中的“第一”和“第二”等描述，是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。In some processes described in the specification, claims, and the foregoing drawings, multiple steps are included in a specific order. However, it should be clearly understood that these steps may be performed in any order or in parallel. The step numbers are merely used to distinguish different steps and do not represent any execution order. Furthermore, descriptions such as "first" and "second" in this document are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

本申请实施例提供一种图像处理方法、装置、设备和计算机可读存储介质。具体地，本申请实施例将从图像处理装置的维度进行描述，该图像处理装置具体可以集成在计算机设备中，该计算机设备可以是服务器，也可以是用户终端等设备。其中，该服务器可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或者分布式系统，还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。其中，用户终端可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表、智能家电、车载终端、智能语音交互设备、飞行器等，但并不局限于此。This application provides an image processing method, apparatus, device, and computer-readable storage medium. Specifically, this application will describe the image processing apparatus from the perspective of an image processing device, which can be integrated into a computer device, such as a server or a user terminal. The server can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. The user terminal can be a smartphone, tablet, laptop, desktop computer, smart speaker, smartwatch, smart home appliance, vehicle terminal, smart voice interaction device, aircraft, etc., but is not limited to these.

可以理解的是，在本申请的具体实施方式中，涉及到用户信息、用户使用记录、用户状况等相关的数据，当本申请以上实施例运用到具体产品或技术中时，需要获得用户许可或者同意，且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。It is understood that in the specific implementation of this application, data related to user information, user usage records, and user status are involved. When the above embodiments of this application are applied to specific products or technologies, user permission or consent is required, and the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions.

需要说明的是，本申请实施例提供的图像处理方法可应用于任意一种图像调整场景，这些场景不限于通过云服务、大数据、人工智能或结合等方式实现，具体通过如下实施例进行说明:It should be noted that the image processing method provided in this application can be applied to any image adjustment scenario. These scenarios are not limited to those implemented through cloud services, big data, artificial intelligence, or a combination thereof. The following embodiments illustrate this further:

本申请实施例提供的图像处理方法涉及人工智能(Artificial Intelligence,AI)技术，人工智能是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说，人工智能是计算机科学的一个综合技术，它企图了解智能的实质，并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法，使机器具有感知、推理与决策的功能。The image processing method provided in this application relates to Artificial Intelligence (AI) technology. AI utilizes digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceiving the environment, acquiring knowledge, and using that knowledge to obtain optimal results. In other words, AI is a comprehensive technology within computer science that attempts to understand the essence of intelligence and produce new intelligent machines capable of reacting in a manner similar to human intelligence. AI studies the design principles and implementation methods of various intelligent machines, enabling them to possess perception, reasoning, and decision-making capabilities.

其中，计算机视觉技术(Computer Vision, CV)计算机视觉是一门研究如何使机器“看”的科学，更进一步的说，就是指用摄影机和电脑代替人眼对目标进行识别、追寻和测量等机器视觉，并进一步做图形处理，使电脑处理成为更适合人眼观察或传送给仪器检测的图像。作为一个科学学科，计算机视觉研究相关的理论和技术，试图建立能够从图像或者多维数据中获取信息的人工智能系统。计算机视觉技术通常包括图像处理、图像识别、图像语义理解、图像检索、OCR、视频处理、视频语义理解、视频内容/行为识别、三维物体重建、3D技术、虚拟现实、增强现实、同步定位与地图构建等技术，还包括常见的人脸识别、指纹识别等生物特征识别技术。Computer vision (CV) is a science that studies how to enable machines to "see." More specifically, it refers to machine vision, which uses cameras and computers to replace human eyes in tasks such as target recognition, tracking, and measurement, and further performs image processing to create images more suitable for human observation or transmission to instruments. As a scientific discipline, computer vision studies related theories and technologies, attempting to build artificial intelligence systems capable of extracting information from images or multidimensional data. Computer vision technologies typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, 3D object reconstruction, 3D technology, virtual reality, augmented reality, simultaneous localization and mapping (SLAM), and common biometric recognition technologies such as facial recognition and fingerprint recognition.

然而，本申请实施例可通过计算机视觉技术中的图像处理、图像识别、图像语义理解等技术来实现图像处理，以完成针对图像的处理任务。具体通过如下实施例进行说明:However, the embodiments of this application can achieve image processing through technologies such as image processing, image recognition, and image semantic understanding in computer vision technology to complete the image processing task. This will be specifically illustrated through the following embodiments:

需要说明的是，在该图像处理场景中，其主要通过神经网络（Artificial NeuralNetworks，ANNs）模型来实现的，以下简称为“模型”。而该图像处理过程可包括模型的训练阶段（A）和应用阶段（B）。该训练阶段和应用阶段可通过图像处理系统中的一个或多个设备组合来实现。It should be noted that in this image processing scenario, it is mainly implemented through Artificial Neural Networks (ANNs) models, hereinafter referred to as "models". The image processing process can include a training phase (A) and an application phase (B) of the model. The training phase and the application phase can be implemented through a combination of one or more devices in the image processing system.

例如，参见图1，为本申请实施例提供的图像处理系统的场景示意图，该场景系统可以包括服务器和/或终端；当系统仅包括服务器或终端时，服务器或终端上包括目标数据库、模型训练装置和模型应用装置；当系统为终端和服务器的组合时，服务器上可包括目标数据库、模型训练装置和模型应用装置。For example, referring to Figure 1, which is a schematic diagram of an image processing system provided in an embodiment of this application, the system may include a server and/or a terminal; when the system includes only a server or a terminal, the server or terminal includes a target database, a model training device and a model application device; when the system is a combination of a terminal and a server, the server may include a target database, a model training device and a model application device.

其中，该目标数据库可以存储有大量的数据，该数据不限于包括图像数据，以作为模型训练阶段的样本相似图像。The target database can store a large amount of data, which is not limited to image data, but can be used as sample similar images during the model training phase.

（A）模型的训练阶段：(A) Model training phase:

在模型的训练阶段中，模型训练装置可以在获取到作为样本的训练数据后，基于获得的训练数据对预设模型进行训练。具体的，该模型训练阶段可包括准备训练数据和模型训练。During the model training phase, the model training device can train a pre-defined model based on the acquired training data, which serves as samples. Specifically, this model training phase may include preparing training data and model training.

其中，准备训练数据的过程：首先，设定样本参考图像和需要调整得到的样本目标图像；然后，可从目标数据库中获取与样本参考图像相似的样本相似图像；进而，基于样本参考图像与样本相似图像之间的差异信息，生成样本参考图像的样本目标掩码图和样本差异描述文本。至此，以获得训练数据。The process of preparing training data involves the following steps: First, a reference image and a target image to be adjusted are defined. Then, similar images of the reference image are retrieved from a target database. Next, based on the difference information between the reference image and the similar images, a target mask image of the reference image and descriptive text describing the differences are generated. This completes the acquisition of training data.

其中，模型的训练可以理解为模型输出的图像与样本目标图像之间的对比学习训练。由于本申请实施例在图像调整过程可以包括：以差异信息的描述文本作为引导条件对相似图像进行微调，同时，以参考图像作为约束条件对相似图像进行微调，这可分别通过用于图像微调的两个模型来实现，因此，模型训练过程可以包括对不同的模型进行训练，此时，样本目标图像包括第一样本目标图像和第二样本目标图像。The training of the model can be understood as comparative learning training between the model's output image and the sample target image. Since the image adjustment process in this embodiment may include: fine-tuning similar images using descriptive text of the difference information as guiding conditions, and simultaneously fine-tuning similar images using a reference image as a constraint condition, which can be achieved separately using two models for image fine-tuning, the model training process may include training different models. In this case, the sample target image includes a first sample target image and a second sample target image.

结合图1所示，以差异文本作为引导条件的模型的训练过程为例。具体的，该模型训练的过程为：通过预设模型基于样本目标掩码图和样本差异描述文本，对样本相似图像进行局部微调，以获取第一预测图像，进而，将第一样本目标图像与第一预测图像进行对比，并在两者存在差异时，根据第一样本目标图像与第一预测图像之间的差异来构建预测损失，以基于该预测损失对预设模型进行训练；按照以上方式对模型进行迭代训练，直至达到预设条件，如预设模型输出的第一预测图像与第一样本目标图像相同，或者迭代训练的次数达到一定数量，又或者预设模型输出的第一预测图像不再变化，等等，得到训练后的第一神经网络模型。Referring to Figure 1, the training process of a model using difference text as a guiding condition is taken as an example. Specifically, the training process of this model is as follows: Based on the sample target mask image and the sample difference description text, the preset model performs local fine-tuning on the sample similar images to obtain the first predicted image. Then, the first sample target image is compared with the first predicted image, and when there is a difference between the two, a prediction loss is constructed based on the difference between the first sample target image and the first predicted image. The preset model is trained based on this prediction loss. The model is iteratively trained in the above manner until the preset conditions are met, such as the first predicted image output by the preset model being the same as the first sample target image, or the number of iterations reaching a certain number, or the first predicted image output by the preset model no longer changing, etc., to obtain the trained first neural network model.

同理，结合图1所示，以参考图像作为约束条件的模型的训练过程为例。具体的，在准备训练数据的过程中除了需要获取样本参考图像、样本目标掩码图和样本相似图像外，还需要获取样本目标掩码图关联的样本目标反掩码图。进而，通过预设模型基于样本目标反掩码图和样本参考图像，对样本相似图像进行局部微调，以获取第二预测图像，进而，将第二样本目标图像与第二预测图像进行对比，并在两者存在差异时，根据第二样本目标图像与第二预测图像之间的差异来构建预测损失，以基于该预测损失对预设模型进行迭代训练，直至达到预设条件，得到训练后的第二神经网络模型。Similarly, referring to Figure 1, let's take the training process of a model using a reference image as a constraint as an example. Specifically, in preparing training data, in addition to obtaining the sample reference image, sample target mask image, and sample similar images, it is also necessary to obtain the sample target inverse mask image associated with the sample target mask image. Then, based on the sample target inverse mask image and the sample reference image, the preset model performs local fine-tuning on the sample similar images to obtain the second predicted image. Then, the second sample target image is compared with the second predicted image, and when there is a difference between the two, a prediction loss is constructed based on the difference between the second sample target image and the second predicted image. The preset model is iteratively trained based on this prediction loss until the preset conditions are met, resulting in the trained second neural network model.

至此，基于模型训练装置的训练过程结束，分别得到第一神经网络模型和第二神经网络模型，该训练得到的第一神经网络模型和第二神经网络模型可以用于参与本申请的图像处理过程。需要说明的是，在模型训练阶段可采用带有条件的隐空间扩散（StableDiffusion，SD）模型进行训练，得到训练后的第一神经网络模型和第二神经网络模型属于SD模型。At this point, the training process based on the model training device is complete, yielding a first neural network model and a second neural network model. These trained models can be used in the image processing process of this application. It should be noted that a conditional latent space diffusion (SD) model can be used for training during the model training phase, resulting in the trained first and second neural network models belonging to the SD model category.

此外，以差异信息的描述文本作为引导条件对相似图像进行微调，同时，以参考图像作为约束条件对相似图像进行微调，还可通过一个模型来实现对图像进行微调，完成图像处理。则，在模型训练阶段，可通过预设模型基于样本目标掩码图、样本差异描述文本和样本参考图像，对样本相似图像进行局部微调，以获取预测图像，进而，将样本目标图像与预测图像进行对比，并在两者存在差异时，根据样本目标图像与预测图像之间的差异来构建预测损失，以基于该预测损失对预设模型进行迭代训练，直至达到预设条件，得到训练后的目标模型。Furthermore, fine-tuning similar images can be achieved using descriptive text about differences as guiding conditions, and simultaneously, fine-tuning similar images can be performed using reference images as constraints. Alternatively, a single model can be used to fine-tune images, completing the image processing. During the model training phase, a pre-defined model can be used to locally fine-tune sample-like images based on sample target masks, sample difference descriptive text, and sample reference images to obtain predicted images. Then, the sample target images are compared with the predicted images, and when differences exist, a prediction loss is constructed based on these differences. The pre-defined model is then iteratively trained based on this prediction loss until the pre-defined conditions are met, resulting in the trained target model.

（B）模型的应用阶段：(B) Application phase of the model:

在模型的应用阶段中，可将训练好的第一神经网络模型和第二神经网络模型上传或安装至模型应用装置中，以使得模型应用装置在图像处理过程中运行该第一神经网络模型和第二神经网络模型，以配合完成图像处理的相关流程。具体的，该图像处理流程包括：获取参考图像，并获取与参考图像相似的相似图像；确定参考图像与相似图像之间的差异信息；确定参考图像中针对差异信息的目标掩码图；基于差异信息进行扩充，得到差异描述文本；根据目标掩码图、差异描述文本和参考图像，对相似图像进行局部调整，得到调整后的目标图像。In the application phase of the model, the trained first and second neural network models can be uploaded or installed into the model application device. This allows the device to run these models during image processing to facilitate the completion of the relevant image processing workflow. Specifically, the image processing workflow includes: acquiring a reference image and acquiring similar images; determining the difference information between the reference image and the similar images; determining a target mask image in the reference image that addresses the difference information; expanding the target mask image based on the difference information to obtain a difference description text; and performing local adjustments on the similar images based on the target mask image, the difference description text, and the reference image to obtain the adjusted target image.

需要说明的是，第一神经网络模型和第二神经网络模型主要应用在对相似图像进行局部微调过程中。具体的，可通过第一神经网络模型基于目标掩码图和差异描述文本，对相似图像进行局部微调，生成第一图像；以及通过第二神经网络模型基于目标反掩码图和参考图像，对相似图像进行局部微调，生成第二图像；将第一图像与第二图像进行融合，得到调整后的目标图像。It should be noted that the first and second neural network models are mainly used in the process of local fine-tuning of similar images. Specifically, the first neural network model can perform local fine-tuning of similar images based on the target mask image and the difference description text to generate a first image; and the second neural network model can perform local fine-tuning of similar images based on the target inverse mask image and the reference image to generate a second image; the first image and the second image are then fused to obtain the adjusted target image.

此外，假若以一个模型来实现对图像进行微调时，可将训练后的该目标模型上传或安装至模型应用装置中，以使得模型应用装置在图像处理过程中运行该目标模型，以配合完成图像处理的相关流程，该图像处理流程包括：获取参考图像，并获取与参考图像相似的相似图像；确定参考图像与相似图像之间的差异信息；确定参考图像中针对差异信息的目标掩码图；基于差异信息进行扩充，得到差异描述文本；根据目标掩码图、差异描述文本和参考图像，对相似图像进行局部调整，得到调整后的目标图像。Furthermore, if a model is used to fine-tune an image, the trained target model can be uploaded or installed into a model application device. This allows the application device to run the target model during image processing to complete the relevant image processing workflow. This workflow includes: acquiring a reference image and acquiring similar images; determining the difference information between the reference image and the similar images; determining a target mask image in the reference image that addresses the difference information; expanding the target mask image based on the difference information to obtain a difference description text; and performing local adjustments on the similar images based on the target mask image, the difference description text, and the reference image to obtain the adjusted target image.

通过以上模型的训练阶段和应用阶段的场景，可以实现本申请的图像处理方法。The image processing method of this application can be realized through the training and application scenarios of the above model.

例如，假设服务器或终端上包括目标数据库、模型训练装置和模型应用装置，服务器或终端可以基于目标数据库中的样本图像数据来准备训练数据，并通过模型训练装置根据训练数据对预设模型进行训练，并将训练后的第一神经网络模型和第二神经网络模型传输到模型应用装置上运行。此时，终端或服务器可以实现如下：获取参考图像，并获取与参考图像相似的相似图像；确定参考图像与相似图像之间的差异信息；确定参考图像中针对差异信息的目标掩码图；基于差异信息进行扩充，得到差异描述文本；根据目标掩码图、差异描述文本和参考图像，对相似图像进行局部调整，得到调整后的目标图像。For example, assuming a server or terminal includes a target database, a model training device, and a model application device, the server or terminal can prepare training data based on sample image data in the target database, train a preset model using the training data through the model training device, and then transmit the trained first and second neural network models to the model application device for execution. In this case, the terminal or server can perform the following: acquire a reference image and acquire similar images; determine the difference information between the reference image and the similar images; determine a target mask image in the reference image that addresses the difference information; expand the target image based on the difference information to obtain difference description text; and perform local adjustments on the similar images based on the target mask image, the difference description text, and the reference image to obtain the adjusted target image.

又如，以终端和服务器组合的系统为例，终端与服务器之间建立有通信连接。其中，服务器可以是由多个物理服务机构成的分布式服务系统，其至少包含目标数据库、模型训练装置和模型应用装置，可在服务器上完成对模型的训练后，通过服务器上运行训练后的第一神经网络模型和第二神经网络模型，或者，通过服务器上运行训练后的目标模型，以实现图像处理流程。具体的，在应用阶段，可通过终端上的客户端向服务器发送参考图像。而服务器在获得获取参考图像后，可从目标数据库中获取与参考图像相似的相似图像；确定参考图像与相似图像之间的差异信息；确定参考图像中针对差异信息的目标掩码图；基于差异信息进行扩充，得到差异描述文本；根据目标掩码图、差异描述文本和参考图像，对相似图像进行局部调整，得到调整后的目标图像。此后，服务器可将调整得到的目标图像返回给终端。For example, consider a system combining a terminal and a server, where a communication connection is established between the terminal and the server. The server can be a distributed service system composed of multiple physical service providers, containing at least a target database, a model training device, and a model application device. After training the model on the server, the trained first and second neural network models, or the trained target model, can be run on the server to implement the image processing workflow. Specifically, in the application phase, a reference image can be sent to the server via a client on the terminal. After obtaining the reference image, the server can retrieve similar images from the target database; determine the difference information between the reference image and the similar images; determine the target mask image in the reference image based on the difference information; expand the image based on the difference information to obtain difference description text; and perform local adjustments on the similar images according to the target mask image, difference description text, and the reference image to obtain the adjusted target image. The server can then return the adjusted target image to the terminal.

示例性的，结合图1所示，假设终端上安装有图像处理应用（客户端），用户可在图像处理应用上选定图像搜索任务，以执行图像处理过程。具体的，该图像处理过程为：首先，用户可在终端上的客户端页面中选定图像搜索任务，并针对图像搜索任务设定需要搜索的参考图像；进而，客户端将该参考图像传输至服务器。然后，服务器在获得获取参考图像后，可从目标数据库中获取与参考图像相似的相似图像；确定参考图像与相似图像之间的差异信息；确定参考图像中针对差异信息的目标掩码图；基于差异信息进行扩充，得到差异描述文本；根据目标掩码图、差异描述文本和参考图像，对相似图像进行局部调整，得到调整后的目标图像。最后，服务器将针对相似图像进行局部微调的目标图像返回至客户端，以使得图像搜索业务提供的图像能够与实际需求（参考图像）更符合，以利于图像搜索业务的开展。For example, referring to Figure 1, assuming an image processing application (client) is installed on the terminal, the user can select an image search task on the image processing application to execute the image processing process. Specifically, the image processing process is as follows: First, the user can select an image search task on the client page on the terminal and set a reference image to be searched for the image search task; then, the client transmits the reference image to the server. Then, after obtaining the reference image, the server can retrieve similar images from the target database; determine the difference information between the reference image and the similar images; determine the target mask image in the reference image based on the difference information; expand based on the difference information to obtain difference description text; and perform local adjustments on the similar images according to the target mask image, difference description text, and reference image to obtain the adjusted target image. Finally, the server returns the target image with local fine-tuning of the similar images to the client, so that the images provided by the image search service can better match the actual needs (reference image), thus facilitating the development of the image search service.

需要说明的是，以上仅为示例，还可应用于其他图像业务中，此处不做一一赘述。It should be noted that the above are just examples and can be applied to other image services, which will not be elaborated here.

为了便于理解，以下将分别对图像处理方法的各步骤进行详细说明。需说明的是，以下实施例的顺序不作为对实施例优选顺序的限定。For ease of understanding, each step of the image processing method will be described in detail below. It should be noted that the order of the following embodiments is not intended to limit the preferred order of the embodiments.

在本申请实施例中，将从图像处理装置的维度进行描述，以该图像处理装置具体可以集成在计算机设备如终端或服务器中。参见图2，图2为本申请实施例提供的一种图像处理方法的步骤流程示意图，本申请实施例以图像处理装置具体集成在服务器上为例，服务器上的处理器执行图像处理方法对应的程序指令时，具体流程如下：In this embodiment, the description will focus on the image processing device, which can be integrated into a computer device such as a terminal or server. Referring to Figure 2, which is a flowchart illustrating the steps of an image processing method provided in this embodiment, this embodiment takes the image processing device integrated into a server as an example. When the processor on the server executes the program instructions corresponding to the image processing method, the specific process is as follows:

101、获取参考图像，并获取与参考图像相似的相似图像。101. Obtain a reference image, and then obtain similar images that are similar to the reference image.

本申请实施例在获取得到参考图像后，为了获得符合参考图像的实际需求的图像，一般可从现有图像数据中搜寻与参考图像最为相似的图像，以便后续对搜寻到的相似图像作进一步的图像处理，该图像处理过程可以是图像调整，如大幅度调整、局部调整等，以便调整后的目标图像所包含的信息更匹配参考图像，如获得的图像在风格和内容方面与参考图像更为相近，具有可靠性。In this embodiment of the application, after obtaining the reference image, in order to obtain an image that meets the actual needs of the reference image, it is generally possible to search for the image most similar to the reference image from the existing image data so that the searched similar image can be further processed. The image processing process can be image adjustment, such as large-scale adjustment or local adjustment, so that the information contained in the adjusted target image is more in line with the reference image. For example, the obtained image is more similar to the reference image in terms of style and content, and has reliability.

其中，该参考图像可以是包含任意内容信息的图像，如包含水果、餐具、动物、人物、动画等任意一种或多种内容信息的图像，还可以是包含其他形式的内容信息的图像，此处不做一一列举。需要说明的是，该参考图像可以作为图像处理过程的图像调整依据，即可基于该参考图像对其他图像进行调整。The reference image can be an image containing any content information, such as an image containing one or more types of content information, including fruits, tableware, animals, people, and animations. It can also be an image containing other forms of content information, which are not listed here. It should be noted that the reference image can be used as the basis for image adjustment in the image processing process, that is, other images can be adjusted based on the reference image.

其中，该相似图像可以是数据库中与参考图像最为相似的图像，其可以理解为现有数据中在图像内容或图像风格等方面与参考图像最为相似的图像。需要说明的是，本申请实施例以该相似图像作为图像调整的基础数据，即在该相似图像的基础上进行图像调整。The similar image can be the image in the database that is most similar to the reference image. It can be understood as the image in the existing data that is most similar to the reference image in terms of image content or style. It should be noted that this embodiment uses the similar image as the basis for image adjustment; that is, image adjustments are performed based on the similar image.

为了便于理解参考图像和相似图像，以示例方式对这两种图像进行介绍。示例性的，以图像搜索业务为例，客户通过客户端向图像搜索平台发送一张例图，该例图的内容信息为包含两只卧躺状态的猫，该例图可以视为参考图像；进而，图像搜索平台在收到客户发送的例图后，可在本平台的数据库中查找与该例图最为相似的相似图像，以便后续基于该参考图像的相关信息对该相似图像作进一步调整，以尽可能满足客户的图像搜索业务的需求。To facilitate understanding of reference images and similar images, we will introduce these two types of images with examples. For instance, taking an image search service as an example, a customer sends an example image to the image search platform through a client. The example image contains information about two cats lying down; this example image can be considered a reference image. Then, after receiving the example image from the customer, the image search platform can search its database for the most similar image to that example image. This allows the platform to further adjust the similar image based on relevant information from the reference image, in order to best meet the customer's image search needs.

在一些实施方式中，为了从现有数据库中查找出与参考图像相似的相似图像，可以按照特征距离方式来确定任意两张图像之间是否相似，或者通过特征距离来衡量两张图像之间的相似度。例如，步骤101中的“获取与参考图像相似的相似图像”，可以包括：确定参考图像所属的参考聚类中心；确定预设数据库中预构建的每个图像聚类中心与参考聚类中心之间的特征类别距离；基于特征类别距离，为参考图像选取相似的相似图像。In some implementations, to find similar images to a reference image from an existing database, the similarity between any two images can be determined by using feature distance, or the similarity between two images can be measured by feature distance. For example, "obtaining similar images to the reference image" in step 101 may include: determining the reference cluster center to which the reference image belongs; determining the feature category distance between each pre-constructed image cluster center in a preset database and the reference cluster center; and selecting similar images to the reference image based on the feature category distance.

需要说明的是，不同的图像之间所包含的内容信息具有差异，可按照图像中包含的内容信息对图像进行分类，例如，在动物科目下的图像，可按照动物种类对图像划分类别，如划分为猫、狗、老虎、马、鸽子、老鹰以及其他动物的类别。对于属于同一类别的一个或多个图像，可以通过一个或多个图像来计算该类别对应的聚类中心。It should be noted that different images contain different information. Images can be classified according to their content. For example, images under the animal category can be categorized by animal species, such as cats, dogs, tigers, horses, pigeons, eagles, and other animals. For one or more images belonging to the same category, cluster centers for that category can be calculated using one or more images.

其中，该参考聚类中心可以是基于一个或多个参考图像所构建的特征聚类中心，表示这一个或多个参考图像之间特征类别中心点，可以理解为特征均值点。例如，当存在一个参考图像时，可将该参考图像转换为像素点矩阵，该参考图像的像素点矩阵可以视为参考聚类中心；当存在多个参考图像时，可分别确定每个参考图像的像素点矩阵，并结合每个像素点矩阵来计算这多个参考图像的参考聚类中心，如将多个像素点矩阵之间的均值作为参考聚类中心。以上仅为示例，不作为实施本申请的具体限定方式。The reference cluster center can be a feature cluster center constructed based on one or more reference images, representing the feature category center point among these reference images, which can be understood as the feature mean point. For example, when there is one reference image, the reference image can be converted into a pixel matrix, and the pixel matrix of the reference image can be regarded as the reference cluster center; when there are multiple reference images, the pixel matrix of each reference image can be determined separately, and the reference cluster center of these multiple reference images can be calculated by combining each pixel matrix, such as using the mean of multiple pixel matrices as the reference cluster center. The above are merely examples and are not intended to limit the specific implementation of this application.

其中，该图像聚类中心可以是现有数据库中每个类别对应的图像集合的聚类中心，每个图像聚类中心可以根据数据库中每个类别的图像集合的更新而改变，即该图像聚类中心可以是实时构建。示例性的，在预设数据库中可包含食物、动物、植物、交通工具、饰品等科目的图像，每个科目下可包含一个或多个图像类别，每个图像类别对应一个图像聚类中心；以动物为例，假设包含猫类别的图像集合，此时，通过每个像素点矩阵分别表示猫类别下的每个图像，并计算猫类别下的所有像素点矩阵的均值，以作为猫类别对应的图像聚类中心，假设动物科目下还包括狗类别，则该狗类别的图像聚类中心的计算方式与“猫类别的图像聚类中心”的计算方式一致。需要说明的是，对于其他科目下任意类别的图像，其图像聚类中心的计算方式与上述相同，此处不做一一赘述。The image cluster center can be the cluster center of the image set corresponding to each category in the existing database. Each image cluster center can change according to the update of the image set of each category in the database, that is, the image cluster center can be constructed in real time. For example, the preset database may contain images of subjects such as food, animals, plants, vehicles, and ornaments. Each subject may contain one or more image categories, and each image category corresponds to an image cluster center. Taking animals as an example, assuming that there is an image set containing the cat category, each image under the cat category is represented by a matrix of each pixel point, and the mean of all pixel point matrices under the cat category is calculated as the image cluster center corresponding to the cat category. Assuming that the animal subject also includes a dog category, the calculation method for the image cluster center of the dog category is the same as the calculation method for the image cluster center of the cat category. It should be noted that the calculation method for the image cluster center of any category under other subjects is the same as the above, and will not be elaborated here.

为了从预设数据库中选取与该参考图像相似的相似图像，在确定参考图像对应的参考聚类中心后，可分别计算该参考聚类中心与数据库中每个图像聚类中心之间的特征类别距离。进一步的，可根据特征类别距离来选取与参考图像相似的相似图像，具体的，可根据该特征类别距离的大小来判定参考图像的聚类中心与数据库中的哪一个图像聚类中心更相近，以将该相近的目标图像聚类中心的图像类别确定为与该参考聚类中心的图像类别；进而，计算参考图像与目标图像聚类中心的图像类别下的每一图像之间的特征距离，需要说明的是，特征距离的大小可以反映任意两个图像之间的相似度，因此，可根据特征距离的大小来选取与参考图像相似的相似图像，例如，选取与参考图像的特征距离最小的图像作为相似图像。To select similar images from a pre-defined database, after determining the reference cluster center corresponding to the reference image, the feature class distance between the reference cluster center and each image cluster center in the database can be calculated. Further, similar images can be selected based on the feature class distance. Specifically, the magnitude of the feature class distance can be used to determine which image cluster center in the database is more similar to the cluster center of the reference image, thus identifying the image class of that similar target image cluster center as the image class of the reference cluster center. Then, the feature distance between each image in the image class of the reference image and the target image cluster center is calculated. It should be noted that the magnitude of the feature distance can reflect the similarity between any two images; therefore, similar images can be selected based on the magnitude of the feature distance, for example, selecting the image with the smallest feature distance to the reference image as the similar image.

通过以上方式，可在获取得到参考图像后，从现有图像数据中搜寻与参考相似最为相似的图像，以便后续对搜寻到的图像作进一步的图像调整，以获得更符合参考图像的实际需求的图像，具有可靠性。Using the above method, after obtaining the reference image, we can search for the most similar image from the existing image data. This allows us to make further image adjustments to the searched image to obtain an image that better meets the actual needs of the reference image, thus ensuring reliability.

102、确定参考图像与相似图像之间的差异信息。102. Determine the differences between the reference image and similar images.

在本申请实施例中，为了更准确地对相似图像进行调整，可以确定参考图像与相似图像之间的差异情况，以便后续结合该两者图像之间的差异情况作为图像处理过程中的调整依据，并对相似图像进行调整，提高图像调整的准确性。In this embodiment of the application, in order to adjust similar images more accurately, the differences between the reference image and the similar image can be determined so that the differences between the two images can be used as the basis for adjustment in the image processing process, and the similar image can be adjusted to improve the accuracy of image adjustment.

其中，该差异信息可以是表示参考图像与相似图像之间的特征差异的信息，其不限于包括图像中的存在差异的对象（事物）数量、对象位置和/或对象体态等差异信息。例如，参考图像中包含两只橘猫，第一只橘猫卧在草地上，第二只橘猫在第一只橘猫的周围区域处于奔跑动作的状态，假设相似图像中包含两只橘猫，其中一只橘猫卧在草地上，另一橘猫在距离该卧着的橘猫较远处区域作出奔跑动作的状态，则参考图像与相似图像之间的差异信息可以是两只橘猫之间的位置差异，即位置关系；又如，参考图像中包含两只橘猫和一只蓝猫，相似图像中包含两只橘猫，则参考图像相对于相似图像的差异信息中存在差异对象（一只蓝猫）。需要说明的是，可结合差异对象和对象位置关系来生成该差异信息。以上仅为示例，不作为实施本申请的具体限定方式。The difference information can represent the feature differences between a reference image and a similar image, and is not limited to differences in the number of objects (things) that differ in the image, the location of the objects, and/or the posture of the objects. For example, if the reference image contains two orange cats, the first orange cat is lying on the grass, and the second orange cat is running around the first orange cat, and the similar image contains two orange cats, one of which is lying on the grass, and the other is running at a distance from the lying orange cat, then the difference information between the reference image and the similar image could be the positional difference between the two orange cats, i.e., their positional relationship. As another example, if the reference image contains two orange cats and one blue cat, and the similar image contains two orange cats, then the difference information between the reference image and the similar image includes a different object (the blue cat). It should be noted that this difference information can be generated by combining the difference object and the positional relationship of the objects. The above are merely examples and are not intended to limit the specific implementation of this application.

在一些实施方式中，为了获取参考图像与相似图像之间的差异信息，可以根据参考图像的描述文本和相似图像的描述文本来确定图像之间的差异，从而生成差异信息。例如，步骤102可以包括：In some implementations, to obtain difference information between a reference image and similar images, the differences between the images can be determined based on the descriptive text of the reference image and the descriptive text of the similar images, thereby generating difference information. For example, step 102 may include:

（102.1）获取参考图像对应的第一描述文本；(102.1) Obtain the first descriptive text corresponding to the reference image;

（102.2）获取相似图像对应的第二描述文本；(102.2) Obtain the second descriptive text corresponding to similar images;

（102.3）基于第一描述文本与第二描述文本之间的差异，生成差异信息。(102.3) Generate difference information based on the difference between the first description text and the second description text.

其中，该第一描述文本可以是针对参考图像中的内容信息生成的图像内容描述文本，由于参考图像的内容信息可以包括对象信息和对象所在的环境信息，因此，该第一描述文本不限于包括针对参考图像的整体内容的全局描述文本和针对参考图像中对象的对象描述文本。示例性的，假设参考图像中包含两只橘猫和一只蓝猫，则全局描述文本可以为“草地上有两只橘猫和一只蓝猫，第二只橘猫位于第一只橘猫的右上方草地区域，蓝猫位于第一只橘猫的左上方草地区域”，针对第一只橘猫的对象描述文本为“橘猫卧在草地上”，针对第二只橘猫的对象描述文本为“橘猫在草地上奔跑”，针对蓝猫的对象描述文本为“蓝猫在草地上打滚”，以上全局描述文本和对象描述文本仅为示例，不作为实施本申请的具体限定方式，以上任一描述文本还可根据实际情况进行更详细或更简洁的描述。The first descriptive text can be image content description text generated based on the content information in the reference image. Since the content information of the reference image can include object information and the environment in which the object is located, the first descriptive text is not limited to including global descriptive text for the overall content of the reference image and object descriptive text for objects in the reference image. For example, assuming the reference image contains two orange cats and one blue cat, the global descriptive text can be "There are two orange cats and one blue cat on the grass. The second orange cat is located in the grass area to the upper right of the first orange cat, and the blue cat is located in the grass area to the upper left of the first orange cat." The object descriptive text for the first orange cat is "The orange cat is lying on the grass." The object descriptive text for the second orange cat is "The orange cat is running on the grass." The object descriptive text for the blue cat is "The blue cat is rolling on the grass." The above global descriptive text and object descriptive text are only examples and are not intended as specific limitations for implementing this application. Any of the above descriptive texts can be described in more detail or more concisely according to the actual situation.

其中，该第二描述文本可以是针对相似图像中的内容信息生成的图像内容描述文本，同理，该第二描述文本不限于包括针对相似图像的整体内容的全局描述文本和针对相似图像中对象的对象描述文本。具体示例可参见关于第一描述文本的描述，此处不做限定。The second descriptive text can be image content description text generated based on content information in similar images. Similarly, the second descriptive text is not limited to global descriptive text for the overall content of similar images and object descriptive text for objects in similar images. Specific examples can be found in the description of the first descriptive text, which is not limited here.

具体的，为了获取参考图像与相似图像之间的差异信息，可在分别获取到参考图像的第一描述文本和相似图像的第二描述文本后，根据第一描述文本和第二描述文本之间的描述差异，来学习参考图像与相似图像中内容信息的差异，如确定参考图像与相似图像之间在对象数量上是否存在差异，确定参考图像与相似图像之间的对象位置分布是否存在差异，确定参考图像与相似图像之间的对象体态是否存在差异，等等，需要说明的是，参考图像与相似图像之间的差异可包括以上一种或多种情况，且还可以包括其他差异的情况；进一步的，基于以上确定的差异，获取差异信息。Specifically, in order to obtain the difference information between the reference image and similar images, after obtaining the first descriptive text of the reference image and the second descriptive text of the similar images, the differences in content information between the reference image and the similar images can be learned based on the descriptive differences between the first and second descriptive texts. For example, it can be determined whether there is a difference in the number of objects between the reference image and the similar images, whether there is a difference in the distribution of object positions between the reference image and the similar images, whether there is a difference in the shape of objects between the reference image and the similar images, etc. It should be noted that the differences between the reference image and the similar images may include one or more of the above situations, and may also include other differences. Furthermore, based on the differences determined above, difference information is obtained.

示例性的，假设针对参考图像的第一描述文本为“草地上有两只橘猫，第一只橘猫卧在草地上，第二只橘猫位于第一只橘猫的右上方区域，第二只橘猫在草地上奔跑”，假设针对相似图像的第二描述文本为“草地上有两只橘猫，第一只橘猫卧在草地上，第二只橘猫位于第一只橘猫的左上方区域，第二只橘猫在草地上奔跑”，由此可得，第一描述文本与第二描述文本之间的差异为第二只橘猫（对象）在图像中的位置分布差异，因此，可基于这两个差异来生差异信息，该差异信息可以是针对参考图像中的第二只橘猫（差异对象）的信息，如第二只橘猫的位置信息、形状信息、尺寸信息等，此外，由于参考图像中第一只橘猫的左上方区域为草地，而相似图像中第一只橘猫的左上方区域存在第二只橘猫，则差异信息还可包括参考图像中第一只橘猫的左上方区域的草地（差异对象）的信息。For example, suppose the first descriptive text for a reference image is "There are two orange cats on the grass. The first orange cat is lying on the grass, and the second orange cat is located in the upper right area of the first orange cat. The second orange cat is running on the grass." Suppose the second descriptive text for a similar image is "There are two orange cats on the grass. The first orange cat is lying on the grass, and the second orange cat is located in the upper left area of the first orange cat. The second orange cat is running on the grass." Thus, the difference between the first and second descriptive texts is the difference in the positional distribution of the second orange cat (object) in the image. Therefore, difference information can be generated based on these two differences. This difference information can be information about the second orange cat (difference object) in the reference image, such as the position, shape, and size information of the second orange cat. In addition, since the upper left area of the first orange cat in the reference image is grass, and the upper left area of the first orange cat in the similar image contains the second orange cat, the difference information can also include information about the grass (difference object) in the upper left area of the first orange cat in the reference image.

示例性的，假设针对参考图像的第一描述文本为“草地上有两只橘猫和一只蓝猫，第一只橘猫卧在草地上，第二只橘猫位于第一只橘猫的右上方区域，第二只橘猫在草地上奔跑，蓝猫位于第一只橘猫的左上方区域，蓝猫在草地上打滚”，假设针对相似图像的第二描述文本为“草地上有两只橘猫，第一只橘猫卧在草地上，第二只橘猫位于第一只橘猫的右上方区域，第二只橘猫在草地上奔跑”，由此可得，第一描述文本与第二描述文本之间的差异为猫（对象）的数量差异、猫在图像中的位置分布差异等，因此，可基于这两个差异来生差异信息，该差异信息可以是针对蓝猫（差异对象）的信息，如蓝猫的对象描述文本和在参考图像中位置信息。For example, suppose the first descriptive text for a reference image is "There are two orange cats and one blue cat on the grass. The first orange cat is lying on the grass, the second orange cat is located to the upper right of the first orange cat and is running on the grass, and the blue cat is located to the upper left of the first orange cat and is rolling on the grass." Suppose the second descriptive text for a similar image is "There are two orange cats on the grass. The first orange cat is lying on the grass, the second orange cat is located to the upper right of the first orange cat and is running on the grass." Thus, the difference between the first and second descriptive texts is the difference in the number of cats (objects) and the difference in the distribution of the cats in the image. Therefore, difference information can be generated based on these two differences. This difference information can be information about the blue cat (difference object), such as the object description text of the blue cat and its position information in the reference image.

在一些实施方式中，由于参考图像和相似图像都属于图像数据，为了中准确获取由于描述参考图像和相似图像的文本信息，可通过图文转换方式，获取得到参考图像的第一描述文本和相似图像的第二描述文本。例如，以获取第一描述文本为例，步骤（102.1）可以包括：In some implementations, since both the reference image and the similar image are image data, in order to accurately obtain the text information describing the reference image and the similar image, a first descriptive text for the reference image and a second descriptive text for the similar image can be obtained through image-to-text conversion. For example, to obtain the first descriptive text, step (102.1) may include:

（102.1.1）通过第一预设模型对参考图像进行全局描述，生成参考图像的全局描述文本；(102.1.1) The reference image is globally described using the first preset model, and a global description text of the reference image is generated;

（102.1.2）通过第二预设模型对参考图像中的每个对象所在的像素区域进行处理，得到参考图像中每个对象对应的对象描述文本；(102.1.2) The pixel region where each object is located in the reference image is processed by the second preset model to obtain the object description text corresponding to each object in the reference image;

（102.1.3）根据参考图像的全局描述文本和每个对象对应的对象描述文本，确定参考图像对应的第一描述文本。(102.1.3) Determine the first description text corresponding to the reference image based on the global description text of the reference image and the object description text corresponding to each object.

其中，该全局描述文本可以理解为对图像中包含的内容信息进行整体描述、概括得到的文本，通过该全局描述文本，可以基于文本的方式快速理解图像。示例性的，以参考图像为例，假设参考图像的内容包含草地、飞盘和向飞盘方向跳跃的宠物狗，则该全局描述文本可以是“宠物狗在草地上跳跃起来玩飞盘”。又如，假设参考图像的内容包括草地、卧着的橘猫和奔跑的蓝猫，则全局描述文本可以是“橘猫卧在草地上休息，蓝猫在草地上奔跑玩耍”；以上示例。The global description text can be understood as a comprehensive description and summary of the content information contained in the image. This global description text allows for a quick understanding of the image based on text. For example, taking a reference image as an example, if the reference image contains grass, a frisbee, and a dog jumping towards the frisbee, the global description text could be "The dog is jumping on the grass playing with a frisbee." Or, if the reference image contains grass, a reclining orange cat, and a running blue cat, the global description text could be "The orange cat is resting on the grass, and the blue cat is running and playing on the grass." These are just a few examples.

其中，该对象描述文本可以是用于描述图像中相应对象的特征的文本，其可包括对象的颜色、形状、体态、位置以及其他方面的描述。例如，以参考图像为例，该对象描述文本可以描述图像中对应的对象的位置、动作、体态、形状等信息。需要说明的是，该对象描述文本可以包括对象类别标签和对象描述信息，该对象类别标签用于表示对应的对象所属的类别或名称，对象描述信息用于具体描述对象的动作、位置、形状、体态等等；例如，假设一张图像中包含的内容信息为“一条棕色的狗在草地上玩飞盘”，则对象类别标签可以是“狗”或“棕色的狗”，对象描述信息可以是“棕色的狗在草地上玩飞盘”，以上仅为示例，不作为本申请实施的具体限定。The object description text can be text used to describe the characteristics of the corresponding object in the image, and may include descriptions of the object's color, shape, posture, position, and other aspects. For example, taking a reference image as an example, the object description text can describe the position, action, posture, shape, and other information of the corresponding object in the image. It should be noted that the object description text may include an object category label and object description information. The object category label is used to indicate the category or name to which the corresponding object belongs, and the object description information is used to specifically describe the object's action, position, shape, posture, etc. For example, assuming that the content information contained in an image is "a brown dog playing frisbee on the grass," then the object category label can be "dog" or "brown dog," and the object description information can be "a brown dog playing frisbee on the grass." The above are merely examples and are not intended to limit the specific implementation of this application.

其中，该第一预设模型可以是用于对图像的整体情况进行全局的文本描述的模型，例如，该第一预设模型可以是预训练的冻结图像编码器和大型语言模型（Bootstrapping Language-Image Pre-training with Frozen Image Encoders andLarge Language Models，BLIP2），该模型引入了大规模语言模型（Large LanguageModels，LLM）。The first preset model can be a model for global textual description of the overall situation of the image. For example, the first preset model can be a pre-trained frozen image encoder and large language model (BLIP2), which introduces large language models (LLM).

具体的，结合图3所示，该第一预设模型在结构上可包括视觉与语言表征学习（Vision-and-Language Representation Learning）、视觉到语言的生成学习（Vision-to-Language Generative Learning）两部分，其中，该视觉与语言表征学习在结构上包括图像编码器（Image Encoder）和轻量级的查询变压器（ Querying Transformer，Q-Former），而视觉到语言的生成学习在结构上包括大规模语言模型（Large Language Models，LLM）。以对参考图像进行全局描述为例，在该第一预设模型中，首先，将参考图像输入到图像编码器，通过图像编码器对该参考图像进行编码处理，得到图像编码结果；进而，将该图像编码结果输入到轻量级的查询变压器，并确定参考图像中的对象类别标签，以在轻量级的查询变压器中与对象类别标签进行融合，得到融合特征结果；最后，将该融合特征结果输入至大规模语言模型进行语言处理，以输出针对该参考图像的全局描述文本。Specifically, as shown in Figure 3, the first preset model can be structurally divided into two parts: Vision-and-Language Representation Learning and Vision-to-Language Generative Learning. The Vision-and-Language Representation Learning includes an Image Encoder and a lightweight Querying Transformer (Q-Former), while the Vision-to-Language Generative Learning includes Large Language Models (LLM). Taking the global description of a reference image as an example, in the first preset model, the reference image is first input into an image encoder, which encodes the reference image to obtain an image encoding result. Then, the image encoding result is input into a lightweight query transformer, and the object category label in the reference image is determined. The object category label is then fused with the object category label in the lightweight query transformer to obtain a fused feature result. Finally, the fused feature result is input into a large-scale language model for language processing to output a global description text for the reference image.

其中，该第二预设模型可以是用于对图像中包含对象的图像区域进行文本描述的模型，例如，该第二预设模型可以是图像区域到文本的生成转换器（Generative Region-to-Text Transformer，GRIT）。具体的，结合图4所示，该第二预设模型在结构上可以包括视觉编码器（Visual Encoder）、定位对象的地表物体提取器（Foreground ObjectExtractor）和文本解码器（Text Decoder）。其中，以对参考图像中每个对象所在的像素区域进行文本描述为例，首先，对参考图像中的每个对象进行识别，以确定参考图像中每个对象所在的像素区域，以及对参考图像中识别到的每个对象进行类别标示，然后，将已确定每个对象的像素区域和类别的参考图像输入至第二预设模型，在该第二预设模型中，通过第二预设模型针对每个对象所在的像素区域进行区域与语言之间的转化处理，输出得到转化处理后的参考图像，该转化处理后的参考图像中包含用于标注每个对象的标记框，以及每个标记框中的对象的对象描述文本。The second preset model can be a model for describing the text of image regions containing objects in an image. For example, the second preset model can be a generative region-to-text transformer (GRIT). Specifically, as shown in Figure 4, the second preset model can structurally include a visual encoder, a foreground object extractor, and a text decoder. Taking the description of the pixel region where each object is located in the reference image as an example, firstly, each object in the reference image is identified to determine the pixel region where each object is located, and each identified object in the reference image is categorized. Then, the reference image with the determined pixel region and category of each object is input into the second preset model. In the second preset model, the region-to-language conversion processing is performed on the pixel region where each object is located, and the converted reference image is output. The converted reference image contains a bounding box for labeling each object, and the object description text of the object in each bounding box.

进一步的，在得到参考图像的全局描述文本和对象描述文本后，可将该全局描述文本和对象描述文本合并，以获取针对参考图像的第一描述文本。Furthermore, after obtaining the global description text and object description text of the reference image, the global description text and object description text can be merged to obtain the first description text for the reference image.

在本申请实施例中，关于相似图像的第二描述文本的获取方式可以参考上述“参考图像的第一描述文本”的获取过程，此处不做一一赘述。In the embodiments of this application, the method for obtaining the second descriptive text of similar images can refer to the above-described process for obtaining the first descriptive text of the reference image, and will not be repeated here.

通过以上方式，可确定参考图像与相似图像之间的差异情况，以便后续结合该两者图像之间的差异情况作为图像处理过程中的调整依据，以便更准确地对相似图像进行调整。By using the above methods, the differences between the reference image and similar images can be determined, so that the differences between the two images can be used as the basis for adjustment in the image processing process, so as to adjust the similar images more accurately.

103、确定参考图像中针对差异信息的目标掩码图。103. Determine the target mask image for the difference information in the reference image.

在本申请实施例中，在确定需要对相似图像进行调整后，可采用局部调整的方式来进行图像处理，为了实现对相似图像进行局部调整，需要获取参考图像中针对差异信息的掩码图，以利用掩码图来参与对相似图像的局部微调，以便后续提高图像调整时的精确性。In this embodiment of the application, after determining that similar images need to be adjusted, image processing can be performed by local adjustment. In order to achieve local adjustment of similar images, it is necessary to obtain a mask image of the difference information in the reference image, so as to use the mask image to participate in the local fine adjustment of similar images, so as to improve the accuracy of subsequent image adjustment.

其中，该目标掩码图可以是针对参考图像相对于相似图像的差异信息生成的像素区域遮掩图，用于在对相似图像进行微调时遮掩相似图像中的部分像素区域，以使得相似图像被遮掩的像素区域为空白（无内容）。The target mask map can be a pixel region mask map generated based on the difference information between the reference image and similar images. It is used to mask part of the pixel region in the similar image when fine-tuning the similar image, so that the masked pixel region of the similar image is blank (without content).

在一些实施方式中，可根据参考图像中相对于相似图像中的差异对象来构建目标掩码图。例如，步骤103可以包括：In some implementations, a target mask map may be constructed based on objects in a reference image that differ from those in similar images. For example, step 103 may include:

（103.1）根据差异信息，确定参考图像中相对于相似图像的差异对象；(103.1) Based on the difference information, identify the objects in the reference image that differ from similar images;

（103.2）基于参考图像中的差异对象，生成目标掩码图。(103.2) Generate a target mask based on the difference objects in the reference image.

具体的，为了获取参考图像中针对差异信息的目标掩码图，在得到差异信息后，可根据该差异信息来确定参考图像中相对于相似图像所包含的差异对象，以根据确定的差异对象来生成目标掩码图。需要说明的是，在生成目标掩码图时，首先，可构建与参考图像同样尺寸的初始掩码图，然后，以差异对象的相关信息作为指示信息，按照指示信息、初始掩码图和参考图像来生成目标掩码图。Specifically, to obtain a target mask map of the reference image based on the difference information, after obtaining the difference information, the difference objects contained in the reference image relative to similar images can be determined based on this difference information, and the target mask map can be generated based on the determined difference objects. It should be noted that when generating the target mask map, firstly, an initial mask map of the same size as the reference image can be constructed. Then, using the relevant information of the difference objects as indicator information, the target mask map is generated according to the indicator information, the initial mask map, and the reference image.

在一些实施方式中，可通过语义分割的方式来获取得到参考图像中针对差异信息的目标掩码图。例如，步骤（103.2）可以包括：构建与参考图像相同尺寸的初始掩码图；获取参考图像中针对差异对象的指示信息，该指示信息不限于包括差异对象的背景的像素区域、标记框和对象描述文本等；通过语义分割模型基于初始掩码图、指示信息和参考图像生成目标掩码图。In some implementations, a target mask for the difference information in the reference image can be obtained through semantic segmentation. For example, step (103.2) may include: constructing an initial mask of the same size as the reference image; obtaining indication information for the difference object in the reference image, which is not limited to including the pixel region of the difference object's background, bounding boxes, and object description text; and generating a target mask based on the initial mask, indication information, and reference image using a semantic segmentation model.

其中，该掩码分割模型（segment everything）用于按照指示信息生成任意图像的掩码图。结合图5所示，该模型在结构上可包括图像编码器（image encoder）、卷积模块（conv）、融合模块、指示信息的编码器（prompt encoder）、掩码图的解码器（maskdecoder）；具体的，将参考图像输入到语义分割模型中，通过图像编码器对该参考图像进行编码处理，得到图像向量，同时，通过卷积模块（conv）对初始掩码图进行特征提取，得到掩码图向量；进而，将图像向量与掩码图向量进行融合处理，得到图像融合特征；此外，通过指示信息的编码器（prompt encoder）对指示信息进行编码处理，该指示信息可包含差异对象的背景的像素区域（point）、标记框（box）和对象描述文本（text），得到指示信息的编码特征，并通过掩码图的解码器（mask decoder）对图像融合特征和指示信息的编码特征进行解码处理，输出得到一个或多个预测掩码图；需要说明的是，当输出仅有一个预测掩码图时，将该预测掩码图作为目标掩码图；当输出有多个预测掩码图时，每个预测掩码图具有对应的评分，可选取评分最大的预测掩码图作为目标掩码图。The segment everything model is used to generate a mask for any image based on the indication information. As shown in Figure 5, the model structurally includes an image encoder, a convolutional module (conv), a fusion module, an indication information encoder, and a mask decoder. Specifically, a reference image is input into the semantic segmentation model. The image encoder encodes the reference image to obtain an image vector. Simultaneously, the convolutional module (conv) extracts features from the initial mask to obtain a mask vector. Then, the image vector and the mask vector are fused to obtain image fusion features. Furthermore, the indication information encoder (conv)... The `t encoder` encodes the indication information, which may include the pixel region (point), bounding box, and object description text of the difference object's background. This yields the encoded features of the indication information. The image fusion features and the encoded features of the indication information are then decoded by the mask decoder, outputting one or more predicted masks. It should be noted that when only one predicted mask is output, it is used as the target mask. When multiple predicted masks are output, each has a corresponding score, and the predicted mask with the highest score is selected as the target mask.

为了便于理解目标掩码图的生成原理，以下将以场景示例对介绍该目标掩码图，具体为：示例性的，假设参考图像中包含两只橘猫，第一只橘猫卧在草地上，第二只橘猫位于第一只橘猫的右上方区域奔跑；而相似图像中包含两只橘猫，第一只橘猫卧在草地上，第二只橘猫位于第一只橘猫的左上方区域奔跑；则差异信息包含在参考图像中的第二只橘猫的信息、以及位于第一只橘猫的左上方区域的草地的信息，如位置分布信息、尺寸信息、形状信息等。进而，在构建与参考图像大小的初始掩码图后，可针对以上得到的差异信息生成目标掩码图，该目标掩码图主要用于在图像处理时对相似图像中的第二只橘猫的像素区域进行掩码，以及对相似图像中位于第一只橘猫的右上方区域的草地的像素区域进行掩码，则在目标掩码图中，针对这两个像素区域用“0”表示，并在目标掩码图中用“1”表示除了以上两个区域外的所有像素区域。To facilitate understanding of the target mask generation principle, the following scenario example will be used to illustrate the target mask: For example, suppose a reference image contains two orange cats. The first orange cat is lying on the grass, and the second orange cat is running in the area above and to the right of the first orange cat. A similar image contains two orange cats, the first orange cat is lying on the grass, and the second orange cat is running in the area above and to the left of the first orange cat. The difference information includes information about the second orange cat in the reference image and information about the grass in the area above and to the left of the first orange cat, such as its location, size, and shape. Then, after constructing an initial mask of the same size as the reference image, a target mask can be generated based on the obtained difference information. This target mask is mainly used to mask the pixel area of the second orange cat in the similar image and the pixel area of the grass in the area above and to the right of the first orange cat in the similar image during image processing. In the target mask, these two pixel areas are represented by "0", and all pixel areas other than these two are represented by "1".

通过以上方式，可获取参考图像相对于相似图像的差异信息的目标掩码图，利用目标掩码图作为调整相似图像时的必要元素，以参与对相似图像的局部微调，以便后续提高图像调整时的精确性。By using the above methods, a target mask image can be obtained that shows the difference between the reference image and similar images. This target mask image can then be used as a necessary element when adjusting similar images to participate in local fine-tuning of the similar images, thereby improving the accuracy of subsequent image adjustments.

104、基于差异信息进行扩充，得到差异描述文本。104. Expand upon the difference information to obtain the difference description text.

在本申请实施例中，为了实现对相似图像进行局部调整，除了需要获取参考图像中针对差异信息的掩码图外，还需要获取针对参考图像中的差异信息的相关描述文本，以便后续以该差异信息的相关描述文本作为图像调整的引导条件，以对相似图像进行的局部调整，提高图像调整的准确性。In this embodiment of the application, in order to make local adjustments to similar images, in addition to obtaining the mask image of the difference information in the reference image, it is also necessary to obtain the relevant descriptive text of the difference information in the reference image, so that the relevant descriptive text of the difference information can be used as the guiding condition for subsequent image adjustment, thereby improving the accuracy of image adjustment.

其中，该差异描述文本可以是描述参考图像相对于相似图像的内容差异的文本，其主要是基于差异信息进行文本上的扩充表述得到，可丰富、准确地表示参考图像与相似图像之间的差异。该差异描述文本可作为对相似图像调整的引导条件，以参与指示相似图像的局部调整，提高图像调整的准确性。The difference description text can be text describing the content differences between the reference image and similar images. It is mainly derived from textual expansion based on the difference information, which can enrich and accurately represent the differences between the reference image and similar images. This difference description text can serve as a guiding condition for adjusting similar images, participating in indicating local adjustments to similar images and improving the accuracy of image adjustments.

在一些实施方式中，由于差异信息可以反映参考图像对于相似图像之间的差异，为了获取用于充分描述两个图像之间的差异的描述文本，可以基于参考图像中差异对象与其他对象之间的关系、差异对象的对象描述文本进行扩充描述，以获取针对两个图像之间的差异的描述文本，即差异描述文本。例如，步骤104可以包括：In some implementations, since difference information can reflect the differences between a reference image and similar images, in order to obtain descriptive text that fully describes the differences between two images, the description can be expanded based on the relationship between the differing object and other objects in the reference image, and the object description text of the differing object, to obtain descriptive text for the differences between the two images, i.e., difference description text. For example, step 104 may include:

（104.1）根据差异信息，确定参考图像相对于相似图像的差异对象；(104.1) Based on the difference information, determine the objects of difference between the reference image and similar images;

（104.2）确定差异对象与参考图像中其他对象之间的对象关系信息；(104.2) Determine the object relationship information between the differing object and other objects in the reference image;

（104.3）获取参考图像的全局描述文本和差异对象的目标对象描述文本；(104.3) Obtain the global description text of the reference image and the target object description text of the difference object;

（104.4）基于全局描述文本、目标对象描述文本和对象关系信息进行文本扩充，得到差异描述文本。(104.4) Based on the global description text, the target object description text and the object relationship information, the text is expanded to obtain the difference description text.

其中，该对象关系信息可以是表示差异对象与任意一个对象之间关系的信息，该对象关系不限于包括差异对象与其他对象之间的位置分布关系（如距离、方向）、类别关系（是否属于同一物种属性）等。示例性的，参考图像的对象元素包括草地、一只橘猫和一只蓝猫，橘猫卧在参考图像中心位置的草地区域，蓝猫位于橘猫的右上方区域的草地；相似图像的对象元素同样包括草地、一只橘猫和一只蓝猫，但不同的是，相似图像中的蓝猫位于第橘猫的左上方区域的草地；由此可得，参考图像相对于相似图像的差异对象至少可以包括蓝猫，以蓝猫作为差异对象为例，对象关系信息可以包括参考图像中蓝猫与橘猫之间的位置分布关系（如蓝猫位于橘猫的右上方区域）、蓝猫与橘猫之间的类别关系（蓝猫与橘猫属于同一物种类别）。The object relationship information can represent the relationship between the differing object and any other object. This object relationship is not limited to the positional distribution relationship (such as distance and direction) and category relationship (whether they belong to the same species) between the differing object and other objects. For example, the object elements of the reference image include grass, an orange cat, and a blue cat. The orange cat is lying in the grass area in the center of the reference image, and the blue cat is located in the grass area to the upper right of the orange cat. The object elements of the similar image also include grass, an orange cat, and a blue cat, but the difference is that the blue cat in the similar image is located in the grass area to the upper left of the orange cat. Therefore, the differing object of the reference image relative to the similar image can at least include the blue cat. Taking the blue cat as the differing object as an example, the object relationship information can include the positional distribution relationship between the blue cat and the orange cat in the reference image (such as the blue cat being located to the upper right of the orange cat) and the category relationship between the blue cat and the orange cat (the blue cat and the orange cat belong to the same species category).

具体的，为了能够对参考图像与相似图像之间的差异信息进行丰富的描述，首先，可确定参考图像中该差异信息所对应的差异对象，并获取该差异对象在参考图像中的与任意一个其他对象之间的对象关系信息，如确定当前的差异对象与图中其他对象的位置关系、是否属于同一对象类别（物体类别）等；进一步的，参考前述的“第一描述文本”的获取，获取到参考图像的全局描述文本，以及获取差异对象关联的对象描述文本；最后，基于参考图像的全局描述文本、差异对象的描述文本、以及差异对象相对于其他对象之间的对象关系信息，进行文本的扩充描述，以此，实现对参考图像与相似图像之间的差异信息进行充分的描述，得到扩充后的差异描述文本。该差异描述文本丰富地表述了该差异对象在参考图像中其他对象之间的关系以及差异对象的对象状态信息。如此，相对于对象描述文本，该差异描述文本能够更丰富地表示参考图像中差异对象的相关信息，以便后续在对相似图像进行调整时作为引导条件，实现准确地对相似图像进行局部区域的微调，具有可靠性。Specifically, to provide a rich description of the differences between the reference image and similar images, firstly, the object corresponding to the difference in the reference image is identified, and the object relationship information between this object and any other object in the reference image is obtained, such as determining the positional relationship between the current object and other objects in the image, and whether they belong to the same object category (object class). Further, referring to the aforementioned acquisition of the "first descriptive text," the global descriptive text of the reference image and the object descriptive text associated with the object are obtained. Finally, based on the global descriptive text of the reference image, the descriptive text of the object, and the object relationship information between the object and other objects, the text is expanded to provide a comprehensive description of the differences between the reference image and similar images, resulting in the expanded difference descriptive text. This difference descriptive text richly expresses the relationship between the object and other objects in the reference image, as well as the object state information of the object. Thus, compared to the object descriptive text, this difference descriptive text can more comprehensively represent the relevant information of the object in the reference image, serving as a guiding condition when adjusting similar images, enabling accurate and reliable fine-tuning of local areas in similar images.

示例性的，在对差异信息进行文本扩充时，可采用现有的大规模语言处理模型，具体的，可将对象关系信息、全局描述文本、差异对象的对象描述文本传输给大规模语言处理模型，使得大规模语言处理模型基于该对象关系信息，从全局描述文本和差异对象的对象描述文本中挖掘出相关信息，并进一步进行文本扩充，以生成一段包含差异对象的丰富描述的文本，即差异描述文本，以用作后续的图像调整的数据。For example, when augmenting the text of the difference information, an existing large-scale language processing model can be used. Specifically, the object relationship information, global description text, and object description text of the difference objects can be transmitted to the large-scale language processing model. Based on the object relationship information, the large-scale language processing model can mine relevant information from the global description text and the object description text of the difference objects, and further augment the text to generate a text containing a rich description of the difference objects, i.e., difference description text, which can be used as data for subsequent image adjustment.

通过以上方式，可获取针对参考图像中的差异信息的丰富描述的文本，以便后续利用该差异描述文本作为引导条件，用于指示对相似图像进行的局部调整，提高图像调整的准确性。By using the above methods, rich descriptive text about the differences in the reference image can be obtained. This descriptive text can then be used as a guiding condition to indicate local adjustments to similar images, thereby improving the accuracy of image adjustments.

105、根据目标掩码图、差异描述文本和参考图像，对相似图像进行局部调整，得到调整后的目标图像。105. Based on the target mask image, the difference description text, and the reference image, perform local adjustments on the similar images to obtain the adjusted target image.

在本申请实施例中，为了获得更符合参考图像的内容信息的目标图像，可在获得与参考图像相似的相似图像后，以相似图像作为待调整图像，进而，以目标掩码图、差异描述文本和参考图像等作为图像处理的引导条件，并基于引导条件来调整相似图像，以对相似图像的局部微调，生成一个更符合参考图像的内容需求的目标图像，具有可靠性。In this embodiment of the application, in order to obtain a target image that better matches the content information of the reference image, after obtaining a similar image that is similar to the reference image, the similar image can be used as the image to be adjusted. Then, the target mask image, the difference description text, and the reference image are used as guiding conditions for image processing. Based on the guiding conditions, the similar image is adjusted to make local fine adjustments to the similar image and generate a target image that better matches the content requirements of the reference image, which has reliability.

其中，该目标图像可以是当前数据库中并未存储有的图像，其主要是在相似图像的基础上进行局部微调得到的。具体的，从预设数据库中查找到与参考图像最相似的相似图像后，由于该相似图像实际上可能与参考图像存在一定的差异，此时，为了获取到与参考图像更匹配相似图像，可将相似图像作为基础图像，根据参考图像以及针对差异的目标掩码图、差异描述文本对该基础图像作进一步微调，以获得与参考图像更相似的目标图像。The target image can be an image not currently stored in the database; it is primarily obtained by fine-tuning a similar image. Specifically, after finding the most similar image to the reference image from the preset database, since this similar image may actually have some differences from the reference image, in order to obtain a more closely matching similar image, the similar image can be used as a base image. Based on the reference image and the target mask image and difference description text, the base image is further fine-tuned to obtain a target image that is more similar to the reference image.

需要说明的是，在对相似图像进行局部区域的调整时，其主要是以目标掩码、差异描述文本和参考图像作为图像调整的引导条件。其中，目标掩码图主要作用是影响相似图像中属于目标像素区域的像素呈现，该目标像素区域可以是参考图像相对于相似图像存在差异或非差异对应的像素区域。其中，对相似图像的调整可以包括两部分，具体的，第一部分是以差异描述文本作为引导条件，结合目标掩码图对相似图像进行局部调整；第二部分是以参考图像作为引导（约束）条件，结合目标掩码图对相似图像进行局部调整。为了便于理解，以下将对相似图像的调整作具体描述。It should be noted that when adjusting local regions of similar images, the primary guiding conditions are the target mask, the difference description text, and the reference image. The target mask primarily influences the rendering of pixels belonging to the target pixel region within the similar image. This target pixel region can be a pixel region in the reference image that differs from or does not differ from the similar image. The adjustment of similar images can include two parts: first, using the difference description text as a guiding condition, and combining it with the target mask to perform local adjustments; second, using the reference image as a guiding (constraint) condition, and combining it with the target mask to perform local adjustments. For ease of understanding, the adjustment of similar images will be described in detail below.

在本申请实施例中，在对相似图像进行局部微调时，可先将对相似图像进行掩码处理，再对掩码处理结果进行微调处理，得到调整后的目标图像；此外，还可在微调处理过程中对图像进行掩码。需要说明的是，以上图像的微调处理过程不限于通过噪声处理方式来实现，具体可选的实施方式参见如下描述。In this embodiment, when performing local fine-tuning on similar images, the similar images can first be masked, and then the masking result can be fine-tuned to obtain the adjusted target image. Alternatively, the image can be masked during the fine-tuning process. It should be noted that the above image fine-tuning process is not limited to noise processing; specific optional implementation methods are described below.

（A）先对相似图像进行掩码处理，再对掩码处理结果进行微调处理：(A) First, perform masking on similar images, then fine-tune the masking results:

在一些实施方式中，可将差异描述文本和参考图像分别作为相似图像的引导条件，并分别对相似图像进行局部调整，以将两个调整结果融合得到目标图像。例如，步骤105可以包括：In some implementations, the difference description text and the reference image can be used as guiding conditions for similar images, and local adjustments can be made to the similar images respectively, so as to fuse the two adjustment results to obtain the target image. For example, step 105 may include:

（105.A.1）根据目标掩码图和差异描述文本，对相似图像进行局部微调，得到第一图像；(105.A.1) Based on the target mask image and the difference description text, the similar images are locally fine-tuned to obtain the first image;

（105.A.2）根据目标掩码图和参考图像，对相似图像进行局部微调，得到第二图像；(105.A.2) Based on the target mask image and the reference image, the similar image is locally fine-tuned to obtain the second image;

（105.A.3）将第一图像与第二图像进行融合，得到调整后的目标图像。(105.A.3) The first image and the second image are fused to obtain the adjusted target image.

其中，该第一图像可以是相似图像按照差异描述文本进行局部调整后得到的图像，该图像中的内容相对于相似图像存在区别，该第一图像中的部分区域存在空白像素区域，即无内容的区域，具体为与参考图像中差异信息对应的像素区域为空白像素区域。示例性的，假设参考图像和相似图像的内容都是包含蓝猫和橘猫在草地上玩耍，但这两个图像之间的差异是“蓝猫在图像中的位置”，因此，在按照差异描述文本对相似图像进行局部微调时，可按照参考图像中蓝猫所在的像素区域对相似图像中相同位置的像素区域进行调整，将该调整的像素区域定义为差异像素区域，得到第一图像，该第一图像中的差异像素区域会以空白像素区域代替，即无内容；此外，在对相似图像进行微调时，还可将该相似图像中原蓝猫所在的像素区域进行调整，以使得第一图像中原相似图像的蓝猫所在的像素区域以空白像素区域代替。又如，假设参考图像和相似图像包含的内容都是“一个盘子装有食物，刀叉餐具”，但该参考图像和相似图像之间的差异在于刀叉餐具的摆放位置，则在对相似图像进行局部调整后时，按照参考图像中刀叉餐具所在的像素区域对相似图像中相同位置的像素区域进行调整，得到的第一图像中被调整的像素区域以空白像素区域代替。The first image can be obtained by locally adjusting a similar image according to the difference description text. The content of this first image differs from that of the similar image. Some areas of the first image contain blank pixel regions, i.e., regions without content. Specifically, the pixel regions corresponding to the difference information in the reference image are blank pixel regions. For example, suppose that both the reference image and the similar image contain a blue cat and an orange cat playing on the grass, but the difference between the two images is "the position of the blue cat in the image". Therefore, when making local fine-tuning of the similar image according to the difference description text, the pixel regions at the same position in the similar image can be adjusted according to the pixel region where the blue cat is located in the reference image. The adjusted pixel regions are defined as the difference pixel regions, resulting in the first image. The difference pixel regions in the first image will be replaced by blank pixel regions, i.e., without content. In addition, when fine-tuning the similar image, the pixel region where the original blue cat was located in the similar image can also be adjusted so that the pixel region where the original blue cat was located in the first image is replaced by blank pixel regions. For example, suppose that the reference image and the similar image both contain the content "a plate with food, cutlery, and utensils", but the difference between the reference image and the similar image is the placement of the cutlery. After making local adjustments to the similar image, the pixel regions in the similar image at the same positions are adjusted according to the pixel regions where the cutlery is located in the reference image. The adjusted pixel regions in the resulting first image are replaced with blank pixel regions.

其中，第二图像可以是相似图像按照参考图像的引导条件进行局部调整得到的图像，该第二图像中大部分区域为空白像素区域，仅包含针对参参考图像中差异信息对应的像素区域为非空白像素区域，该非空白像素区域中的像素组合所呈现的内容为参考图像中的差异信息对应的差异对象的图像内容。示例的，假设参考图像和相似图像的内容都是包含“蓝猫和橘猫在草地上玩耍”，但这两个图像之间的差异是“蓝猫在图像中的位置”，因此，在按照参考图像对相似图像进行局部微调时，可按照参考图像中蓝猫所在的像素区域对相似图像中相同位置的像素区域进行调整，得到的第二图像中非空白像素区域包含参考图像的蓝猫所在的像素区域，所呈现的内容为参考图像中的蓝猫以及该蓝猫的形状、体态、动作等信息，可以理解的是，该第二图像中部包含橘猫和草地的图像内容。又如，假设参考图像和相似图像包含的内容都是“一个盘子装有食物，刀叉餐具”，但该参考图像和相似图像之间的差异在于刀叉餐具的摆放位置，则在对相似图像进行局部调整后时，得到的第二图像仅包含参考图像中的刀叉餐具的图像内容。The second image can be obtained by locally adjusting a similar image according to the guiding conditions of a reference image. Most of the area in this second image is blank pixel area, with only the pixel areas corresponding to the difference information in the reference image being non-blank pixel areas. The content presented by the pixel combinations in these non-blank pixel areas is the image content of the difference object corresponding to the difference information in the reference image. For example, suppose both the reference image and the similar image contain the content "a blue cat and an orange cat playing on the grass," but the difference between the two images is "the position of the blue cat in the image." Therefore, when locally fine-tuning the similar image according to the reference image, the pixel areas at the same position in the similar image can be adjusted according to the pixel area where the blue cat is located in the reference image. The resulting second image contains the pixel area where the blue cat is located in the reference image in its non-blank pixel areas, presenting the blue cat in the reference image along with its shape, posture, and actions. It can be understood that the second image contains the image content of the orange cat and the grass. For example, suppose that both the reference image and the similar image contain the content "a plate with food, cutlery", but the difference between the reference image and the similar image is the placement of the cutlery. After making local adjustments to the similar image, the resulting second image will only contain the cutlery content from the reference image.

为了获取与参考图像更相似的目标图像，可包含图像调整和图像融合这两个阶段。具体的，在图像调整阶段，可包含差异描述文本引导和参考图像引导这两部分；其中，第一部分可以是以差异描述文本作为图像调整时的引导条件，结合目标掩码图对相似图像进行调整，以获取得到第一图像，使得该第一图像在差异信息对应的像素区域为空白像素区域，不包含图像内容；其中，第二部分可以是以参考图像作为图像调整的引导条件，并结合目标掩码图对相似图像进行调整，以获取第二图像，使得该第二图像仅包含差异信息对应的像素区域的图像内容。进一步的，将获得的第一图像和第二图像进行叠加融合，以使得第二图像中差异信息对应的像素区域的图像内容与第一图像中差异信息对应的空白像素区域叠加填充，实现第一图像和第二图像之间的图像内容互补，以获得经过局部微调后的目标图像，该目标图像相对于相似图像，其更符合参考图像的图像内容需求，与参考图像更相似。To obtain a target image that is more similar to a reference image, the process can include two stages: image adjustment and image fusion. Specifically, the image adjustment stage can include two parts: difference description text guidance and reference image guidance. The first part can use the difference description text as a guiding condition for image adjustment, combined with a target mask image, to adjust the similar image to obtain a first image where the pixel regions corresponding to the difference information are blank pixel regions, containing no image content. The second part can use the reference image as a guiding condition for image adjustment, combined with a target mask image, to adjust the similar image to obtain a second image where the second image only contains image content in the pixel regions corresponding to the difference information. Furthermore, the obtained first and second images are superimposed and fused, so that the image content in the pixel regions corresponding to the difference information in the second image overlaps and fills the blank pixel regions corresponding to the difference information in the first image, achieving image content complementarity between the first and second images. This results in a target image that has undergone local fine-tuning, and compared to similar images, better meets the image content requirements of the reference image and is more similar to it.

需要说明的是，在对相似图像进行局部微调时，可通过带有条件的隐空间扩散（Stable Diffusion，SD）模型来实现，具体的，将相似图像输入到带有条件的隐空间扩散模型中，通过该带有条件的隐空间扩散模型对相似图像进行噪声扩散处理，并在噪声扩散过程中引入引导条件进行辅助，以指示图像对相关像素区域进行精确微调，提高图像调整的准确性。It should be noted that when performing local fine-tuning on similar images, a conditional latent space diffusion (SD) model can be used. Specifically, similar images are input into the conditional latent space diffusion model, which performs noise diffusion processing on the similar images. During the noise diffusion process, guiding conditions are introduced to assist in instructing the image to make precise fine-tuning of relevant pixel regions, thereby improving the accuracy of image adjustment.

在一些实施方式中，步骤（105.A.1）可以包括：通过第一神经网络模型基于目标掩码图和差异描述文本，对相似图像进行局部微调，生成第一图像。步骤（105.A.2）可以包括：通过第二神经网络模型基于目标掩码图和参考图像，对相似图像进行局部微调，生成第一图像。其中，该第一神经网络模型和第二神经网络模型都是带有条件的隐空间扩散（StableDiffusion，SD）模型。In some implementations, step (105.A.1) may include: using a first neural network model to locally fine-tune similar images based on a target mask and difference description text to generate a first image. Step (105.A.2) may include: using a second neural network model to locally fine-tune similar images based on a target mask and a reference image to generate the first image. Both the first and second neural network models are conditional latent space diffusion (SD) models.

在一些实施方式中，为了实现通过带有条件的隐空间扩散（Stable Diffusion，SD）模型来对相似图像进行局部微调，需要对带有条件的隐空间扩散模型进行训练，以分别得到用于图像微调的第一神经网络模型和第二神经网络模型。In some implementations, in order to achieve local fine-tuning of similar images using a conditional latent space diffusion (SD) model, the conditional latent space diffusion model needs to be trained to obtain a first neural network model and a second neural network model for image fine-tuning, respectively.

例如，以第一神经网络模型的训练为例，在步骤（105.A.1）之前，还可以包括：获取样本参考图像和样本相似图像，以及第一样本目标图像；基于样本参考图像与样本相似图像之间的差异信息，生成样本参考图像的样本目标掩码图和样本差异描述文本；通过预设模型基于样本目标掩码图、样本差异描述文本，对样本相似图像进行局部微调，生成第一预测图像；根据第一样本目标图像与第一预测图像之间的差异，确定预测损失；基于预测损失对预设模型进行迭代训练，直至达到预设条件，得到第一神经网络模型。For example, taking the training of the first neural network model as an example, before step (105.A.1), it may also include: acquiring a sample reference image and a sample similar image, as well as a first sample target image; generating a sample target mask map and sample difference description text of the sample reference image based on the difference information between the sample reference image and the sample similar image; performing local fine-tuning on the sample similar image based on the sample target mask map and the sample difference description text using a preset model to generate a first prediction image; determining the prediction loss based on the difference between the first sample target image and the first prediction image; and iteratively training the preset model based on the prediction loss until the preset conditions are met to obtain the first neural network model.

又如，在步骤（105.A.2）之前，还可以包括：获取样本参考图像和样本相似图像，以及第二样本目标图像；基于样本参考图像与样本相似图像之间的差异信息，生成样本参考图像的样本目标掩码图，并获取与样本目标掩码图相反的样本目标反掩码图；通过预设模型基于样本目标反掩码图和参考图像，对样本相似图像进行局部微调，生成第二预测图像；根据第二样本目标图像与第二预测图像之间的差异，确定预测损失；基于预测损失对预设模型进行迭代训练，直至达到预设条件，得到第二神经网络模型。For example, before step (105.A.2), the method may further include: acquiring a sample reference image and a sample similar image, as well as a second sample target image; generating a sample target mask image of the sample reference image based on the difference information between the sample reference image and the sample similar image, and acquiring a sample target inverse mask image that is the opposite of the sample target mask image; performing local fine-tuning on the sample similar image based on the sample target inverse mask image and the reference image using a preset model to generate a second prediction image; determining the prediction loss based on the difference between the second sample target image and the second prediction image; and iteratively training the preset model based on the prediction loss until a preset condition is met to obtain a second neural network model.

在一些实施方式中，相似图像的局部微调主要是在图像的噪声扩散中引入引导条件，以在噪声扩散中指示对相似图像的目标区域区域进行局部微调，提高图像调整时的准确性。例如，以差异描述文本作为图像调整的引导条件，通过第一神经网络模型对相似图像进行调整为例，步骤（105.A.1）可以包括：In some implementations, local fine-tuning of similar images primarily involves introducing guiding conditions within the noise diffusion of the image to instruct local fine-tuning of target regions in the similar images, thereby improving the accuracy of image adjustment. For example, using the difference description text as the guiding condition for image adjustment, and adjusting similar images through a first neural network model, step (105.A.1) may include:

（105.A.1.1）根据目标掩码图对相似图像进行掩码处理，得到第一相似掩码图像；(105.A.1.1) Perform masking processing on similar images based on the target mask image to obtain the first similarity mask image;

（105.A.1.2）对第一相似掩码图像进行加噪处理，得到第一相似噪声图；(105.A.1.2) Add noise to the first similarity mask image to obtain the first similarity noise map;

（105.A.1.3）获取差异描述文本对应的差异文本向量；(105.A.1.3) Obtain the difference text vector corresponding to the difference description text;

（105.A.1.4）根据差异文本向量对第一相似噪声图进行减噪处理，得到第一特征图；(105.A.1.4) Perform noise reduction processing on the first similarity noise map based on the difference text vector to obtain the first feature map;

（105.A.1.5）对第一特征图进行解码处理，得到第一图像。(105.A.1.5) Decode the first feature map to obtain the first image.

需要说明的是，目标掩码图的作用为将相似图像中与参考图像的差异对象所在区域的相同位置的目标像素区域进行掩盖，以阻拦相似图像中针对该目标像素区域中的像素的表征，因此，可在通过第一神经网络模型对相似图像进行调整之前，首先，根据目标掩码图对相似图像进行掩码处理，该掩码处理过程可以是将目标掩码图与相似图像之间进行相乘，以使得目标掩码图中每个数值都与相似图像中对应的像素进行相乘，得到第一相似掩码图像。然后，将该第一相似掩码图像输入到第一神经网络模型中，以通过第一神经网络模型对第一相似掩码图进行局部微调，为了便于理解，可结合图6所示，对该第一神经网络模型的图像调整过程进行介绍，具体为：第一神经网络模型对第一相似掩码图像进行编码处理，并将编码结果引入到隐空间中，在隐空间中对编码结果进行前向扩散，该前向扩散可以理解为加噪处理的过程，以获取第一相似噪声图。接着，在对第一相似噪声图进行反向扩散之前，对作为引导条件的差异描述文本进行文本编码处理，以获得差异文本向量，实现将差异描述文本引入隐空间中；进而，在隐空间中对第一相似噪声图进行反向扩散（减噪）处理，并在反向扩散处理时，将差异文本向量融入第一相似噪声图的噪声中进行一起反向扩散，以获得第一特征图。最后，对第一特征图进行解码处理，以将隐空间中第一特征图进行恢复，以得到第一图像。It should be noted that the purpose of the target mask image is to mask the target pixel regions at the same locations in the regions where the difference between the target and reference images are located, thereby blocking the representation of pixels in the target pixel regions in the similar image. Therefore, before adjusting the similar image through the first neural network model, the similar image is first masked according to the target mask image. This masking process can be achieved by multiplying the target mask image and the similar image, so that each value in the target mask image is multiplied by the corresponding pixel in the similar image, resulting in a first similarity mask image. Then, this first similarity mask image is input into the first neural network model to perform local fine-tuning of the first similarity mask image. For ease of understanding, the image adjustment process of the first neural network model can be described with reference to Figure 6. Specifically, the first neural network model encodes the first similarity mask image and introduces the encoding result into the latent space. The encoding result is then forward diffused in the latent space. This forward diffusion can be understood as a noise addition process to obtain a first similarity noise map. Next, before backdiffusion of the first similarity noise map, the difference description text used as a guiding condition is text-encoded to obtain a difference text vector, thus introducing the difference description text into the latent space. Then, backdiffusion (noise reduction) is performed on the first similarity noise map in the latent space, and during backdiffusion, the difference text vector is integrated into the noise of the first similarity noise map for backdiffusion together to obtain the first feature map. Finally, the first feature map is decoded to recover the first feature map in the latent space, thus obtaining the first image.

需要说明的是，在通过第一神经网络模型对第一相似噪声图进行局部微调时，主要是包括前向扩散和反向扩散两个过程，其中，该前向扩散具体是对图像进行噪声化的处理过程，可以理解为逐步加噪处理的过程，而该反向扩散过程是对噪声图进行降噪处理的过程，可以理解为逐步减噪的过程。It should be noted that when the first similar noise map is locally fine-tuned through the first neural network model, it mainly includes two processes: forward diffusion and backward diffusion. The forward diffusion is specifically a process of noise reduction of the image, which can be understood as a process of gradually adding noise. The backward diffusion is a process of noise reduction of the noise map, which can be understood as a process of gradually reducing noise.

在一些实施方式中，为了对相似图像进行前向扩散，需要将相似图像转化为向量，以导入到隐空间中，并在隐空间中进行前向扩散处理。例如，以差异描述文本作为引导条件的模型的前向扩散为例，步骤（105.A.1.2）可以包括：对第一相似掩码图像进行编码处理，得到编码特征图；对编码特征图进行噪声处理，得到第一相似噪声图。In some implementations, in order to perform forward diffusion on similar images, it is necessary to convert the similar images into vectors, import them into the latent space, and perform forward diffusion processing in the latent space. For example, taking the forward diffusion of a model guided by the difference description text as an example, step (105.A.1.2) may include: encoding the first similarity mask image to obtain an encoded feature map; and processing the encoded feature map for noise to obtain a first similarity noise map.

具体的，在将第一相似掩码图像输入至第一神经网络模型后，该第一神经网络模型在像素空间中对第一相似掩码图像进行图的特征编码处理，得到编码特征图，该编码特征图为第一相似掩码图像的向量特征矩阵；进而，将该编码特征图传输至隐空间中，在隐空间中对编码特征图进行前向扩散，该前向扩散过程为加噪处理过程，主要是编码特征图进行逐步加噪处理，并经过多个时间步的加噪处理，实现对第一相似掩码图进行完全噪声化处理，得到全部噪声化的第一相似噪声图。Specifically, after the first similarity mask image is input into the first neural network model, the first neural network model performs graph feature encoding processing on the first similarity mask image in the pixel space to obtain an encoded feature map, which is the vector feature matrix of the first similarity mask image. Then, the encoded feature map is transmitted to the latent space, where it is forward diffused. This forward diffusion process is a noise addition process, which mainly involves gradually adding noise to the encoded feature map and performing noise addition processing at multiple time steps to achieve complete noise reduction of the first similarity mask image, resulting in a fully noise-reduced first similarity noise map.

在一些实施方式中，经过前向的加噪处理得到第一相似噪声图后，对第一相似噪声图进行反向的降（减）噪处理，并在降噪处理过程中结合引导条件对应的特征向量，以得到降噪后的噪声图。例如，以差异描述文本对应的差异文本向量作为引导条件为例，步骤（105.A.1.4）可以包括：对第一相似噪声图进行连续多次降噪处理，并通过注意力机制在每次降噪处理过程中融入差异文本向量，得到第一特征图。In some implementations, after obtaining a first similar noise map through forward noise addition, a reverse noise reduction process is performed on the first similar noise map, and the feature vector corresponding to the guiding condition is combined during the noise reduction process to obtain a noise map after noise reduction. For example, taking the difference text vector corresponding to the difference description text as the guiding condition, step (105.A.1.4) may include: performing multiple consecutive noise reduction processes on the first similar noise map, and incorporating the difference text vector in each noise reduction process through an attention mechanism to obtain a first feature map.

具体的，为了对完全噪声化的第一相似噪声图进行反向扩散处理时，需要对第一相似噪声图进行多个回合（时间步）的减噪处理，并在每个减噪处理过程中加入差异描述文本对应的差异文本向量，实现对相似图像的准确微调，直至经历预设数量个时间步的减噪，获得第一特征图。Specifically, in order to perform backdiffusion processing on the first similar noise map that has been completely noisy, it is necessary to perform noise reduction processing on the first similar noise map for multiple rounds (time steps), and add the difference text vector corresponding to the difference description text in each noise reduction process to achieve accurate fine-tuning of similar images until the noise reduction has been performed for a preset number of time steps to obtain the first feature map.

示例性的，为了便于理解每个回合的减噪处理过程，以第一回合的减噪处理过程为例，由于第一相似噪声图是完全噪声化的噪声图，假设经历过T个时间步的加噪处理，则该第一相似噪声图可以表示为“Z_T”或“X_T”，在第一回合的减噪处理时，是对第一相似噪声图“Z_T”进行减噪处理，并在减噪处理时通过注意力机制融入差异文本向量。结合图6，每个回合的解噪处理过程都是在减噪网络层（Denoising U-Net）中实现，具体参见图7，该减噪网络层（Denoising U-Net）在结构上包括残差网络层（ResNet）和注意力模块，其中，该残差网络层（ResNet）主要用于特征提取，实现逐步减噪的过程，以及注意力模块用于对引导条件对应的特征向量（如文本差异向量）与噪声特征图进行融合，实现对图像微调中的指示和引导。For example, to facilitate understanding of the noise reduction process in each round, let's take the first round as an example. Since the first similar noise map is a completely noisy map, assuming it has undergone T time steps of noise addition, this first similar noise map can be represented as "Z _T " or "X _T ". In the first round of noise reduction, the first similar noise map "Z_T" is subjected to noise reduction, and the difference text vector is incorporated through an attention mechanism during the noise reduction process. Referring to Figure 6, the denoising process in each round is implemented in the Denoising U-Net layer. See Figure 7 for details. The Denoising U-Net layer includes a Residual Network layer (ResNet) and an attention module. The Residual Network layer (ResNet) is mainly used for feature extraction to realize the gradual noise reduction process, and the attention module is used to fuse the feature vectors corresponding to the guiding conditions (such as text difference vectors) with the noise feature map to realize the indication and guidance in image fine-tuning.

具体的，该减噪网络层（Denoising U-Net）可由两个残差网络层和两个注意力模块组成，在结构上具体为“残差网络层-注意力模块-残差网络层-注意力模块”，以下将结合减噪网络层的具体结果对一个减噪回合的减噪处理过程进行介绍，具体如下：首先，通过第一个残差网络层对第一相似噪声图进行特征提取，将特征提取的第一特征结果传输至第一个注意力模块，以及将差异文本向量传输至第一个注意力模块，通过第一个注意力模块将差异文本向量融入到第一特征结果中，例如，可以是通过注意力机制对差异文本向量进行注意力计算，并将注意力计算结果与第一特征结果进行融合，得到第一初始融合结果。进一步的，通过第二个残差网络层对第一初始融合结果进行特征提取，得到第二特征结果，并将第二特征结果传输至第二个注意力模块，以及将差异文本向量传输至第二个注意力模块，通过第二个注意力模块将差异文本向量融入到第二特征结果中，得到第一融合噪声图。Specifically, the denoising U-Net consists of two residual network layers and two attention modules, structurally arranged as "residual network layer - attention module - residual network layer - attention module". The following describes the denoising process of one round, based on the specific results of the denoising network layer: First, the first residual network layer extracts features from the first similar noise map. The extracted features are then passed to the first attention module, along with the differing text vectors. The first attention module integrates the differing text vectors into the first feature result. For example, attention can be calculated on the differing text vectors using an attention mechanism, and the attention result is fused with the first feature result to obtain the first initial fusion result. Further, the second residual network layer extracts features from the first initial fusion result to obtain the second feature result. This second feature result is then passed to the second attention module, along with the differing text vectors. The second attention module integrates the differing text vectors into the second feature result to obtain the first fused noise map.

按照以上示例，经过多回合（“T-1”次）的减噪处理，得到最终的第“T-1”的融合噪声图，该第“T-1”的融合噪声图就是第一特征图。Following the example above, after multiple rounds ("T-1" times) of noise reduction processing, the final "T-1"th fused noise map is obtained, which is the first feature map.

进一步的，在得到第一特征图后，对该第一特征图的解码过程，具体可以是：通过解码模块对隐空间中的第一特征图进行解码处理，以在像素空间中将第一特征图恢复为像素矩阵，以使得第一神经网络输出局部微调后的第一图像。Furthermore, after obtaining the first feature map, the decoding process of the first feature map can specifically be as follows: the first feature map in the latent space is decoded by the decoding module to restore the first feature map into a pixel matrix in the pixel space, so that the first neural network outputs the locally fine-tuned first image.

在一些实施方式中，以参考图像作为图像调整的引导条件，通过第二神经网络模型对相似图像进行调整。例如，步骤（105.A.2）可以包括：对目标掩码图进行取反，得到目标掩码图对应的目标反掩码图；根据目标反掩码图对相似图像进行掩码处理，得到第二相似掩码图像；对第二相似掩码图像进行加噪处理，得到第二相似噪声图，其中，所述第二相似噪声图的时间步与所述第一相似噪声图的时间步相邻；根据参考图像对应的特征图对第二相似噪声图进行减噪处理，得到第二特征图；对第二特征图进行解码处理，得到第二图像。In some implementations, a reference image is used as a guiding condition for image adjustment, and similar images are adjusted using a second neural network model. For example, step (105.A.2) may include: inverting the target mask image to obtain a target inverse mask image corresponding to the target mask image; performing masking processing on similar images based on the target inverse mask image to obtain a second similar mask image; adding noise to the second similar mask image to obtain a second similar noise image, wherein the time step of the second similar noise image is adjacent to the time step of the first similar noise image; performing noise reduction processing on the second similar noise image based on the feature map corresponding to the reference image to obtain a second feature map; and decoding the second feature map to obtain a second image.

需要说明的是，以参考图像作为引导条件对相似图像局部微调的目的是：在经过图像微调后，获取包含针对差异对象的表征的第二图像；由于目标掩码图的作用为将相似图像中与参考图像的差异对象所在区域的相同的目标像素区域进行掩盖，对此，需要将目标掩码图进行取反处理，该取反处理过程是将目标掩码图中原来的“0”置换为“1”，同时，将目标掩码图中原来的“1”置换为“0”，以获得目标反掩码图，该目标反掩码图的作用为允许差异对象所在的目标像素区域中的像素的表征，且拒绝非目标像素区域的其他像素的表征。It should be noted that the purpose of using the reference image as a guiding condition to fine-tune the local area of the similar image is to obtain a second image containing representations of the differing objects after image fine-tuning. Since the target mask image is used to cover up the same target pixel regions in the similar image where the differing objects are located compared to the reference image, it is necessary to invert the target mask image. This inversion process involves replacing the original "0"s in the target mask image with "1"s, and simultaneously replacing the original "1"s with "0"s to obtain the target inverse mask image. The target inverse mask image allows the representation of pixels in the target pixel regions where the differing objects are located, while rejecting the representation of other pixels in non-target pixel regions.

进一步的，在获得目标反掩码图后，首先，根据目标反掩码图对相似图像进行掩码处理，该掩码处理过程可以是将目标反掩码图与相似图像之间进行相乘，以使得目标掩码图中每个数值都与相似图像中对应的像素进行相乘，得到第二相似掩码图像。然后，将该第二相似掩码图像输入到第二神经网络模型中，以通过第二神经网络模型对第二相似掩码图进行局部微调，为了便于理解，可结合图6所示，对该第二神经网络模型的图像调整过程进行介绍，具体为：第二神经网络模型对第二相似掩码图像进行编码处理，并将编码结果引入到隐空间中，在隐空间中对编码结果进行前向扩散，该前向扩散可以理解为加噪处理的过程，以获取第二相似噪声图。接着，在对第二相似噪声图进行反向扩散之前，对作为引导条件的参考图像进行图像编码处理，以获得参考图像对应的特征图（即向量矩阵），实现将参考图像对应的特征图引入隐空间中；进而，在隐空间中对第二相似噪声图进行反向扩散处理，并在反向扩散处理时，将参考图像对应的特征图融入第二相似噪声图的噪声中进行一起反向扩散，以获得第二特征图。最后，对第二特征图进行解码处理，以将隐空间中第一特征图进行恢复至像素空间进行表征，以输出得到第二图像。Furthermore, after obtaining the target inverse mask image, firstly, the similar image is masked based on the target inverse mask image. This masking process can involve multiplying the target inverse mask image and the similar image so that each value in the target mask image is multiplied by the corresponding pixel in the similar image, resulting in a second similarity mask image. Then, this second similarity mask image is input into the second neural network model to perform local fine-tuning of the second similarity mask image. For ease of understanding, the image adjustment process of the second neural network model can be described with reference to Figure 6. Specifically, the second neural network model encodes the second similarity mask image and introduces the encoding result into the latent space. The encoding result is then forward-diffused in the latent space. This forward-diffusing can be understood as a noise-adding process to obtain a second similarity noise image. Next, before backdiffusion of the second similar noise map, image encoding is performed on the reference image used as a guiding condition to obtain the feature map (i.e., vector matrix) corresponding to the reference image, thus introducing the feature map corresponding to the reference image into the latent space. Then, backdiffusion is performed on the second similar noise map in the latent space, and during the backdiffusion process, the feature map corresponding to the reference image is integrated into the noise of the second similar noise map and backdiffused together to obtain the second feature map. Finally, the second feature map is decoded to restore the first feature map in the latent space to the pixel space for representation, so as to output the second image.

需要说明的是，以参考图像的特征图作为约束条件时，对第二相似噪声图进行反向扩散时，其具体处理过程与上述“在减噪网络层（Denoising U-Net）对第一相似噪声图进行反向扩散处理过程”的步骤相同，仅存在差异的“参考图像的特征图”与“差异描述文本的差异文本向量”，该反向扩散过程具体可以结合图6和图7以及前述内容进行理解，此处不做一一赘述。It should be noted that when using the feature map of the reference image as a constraint, the specific processing procedure for backdiffusion of the second similar noise map is the same as the steps in the above-mentioned "backdiffusion processing of the first similar noise map in the denoising U-Net layer". The only difference is the "feature map of the reference image" and the "difference text vector of the difference description text". The backdiffusion process can be understood in conjunction with Figures 6 and 7 and the aforementioned content, and will not be elaborated here.

（B）在微调处理过程中对图像进行掩码处理：(B) Masking the image during fine-tuning:

需要说明的是，在对相似图像进行局部微调时，可通过带有条件的隐空间扩散（Stable Diffusion，SD）模型来实现，具体的，将相似图像和目标掩码图输入到带有条件的隐空间扩散模型中，通过该带有条件的隐空间扩散模型对相似图像进行噪声扩散处理，并在噪声扩散过程中引入引导条件（如差异描述文本和参考图像）进行辅助，以指示图像对相关像素区域进行精确微调，提高图像调整的准确性。It should be noted that when performing local fine-tuning on similar images, a conditional latent space diffusion (SD) model can be used. Specifically, the similar image and the target mask image are input into the conditional latent space diffusion model, which performs noise diffusion processing on the similar image. During the noise diffusion process, guiding conditions (such as difference description text and reference images) are introduced to assist in instructing the image to make precise fine-tuning of relevant pixel regions, thereby improving the accuracy of image adjustment.

在一些实施方式中，先对需要微调的相似图像进行连续多个时间步的噪声化处理，获取时间步相邻的两个噪声图，并根据目标掩码图、差异描述文本和参考图像对噪声图像进行微调，以将微调结果融合得到目标图像。例如，步骤105可以包括：In some implementations, the similar images requiring fine-tuning are first subjected to noise processing at multiple consecutive time steps to obtain two noise images at adjacent time steps. The noise images are then fine-tuned based on the target mask image, difference description text, and a reference image, and the fine-tuning results are fused to obtain the target image. For example, step 105 may include:

（105.B.1）对所述相似图像进行加噪处理，并获取所述加噪处理中相邻时间步的第一相似噪声图和第二相似噪声图；(105.B.1) Noise is added to the similar images, and a first similar noise map and a second similar noise map at adjacent time steps are obtained in the noise addition process;

（105.B.2）根据所述目标掩码图和所述差异描述文本，对所述第一相似噪声图进行解噪处理，得到第一图像；(105.B.2) Based on the target mask image and the difference description text, the first similarity noise image is denoised to obtain the first image;

（105.B.3）根据所述目标掩码图和所述参考图像，对所述第二相似噪声图进行解噪处理，得到第二图像；(105.B.3) Based on the target mask image and the reference image, the second similarity noise image is denoised to obtain the second image;

（105.B.4）将所述第一图像与所述第二图像进行融合，得到调整后的目标图像。(105.B.4) The first image and the second image are fused to obtain the adjusted target image.

具体的，可通过带有条件的隐空间扩散（Stable Diffusion，SD）模型来实现对相似图像的局部微调，将相似图像和目标掩码传输给该扩散模型，扩散模型将相似图像经过编码处理得到特征图，并将该特征图导入到隐（潜）空间中进行前向扩散，该前向扩散过程为经历多个时间步的逐渐加噪处理过程，每个时间步视为一次加噪处理，直至得到完全噪声化的噪声图，进而，可取完全噪声化的噪声图和相邻的前一时间步的噪声图，具体可将该相邻的前一时间步的噪声图作为第一相似噪声图，以完全噪声化的噪声图作为第二相似噪声图；进一步的，基于目标掩码图和差异描述文本，对第一相似噪声图进行反向扩散处理，该反向扩散处理过程为连续多次减噪处理，以获取得到第一图像，以及基于目标掩码图和参考图像，对第二相似噪声图进行反向扩散处理，以获取得到第二图像。最后，将第一图像和第二图像进行融合，得到调整后的目标图像。Specifically, a conditional latent space diffusion (SD) model can be used to fine-tune similar images locally. The similar images and the target mask are fed to the diffusion model, which encodes the similar images to obtain feature maps. These feature maps are then imported into the latent space for forward diffusion. This forward diffusion process involves gradual noise addition over multiple time steps, with each time step considered as one noise addition operation, until a fully noisy image is obtained. Then, the fully noisy image and the noise image from the adjacent previous time step are taken. Specifically, the noise image from the adjacent previous time step can be used as the first similar noise image, and the fully noisy image as the second similar noise image. Further, based on the target mask image and the difference description text, the first similar noise image undergoes backdiffusion processing. This backdiffusion process involves multiple consecutive noise reduction operations to obtain the first image. Similarly, based on the target mask image and a reference image, the second similar noise image undergoes backdiffusion processing to obtain the second image. Finally, the first and second images are fused to obtain the adjusted target image.

示例性的，结合图6所示，将相似图像X、目标掩码图、差异描述文本和参考图像传输给带有条件的隐空间扩散（Stable Diffusion，SD）模型，其中，该扩散模型会对相似图像X进行编码处理，以将编码处理结果（特征图Z）导入到隐空间中，通过前向扩散（加噪）处理，假设通过T个时间步的逐步加噪处理，得到完全噪声化的噪声图“Z_T”，也可以表示为“X_T”，从而，取噪声图“Z_T”作为第二相似噪声图，并取相邻时间步的噪声图噪声图“Z_T-1”作为第一相似噪声图。进一步的，基于目标掩码图和差异描述文本，对第一相似噪声图进行反向扩散处理，该反向扩散为经历对应数量的时间步的减噪处理，如经历“T-1”个时间步的逐渐减噪处理，并解码得到第一图像；同理，相似噪声图进行反向扩散处理，该反向扩散为经历对应数量的时间步的减噪处理，如经历“T-1”个时间步的逐渐减噪处理，得到第一特征图“Z”，并解码得到第二图像。最后，将第一图像和第二图像融合，如叠加、拼接等处理，以得到调整的后目标图像。For example, as shown in Figure 6, a similar image X, a target mask image, a difference description text, and a reference image are transmitted to a conditional latent space diffusion (SD) model. The diffusion model encodes the similar image X to import the encoding result (feature map Z) into the latent space. Through forward diffusion (noising) processing, assuming that through T time steps of progressive noise processing, a fully noisy noise map "Z_T", which can also be represented as "X_T", is obtained. Thus, the noise map "Z_T" is taken as the second similar noise map, and the noise map "Z_T-1" of the adjacent time step is taken as the first similar noise map. Furthermore, based on the target mask image and the difference description text, the first similarity noise image undergoes backdiffusion processing. This backdiffusion involves noise reduction processing over a corresponding number of time steps, such as "T-1" time steps, and is then decoded to obtain the first image. Similarly, the similarity noise image undergoes backdiffusion processing over a corresponding number of time steps, such as "T-1" time steps, to obtain the first feature map "Z", which is then decoded to obtain the second image. Finally, the first and second images are fused, such as through overlay or stitching, to obtain the adjusted target image.

需要说明的是，取与第一相似噪声图相邻时间步的噪声图作为第二相似噪声图，在减噪中引入参考图像来进行引导减噪，以使得参考图像与相似图像的特征在相邻时间步“T-1”和“T”时能够保证一致，如大小、尺寸等一致，具有可靠性。It should be noted that the noise map at the time step adjacent to the first similar noise map is taken as the second similar noise map. A reference image is introduced in the noise reduction process to guide the noise reduction, so that the features of the reference image and the similar image can be consistent at adjacent time steps "T-1" and "T", such as size and dimensions, which has reliability.

在一些实施方式中，第一图像由差异描述文本作为引导条件进行微调得到。例如，步骤（105.B.2）可以包括：根据所述目标掩码图对所述第一相似噪声图进行掩码处理，得到第一掩码噪声图；获取所述差异描述文本对应的差异文本向量，并根据所述差异文本向量对所述第一相似噪声图进行减噪处理，得到第一特征图；对所述第一特征图进行解码处理，得到第一图像。In some implementations, the first image is fine-tuned using the difference description text as a guiding condition. For example, step (105.B.2) may include: masking the first similarity noise map according to the target mask map to obtain a first mask noise map; obtaining the difference text vector corresponding to the difference description text, and performing noise reduction processing on the first similarity noise map according to the difference text vector to obtain a first feature map; and decoding the first feature map to obtain the first image.

示例性的，带有条件的隐空间扩散（Stable Diffusion，SD）模型在经历多次加噪处理得到第一相似噪声图“Z_T-1”，将目标掩码图与第一相似噪声图进行相乘，得到第一掩码噪声图；同时，以差异描述文本作为引导条件，对差异描述文本进行文本编码处理，获得差异文本向量；进而，基于该差异文本向量对第一掩码噪声图进行逐步减噪，具体为在每个时间步的减噪处理过程中通过注意力机制将差异文本向量引入到该时间步的噪声中去，持续连续“T-1”个时间步的减噪，直至完全去噪，得到第一特征图，如图6中“Z”；最后，将该第一特征图进行解码处理，得到第一图像。需要说明的是，以上仅为示例，还可在完全去噪后再对第一特征图进行掩码处理，关于掩码处理过程的时序，此处具体不做限定。For example, the Conditional Latent Diffusion (SD) model, after undergoing multiple noise addition processes to obtain a first similarity noise map "Z _T-1 ", multiplies the target mask map with the first similarity noise map to obtain a first mask noise map. Simultaneously, using the difference description text as a guiding condition, the difference description text is text-encoded to obtain a difference text vector. Then, based on this difference text vector, the first mask noise map is progressively denoised. Specifically, during the denoising process at each time step, an attention mechanism is used to introduce the difference text vector into the noise at that time step. This denoising process continues for "T-1" consecutive time steps until complete denoising is achieved, resulting in a first feature map, as shown in "Z" in Figure 6. Finally, this first feature map is decoded to obtain the first image. It should be noted that the above is only an example; masking can also be performed on the first feature map after complete denoising. The specific timing of the masking process is not limited here.

在一些实施方式中，第二图像由参考图像作为引导条件进行微调得到的。步骤（105.B.3）可以包括：对所述目标掩码图进行取反，得到所述目标掩码图对应的目标反掩码图；根据所述目标反掩码图对所述第二相似噪声图进行掩码处理，得到第二掩码噪声图；根据所述参考图像对应的特征图对所述第二掩码噪声图进行掩码处理，得到第二特征图；对所述第二特征图进行解码处理，得到第二图像。In some implementations, the second image is obtained by fine-tuning a reference image as a guiding condition. Step (105.B.3) may include: inverting the target mask image to obtain a target inverse mask image corresponding to the target mask image; performing masking processing on the second similarity noise image based on the target inverse mask image to obtain a second mask noise image; performing masking processing on the second mask noise image based on the feature image corresponding to the reference image to obtain a second feature image; and decoding the second feature image to obtain the second image.

示例性的，带有条件的隐空间扩散（Stable Diffusion，SD）模型在经历多次加噪处理得到第二相似噪声图“Z_T”后，需要对第二相似噪声图进行掩码处理，由于需要从参考中提取出图像中差异对象部分的信息，以及屏蔽参考图像和相似图像中其他非差异对象部分的信息，因此，需要对目标掩码图进行取反，以获取与目标掩码图相反的目标反掩码图。进而，将目标反掩码图与第二相似噪声图进行相乘，得到第二掩码噪声图，同时，以参考图像作为引导条件，对参考图像进行图像编码处理，获得参考图像的特征图；进而，基于该参考图像的特征图对第二掩码噪声图进行逐步减噪，具体为在每个时间步的减噪处理过程中，通过注意力机制将参考图像的特征图引入到该时间步的噪声中去，持续连续“T”个时间步的减噪，直至完全去噪，得到第二特征图，如图6中“Z”；最后，将该第二特征图进行解码处理，得到第二图像。需要说明的是，以上仅为示例，还可在完全去噪后再对第二特征图进行掩码处理，关于掩码处理过程的时序，此处具体不做限定。For example, after the Conditional Latent Diffusion (SD) model undergoes multiple noise addition processes to obtain the second similarity noise map "Z _T ", it needs to be masked. Since it is necessary to extract information about the differing objects in the reference image and mask information about other non-dissimilar objects in the reference and similar images, the target mask image needs to be inverted to obtain a target inverse mask image that is the opposite of the target mask image. Then, the target inverse mask image is multiplied with the second similarity noise map to obtain the second mask noise map. Simultaneously, using the reference image as a guiding condition, image encoding is performed on the reference image to obtain its feature map. Then, based on this feature map, the second mask noise map is progressively denoised. Specifically, during the denoising process at each time step, an attention mechanism is used to introduce the feature map of the reference image into the noise at that time step. This denoising is continued for "T" consecutive time steps until complete denoising is achieved, resulting in the second feature map, as shown in "Z" in Figure 6. Finally, the second feature map is decoded to obtain the second image. It should be noted that the above is just an example. The second feature map can also be masked after complete denoising. The timing of the masking process is not specified here.

在一些实施方式中，关于带有条件的隐空间扩散（Stable Diffusion，SD）模型的训练过程具体为：获取样本参考图像、样本相似图像以及样本目标图像，并获取样本参考图像相对于样本相似图像的样本差异描述文本和样本目标掩码图；进而，将样本参考图像、样本相似图像、样本差异描述文本和样本目标掩码图传输给预设的SD模型，分别以样本差异描述文本和样本参考图像作为引导条件，结合样本目标掩码图对样本相似图像进行局部微调，得到预测目标图像；进而，获取预测目标图像与样本目标图像之间的差异，以构建预测损失，并基于预测损失对预设模型进行迭代训练，直至达到预设条件，得到训练后的目标模型，即带有条件的隐空间扩散模型。In some implementations, the training process for a conditional latent space diffusion (SD) model is as follows: A sample reference image, sample similar images, and a sample target image are acquired. A sample difference description text and a sample target mask image are also acquired relative to the sample similar images. The sample reference image, sample similar images, sample difference description text, and sample target mask image are then transmitted to a pre-defined SD model. Using the sample difference description text and the sample reference image as guiding conditions, the sample similar images are locally fine-tuned in conjunction with the sample target mask image to obtain a predicted target image. The difference between the predicted target image and the sample target image is then acquired to construct a prediction loss. The pre-defined model is iteratively trained based on this prediction loss until the pre-defined conditions are met, resulting in the trained target model, i.e., the conditional latent space diffusion model.

由上可知，本申请实施例可先从现有的数据中获取与参考图像相似的相似图像，然后，基于参考图像与相似图像之间的差异信息，生成目标掩码图，以及，针对该差异信息进行扩充，以丰富表示该差异信息的差异描述文本，最后，联合目标掩码、差异描述文本和参考图像对现有的相似图像进行局部微调，以获取微调后的目标图像；以此，可对图像之间的差异进行扩充描述，并针对差异的扩充描述文本和参考图像作为约束来局部调整相似图像，提高图像调整的准确性，使得调整后的图像效果与实际需求相符合，以利于后续其他业务的开展。As can be seen from the above, the embodiments of this application can first obtain similar images to the reference image from existing data, then generate a target mask image based on the difference information between the reference image and the similar image, and expand the difference information to enrich the difference description text representing the difference information. Finally, the existing similar image is locally fine-tuned by combining the target mask, the difference description text, and the reference image to obtain the fine-tuned target image. In this way, the difference description between images can be expanded, and the expanded description text of the difference and the reference image can be used as constraints to locally adjust the similar image, improving the accuracy of image adjustment and making the adjusted image effect conform to the actual needs, so as to facilitate the development of other subsequent businesses.

根据上面实施例所描述的方法，以下将举例作进一步详细说明。Based on the method described in the above embodiments, the following examples will provide further detailed explanations.

本申请实施例以图像处理为例，对本申请实施例提供的图像处理方法作进一步叙述。This application takes image processing as an example to further describe the image processing method provided in this application.

图8是本申请实施例提供的图像处理方法的另一步骤流程示意图，图9是本申请实施例提供的图像处理系统的的框架结构的示意图，图10是本申请实施例提供的残差网络层的结构示意图，图11是本申请实施例提供的差异信息汇总生成差异描述文本的场景示意图，图12是本申请实施例提供的图像微调过程的场景示意图。为了便于理解，本申请实施例结合图3-12进行描述。Figure 8 is a flowchart illustrating another step of the image processing method provided in this application embodiment; Figure 9 is a schematic diagram of the framework structure of the image processing system provided in this application embodiment; Figure 10 is a schematic diagram of the structure of the residual network layer provided in this application embodiment; Figure 11 is a schematic diagram of a scenario where difference information is summarized to generate difference description text provided in this application embodiment; and Figure 12 is a schematic diagram of a scenario where image fine-tuning is performed provided in this application embodiment. For ease of understanding, this application embodiment is described in conjunction with Figures 3-12.

在本申请实施例中，将从图像处理装置的维度进行描述，该图像处理装置具体可以集成在计算机设备如服务器中。例如，该计算机设备上的处理器执行图像处理方法对应的程序时，该图像处理方法的具体流程如下：In this embodiment, the description will focus on the image processing apparatus, which can be integrated into a computer device such as a server. For example, when the processor on the computer device executes the program corresponding to the image processing method, the specific flow of the image processing method is as follows:

201、获取参考图像，并获取与参考图像相似的相似图像。201. Obtain a reference image, and then obtain similar images that are similar to the reference image.

在本申请实施例中，为了获得与参考图像更相似的目标图像，可从预设数据库中搜寻与参考图像最为相似的相似图像，以便后续对搜寻到的相似图像作进一步的图像处理，如对相似图像进行局部微调，以获得更贴合参考图像的目标图像。In this embodiment of the application, in order to obtain a target image that is more similar to the reference image, a similar image that is most similar to the reference image can be searched from a preset database so that the searched similar image can be further processed, such as making local fine adjustments to the similar image to obtain a target image that fits the reference image better.

其中，该参考图像可以是包含任意内容的图像。例如，以图像查询业务平台为例，客户向该平台发送一个或多个例图，该例图为本申请实施例的参考图像，平台在收到该例图后，可从现有的数据库中查找与该例图相似的相似图像，以便后续以该相似图像作为基础，结合相似图像与参考图像之间的差异来对相似图像进行局部调整。The reference image can be an image containing any content. For example, taking an image query service platform as an example, a customer sends one or more example images to the platform. These example images are reference images in the embodiments of this application. After receiving the example image, the platform can search for similar images similar to the example image from its existing database, so that it can subsequently use the similar image as a basis and combine the differences between the similar image and the reference image to make local adjustments to the similar image.

其中，该相似图像可以是数据库中与参考图像最为相似的图像，其可以理解为现有数据中在图像内容或图像风格等方面与参考图像最为相似的图像。The similar image can be the image in the database that is most similar to the reference image. It can be understood as the image in the existing data that is most similar to the reference image in terms of image content or image style.

具体的，为了从预设数据库中选取与该参考图像相似的相似图像，获取相似图像的过程可以为：首先，可确定参考图像对应的参考聚类中心，并分别计算该参考聚类中心与现有的数据库中每个图像聚类中心之间的特征类别距离。进一步的，可根据特征类别距离来选取与参考图像相似的相似图像，具体的，可根据该特征类别距离的大小来判定参考图像的聚类中心与数据库中的哪一个图像聚类中心更相近，以将该相近的目标图像聚类中心的图像类别确定为与该参考聚类中心的图像类别；进而，计算参考图像与目标图像聚类中心的图像类别下的每一图像之间的特征距离，需要说明的是，特征距离的大小可以反映任意两个图像之间的相似度，因此，可根据特征距离的大小来选取与参考图像相似的相似图像，例如，选取与参考图像的特征距离最小的图像作为相似图像。Specifically, to select similar images from a pre-defined database, the process of obtaining similar images can be as follows: First, the reference cluster center corresponding to the reference image can be determined, and the feature class distance between the reference cluster center and each image cluster center in the existing database can be calculated. Further, similar images can be selected based on the feature class distance. Specifically, the magnitude of the feature class distance can be used to determine which image cluster center in the database is more similar to the cluster center of the reference image, thus determining the image class of the similar target image cluster center as the image class of the reference cluster center. Then, the feature distance between each image in the image class of the reference image and the target image cluster center is calculated. It should be noted that the magnitude of the feature distance can reflect the similarity between any two images; therefore, similar images can be selected based on the magnitude of the feature distance, for example, selecting the image with the smallest feature distance to the reference image as the similar image.

202、确定参考图像与相似图像之间的差异信息。202. Determine the differences between the reference image and similar images.

其中，该差异信息可以是表示参考图像与相似图像之间的特征差异的信息，其不限于包括图像中的存在差异的对象（事物）数量、对象位置和/或对象体态等差异信息。The difference information can be information representing the feature differences between the reference image and similar images, and is not limited to differences such as the number of objects (things) that differ in the image, the location of the objects, and/or the shape of the objects.

具体的，为了获取参考图像与相似图像之间的差异信息，可通过图文转换的方式获取参考图像的第一描述文本，以及通过图文转换的方式获取相似图像的第二描述文本；进一步的，基于第一描述文本与第二描述文本之间的差异，生成参考图像与相似图像之间的差异信息。Specifically, in order to obtain the difference information between the reference image and similar images, the first descriptive text of the reference image can be obtained through image-to-text conversion, and the second descriptive text of the similar images can be obtained through image-to-text conversion; further, based on the difference between the first descriptive text and the second descriptive text, the difference information between the reference image and the similar images is generated.

为了便于理解该第一描述文本和第二描述文本，以该第一描述文本的为例，对其进行叙述。具体的，该第一描述文本可以包括针对参考图像中整体图像内容的全局描述文本和图像中每个对象的对象描述文本。To facilitate understanding of the first and second descriptive texts, we will take the first descriptive text as an example. Specifically, the first descriptive text may include global descriptive text for the overall image content in the reference image and object descriptive text for each object in the image.

其中，该全局描述文本的生成途径可以基于预训练的冻结图像编码器和大型语言模型（ Bootstrapping Language-Image Pre-training with Frozen Image Encodersand Large Language Models，BLIP2）来实现。具体的，该模型在结构上可以包括视觉与语言表征学习（Vision-and-Language Representation Learning）、视觉到语言的生成学习（Vision-to-Language Generative Learning）两部分，其中，该视觉与语言表征学习在结构上包括图像编码器（Image Encoder）和轻量级的查询变压器（ Querying Transformer，Q-Former），其中，该视觉到语言的生成学习在结构上包括大规模语言模型（LargeLanguage Models，LLM）。全局描述文本的生成过程为：将参考图像输入到图像编码器，通过图像编码器对该参考图像进行编码处理，得到图像编码结果；进而，将该图像编码结果输入到轻量级的查询变压器，并确定参考图像中的对象类别标签，以在轻量级的查询变压器中与对象类别标签进行融合，得到融合特征结果；最后，将该融合特征结果输入至大规模语言模型进行语言处理，以输出针对该参考图像的全局描述文本。The global descriptive text generation approach can be implemented based on a pre-trained frozen image encoder and large language models (BLIP2). Specifically, the model can be structurally divided into two parts: vision-and-language representation learning and vision-to-language generative learning. The vision-and-language representation learning includes an image encoder and a lightweight query transformer (Q-Former), while the vision-to-language generative learning includes large language models (LLM). The global descriptive text generation process is as follows: The reference image is input into an image encoder, which encodes the reference image to obtain an image encoding result; then, the image encoding result is input into a lightweight query transformer, and the object category labels in the reference image are determined and fused with the object category labels in the lightweight query transformer to obtain a fused feature result; finally, the fused feature result is input into a large-scale language model for language processing to output a global descriptive text for the reference image.

其中，该对象描述文本的生成途径可以基于图像区域到文本的生成转换器（Generative Region-to-Text Transformer，GRIT）来实现，该模型在结构上可以包括视觉编码器（Visual Encoder）、定位对象的地标物体提取器（Foreground Object Extractor）和文本解码器（Text Decoder）。具体的，对象描述文本的生成过程为：对参考图像中的每个对象进行识别，以确定参考图像中每个对象所在的像素区域，并标示每个对象的类别，将已确定每个对象的像素区域和类别的参考图像输入到图像区域到文本的生成转换器中，以对像素区域进行区域与语言之间的转化处理，输出得到转化处理后的参考图像，该转化处理后的参考图像中包含用于标注每个对象的标记框，以及每个标记框中的对象的对象描述文本。The object description text generation method can be implemented based on a Generative Region-to-Text Transformer (GRIT). This model structurally includes a Visual Encoder, a Foreground Object Extractor, and a Text Decoder. Specifically, the object description text generation process is as follows: Each object in the reference image is identified to determine its pixel region and class. The reference image with the identified pixel regions and classes is then input into the Generative Region-to-Text Transformer to perform region-to-language conversion. The output is a converted reference image containing bounding boxes for each object and object description text for each object within the bounding box.

以此，可以确定参考图像与相似图像之间的差异情况，以便后续结合该两者图像之间的差异情况作为图像处理过程中的调整依据，并对相似图像进行调整，提高图像调整的准确性。This allows us to determine the differences between the reference image and similar images, so that we can use these differences as a basis for adjustments during image processing and improve the accuracy of image adjustments.

203、确定参考图像中针对差异信息的目标掩码图，并获取与目标掩码图相反的目标反掩码图。203. Determine the target mask image for the difference information in the reference image, and obtain the target inverse mask image that is the opposite of the target mask image.

在本申请实施例中，主要采用局部调整的方式来对相似图像进行图像处理，因此，需要获取参考图像中针对差异信息的掩码图，以利用掩码图来参与对相似图像的局部微调，以便后续提高图像调整时的精确性。In the embodiments of this application, the image processing of similar images is mainly carried out by local adjustment. Therefore, it is necessary to obtain a mask image of the difference information in the reference image so as to use the mask image to participate in the local fine adjustment of the similar image, so as to improve the accuracy of subsequent image adjustment.

需要说明的是，该目标掩码图可以是针对参考图像相对于相似图像的差异信息生成的像素区域遮掩图，用于将相似图像中与参考图像的差异对象所在区域的相同位置的目标像素区域进行掩盖，以阻拦相似图像中针对该目标像素区域中的像素的表征，使得该相似图像中的目标像素区域呈现空白，即内容。该目标掩码图主要以“0”和“1”进行表示，其中，目标像素区域中的数值为“0”，除了该目标像素区之外的其他区域的数值为“1”。It should be noted that this target mask image can be a pixel region masking image generated based on the difference information between the reference image and similar images. It is used to mask the target pixel regions at the same locations in the similar image where the differing objects are located, thereby blocking the representation of pixels in the target pixel region in the similar image, making the target pixel region in the similar image appear blank, i.e., containing content. This target mask image is mainly represented by "0" and "1", where the value in the target pixel region is "0", and the value in other regions is "1".

此外，目标反掩码图是与目标掩码图相反的掩码图，用于允许差异对象所在的目标像素区域中的像素的表征，拒绝非目标像素区域的其他像素的表征。可以理解的是，在该目标反掩码图中，该目标像素区域中的数值为“1”，除了该目标像素区之外的其他区域的数值为“0”。Furthermore, the target inverse mask is the opposite of the target mask, used to allow representation of pixels within the target pixel region where the difference object resides, while rejecting representation of pixels outside the target pixel region. Understandably, in this target inverse mask, the value within the target pixel region is "1", and the values in other regions are "0".

以此，可分别获取针对参考图像中的差异对象的目标掩码图，以及目标反掩码图，以作为调整相似图像时的必要元素，分别参与对相似图像的局部微调，以便后续提高图像调整时的精确性。In this way, the target mask map and the target inverse mask map of the difference objects in the reference image can be obtained respectively, which can be used as necessary elements when adjusting similar images, and participate in the local fine-tuning of similar images to improve the accuracy of subsequent image adjustment.

204、基于差异信息进行文本扩充，得到差异描述文本。204. Based on the difference information, the text is expanded to obtain the difference description text.

在本申请实施例中，可获取针对参考图像中的差异信息的相关描述文本，以便后续以该差异信息的相关描述文本作为图像调整的引导条件，以对相似图像进行的局部调整，提高图像调整的准确性。In this embodiment, relevant descriptive text for the difference information in the reference image can be obtained, so that the relevant descriptive text for the difference information can be used as a guiding condition for subsequent image adjustment, so as to improve the accuracy of image adjustment by making local adjustments to similar images.

具体的，为了能够对参考图像与相似图像之间的差异信息进行丰富的描述，首先，可确定参考图像中该差异信息所对应的差异对象，并获取该差异对象在参考图像中的与任意一个其他对象之间的对象关系信息，如确定当前的差异对象与图中其他对象的位置关系、是否属于同一对象类别（物体类别）等；进一步的，参考前述的“第一描述文本”的获取，获取到参考图像的全局描述文本，以及获取差异对象关联的对象描述文本；最后，基于参考图像的全局描述文本、差异对象的描述文本、以及差异对象相对于其他对象之间的对象关系信息，进行文本的扩充描述，以此，实现对参考图像与相似图像之间的差异信息进行充分的描述，得到扩充后的差异描述文本。需要说明的是，该差异描述文本丰富地表述了该差异对象在参考图像中其他对象之间的关系以及差异对象的对象状态信息，如此，相对于对象描述文本，该差异描述文本能够更丰富地表示参考图像中差异对象的相关信息。Specifically, to provide a rich description of the differences between the reference image and similar images, firstly, the object corresponding to the difference in the reference image is identified, and the object relationship information between this object and any other object in the reference image is obtained, such as determining the positional relationship between the current object and other objects in the image, and whether they belong to the same object category (object class). Further, referring to the aforementioned acquisition of the "first descriptive text," the global descriptive text of the reference image and the object description text associated with the object are obtained. Finally, based on the global descriptive text of the reference image, the descriptive text of the object, and the object relationship information between the object and other objects, the text is expanded to provide a more comprehensive description of the differences between the reference image and similar images, resulting in the expanded difference description text. It should be noted that this difference description text richly expresses the relationship between the object and other objects in the reference image, as well as the object state information of the object. Thus, compared to the object description text, this difference description text can more comprehensively represent the relevant information of the object in the reference image.

需要说明的是，在对相似图像进行局部区域的调整时，该调整过程包括两部分：第一，以差异描述文本作为引导条件，结合目标掩码图对相似图像进行局部调整，得到第一图像，具体参见以下步骤205；第二，以参考图像作为引导（约束）条件，结合目标反掩码图对相似图像进行局部调整，得到第二图像，具体参见以下步骤206。It should be noted that when adjusting local regions of similar images, the adjustment process includes two parts: First, using the difference description text as a guiding condition, the similar images are locally adjusted in conjunction with the target mask image to obtain the first image, as detailed in step 205; Second, using the reference image as a guiding (constraint) condition, the similar images are locally adjusted in conjunction with the target inverse mask image to obtain the second image, as detailed in step 206.

205、通过第一神经网络模型基于目标掩码图和差异描述文本，对相似图像进行局部微调，生成第一图像。205. Based on the target mask image and the difference description text, the first neural network model performs local fine-tuning on similar images to generate the first image.

在本申请实施例中，在对相似图像进行局部微调时，该第一神经网络模型可以是带有条件的隐空间扩散（Stable Diffusion，SD）模型，具体的，将相似图像输入到带有条件的隐空间扩散模型中，通过该带有条件的隐空间扩散模型对相似图像进行噪声扩散处理，并在噪声扩散过程中引入引导条件进行辅助，以指示图像对相关像素区域进行精确微调，提高图像调整的准确性。In this embodiment of the application, when performing local fine-tuning on similar images, the first neural network model can be a conditional latent space diffusion (SD) model. Specifically, the similar images are input into the conditional latent space diffusion model, and noise diffusion processing is performed on the similar images through the conditional latent space diffusion model. During the noise diffusion process, guiding conditions are introduced to assist in instructing the image to perform precise fine-tuning on relevant pixel regions, thereby improving the accuracy of image adjustment.

具体的，在通过隐空间扩散模型对相似图像进行调整之前，首先，根据目标掩码图对相似图像进行掩码处理，该掩码处理过程可以是将目标掩码图与相似图像之间进行相乘，以使得目标掩码图中每个数值都与相似图像中对应的像素进行相乘，得到第一相似掩码图像。然后，将该第一相似掩码图像输入到隐空间扩散模型中，以通过隐空间扩散模型对第一相似掩码图进行局部微调，为了便于理解，可结合图6和图7所示，对该第一神经网络模型的图像调整过程进行介绍，具体为：隐空间扩散模型对第一相似掩码图像进行编码处理，并将编码结果（第一编码特征图）引入到隐空间中，在隐空间中对编码结果进行前向扩散，该前向扩散可以理解为经过多个时间步的逐步加噪处理的过程，以获取完全噪声化的第一相似噪声图。接着，在对第一相似噪声图进行反向扩散之前，对作为引导条件的差异描述文本进行文本编码处理，以获得差异文本向量，实现将差异描述文本引入隐空间中；进而，在隐空间中通过减噪网络层（Denoising U-Net）对第一相似噪声图进行反向扩散处理，该反向扩散处理遇可以理解为进行多次（多个时间步）的减噪处理，并在每次减噪处理过程中通过残差网络层（ResNet）对当前时间步的噪声图进行特征提取，并通过注意力机制融入差异文本向量；经过多个时间步的减噪处理，以获得第一特征图。最后，对第一特征图进行解码处理，以将隐空间中第一特征图恢复至像素空间进行像素表征，以得到第一图像。Specifically, before adjusting the similar images using the latent space diffusion model, the similar images are first masked based on the target mask image. This masking process involves multiplying the target mask image and the similar image so that each value in the target mask image is multiplied by the corresponding pixel in the similar image, resulting in a first similarity mask image. Then, this first similarity mask image is input into the latent space diffusion model for local fine-tuning. For ease of understanding, the image adjustment process of this first neural network model can be described in conjunction with Figures 6 and 7. Specifically, the latent space diffusion model encodes the first similarity mask image and introduces the encoding result (first encoded feature map) into the latent space. Forward diffusion is then performed on the encoding result in the latent space. This forward diffusion can be understood as a process of gradually adding noise over multiple time steps to obtain a fully noisy first similarity noise map. Next, before backdiffusion of the first similar noise map, the differential description text, which serves as the guiding condition, is text-encoded to obtain a differential text vector, thus introducing the differential description text into the latent space. Then, in the latent space, a denoising U-Net layer is used to perform backdiffusion on the first similar noise map. This backdiffusion process can be understood as performing multiple denoising processes (multiple time steps). In each denoising process, a ResNet layer is used to extract features from the noise map at the current time step, and the differential text vector is incorporated through an attention mechanism. After multiple time steps of denoising, the first feature map is obtained. Finally, the first feature map is decoded to restore the first feature map in the latent space to the pixel space for pixel representation, thus obtaining the first image.

其中，按照差异描述文本进行局部调整后得到的第一图像相对于相似图像存在区别，具体为该第一图像中与参考图像中差异对象对应的像素区域为空白像素区域，即不存在图像内容。The first image obtained after local adjustment according to the difference description text is different from the similar image. Specifically, the pixel area corresponding to the difference object in the first image is a blank pixel area, that is, there is no image content.

206、通过第二神经网络模型基于目标反掩码图和参考图像，对相似图像进行局部微调，生成第二图像。206. Based on the target inverse mask image and the reference image, the second neural network model performs local fine-tuning on similar images to generate a second image.

具体的，首先，将目标反掩码图与相似图像之间进行相乘，以使得目标掩码图中每个数值都与相似图像中对应的像素进行相乘，得到第二相似掩码图像。然后，将该第二相似掩码图像输入到第二神经网络模型，在像素空间中对第二相似掩码图像进行编码处理，得到第二编码特征图。接着，将第二编码特征图引入隐空间（潜空间）中，在隐空间中对第二编码特征图进行多个时间步的逐步加噪处理，以获取完全噪声化的第二相似噪声图。进而，对作为引导条件的参考图像进行图像编码处理，以获得参考图像对应的特征图（即向量矩阵），并在隐空间中对第二相似噪声图进行多个时间步的减噪处理，其中，在每个时间步的减噪处理过程中结合注意力机制多次融入参考图像的特征图；经过多个时间步的减噪处理，以获得第二特征图。最后，对第二特征图进行解码处理，具体为将隐空间中第二特征图恢复至像素空间进行像素表征，以得到第二图像。Specifically, firstly, the target inverse mask image is multiplied with a similar image so that each value in the target mask image is multiplied with the corresponding pixel in the similar image, resulting in a second similarity mask image. Then, this second similarity mask image is input into a second neural network model, where it is encoded in pixel space to obtain a second encoded feature map. Next, the second encoded feature map is introduced into the latent space, where it undergoes progressive noise addition at multiple time steps to obtain a fully noisy second similarity noise map. Then, the reference image, used as a guiding condition, is image encoded to obtain its corresponding feature map (i.e., a vector matrix). The second similarity noise map is then subjected to noise reduction at multiple time steps in the latent space, where an attention mechanism is used to repeatedly incorporate the reference image's feature map during each noise reduction step. After multiple noise reduction steps, the second feature map is obtained. Finally, the second feature map is decoded, specifically by restoring it from the latent space to pixel space for pixel representation to obtain the second image.

其中，按照参考图像进行局部调整后得到的第二图像相对于相似图像存在区别，具体为该第二图像中在相似图像中非差异信息对应的像素区域为空白像素区域，而差异信息对应的像素区域不是空白像素区域，即该第二图像仅包含差异信息对应的像素区域的图像内容。The second image obtained by making local adjustments to the reference image differs from the similar image. Specifically, in the second image, the pixel regions corresponding to non-difference information in the similar image are blank pixel regions, while the pixel regions corresponding to difference information are not blank pixel regions. That is, the second image only contains the image content of the pixel regions corresponding to difference information.

需要说明的是，关于第一神经网络模型和第二神经网络模型的训练过程，具体可参照前述实施例的描述，此处不做赘述。It should be noted that the training process of the first neural network model and the second neural network model can be referred to the description in the foregoing embodiments, and will not be repeated here.

207、将第一图像与第二图像进行融合，得到调整后的目标图像。207. Fuse the first image with the second image to obtain the adjusted target image.

在本申请实施例中，在获得以差异描述文本作为引导条件调整得到的第一图像和以参考图像作为引导条件调整得到的第二图像后，可将获得的第一图像和第二图像进行叠加融合，以使得第二图像中差异信息对应的像素区域的图像内容与第一图像中差异信息对应的空白像素区域叠加填充，实现第一图像和第二图像之间的图像内容互补，以获得经过局部微调后的目标图像，该目标图像相对于相似图像，其更符合参考图像的图像内容需求，与参考图像更相似，具有可靠性。In this embodiment, after obtaining a first image adjusted using the difference description text as a guide condition and a second image adjusted using a reference image as a guide condition, the first and second images can be superimposed and fused so that the image content of the pixel area corresponding to the difference information in the second image is superimposed and filled with the blank pixel area corresponding to the difference information in the first image, thereby achieving image content complementarity between the first and second images to obtain a target image after local fine-tuning. This target image, compared to similar images, better meets the image content requirements of the reference image, is more similar to the reference image, and has reliability.

在一些实施方式中，关于步骤205~206，还可通过以下过程来实现：具体的，可通过带有条件的隐空间扩散（Stable Diffusion，SD）模型来实现对相似图像的局部微调，将相似图像和目标掩码传输给该扩散模型，扩散模型将相似图像经过编码处理得到特征图，并将该特征图导入到隐（潜）空间中进行前向扩散，该前向扩散过程为经历多个时间步的逐渐加噪处理过程，每个时间步视为一次加噪处理，直至得到完全噪声化的噪声图，进而，可取完全噪声化的噪声图和相邻的前一时间步的噪声图，具体可将该相邻的前一时间步的噪声图作为第一相似噪声图，以完全噪声化的噪声图作为第二相似噪声图；进一步的，基于目标掩码图和差异描述文本，对第一相似噪声图进行反向扩散处理，该反向扩散处理过程为连续多次减噪处理，以获取得到第一图像，以及基于目标掩码图和参考图像，对第二相似噪声图进行反向扩散处理，以获取得到第二图像。最后，将第一图像和第二图像进行融合，得到调整后的目标图像。关于该实施方式的叙述可参见前面实施例（B）的描述，此处不做一一赘述。In some implementations, steps 205-206 can also be achieved through the following process: Specifically, a conditional latent space diffusion (SD) model can be used to fine-tune the locality of similar images. The similar images and the target mask are transmitted to the diffusion model. The diffusion model encodes the similar images to obtain feature maps and imports these feature maps into the latent space for forward diffusion. This forward diffusion process involves a gradual noise-adding process over multiple time steps, with each time step considered as one noise-adding process, until a fully noise-generated noise map is obtained. Then, the fully noise-generated noise map and the noise map from the adjacent previous time step can be taken. Specifically, the noise map from the adjacent previous time step can be used as the first similar noise map, and the fully noise-generated noise map can be used as the second similar noise map. Further, based on the target mask map and the difference description text, the first similar noise map is subjected to backdiffusion processing. This backdiffusion processing involves multiple consecutive noise reduction processes to obtain the first image. Based on the target mask map and the reference image, the second similar noise map is subjected to backdiffusion processing to obtain the second image. Finally, the first image and the second image are fused to obtain the adjusted target image. For a description of this implementation method, please refer to the preceding embodiment (B), which will not be repeated here.

为了便于对本申请实施例的理解，将以具体的应用场景实例对本申请实施例进行描述。具体的，通过执行以上步骤201-207，以及结合图3-图12，对该应用场景实例进行描述。To facilitate understanding of the embodiments of this application, specific application scenario examples will be used to describe the embodiments of this application. Specifically, the application scenario example will be described by performing the above steps 201-207 and referring to Figures 3-12.

具体的，该图像处理方法主要用于图像局部微调的场景，该图像处理的场景实例具体如下：Specifically, this image processing method is mainly used for scenarios involving local fine-tuning of images. Examples of such scenarios are as follows:

一、结合图9所示，该图像处理系统在框架上课包括：训练集（预设数据库）、残差网络的特征提取层（ResNet50）、语义分割层（Segment everything、GRIT以及BLip2）、大规模语言处理模型和微调训练的隐空间扩散模型（SD）。As shown in Figure 9, the image processing system framework includes: a training set (preset database), a feature extraction layer of a residual network (ResNet50), a semantic segmentation layer (Segmenteverything, GRIT, and BLip2), a large-scale language processing model, and a fine-tuned latent space diffusion model (SD).

为了便于理解，结合图9中各处理层对图像处理过程进行概括，具体如下：To facilitate understanding, the image processing process is summarized below with reference to the processing layers in Figure 9:

从训练集（现存储有的图像集合）中获取与例图相似的相似图像，该相似图像为与例图高度疑似但关键局部信息（某一差异对象）可能不同或者缺失的图片。需要说明的是，该获取相似图像的过程可以通过残差网络的特征提取层（ResNet50）来实现。The process involves extracting similar images from the training set (the currently stored set of images). These similar images are highly similar to the example image but may differ from or be missing key local information (a specific object that is different from the example image). It should be noted that this process of extracting similar images can be implemented using the feature extraction layer of a residual network (ResNet50).

结合图10所示，该ResNet50作为主干网络（基础网络），其在结构上包括编码器，该编程是卷积神经网络（CNN），该卷积神经网络（CNN）的特征提取模块由3个卷积层和6个Resblock组成，对于输入的图像（如，参考图像），经过三个卷积层后，该图像的宽(w)和高(h)为原来的1/4，通道数从3变为128，形成一个w/4 * h/4 * 128的特征图，该特征图会再经过由6个ResBlock组成的子网络，生成新的高层语义特征图。其中，每个ResBlock在结构上由两个卷积层和一个直通（identity）层组成，需要说明的是，经过6个ResBlock后，得到的高层语义特征图为w*h*c(比如w/64 * h/64 * 1024)。As shown in Figure 10, the ResNet50 serves as the backbone network (base network). Structurally, it includes an encoder, which is a Convolutional Neural Network (CNN). The feature extraction module of this CNN consists of 3 convolutional layers and 6 ResBlocks. For the input image (e.g., the reference image), after passing through three convolutional layers, the width (w) and height (h) of the image are reduced to 1/4 of their original values, and the number of channels increases from 3 to 128, forming a feature map of w/4 * h/4 * 128. This feature map then passes through a sub-network composed of 6 ResBlocks to generate a new high-level semantic feature map. Each ResBlock consists of two convolutional layers and one identity layer. It should be noted that after passing through 6 ResBlocks, the resulting high-level semantic feature map is w*h*c (e.g., w/64 * h/64 * 1024).

具体的，首先，将客户提供的例图输入残差网络的特征提取层（ResNet50），以获取参考图像的高层语义特征图，并基于高层语义信息获取参考图像的特征均值；同理，按照以上方式遍历训练集中的图像数据，以获取训练集中每个图像的高层语义特征图，以及获取其特征均值。然后，降低参考图像的高层语义特征图以及训练集中每个图像的高层语义特征图的特征维度，如通过全局最大池化处理，以使得特征图的维度为128维。Specifically, firstly, the example image provided by the client is input into the feature extraction layer of the residual network (ResNet50) to obtain the high-level semantic feature map of the reference image, and the feature mean of the reference image is obtained based on the high-level semantic information. Similarly, the image data in the training set is traversed in the same way to obtain the high-level semantic feature map of each image in the training set, and its feature mean is obtained. Then, the feature dimension of the high-level semantic feature map of the reference image and the high-level semantic feature map of each image in the training set is reduced, for example, by global max pooling, so that the feature map dimension is 128 dimensions.

进一步的，计算训练集中各个类别的图像聚类中心Ac，以及计算例图的聚类中心Bc，并通过聚类中心来计算例图与训练集中各个类别的相似度，具体公式如下：Furthermore, the cluster centers Ac of each category in the training set are calculated, as well as the cluster center Bc of the example image. The similarity between the example image and each category in the training set is then calculated using these cluster centers, as shown in the following formula:

最后，对于每张例图，判断其与训练集中每个类别的图像聚类中心Ac的相似度“”是否大于客户的例图聚类中心与训练集中各个类别的图像聚类中心的相似度。从而，确定相似度大于的情况的图像作为相似图像，以便后续调整。Finally, for each example image, we determine whether its similarity "" with the cluster center Ac of each category of images in the training set is greater than the similarity "" between the customer's example image cluster center and the cluster centers of each category of images in the training set. Images with similarity greater than this are identified as similar images for subsequent adjustments.

（2）分别获取例图（参考图像）和相似图像的语义信息，以确定两者之间的差异信息。(2) Obtain semantic information of example image (reference image) and similar image respectively to determine the difference information between them.

其中，通过预训练的冻结图像编码器和大型语言模型（ BootstrappingLanguage-Image Pre-training with Frozen Image Encoders and Large LanguageModels，BLIP2）来获取该参考图像的全局描述文本。同理，获取相似图像的全局描述文本。Specifically, the global descriptive text of the reference image is obtained through a pre-trained frozen image encoder and a large language model (BLIP2). Similarly, the global descriptive text of similar images is obtained.

其中，通过图像区域到文本的生成转换器（Generative Region-to-TextTransformer，GRIT）获取参考图像中每个对象的对象描述文本。同理，获取相似图像中每个对象的对象描述文本。Specifically, a Generative Region-to-Text Transformer (GRIT) is used to obtain the object description text for each object in the reference image. Similarly, the object description text for each object in similar images is obtained.

进一步的，可根据参考图像与相似图像之间的全局描述文本以及对象描述文本的差异，确定参考图像与相似图像之间的差异信息。Furthermore, the difference information between the reference image and similar images can be determined based on the differences in global description text and object description text between the reference image and similar images.

此外，通过掩码分割模型（segment everything）来获取针对差异信息对应的对象的目标掩码图。In addition, a target mask image for objects corresponding to the difference information is obtained through a mask segmentation model (segmenteverything).

（3）结合图11所示，将差异信息传导给大规模语言处理模型进行汇总处理，并引导模型生成针对差异信息的关键描述（即差异描述文本）。具体的，让大规模语言处理模型推理图像中物体（图像中的对象）之间的关系和物体的信息，以获取针对差异对象的高质量的文本，即差异描述文本。需要说明的是，该大规模语言处理模型可以是任意类型的语言处理模型，如“Chat-Gpt”，此处不做限定。(3) Referring to Figure 11, the difference information is passed to the large-scale language processing model for aggregation and processing, and the model is guided to generate key descriptions of the difference information (i.e., difference description text). Specifically, the large-scale language processing model is allowed to infer the relationships between objects in the image and the information of the objects to obtain high-quality text for the difference objects, i.e., difference description text. It should be noted that the large-scale language processing model can be any type of language processing model, such as "Chat-Gpt", and is not limited here.

（4）对相似图像进行局部微调，主要包括两部分。具体的，第一，以差异描述文本作为提示信息（prompt）对相似图像进行局部微调；第二，以参考图像作为提示信息（prompt）对相似图像进行局部微调。关于该图像微调的具体介绍如下：(4) Local fine-tuning of similar images mainly includes two parts. Specifically, first, using the difference description text as a prompt, local fine-tuning of similar images is performed; second, using a reference image as a prompt, local fine-tuning of similar images is performed. A detailed introduction to this image fine-tuning is as follows:

结合图12所示，为图像微调过程的场景示意图。具体的，该图像微调的目的是将相似图像作为上路的输入图像，并基于下路客户提供的例图，将相似图像调整为与例图更为相似的目标图像。如图所示，目标图像中不包含人物的手，且刀叉放在餐盘上，这与例图更相似。Figure 12 illustrates the scene of the image fine-tuning process. Specifically, the purpose of this image fine-tuning is to use a similar image as the input image for the upstream process, and based on the example image provided by the downstream customer, adjust the similar image to a target image that is more similar to the example image. As shown in the figure, the target image does not contain a person's hand, and the knife and fork are placed on a plate, which is more similar to the example image.

首先，将相似图像输入上路，并进行噪声扩散处理，在扩散（如去噪或减噪）中引入差异描述文本作为提示信息（prompt），该提示信息可以理解为引导条件。需要说明的是，在噪声扩散中可随机取一个时间步的噪声扩散图（定义为第一噪声图），将其与目标掩码图（mask）进行结合，以通过逆向扩散得到一个初步与例图相关的图像，即第一图像。需要说明的是，上路扩散过程可以表示如下：First, similar images are input into the upper path and subjected to noise diffusion processing. During diffusion (e.g., denoising or noise reduction), descriptive text describing the differences is introduced as a prompt, which can be understood as a guiding condition. It should be noted that during noise diffusion, a noise diffusion map at a random time step (defined as the first noise map) can be selected and combined with the target mask image to obtain a preliminary image related to the example image through reverse diffusion, i.e., the first image. The upper path diffusion process can be represented as follows:

然后，下路按照上路的方式对相似图像进行噪声扩散，并取相对于上路的噪声扩散图的后一个时间步的噪声扩散图（定义为第二噪声图），并在降噪扩散中以例图（参考图像）作为约束条件，将第二噪声图与目标掩码图（mask）的反掩码图进行结合，经过去噪，得到第二图像。需要说明的是，可以选择时间步相邻的两个噪声图分别作为第一噪声图与第二噪声图，从而，确保确保上路得到的第一图像和下路得到的第二图像的特征能够保证一致性，如图像尺寸一致等，以便后续使得第一图像和第二图像能够准确融合。需要说明的是，下路扩散过程可以表示如下：Then, the lower path performs noise diffusion on similar images in the same manner as the upper path, and takes the noise diffusion map at the next time step relative to the noise diffusion map of the upper path (defined as the second noise map). In the denoising diffusion process, using the example image (reference image) as a constraint, the second noise map is combined with the inverse mask image of the target mask. After denoising, the second image is obtained. It should be noted that two noise maps at adjacent time steps can be selected as the first noise map and the second noise map, respectively. This ensures that the features of the first image obtained from the upper path and the second image obtained from the lower path are consistent, such as having the same image size, so that the first image and the second image can be accurately fused subsequently. The lower path diffusion process can be represented as follows:

最后，将第一图像与第二图像进行融合，得到调整后的目标图像。其中，该图像融合过程可以表示如下：Finally, the first image and the second image are fused to obtain the adjusted target image. This image fusion process can be represented as follows:

通过执行以上（1）到（4）的场景步骤，可以实现如下：在训练集中找到和客户数据相对类似的图片，然后分别将客户例图输入到多个大模型（segment everything, blip2,grit等）输出相对应的prompt（图片的描述），接着，利用chatgpt对这些描述词进行描述，并对人为认定的较为关键的几个特征重点引导chatgpt添加细节，从而得到最终的描述词；最后使用SD模型对相似图像与客户提供例图的差异进行重点涂抹，让图像对局部重新生成，得到目标图像。By executing the scenario steps (1) to (4) above, the following can be achieved: find images in the training set that are relatively similar to the customer data, then input the customer example images into multiple large models (segmenteverything, blip2, grit, etc.) to output corresponding prompts (image descriptions), then use chatgpt to describe these descriptive words, and focus on guiding chatgpt to add details for several key features identified by the user, so as to obtain the final descriptive words; finally, use the SD model to focus on smoothing the differences between similar images and the example images provided by the customer, so that the image is regenerated locally to obtain the target image.

通过以上应用场景实例，可实现如下效果：以针对例图与相似图像之间的差异作为引导条件，并对客户提供的少量图片进行局部上的微调，提高图像微调时的准确性，以获取更符合例图的目标图像。Through the above application scenario examples, the following effects can be achieved: using the differences between the example image and similar images as guiding conditions, and making local fine-tuning to a small number of images provided by the customer, the accuracy of image fine-tuning is improved, so as to obtain a target image that is more consistent with the example image.

由以上可知，本申请实施例可先从现有的数据中获取与参考图像相似的相似图像，然后，基于参考图像与相似图像之间的差异信息，生成目标掩码图，以及，针对该差异信息进行扩充，以丰富表示该差异信息的差异描述文本，最后，联合目标掩码、差异描述文本和参考图像对现有的相似图像进行局部微调，以获取微调后的目标图像；以此，可对图像之间的差异进行扩充描述，并针对差异的扩充描述文本和参考图像作为约束来局部调整相似图像，提高图像调整的准确性，使得调整后的图像效果与实际需求相符合，以利于后续其他业务的开展。As can be seen from the above, the embodiments of this application can first obtain similar images to the reference image from existing data, then generate a target mask image based on the difference information between the reference image and the similar image, and expand the difference information to enrich the difference description text representing the difference information. Finally, the existing similar image is locally fine-tuned by combining the target mask, the difference description text, and the reference image to obtain the fine-tuned target image. In this way, the difference description between images can be expanded, and the expanded description text of the difference and the reference image can be used as constraints to locally adjust the similar image, improving the accuracy of image adjustment and making the adjusted image effect conform to the actual needs, so as to facilitate the development of other subsequent businesses.

为了更好地实施以上方法，本申请实施例还提供一种图像处理装置。例如，如图13所示，该图像处理装置可以包括获取单元401、第一确定单元402、第二确定单元403、扩充单元404和调整单元405。To better implement the above methods, embodiments of this application also provide an image processing apparatus. For example, as shown in FIG13, the image processing apparatus may include an acquisition unit 401, a first determination unit 402, a second determination unit 403, an expansion unit 404, and an adjustment unit 405.

获取单元401，用于获取参考图像，并获取与参考图像相似的相似图像；The acquisition unit 401 is used to acquire a reference image and acquire similar images that are similar to the reference image;

第一确定单元402，用于确定参考图像与相似图像之间的差异信息；The first determining unit 402 is used to determine the difference information between the reference image and similar images;

第二确定单元403，用于确定参考图像中针对差异信息的目标掩码图；The second determining unit 403 is used to determine the target mask map for the difference information in the reference image;

扩充单元404，用于基于差异信息进行扩充，得到差异描述文本；The expansion unit 404 is used to expand based on the difference information to obtain the difference description text;

调整单元405，用于根据目标掩码图、差异描述文本和参考图像，对相似图像进行局部调整，得到调整后的目标图像。The adjustment unit 405 is used to perform local adjustments on similar images based on the target mask image, the difference description text, and the reference image to obtain the adjusted target image.

在一些实施方式中，调整单元405，还用于：对相似图像进行加噪处理，并获取加噪处理中相邻时间步的第一相似噪声图和第二相似噪声图；根据目标掩码图和差异描述文本，对第一相似噪声图进行解噪处理，得到第一图像；根据目标掩码图和参考图像，对第二相似噪声图进行解噪处理，得到第二图像；将第一图像与第二图像进行融合，得到调整后的目标图像。In some embodiments, the adjustment unit 405 is further configured to: add noise to similar images and obtain a first similar noise map and a second similar noise map at adjacent time steps in the noise addition process; perform denoising processing on the first similar noise map according to the target mask map and the difference description text to obtain a first image; perform denoising processing on the second similar noise map according to the target mask map and the reference image to obtain a second image; and fuse the first image and the second image to obtain the adjusted target image.

在一些实施方式中，调整单元405，还用于：根据目标掩码图对第一相似噪声图进行掩码处理，得到第一掩码噪声图；获取差异描述文本对应的差异文本向量，并根据差异文本向量对第一相似噪声图进行减噪处理，得到第一特征图；对第一特征图进行解码处理，得到第一图像。In some embodiments, the adjustment unit 405 is further configured to: perform masking processing on the first similar noise map according to the target mask map to obtain a first mask noise map; obtain the difference text vector corresponding to the difference description text, and perform noise reduction processing on the first similar noise map according to the difference text vector to obtain a first feature map; and perform decoding processing on the first feature map to obtain a first image.

在一些实施方式中，调整单元405，还用于：对目标掩码图进行取反，得到目标掩码图对应的目标反掩码图；根据目标反掩码图对第二相似噪声图进行掩码处理，得到第二掩码噪声图；根据参考图像对应的特征图对第二掩码噪声图进行掩码处理，得到第二特征图；对第二特征图进行解码处理，得到第二图像。In some embodiments, the adjustment unit 405 is further configured to: invert the target mask image to obtain a target inverse mask image corresponding to the target mask image; perform masking processing on the second similar noise image according to the target inverse mask image to obtain a second mask noise image; perform masking processing on the second mask noise image according to the feature image corresponding to the reference image to obtain a second feature image; and perform decoding processing on the second feature image to obtain a second image.

在一些实施方式中，调整单元405，还用于：根据目标掩码图和差异描述文本，对相似图像进行局部微调，得到第一图像；根据目标掩码图和参考图像，对相似图像进行局部微调，得到第二图像；将第一图像与第二图像进行融合，得到调整后的目标图像。In some embodiments, the adjustment unit 405 is further configured to: perform local fine-tuning on similar images based on the target mask image and the difference description text to obtain a first image; perform local fine-tuning on similar images based on the target mask image and the reference image to obtain a second image; and fuse the first image and the second image to obtain an adjusted target image.

在一些实施方式中，调整单元405，还用于：根据目标掩码图对相似图像进行掩码处理，得到第一相似掩码图像；对第一相似掩码图像进行加噪处理，得到第一相似噪声图；获取差异描述文本对应的差异文本向量；根据差异文本向量对第一相似噪声图进行减噪处理，得到第一特征图；对第一特征图进行解码处理，得到第一图像。In some embodiments, the adjustment unit 405 is further configured to: perform masking processing on the similar image according to the target mask image to obtain a first similar mask image; perform noise processing on the first similar mask image to obtain a first similar noise image; obtain the difference text vector corresponding to the difference description text; perform noise reduction processing on the first similar noise image according to the difference text vector to obtain a first feature image; and perform decoding processing on the first feature image to obtain a first image.

在一些实施方式中，调整单元405，还用于：对第一相似噪声图进行连续多次降噪处理，并通过注意力机制在每次降噪处理过程中融入差异文本向量，得到第一特征图。In some implementations, the adjustment unit 405 is further configured to: perform multiple consecutive noise reduction processes on the first similar noise map, and incorporate the difference text vectors in each noise reduction process through an attention mechanism to obtain the first feature map.

在一些实施方式中，调整单元405，还用于：对第一相似掩码图像进行编码处理，得到编码特征图；对编码特征图进行噪声处理，得到第一相似噪声图。In some embodiments, the adjustment unit 405 is further configured to: encode the first similar mask image to obtain an encoded feature map; and perform noise processing on the encoded feature map to obtain a first similar noise map.

在一些实施方式中，调整单元405，还用于：通过第一神经网络模型基于目标掩码图和差异描述文本，对相似图像进行局部微调，生成第一图像；In some embodiments, the adjustment unit 405 is further configured to: perform local fine-tuning on similar images based on the target mask image and the difference description text using a first neural network model to generate a first image;

则图像处理装置还包括训练单元，用于：获取样本参考图像和样本相似图像，以及第一样本目标图像；基于样本参考图像与样本相似图像之间的差异信息，生成样本参考图像的样本目标掩码图和样本差异描述文本；通过预设模型基于样本目标掩码图和样本差异描述文本，对样本相似图像进行局部微调，生成第一预测图像；根据第一样本目标图像与第一预测图像，确定预测损失；基于预测损失对预设模型进行迭代训练，直至达到预设条件，得到第一神经网络模型。The image processing device further includes a training unit, used for: acquiring a sample reference image and a sample similar image, as well as a first sample target image; generating a sample target mask image and sample difference description text of the sample reference image based on the difference information between the sample reference image and the sample similar image; performing local fine-tuning on the sample similar image based on the sample target mask image and the sample difference description text using a preset model to generate a first predicted image; determining the prediction loss based on the first sample target image and the first predicted image; and iteratively training the preset model based on the prediction loss until a preset condition is met to obtain a first neural network model.

在一些实施方式中，调整单元405，还用于：对目标掩码图进行取反，得到目标掩码图对应的目标反掩码图；根据目标反掩码图对相似图像进行掩码处理，得到第二相似掩码图像；对第二相似掩码图像进行加噪处理，得到第二相似噪声图；根据参考图像对应的特征图对第二相似噪声图进行减噪处理，得到第二特征图；对第二特征图进行解码处理，得到第二图像。In some embodiments, the adjustment unit 405 is further configured to: invert the target mask image to obtain a target inverse mask image corresponding to the target mask image; perform masking processing on similar images based on the target inverse mask image to obtain a second similar mask image; perform noise processing on the second similar mask image to obtain a second similar noise image; perform noise reduction processing on the second similar noise image based on the feature map corresponding to the reference image to obtain a second feature map; and perform decoding processing on the second feature map to obtain a second image.

在一些实施方式中，扩充单元404，还用于：根据差异信息，确定参考图像相对于相似图像的差异对象；确定差异对象与参考图像中其他对象之间的对象关系信息；获取参考图像的全局描述文本和差异对象的目标对象描述文本；基于全局描述文本、目标对象描述文本和对象关系信息进行文本扩充，得到差异描述文本。In some embodiments, the expansion unit 404 is further configured to: determine the difference objects of the reference image relative to similar images based on the difference information; determine the object relationship information between the difference objects and other objects in the reference image; obtain the global description text of the reference image and the target object description text of the difference objects; and perform text expansion based on the global description text, the target object description text, and the object relationship information to obtain the difference description text.

在一些实施方式中，获取单元401，还用于：确定参考图像所属的参考聚类中心；确定预设数据库中预构建的每个图像聚类中心与参考聚类中心之间的特征类别距离；基于特征类别距离，为参考图像选取相似的相似图像。In some implementations, the acquisition unit 401 is further configured to: determine the reference cluster center to which the reference image belongs; determine the feature category distance between each image cluster center pre-constructed in the preset database and the reference cluster center; and select similar images for the reference image based on the feature category distance.

在一些实施方式中，第一确定单元402，还用于：获取参考图像对应的第一描述文本；获取相似图像对应的第二描述文本；基于第一描述文本与第二描述文本之间的差异，生成差异信息。In some implementations, the first determining unit 402 is further configured to: obtain a first descriptive text corresponding to a reference image; obtain a second descriptive text corresponding to a similar image; and generate difference information based on the difference between the first descriptive text and the second descriptive text.

在一些实施方式中，第一确定单元402，还用于：通过第一预设模型对参考图像进行全局描述，生成参考图像的全局描述文本；通过第二预设模型对参考图像中的每个对象所在的像素区域进行处理，得到参考图像中每个对象对应的对象描述文本；根据参考图像的全局描述文本和每个对象对应的对象描述文本，确定参考图像对应的第一描述文本。In some embodiments, the first determining unit 402 is further configured to: perform a global description of the reference image using a first preset model to generate a global description text for the reference image; process the pixel region where each object in the reference image is located using a second preset model to obtain an object description text corresponding to each object in the reference image; and determine a first description text corresponding to the reference image based on the global description text of the reference image and the object description text corresponding to each object.

在一些实施方式中，第二确定单元403，还用于：根据差异信息，确定参考图像中相对于相似图像的差异对象；基于参考图像中的差异对象，生成目标掩码图。In some implementations, the second determining unit 403 is further configured to: determine, based on the difference information, a difference object in the reference image relative to a similar image; and generate a target mask image based on the difference object in the reference image.

由以上可知，本申请实施例可从现有的数据中获取与参考图像相似的相似图像，然后，基于参考图像与相似图像之间的差异信息，生成目标掩码图，以及，针对该差异信息进行扩充，以丰富表示该差异信息的差异描述文本，最后，联合目标掩码、差异描述文本和参考图像对现有的相似图像进行局部微调，以获取微调后的目标图像；以此，可对图像之间的差异进行扩充描述，并将针对差异的扩充描述文本和参考图像作为约束条件来局部调整相似图像，提高图像调整的准确性，使得调整后的图像效果与实际需求相符合，以利于后续其他业务的开展。As can be seen from the above, the embodiments of this application can obtain similar images to the reference image from existing data. Then, based on the difference information between the reference image and the similar image, a target mask image is generated. Furthermore, the difference information is expanded to enrich the difference description text representing the difference information. Finally, the existing similar image is locally fine-tuned by combining the target mask, the difference description text, and the reference image to obtain the fine-tuned target image. In this way, the difference description between images can be expanded, and the expanded description text and the reference image can be used as constraints to locally adjust the similar image, improving the accuracy of image adjustment and making the adjusted image effect conform to the actual needs, which is conducive to the development of other subsequent services.

本申请实施例还提供一种计算机设备，如图14所示，其示出了本申请实施例所涉及的计算机设备的结构示意图，具体来讲：This application also provides a computer device, as shown in FIG14, which illustrates a schematic diagram of the structure of the computer device involved in this application embodiment. Specifically:

该计算机设备可以包括一个或者一个以上处理核心的处理器501、一个或一个以上计算机可读存储介质的存储器502、电源503和输入单元504等部件。本领域技术人员可以理解，图14中示出的计算机设备结构并不构成对计算机设备的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。其中：The computer device may include components such as a processor 501 with one or more processing cores, a memory 502 with one or more computer-readable storage media, a power supply 503, and an input unit 504. Those skilled in the art will understand that the computer device structure shown in FIG14 does not constitute a limitation on the computer device, and may include more or fewer components than shown, or combine certain components, or have different component arrangements. Wherein:

处理器501是该计算机设备的控制中心，利用各种接口和线路连接整个计算机设备的各个部分，通过运行或执行存储在存储器502内的软件程序和/或模块，以及调用存储在存储器502内的数据，执行计算机设备的各种功能和处理数据。可选的，处理器501可包括一个或多个处理核心；优选的，处理器501可集成应用处理器和调制解调处理器，其中，应用处理器主要处理操作系统、用户界面和应用程序等，调制解调处理器主要处理无线通信。可以理解的是，上述调制解调处理器也可以不集成到处理器501中。The processor 501 is the control center of the computer device, connecting various parts of the computer device through various interfaces and lines. It performs various functions and processes data by running or executing software programs and/or modules stored in the memory 502, and by calling data stored in the memory 502. Optionally, the processor 501 may include one or more processing cores; preferably, the processor 501 may integrate an application processor and a modem processor, wherein the application processor mainly handles the operating system, user interface, and applications, and the modem processor mainly handles wireless communication. It is understood that the modem processor may not be integrated into the processor 501.

存储器502可用于存储软件程序以及模块，处理器501通过运行存储在存储器502的软件程序以及模块，从而执行各种功能应用以及图像处理过程。存储器502可主要包括存储程序区和存储数据区，其中，存储程序区可存储操作系统、至少一个功能所需的应用程序（比如声音播放功能、图像播放功能等）等；存储数据区可存储根据计算机设备的使用所创建的数据等。此外，存储器502可以包括高速随机存取存储器，还可以包括非易失性存储器，例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地，存储器502还可以包括存储器控制器，以提供处理器501对存储器502的访问。The memory 502 can be used to store software programs and modules. The processor 501 executes various functional applications and image processing processes by running the software programs and modules stored in the memory 502. The memory 502 may mainly include a program storage area and a data storage area. The program storage area may store the operating system, application programs required for at least one function (such as sound playback function, image playback function, etc.), etc.; the data storage area may store data created according to the use of the computer device, etc. In addition, the memory 502 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 502 may also include a memory controller to provide the processor 501 with access to the memory 502.

计算机设备还包括给各个部件供电的电源503，优选的，电源503可以通过电源管理系统与处理器501逻辑相连，从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。电源503还可以包括一个或一个以上的直流或交流电源、再充电系统、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。The computer equipment also includes a power supply 503 that supplies power to the various components. Preferably, the power supply 503 can be logically connected to the processor 501 through a power management system, thereby enabling functions such as charging, discharging, and power consumption management through the power management system. The power supply 503 may also include one or more DC or AC power supplies, recharging systems, power fault detection circuits, power converters or inverters, power status indicators, and other arbitrary components.

该计算机设备还可包括输入单元504，该输入单元504可用于接收输入的数字或字符信息，以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。The computer device may also include an input unit 504, which can be used to receive input digital or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

尽管未示出，计算机设备还可以包括显示单元等，在此不再赘述。具体在本申请实施例中，计算机设备中的处理器501会按照如下的指令，将一个或一个以上的应用程序的进程对应的可执行文件加载到存储器502中，并由处理器501来运行存储在存储器502中的应用程序，从而实现各种功能，如下：Although not shown, the computer device may also include a display unit, etc., which will not be described in detail here. Specifically, in the embodiments of this application, the processor 501 in the computer device loads the executable files corresponding to the processes of one or more application programs into the memory 502 according to the following instructions, and the processor 501 runs the application programs stored in the memory 502 to realize various functions, as follows:

获取参考图像，并获取与参考图像相似的相似图像；确定参考图像与相似图像之间的差异信息；确定参考图像中针对差异信息的目标掩码图；基于差异信息进行扩充，得到差异描述文本；根据目标掩码图、差异描述文本和参考图像，对相似图像进行局部调整，得到调整后的目标图像。Acquire a reference image and similar images; determine the difference information between the reference image and the similar images; determine the target mask image in the reference image based on the difference information; expand based on the difference information to obtain the difference description text; perform local adjustments on the similar images according to the target mask image, the difference description text, and the reference image to obtain the adjusted target image.

以上各个操作的具体实施可参见前面的实施例，在此不作赘述。For details on the implementation of each of the above operations, please refer to the previous examples, which will not be repeated here.

由此可得，本方案可从现有的数据中获取与参考图像相似的相似图像，然后，基于参考图像与相似图像之间的差异信息，生成目标掩码图，以及，针对该差异信息进行扩充，以丰富表示该差异信息的差异描述文本，最后，联合目标掩码、差异描述文本和参考图像对现有的相似图像进行局部微调，以获取微调后的目标图像；以此，可对图像之间的差异进行扩充描述，并针对差异的扩充描述文本和参考图像作为约束来局部调整相似图像，提高图像调整的准确性，使得调整后的图像效果与实际需求相符合，以利于后续其他业务的开展。Therefore, this solution can obtain similar images from existing data that are similar to the reference image. Then, based on the difference information between the reference image and the similar image, a target mask image is generated. Furthermore, the difference information is expanded to enrich the difference description text representing the difference information. Finally, the existing similar images are locally fine-tuned by combining the target mask, the difference description text, and the reference image to obtain the fine-tuned target image. In this way, the difference description between images can be expanded, and the expanded difference description text and the reference image can be used as constraints to locally adjust similar images, improving the accuracy of image adjustment and ensuring that the adjusted image effect meets the actual requirements, thus facilitating the development of other subsequent businesses.

本领域普通技术人员可以理解，上述实施例的各种方法中的全部或部分步骤可以通过指令来完成，或通过指令控制相关的硬件来完成，该指令可以存储于一计算机可读存储介质中，并由处理器进行加载和执行。Those skilled in the art will understand that all or part of the steps in the various methods of the above embodiments can be performed by instructions, or by instructions controlling related hardware. These instructions can be stored in a computer-readable storage medium and loaded and executed by a processor.

为此，本申请实施例提供一种计算机可读存储介质，其中存储有多条指令，该指令能够被处理器进行加载，以执行本申请实施例所提供的任一种图像处理方法中的步骤。例如，该指令可以执行如下步骤：Therefore, embodiments of this application provide a computer-readable storage medium storing a plurality of instructions that can be loaded by a processor to execute steps in any of the image processing methods provided in embodiments of this application. For example, the instructions can execute the following steps:

以上各个操作的具体实施可参见前面的实施例，在此不再赘述。For details on the implementation of each of the above operations, please refer to the previous examples, which will not be repeated here.

其中，该计算机可读存储介质可以包括：只读存储器（ROM，Read Only Memory）、随机存取记忆体（RAM，Random Access Memory）、磁盘或光盘等。The computer-readable storage medium may include: read-only memory (ROM), random access memory (RAM), disk or optical disk, etc.

由于该计算机可读存储介质中所存储的指令，可以执行本申请实施例所提供的任一种图像处理方法中的步骤，因此，可以实现本申请实施例所提供的任一种图像处理方法所能实现的有益效果，详见前面的实施例，在此不再赘述。Since the instructions stored in the computer-readable storage medium can execute the steps of any of the image processing methods provided in the embodiments of this application, the beneficial effects that any of the image processing methods provided in the embodiments of this application can achieve can be realized, as detailed in the preceding embodiments, and will not be repeated here.

根据本申请的一个方面，提供了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行上述实施例提供的各种可选实现方式中提供的方法。According to one aspect of this application, a computer program product or computer program is provided, comprising computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the methods provided in the various optional implementations of the above embodiments.

以上对本申请实施例所提供的一种图像处理方法、装置、设备和计算机可读存储介质进行了详细介绍，本文中应用了具体个例对本申请的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本申请的方法及其核心思想；同时，对于本领域的技术人员，依据本申请的思想，在具体实施方式及应用范围上均会有改变之处，综上，本说明书内容不应理解为对本申请的限制。The foregoing has provided a detailed description of an image processing method, apparatus, device, and computer-readable storage medium provided in the embodiments of this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.

Claims

1. An image processing method, characterized in that it comprises:

Obtain a reference image, and then obtain similar images that are similar to the reference image;

Determine the difference information between the reference image and the similar image;

Determine the target mask image in the reference image for the difference information;

Based on the aforementioned difference information, the difference description text is obtained by expanding upon it.

Based on the target mask image, the difference description text, and the reference image, the similar image is locally adjusted to obtain the adjusted target image.

2. The method according to claim 1, characterized in that, the step of locally adjusting the similar image based on the target mask image, the difference description text, and the reference image to obtain the adjusted target image includes:

The similar images are subjected to noise processing, and a first similar noise map and a second similar noise map at adjacent time steps are obtained in the noise processing.

Based on the target mask image and the difference description text, the first similarity noise image is denoised to obtain the first image;

Based on the target mask image and the reference image, the second similarity noise image is denoised to obtain the second image;

The first image and the second image are fused together to obtain the adjusted target image.

3. The method according to claim 2, characterized in that, the step of performing denoising processing on the first similarity noise map based on the target mask map and the difference description text to obtain the first image includes:

The first similar noise map is masked according to the target mask map to obtain the first mask noise map;

Obtain the difference text vector corresponding to the difference description text, and perform noise reduction processing on the first similarity noise map based on the difference text vector to obtain the first feature map;

The first feature map is decoded to obtain the first image.

4. The method according to claim 2, characterized in that, the step of performing denoising processing on the second similarity noise map based on the target mask map and the reference image to obtain the second image includes:

Invert the target mask image to obtain the target inverse mask image corresponding to the target mask image;

The second similar noise map is masked according to the target inverse mask map to obtain the second masked noise map;

The second mask noise map is masked according to the feature map corresponding to the reference image to obtain the second feature map;

The second feature map is decoded to obtain the second image.

5. The method according to claim 1, characterized in that, based on the target mask image, the difference description text, and the reference image, local adjustments are made to the similar image to obtain the adjusted target image, comprising:

Based on the target mask image and the difference description text, the similar images are locally fine-tuned to obtain the first image;

Based on the target mask image and the reference image, the similar image is locally fine-tuned to obtain the second image;

6. The method according to claim 5, characterized in that, the step of locally fine-tuning the similar image based on the target mask image and the difference description text to obtain the first image includes:

The similar image is masked according to the target mask image to obtain a first similar mask image;

The first similarity mask image is subjected to noise processing to obtain a first similarity noise image;

Obtain the difference text vector corresponding to the difference description text;

The first similarity noise map is denoised based on the difference text vector to obtain the first feature map;

The first feature map is decoded to obtain the first image.

7. The method according to claim 6, wherein the step of performing noise reduction processing on the first similarity noise map based on the difference text vector to obtain the first feature map includes:

The first similar noise map is subjected to multiple consecutive noise reduction processes, and the difference text vector is incorporated into each noise reduction process through an attention mechanism to obtain the first feature map.

8. The method according to claim 6, wherein adding noise to the first similar mask image to obtain a first similar noise map comprises:

The first similar mask image is encoded to obtain an encoded feature map;

The encoded feature map is subjected to noise processing to obtain a first similar noise map.

9. The method according to claim 5, characterized in that, the step of locally fine-tuning the similar image based on the target mask image and the difference description text to obtain the first image includes:

The first image is generated by locally fine-tuning the similar image based on the target mask image and the difference description text using a first neural network model.

Before generating the first image by locally fine-tuning the similar image based on the target mask image and the difference description text using a first neural network model, the method further includes:

Obtain the sample reference image and sample similar images, as well as the first sample target image;

Based on the difference information between the sample reference image and the sample similar image, a sample target mask image and sample difference description text are generated for the sample reference image;

Based on the sample target mask image and the sample difference description text, the sample similar image is locally fine-tuned using a preset model to generate a first predicted image.

Based on the difference between the first sample target image and the first predicted image, a prediction loss is constructed;

The preset model is iteratively trained based on the prediction loss until the preset conditions are met, resulting in the first trained neural network model.

10. The method according to claim 6, wherein the step of locally fine-tuning the similar image based on the target mask image and the reference image to obtain a second image comprises:

The similar image is masked according to the target inverse mask image to obtain a second similar mask image;

The second similar mask image is subjected to noise processing to obtain a second similar noise map, wherein the time step of the second similar noise map is adjacent to the time step of the first similar noise map.

The second similarity noise map is denoised based on the feature map corresponding to the reference image to obtain the second feature map;

The second feature map is decoded to obtain the second image.

11. The method according to claim 10, wherein the step of performing noise reduction processing on the second similarity noise map based on the feature map corresponding to the reference image to obtain the second feature map includes:

The second similar noise map is subjected to multiple consecutive noise reduction processes, and the feature map corresponding to the reference image is incorporated into each noise reduction process through an attention mechanism to obtain the second feature map.

12. The method according to claim 1, wherein the step of expanding based on the difference information to obtain the difference description text includes:

Based on the difference information, determine the difference objects between the reference image and the similar image;

Determine the object relationship information between the differing object and other objects in the reference image;

Obtain the global description text of the reference image and the target object description text of the difference object;

Based on the global description text, the target object description text, and the object relationship information, a textual description is performed to obtain the difference description text.

13. The method according to claim 1, wherein obtaining a similar image to the reference image comprises:

Determine the reference cluster center to which the reference image belongs;

Determine the feature category distance between each pre-constructed image cluster center in the preset database and the reference cluster center;

Based on the feature category distance, similar images are selected for the reference image.

14. The method according to claim 1, wherein determining the difference information between the reference image and the similar image comprises:

Obtain the first descriptive text corresponding to the reference image;

Obtain the second descriptive text corresponding to the similar image;

Based on the differences between the first description text and the second description text, difference information is generated.

15. The method according to claim 14, wherein obtaining the first descriptive text corresponding to the reference image comprises:

The reference image is globally described using a first preset model, generating a global description text for the reference image;

The pixel region where each object is located in the reference image is processed by the second preset model to obtain the object description text corresponding to each object in the reference image;

The first description text corresponding to the reference image is determined based on the global description text of the reference image and the object description text corresponding to each object.

16. The method according to claim 1, wherein determining the target mask map in the reference image for the difference information comprises:

Based on the difference information, determine the difference objects in the reference image relative to the similar image;

A target mask image is generated based on the differences in the reference image.

17. An image processing apparatus, characterized in that it comprises:

An acquisition unit is used to acquire a reference image and acquire similar images that are similar to the reference image;

The first determining unit is used to determine the difference information between the reference image and the similar image;

The second determining unit is used to determine the target mask image in the reference image for the difference information;

An expansion unit is used to represent the difference information in text to obtain a difference description text.

The adjustment unit is used to perform local adjustments on the similar image based on the target mask image, the difference description text, and the reference image to obtain the adjusted target image.

18. A computer device, characterized in that it comprises a processor and a memory, the memory storing a computer program, the processor being configured to run the computer program in the memory to implement the steps of the image processing method according to any one of claims 1 to 16.

19. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a plurality of instructions adapted for loading by a processor to perform the steps of the image processing method according to any one of claims 1 to 16.