CN112102303A

CN112102303A - Semantic image analogy method for generating countermeasure network based on single image

Info

Publication number: CN112102303A
Application number: CN202011001562.XA
Authority: CN
Inventors: 熊志伟; 李家丞; 刘�东
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2020-12-18
Anticipated expiration: 2040-09-22
Also published as: CN112102303B

Abstract

The invention discloses a semantic image analogy method for generating an anti-network based on a single image, and the technical scheme provided by the invention can be seen that a generation model special for a given image can be trained under the condition of giving any image and a semantic segmentation image thereof, the model can recombine a source image according to different expected semantic layouts to generate an image conforming to a target semantic layout, and the semantic image analogy effect is achieved. The visual quality and the conformity accuracy of the result generated by the method are optimal.

Description

A Semantic Image Analogy Method Based on Single Image Generative Adversarial Network

技术领域technical field

本发明涉及图像处理技术领域，尤其涉及一种基于单图像生成对抗网络的语义图像类比方法。The invention relates to the technical field of image processing, in particular to a semantic image analogy method based on a single-image generation confrontation network.

背景技术Background technique

诸如可变自动编码器(Variational Auto-Encoder，VAE)和生成对抗网络(Generative Adversarial Network，GAN)的可生成模型在以可生成方式对自然图像布局进行建模方面取得了长足的进步。通过将诸如类标签，文本，边线或分割图之类的附加信号作为输入，条件生成模型可以可控方式生成照片级逼真的样本，这在诸如交互设计和艺术风格转移之类的许多多媒体应用中很有用。Generative models such as Variational Auto-Encoder (VAE) and Generative Adversarial Network (GAN) have made great strides in modeling natural image layouts in a generative manner. By taking additional signals such as class labels, text, edges or segmentation maps as input, conditional generative models can generate photo-realistic samples in a controllable manner, which is useful in many multimedia applications such as interaction design and artistic style transfer. Very useful.

具体来说，分割图为生成模型提供了密集的像素级指导，并使用户能够在空间上控制预期的实例，这比像类标签或样式这样的图像级指导要灵活得多。Specifically, segmentation maps provide dense pixel-level guidance for generative models and enable users to spatially control expected instances, which is much more flexible than image-level guidance like class labels or styles.

Isola等人提出Pix2Pix模型显示了给定密集条件信号(包括草图和分割图)的条件GAN生成可控图像的能力(Phillip Isola,Jun-Yan Zhu,Tinghui Zhou,and AlexeiA.Efros.2017.Image-to-Image Translation with Conditional AdversarialNetworks.In Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition(CVPR).5967–5976)。Wang等人用粗到细生成器和多尺度鉴别器扩展上述框架，以生成具有高分辨率细节的图像(Ting-Chun Wang,Ming-Yu Liu,Jun-Yan Zhu,Andrew Tao,JanKautz,and Bryan Catanzaro.2018.High-Resolution Image Synthesisand Semantic Manipulation With Conditional GANs.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).8798–8807)。Park等人提出了一种空间自适应的归一化技术(SPADE)，该技术使用语义图来预测仿射变换参数，以调制归一化层中的激活(Taesung Park,Ming-Yu Liu,Ting-Chun Wang,andJun-Yan Zhu.2019.Semantic Image Synthesis With Spatially-AdaptiveNormalization.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition(CVPR).2337–2346)。通常，这些方法需要一个大型训练数据集才能将分割类别标签映射到整个数据集的图像块外观。但是，在生成的图像中某个标签实例的出现仅限于该标签在训练数据集中的外观，因此限制了这些模型在随机自然图像上的泛化能力。The Pix2Pix model proposed by Isola et al. shows the ability of conditional GANs to generate controllable images given dense conditional signals, including sketches and segmentation maps (Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image- to-Image Translation with Conditional Adversarial Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).5967–5976). Wang et al. extended the above framework with a coarse-to-fine generator and a multi-scale discriminator to generate images with high-resolution details (Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, JanKautz, and Bryan Catanzaro. 2018. High-Resolution Image Synthesis and Semantic Manipulation With Conditional GANs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8798–8807). Park et al. proposed a spatially adaptive normalization technique (SPADE) that uses semantic maps to predict affine transformation parameters to modulate activations in normalization layers (Taesung Park, Ming-Yu Liu, Ting - Chun Wang, and Jun-Yan Zhu. 2019. Semantic Image Synthesis With Spatially-AdaptiveNormalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2337–2346). Typically, these methods require a large training dataset to map segmentation class labels to patch appearances across the dataset. However, the occurrence of a label instance in the generated images is limited to the appearance of that label in the training dataset, thus limiting the generalization ability of these models on random natural images.

另一方面，最近针对单图像GAN的研究表明，有可能从单个图像的内部补丁布局中学习生成模型。InGAN定义了调整大小的转换，并训练了一个生成模型来捕获内部补丁统计信息以进行重新定向(Assaf Shocher,Shai Bagon,Phillip Isola,and MichalIrani.2019.InGAN:Capturingand Retargeting the"DNA"of a Natural Image.InProceedings of the IEEE/CVF International Conference on Computer Vision(ICCV).4491–4500)。SinGAN利用多阶段训练方案生成无条件图像，该条件可从噪声中生成任意大小的图像(Tamar Rott Shaham,Tali Dekel,and Tomer Michaeli.2019.SinGAN:Learning a Generative Model From a Single Natural Image.In Proceedings of theIEEE/CVF International Conference on Computer Vision(ICCV).4569–4579)。KernelGAN使用深度线性生成器并对其进行约束，以学习针对盲超分辨率的图像特定的降级内核(Sefi Bell-Kligler,Assaf Shocher,and Michal Irani.2019.Blind Super-Resolution Kernel Estimation using an Internal-GAN.In Advances in NeuralInformation Processing Systems 32:Annual Conference on Neural InformationProcessing Systems(NeurIPS).284–293)。尽管这些特定于图像的GAN独立于数据集并产生可喜的结果，但单幅图像内补丁的语义含义仍然鲜有探索。On the other hand, recent work on single-image GANs has shown that it is possible to learn generative models from the internal patch layout of a single image. InGAN defines resizing transformations and trains a generative model to capture internal patch statistics for retargeting (Assaf Shocher, Shai Bagon, Phillip Isola, and Michal Irani. 2019. InGAN: Capturing and Retargeting the "DNA" of a Natural Image. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV. 4491–4500). SinGAN utilizes a multi-stage training scheme to generate unconditional images that can generate images of arbitrary size from noise (Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli. 2019. SinGAN: Learning a Generative Model From a Single Natural Image. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 4569–4579). KernelGAN uses a deep linear generator and constrains it to learn image-specific degradation kernels for blind super-resolution (Sefi Bell-Kligler, Assaf Shocher, and Michal Irani. 2019. Blind Super-Resolution Kernel Estimation using an Internal- GAN. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems (NeurIPS). 284–293). Although these image-specific GANs are dataset-independent and yield promising results, the semantic meaning of patches within a single image remains poorly explored.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种基于单图像生成对抗网络的语义图像类比方法，所产生的结果视觉质量和符合准确性都达到了最优。The purpose of the present invention is to provide a semantic image analogy method based on a single-image generative confrontation network, and the resultant visual quality and coincidence accuracy are optimized.

本发明的目的是通过以下技术方案实现的：The purpose of this invention is to realize through the following technical solutions:

一种基于单图像生成对抗网络的语义图像类比方法，通过编码器、生成器、辅助分类器及鉴别器构成的网络模型实现；其中：A semantic image analogy method based on a single-image generative confrontation network is implemented by a network model composed of an encoder, a generator, an auxiliary classifier and a discriminator; wherein:

训练阶段：在每次训练迭代期间，对于给定的源图像及对应的源语义分割图像，进行相同的随机扩充操作，获得对应的增强图像和增强语义分割图像；对于源语义分割图像及增强语义分割图像通过相同的编码器提取出各自的特征张量，再通过生成器中的语义特征转换模块基于两个特征张量预测图像域中的变换参数，从而变换参数的指导下结合源图像生成目标图像；目标图像将分别输入至鉴别器与辅助分类器，各自预测目标图像与增强图像的得分图以及目标图像对应的目标语义分割图像；利用目标图像与源图像之间的外观相似度损失、基于得分图得到的目标图像与增强图像特征匹配损失、以及目标语义分割图像与增强语义分割图像之间的语义对齐损失构建总损失函数进行训练；Training phase: During each training iteration, for a given source image and the corresponding source semantic segmentation image, perform the same random expansion operation to obtain the corresponding enhanced image and enhanced semantic segmentation image; for the source semantic segmentation image and the enhanced semantic segmentation image The segmented images extract their respective feature tensors through the same encoder, and then use the semantic feature conversion module in the generator to predict the transformation parameters in the image domain based on the two feature tensors, so as to combine the source images under the guidance of the transformation parameters to generate the target. image; the target image will be input to the discriminator and the auxiliary classifier, respectively, to predict the score map of the target image and the enhanced image and the target semantic segmentation image corresponding to the target image; using the appearance similarity loss between the target image and the source image, based on The target image and the enhanced image feature matching loss obtained from the score map, and the semantic alignment loss between the target semantic segmentation image and the enhanced semantic segmentation image construct a total loss function for training;

推断阶段：将源图像及对应的源语义分割图像、以及指定的语义分割图像输入至语义图像类比网络，输出与指定的语义分割图像相同语义布局的图像。Inference stage: input the source image, the corresponding source semantic segmentation image, and the specified semantic segmentation image to the semantic image analogy network, and output an image with the same semantic layout as the specified semantic segmentation image.

由上述本发明提供的技术方案可以看出，能够在给定任意图像和其语义分割图的情况下训练出专属于给定图像的生成模型，该模型能够根据期望语义布局的不同对源图像进行重新组合，生成符合目标语义布局的图像，达到语义图像类比的效果。It can be seen from the above technical solutions provided by the present invention that, given any image and its semantic segmentation map, a generative model dedicated to a given image can be trained, and the model can perform a generation model on the source image according to the desired semantic layout. Recombine to generate images that conform to the target semantic layout, and achieve the effect of semantic image analogy.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域的普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1为本发明实施例提供的语义图像类比概念图示；FIG. 1 is a conceptual diagram of a semantic image analogy provided by an embodiment of the present invention;

图2为本发明实施例提供的一种基于单图像生成对抗网络的语义图像类比方法的示意图；2 is a schematic diagram of a semantic image analogy method based on a single-image generative adversarial network provided by an embodiment of the present invention;

图3是为本发明实施例提供的SFT模块的计算流程图；Fig. 3 is the calculation flow chart of the SFT module provided for the embodiment of the present invention;

图4是为本发明实施例提供的本发明图像生成效果与现有图像类比方法的视觉效果对比；Fig. 4 is the visual effect comparison of the image generation effect of the present invention and the existing image analogy method provided for the embodiment of the present invention;

图5是为本发明实施例提供的本发明图像生成效果与与现有单图像GAN方法的视觉效果对比；5 is a comparison between the image generation effect of the present invention and the visual effect of the existing single-image GAN method provided for the embodiment of the present invention;

图6是为本发明实施例提供的本发明图像生成效果与现有语义图像翻译方法的视觉效果对比；6 is a visual effect comparison between the image generation effect of the present invention and the existing semantic image translation method provided for the embodiment of the present invention;

图7是为本发明实施例提供的本发明在语义图像类比任务上的视觉效果；7 is a visual effect of the present invention on a semantic image analogy task provided for an embodiment of the present invention;

图8是为本发明实施例提供的本发明在图像物体移除任务上的视觉效果；8 is a visual effect of the present invention on an image object removal task provided for an embodiment of the present invention;

图9是为本发明实施例提供的本发明在人脸编辑任务上的视觉效果；9 is a visual effect of the present invention on a face editing task provided for an embodiment of the present invention;

图10是为本发明实施例提供的本发明在边缘到图像翻译任务上的视觉效果。FIG. 10 is a visual effect of the present invention on an edge-to-image translation task provided for an embodiment of the present invention.

具体实施方式Detailed ways

下面结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明的保护范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present invention.

本发明实施例提供一种基于单图像生成对抗网络的语义图像类比方法，通过条件单图像GAN，训练了一个生成模型，该模型通过在源图像本身，而不是外部数据集中的分段标签生成语义可控制的图像。将此任务命名为“语义图像类比”，作为“图像类比”(AaronHertzmann,Charles E.Jacobs,Nuria Oliver,Brian Curless,and DavidSalesin.2001.Image analogies.In Proceedings of the 28th Annual Conference onComputer Graphics and Interactive Techniques.327–340)的一种变体，并定义如下。Embodiments of the present invention provide a semantic image analogy method based on a single-image generative adversarial network. Through a conditional single-image GAN, a generative model is trained, and the model generates semantics by segmenting labels in the source image itself rather than in an external data set. Controllable images. Name this task "Semantic Image Analogies", as "Image Analogies" (Aaron Hertzmann, Charles E. Jacobs, Nuria Oliver, Brian Curless, and David Salesin. 2001. Image analogies. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques .327–340) and is defined below.

给定源图像I及其对应的语义分割图P，以及一些其他语义分割图P'，合成新的目标图像I'，使得：Given a source image I and its corresponding semantic segmentation map P, and some other semantic segmentation maps P', synthesize a new target image I' such that:

上式中，::表示类别关系。In the above formula, :: represents the category relationship.

如图1中所示，目标图像I'(虚线框中的四个图像)应该与源图像I的外观和目标分割区P'(虚线框四个分割图像)的布局都匹配。任务设置旨在以从P到P'的相同方式找到从I到I'的类似转换。此外，还使用两个指标来评估从语义图像类比模型生成的图像的质量：图像块级别距离和语义对齐分数。前者限制原始图像I是生成图像I'的补丁的唯一来源，而后者则强制生成图像I'必须具有与目标分割图P'对齐的语义布局。As shown in Figure 1, the target image I' (the four images in the dashed box) should match both the appearance of the source image I and the layout of the target partition P' (the four segmented images in the dashed box). The task set aims to find a similar transition from I to I' in the same way as from P to P'. Furthermore, two metrics are used to evaluate the quality of images generated from the semantic image analogy model: image patch-level distance and semantic alignment score. The former restricts the original image I to be the only source of patches for the generated image I', while the latter enforces that the generated image I' must have a semantic layout aligned with the target segmentation map P'.

在实践中，可以编辑源分割图P或提供具有相似上下文的另一幅图像以获得目标分割图P'。然后，生成器可以从源图像I生成语义对齐的目标图像I'，类似于从P获得P'的方式。与现有方法的比较表明，在定量和定性评估方面，本发明提供的方法均具有优势。由于灵活的任务设置，因此所提出的方法可以轻松扩展到各种应用，包括对象移除，面部编辑和自然图像的草图到图像生成。In practice, one can edit the source segmentation map P or provide another image with similar context to obtain the target segmentation map P'. The generator can then generate a semantically aligned target image I' from the source image I, similar to how P' is obtained from P. Comparison with existing methods shows that the method provided by the present invention has advantages in both quantitative and qualitative evaluation. Due to the flexible task setup, the proposed method can be easily extended to various applications including object removal, face editing and sketch-to-image generation of natural images.

如图2所示，为本发明提供一种基于单图像生成对抗网络的语义图像类比方法的主要原理，通过编码器、生成器、辅助分类器及鉴别器构成的网络模型实现；其中：As shown in Figure 2, the present invention provides a main principle of a semantic image analogy method based on a single-image generative adversarial network, which is realized by a network model composed of an encoder, a generator, an auxiliary classifier and a discriminator; wherein:

训练阶段：我们设计了一个自监督框架，用于从单个图像训练条件GAN。在每次训练迭代期间，对于给定的源图像I_source及对应的源语义分割图像P_source，进行相同的随机扩充操作，获得对应的增强图像I_aug和增强语义分割图像P_aug；对于源语义分割图像及增强语义分割图像通过相同的编码器提取出各自的特征张量，再通过生成器中的语义特征转换模块基于两个特征张量预测图像域中的变换参数，从而在变换参数的指导下结合源图像生成目标图像I_target；目标图像将分别输入至鉴别器与辅助分类器，各自预测目标图像与增强图像的得分图以及目标图像对应的目标语义分割图像；利用目标图像与源图像之间的外观相似度损失、基于得分图得到的目标图像与增强图像特征匹配损失、以及目标语义分割图像与增强语义分割图像之间的语义对齐损失构建总损失函数进行训练；Training Phase: We design a self-supervised framework for training conditional GANs from a single image. During each training iteration, for a given source image I _source and the corresponding source semantic segmentation image P _source , the same random expansion operation is performed to obtain the corresponding enhanced image I _aug and the enhanced semantic segmentation image P _aug ; for the source semantic segmentation image P aug ; The segmentation image and the enhanced semantic segmentation image extract their respective feature tensors through the same encoder, and then predict the transformation parameters in the image domain based on the two feature tensors through the semantic feature transformation module in the generator, so as to guide the transformation parameters. The target image I _target is generated in conjunction with the source image below; the target image will be input to the discriminator and the auxiliary classifier respectively, and the target semantic segmentation image corresponding to the score map of the target image and the enhanced image and the target image is predicted separately; Utilize the difference between the target image and the source image The appearance similarity loss between the two, the feature matching loss between the target image and the enhanced image based on the score map, and the semantic alignment loss between the target semantic segmentation image and the enhanced semantic segmentation image construct a total loss function for training;

在训练过程中，逐渐增加增强的随机性。由于生成器是同态的，因此当P_target与P_source相同时，可以很好地重构源图像。此处的P_target是一般的表示，在训练中实际上P_target＝P_aug,因而在下面的训练过程中两者可以混用。During training, gradually increase the randomness of the augmentation. Since the generator is homomorphic, the source image can be reconstructed well when P _target is the same as P _source . The P _target here is a general representation, and actually P _target =P _aug in training, so the two can be mixed in the following training process.

本发明实施例中，采用采样和重建两种模式交替训练。采样模式下，也就是上文介绍的方式，即生成器以增强语义分割图像为指导，以生成外观与增强图像I_aug相同且语义布局与增强语义分割图像P_aug相同的目标图像。重建模式与采样模式的工作过程相同，直接输入为给定的源图像及对应的源语义分割图像，利用源语义分割图像重构源图像。In the embodiment of the present invention, two modes of sampling and reconstruction are used for alternate training. In the sampling mode, which is the method described above, the generator is guided by the enhanced semantic segmentation image to generate a target image with the same appearance as the enhanced semantic segmentation image I _aug and the same semantic layout as the enhanced semantic segmentation image P _aug . The working process of the reconstruction mode is the same as that of the sampling mode. The input is the given source image and the corresponding source semantically segmented image, and the source image is reconstructed by using the source semantically segmented image.

训练完成后，网络模型能够在给定任意形状布局的语义分割图的情况下生成与所给语义分割图相吻合的目标图像，既保留源图像的内容信息，又能与目标语义布局相吻合。如图1所示，训练好的网络模型可以根据给定的形状来改变源图像马的形状。After training, the network model can generate a target image that matches the given semantic segmentation map given a semantic segmentation map of arbitrary shape layout, which not only retains the content information of the source image, but also matches the target semantic layout. As shown in Figure 1, the trained network model can change the shape of the source image horse according to the given shape.

为了便于理解，下面针对本发明的原理及过程做详细的介绍。For ease of understanding, the principles and processes of the present invention are described in detail below.

本发明的技术原理是单图像的生成对抗网络。对于单张图像，训练一个以其语义分割图为条件的生成对抗网络(即生成模型)，其主要包含前文提到的生成器、辅助分类器及鉴别器，采用一系列新颖的设计建立语义分割图与图像像素之间的语义关联，进而利用这种关联通过语义分割图达到重新组合图像的目的。The technical principle of the present invention is a single-image generation adversarial network. For a single image, train a generative adversarial network (ie, a generative model) conditioned on its semantic segmentation map, which mainly includes the generator, auxiliary classifier and discriminator mentioned above, and adopts a series of novel designs to establish semantic segmentation. Semantic associations between graphs and image pixels, and then use this association to recombine images through semantic segmentation graphs.

将语义图像类比任务转换为补丁级别的布局匹配问题，并在语义分割域中进行了转换指导。为此，需要解决三个主要挑战：从单个图像训练生成模型的配对数据源，从分割域到图像域提供指导的条件方案，以及对生成的样本(即生成器的输出)进行适当的监督。为了完成此任务，提出了一种整合以下三个基本部分的新颖方法：The semantic image analogy task is transformed into a patch-level layout matching problem, and the transformation is guided in the semantic segmentation domain. To do this, three main challenges need to be addressed: paired data sources to train generative models from a single image, conditional schemes that provide guidance from the segmentation domain to the image domain, and proper supervision of the generated samples (i.e., the generator's output). To accomplish this task, a novel approach is proposed that integrates the following three essential parts:

1)设计了一种具有渐进数据增强策略的自我监督培训框架。通过与增强分割和原始分割进行交替优化，成功地从单个图像训练了条件GAN，这很好地概括了看不见的变换。1) A self-supervised training framework with a progressive data augmentation strategy is designed. A conditional GAN is successfully trained from a single image by alternating optimization with augmented and raw segmentation, which generalizes well for unseen transformations.

2)设计了一个语义特征转换模块，该模块将转换参数从分割域转换为图像域。2) A semantic feature transformation module is designed, which transforms the transformation parameters from the segmentation domain to the image domain.

3)设计了一个语义感知的补丁一致性损失，它鼓励转换后的图像仅包含源图像中的补丁。与语义对齐约束一起，它使我们的生成器可以生成具有目标语义布局的真实图像。3) A semantic-aware patch consistency loss is designed, which encourages the transformed image to contain only patches from the source image. Together with semantic alignment constraints, it enables our generator to generate realistic images with the target semantic layout.

如图1所示，训练阶段主要步骤包括：As shown in Figure 1, the main steps in the training phase include:

步骤1、给定源图像I_source及对应的源语义分割图像P_source，首先执行随机扩充以获得增强图像I_aug和增强语义分割图像P_aug，然后将源语义分割图像P_source和增强语义分割图像P_aug输入相同的编码器E(即，图1中的E_seg)以分别提取特征。Step 1. Given the source image I _source and the corresponding source semantic segmentation image P _source , first perform random expansion to obtain the enhanced image I _aug and the enhanced semantic segmentation image P _aug , and then the source semantic segmentation image P _source and the enhanced semantic segmentation image P aug . _Paug is input to the same encoder E (ie, E _seg in Fig. 1 ) to extract features separately.

本发明实施例中，所述随机扩充操作包括如下操作中的一种或多种的结合：随机翻转、大小调整、旋转和裁剪中。随着训练步骤线性增加这些操作的随机性，这种渐进策略可以帮助编码器在训练的早期迭代中学习源图像的外观。In this embodiment of the present invention, the random expansion operation includes a combination of one or more of the following operations: random flip, size adjustment, rotation, and cropping. As the training step linearly increases the randomness of these operations, this progressive strategy can help the encoder learn the appearance of the source images in early iterations of training.

步骤2、设计了语义特征转换(Semantic Feature Translation，SFT)模块，来从特征张量预测图像域中的变换参数。Step 2. A Semantic Feature Translation (SFT) module is designed to predict the transformation parameters in the image domain from the feature tensor.

通过SFT模块将转换参数从分割域显式转换为图像域，如图3所示。将从源语义分割图像P_source到增强语义分割图像P_aug的转换建模为特征级别的线性变换。因此，对于源语义分割图像的特征张量F_source以及增强语义分割图像的特征张量F_aug进行逐元素作比和作差，得到的特征缩放张量F_scale和特征移位张量F_shift用于后续下采样阶段，对于第l个下采样阶段，计算：The transformation parameters are explicitly transformed from the segmentation domain to the image domain by the SFT module, as shown in Figure 3. The transformation from the source semantically segmented image P _source to the augmented semantically segmented image P _aug is modeled as a feature-level linear transformation. Therefore, the feature tensor F _source of the source semantic segmentation image and the feature tensor F _aug of the enhanced semantic segmentation image are compared and compared element by element, and the obtained feature scaling tensor F _scale and feature shift tensor F _shift are In subsequent downsampling stages, for the l-th downsampling stage, calculate:

其中，

分别为第l个下采样阶段中从特征张量F_aug、特征张量F_source提取出的特征张量；例如，下采样次数为K，则两个特征张量被划分为K个部分，每一个下采样阶段取出相应部分带入上述进行计算。in,

are the feature tensors extracted from the feature tensor F _aug and the feature tensor F _source in the l-th downsampling stage respectively; for example, if the number of downsampling is K, the two feature tensors are divided into K parts, each A downsampling stage takes out the corresponding part and brings it into the above calculation.

使用特征缩放张量

和特征移位张量

来近似作为分割图变换的缩放因子

和移位因子，如图3所示，可以使用两个SFT单元对从分割域到图像域的转换过程进行建模，分别处理

和

得到图像域的缩放因子和移位因子

SFT单元的参数通过训练过程学习得到。Scale tensors using features

and feature shift tensor

to approximate the scaling factor as the split map transform

and the shift factor, as shown in Figure 3, can be used to model the transformation process from segmentation domain to image domain using two SFT units, which are processed separately

and

Get the scale factor and shift factor of the image domain

The parameters of the SFT unit are learned through the training process.

步骤3，从SFT模块，得到了图像域的缩放因子和移位因子(γ_img,β_img)，生成器G的编码器-解码器部分在

的指导下映射到目标图像。Step 3, from the SFT module, the scaling factor and shift factor (γ _img , β _img ) of the image domain are obtained, and the encoder-decoder part of the generator G is in

map to the target image under the guidance of .

对于生成器中的第l+1每个下采样阶段，输出特征张量由下式得到：For each downsampling stage l+1 in the generator, the output feature tensor is given by:

其中，DS代表下采样模块(即编码器)，mean和std分别代表取均值和标准差；。Among them, DS represents the downsampling module (ie encoder), and mean and std represent the mean and standard deviation, respectively;

生成器的上采样模块(即解码器)再将下采样阶段输出的图像特征张量映射到图像域，从而生成的目标图像。本发明实施例中，生成器为具有K个下采样块和K个上采样块的编码器-解码器结构；每个块包含一个步长为3的3×3卷积层，以及一个步长为2的4×4卷积层或转置卷积层，用于下采样或上采样，且每个块还使用光谱归一化、批归一化和LeakyReLU激活操作。示例性的，起始频道数为32，在下采样期间将其加倍。The upsampling module of the generator (ie the decoder) then maps the image feature tensor output from the downsampling stage to the image domain to generate the target image. In this embodiment of the present invention, the generator is an encoder-decoder structure with K downsampling blocks and K upsampling blocks; each block includes a 3×3 convolutional layer with a stride of 3, and a stride of 3 A 4×4 convolutional layer of 2 or a transposed convolutional layer for downsampling or upsampling, and each block also uses spectral normalization, batch normalization, and LeakyReLU activation operations. Exemplarily, the starting number of channels is 32, which is doubled during downsampling.

示例性的，可以设置K＝3，三个下采样块的每个阶段都会接收

然后按上式输出

而上采样块的输入是

输出是目标图像I_target。Exemplarily, K=3 can be set, and each stage of the three downsampling blocks will receive

Then press the above formula to output

while the input to the upsampling block is

The output is the target image I _target .

步骤4、鉴别器D将增强图像I_aug作为真实样本，并将生成目标图像I_target作为伪样本。同时，所生成的图像也被输入到辅助分类器S中以预测其分割图。Step 4. The discriminator D takes the enhanced image I _aug as a real sample, and takes the generated target image I _target as a pseudo sample. At the same time, the generated image is also input into an auxiliary classifier S to predict its segmentation map.

本发明实施例中，鉴别器是完全卷积的PatchGAN(Phillip Isola,Jun-Yan Zhu,Tinghui Zhou,and Alexei A.Efros.2017.Image-to-Image Translation withConditional Adversarial Networks.In Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition(CVPR).5967–5976)，它预测得分图以区分真实和假样本。In the embodiment of the present invention, the discriminator is a fully convolutional PatchGAN (Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the IEEE/ CVF Conference on Computer Vision and Pattern Recognition (CVPR).5967–5976), which predicts score maps to distinguish real and fake samples.

本发明实施例中，辅助分类器(图2中的Segmentation Network)中使用DeepLabV3架构(Liang-Chieh Chen,Yukun Zhu,George Papandreou,Florian Schroff,andHartwig Adam.2018.Encoder-Decoder with Atrous Separable Convolution forSemantic Image Segmentation.In Proceedings of the European Conference onComputer Vision(ECCV),Vol.11211.833–851)的简化版本进行语义分割。In the embodiment of the present invention, the DeepLabV3 architecture (Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-Decoder with Atrous Separable Convolution for Semantic Image) is used in the auxiliary classifier (Segmentation Network in FIG. 2 ). A simplified version of Segmentation.In Proceedings of the European Conference on Computer Vision (ECCV), Vol.11211.833–851) for semantic segmentation.

步骤5、构建损失函数训练所设计的自监督网络。Step 5. Construct the self-supervised network designed for loss function training.

根据语义图像类比的任务设置，生成的图像应满足以下要求：1)与源图像内容一致；2)语义布局与目标分割图对齐。因此，提出图像块相干损失(Patch Cohenrence Loss)来衡量生成的图像和源图像之间的外观相似度。并提出语义对齐损失(SemanticAlignment Loss)，由辅助分类器从目标图像预测的目标语义分割图像和源语义分割图像P_source之间的一致性衡量。具体来说：According to the task setting of semantic image analogy, the generated images should meet the following requirements: 1) Consistent with the source image content; 2) The semantic layout is aligned with the target segmentation map. Therefore, Patch Cohenrence Loss is proposed to measure the appearance similarity between the generated image and the source image. And propose Semantic Alignment Loss, which is measured by the consistency between the target semantic segmentation image predicted from the target image by the auxiliary classifier and the source semantic segmentation image P _source . Specifically:

1)通过图像块相干损失来衡量生成的图像和源图像之间的外观相似度，如果生成器生成在源图像中找不到的对应的图像块，则此约束将对生成器G造成不利影响，定义为源图像和目标图像之间的图像块距离下限的平均值：1) The appearance similarity between the generated image and the source image is measured by the patch coherence loss, this constraint will adversely affect the generator G if the generator generates a corresponding patch that cannot be found in the source image , defined as the average of the lower bounds of the patch distances between the source and destination images:

其中，N_target是目标图像I_target中的图像块数量，I_source表示源图像，G(I_source)＝I_target；U_class和V_class表示图像块UP和VQ的分割标签，d(·)是距离度量函数。这一损失放松了像素距离的位置依赖性。相反，我们将图像视为视觉特征词袋包。对于目标图像中的每个图像块，运行最近邻搜索，以从源图像中找到具有相同类标签的最相似图像块，然后取其距离的平均值。我们从经验上发现，尽管其他特征描述符也适用，但是来自预训练的VGG网络(Karen Simonyan and Andrew Zisserman.2015.Very Deep Convolutional Net-worksfor Large-Scale Image Recognition.In Proceedings of the 3rd InternationalConference on Learning Representations(ICLR))的特征会产生良好的结果。Among them, N _target is the number of image blocks in the target image I _target , I _source represents the source image, G(I _source )=I _target ; U _class and V _class represent the segmentation labels of the image blocks UP and VQ, and d( ) is distance metric function. This loss relaxes the position dependence of pixel distance. Instead, we treat images as bags of visual features. For each patch in the target image, a nearest neighbor search is run to find the most similar patch from the source image with the same class label, and their distances are averaged. We have empirically found that although other feature descriptors are also suitable, from a pretrained VGG network (Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Net-works for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR)) will yield good results.

2)使用辅助分类器来预测目标图像的分割图(即目标语义分割图像)。然后，计算了预测的分割图和增强分割图之间的交叉熵(CE)损失。生成器的语义对齐损失定义为：2) Use an auxiliary classifier to predict the segmentation map of the target image (ie, the target semantic segmentation image). Then, the cross-entropy (CE) loss between the predicted segmentation map and the enhanced segmentation map is calculated. The semantic alignment loss of the generator is defined as:

其中，CE表示交叉熵损失，其中，P_predict＝S(G(I_source))＝S(I_target)，P_predict为目标语义分割图像(辅助分类器S的输出)。Among them, CE represents cross entropy loss, where P _predict =S(G(I _source ))=S(I _target ), and P _predict is the target semantic segmentation image (the output of the auxiliary classifier S).

3)使用最小二乘GAN损失

作为对抗约束，并从鉴别器中获取特征以计算增强图像与生成的图像之间的特征匹配损失

图像。3) Use the least squares GAN loss

as an adversarial constraint and take features from the discriminator to compute the feature matching loss between the augmented image and the generated image

image.

总损失函数为：The total loss function is:

其中，

表示外观相似度损失，

表示语义对齐损失；λ_seg、λ_GAN和λ_fm均为超参数，实验中均设置为1.0。in,

represents the appearance similarity loss,

represents the semantic alignment loss; λ _seg , λ _GAN and λ _fm are hyperparameters, which are all set to 1.0 in the experiments.

本发明实施例中，采用采样和重建两种模式交替训练；In the embodiment of the present invention, two modes of sampling and reconstruction are used for alternate training;

采样模式也即前文介绍的步骤1～步骤5，生成器以增强语义分割图像为指导，以生成外观与增强图像I_aug相同且语义布局与增强语义分割图像P_aug相同的目标图像.The sampling mode is also the steps 1 to 5 introduced above, the generator is guided by the enhanced semantic segmentation image to generate a target image with the same appearance as the enhanced semantic segmentation image I _aug and the same semantic layout as the enhanced semantic segmentation image P _aug .

重建模式与采样模式的工作过程相同，区别在于，无需执行步骤1中的执行随机操作，直接输入为给定的源图像及对应的源语义分割图像，利用源语义分割图像重构源图像；总损失函数中略有不同，即少了外观相似度损失

外观相似度损失被替换为输出的重构图像和源图像之间的L1重建损失，特征匹配损失及语义对齐损失各自为目标图像与源图像之间、目标语义分割图像与源语义分割图像之间的损失。The working process of the reconstruction mode is the same as that of the sampling mode, the difference is that there is no need to perform the random operation in step 1, the input is a given source image and the corresponding source semantic segmentation image, and the source semantic segmentation image is used to reconstruct the source image; There is a slight difference in the loss function, that is, there is less appearance similarity loss

The appearance similarity loss is replaced by the L1 reconstruction loss between the output reconstructed image and the source image, and the feature matching loss and semantic alignment loss are respectively between the target image and the source image, and between the target semantic segmentation image and the source semantic segmentation image. Loss.

推断阶段，网络模型参数已经固定，此时给定源图像I_source及对应的源语义分割图像P_source，以及指定的语义分割图像(可以通过编辑P_source得到，也可以采用其他方式得到)；然后，将两幅语义分割图像输入至编码器E，再执行步骤2～步骤3，得到的图像与指定的语义分割图像相吻合，又保留源图像的内容信息。In the inference stage, the parameters of the network model have been fixed. At this time, the source image I _source and the corresponding source semantic segmentation image P _source are given, as well as the specified semantic segmentation image (which can be obtained by editing P _source , or by other methods); then , input the two semantically segmented images to the encoder E, and then perform steps 2 to 3, the obtained images are consistent with the specified semantically segmented images, and the content information of the source image is retained.

为验证本发明的有效性，分别在数量指标和视觉效果方面评估本发明上述方法的性能。In order to verify the effectiveness of the present invention, the performance of the above method of the present invention is evaluated in terms of quantitative indicators and visual effects, respectively.

将语义图像类比任务应用于来自不同数据集的图像，包括COCO-Stuff，ADE20K，CelebAMask-HQ和网络(即网络中的随机选取的自然图片)。本发明上述方法的结果以及比较方法在以下两个方面进行评估：1)源图像和目标图像之间的外观相似性；2)目标图像与目标分割图的语义一致性。Apply the semantic image analogy task to images from different datasets, including COCO-Stuff, ADE20K, CelebAMask-HQ, and the network (i.e., randomly selected natural pictures in the network). The results of the above method of the present invention and the comparison method are evaluated in the following two aspects: 1) the appearance similarity between the source image and the target image; 2) the semantic consistency of the target image and the target segmentation map.

为了评估生成图像与源图像的外观相似性，通过以下方式进行用户研究。从COCO-Stuff数据集中随机选择10对具有相同类别标签的图像。对于每对图像，以一个图像作为源图像，另一幅图像用于提供目标布局，我们使用图像类比(Aaron Hertzmann,CharlesE.Jacobs,Nuria Oliver,Brian Curless,and David Salesin.2001.Image analogies.InProceedings of the 28th Annual Conference on Computer Graphics andInteractive Techniques.327–340,IA)和深度图像类比(Jing Liao,Yuan Yao,Lu Yuan,Gang Hua,and Sing Bing Kang.2017.Visual attribute transfer through deep imageanalogy.ACM Trans.Graph.36,4(2017),120:1–120:15,DIA)方法将源图像转移到另一幅图像的布局中。IA和DIA是与本发明上述方法最相关的两个工作。DIA需要一对图像作为源和目标，而本发明上述方法和IA只需要一个源图像和两个分割图。以随机顺序显示结果，并要求20个用户对外观相似性进行排名，并以源图像作为参考。然后，计算每种方法在所有图像和用户中的平均排名(Avg.User Ranking)。表1显示了本发明上述方法(Our)相对于两个竞争对手的优越性。In order to evaluate the appearance similarity of the generated image to the source image, a user study was conducted in the following way. 10 pairs of images with the same class label are randomly selected from the COCO-Stuff dataset. For each pair of images, with one image as the source image and the other image used to provide the target layout, we use image analogies (Aaron Hertzmann, Charles E. Jacobs, Nuria Oliver, Brian Curless, and David Salesin. 2001. Image analogies. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques. 327–340, IA) and Deep Image Analogy (Jing Liao, Yuan Yao, Lu Yuan, Gang Hua, and Sing Bing Kang. 2017. Visual attribute transfer through deep imageanalogy. ACM Trans .Graph.36, 4 (2017), 120:1–120:15, DIA) method to transfer a source image into the layout of another image. IA and DIA are the two works most relevant to the above method of the present invention. DIA requires a pair of images as source and target, while the above method and IA of the present invention only require one source image and two segmentation maps. Results were presented in random order and 20 users were asked to rank appearance similarity, with the source image as a reference. Then, calculate the average ranking (Avg.User Ranking) of each method among all images and users. Table 1 shows the superiority of the above method of the present invention (Our) over two competitors.

表1在语义对齐指标下的性能和用户主观评测性能Table 1. Performance and user subjective evaluation performance under the semantic alignment index

为了评估生成的图像与目标分割图的语义一致性，使用Detectron2的全景分割模型来预测生成的图像的分割图，然后计算目标分割的逐像素精度(Pixel Accuracy)和平均相交比例(mIOU)。用于评估的图像与用户研究中的图像相同。如表1所示，该方法达到了最高的精度。To evaluate the semantic consistency of the generated image with the target segmentation map, the panoptic segmentation model of Detectron2 is used to predict the segmentation map of the generated image, and then the pixel-wise accuracy (Pixel Accuracy) and the mean intersection ratio (mIOU) of the target segmentation are calculated. The images used for evaluation are the same as those in the user study. As shown in Table 1, this method achieves the highest accuracy.

在图4，图5和图6中，跟当前最优的图像类比算法IA和DIA、单图片生成对抗网络模型SinGAN(Tamar Rott Shaham,Tali Dekel,and Tomer Michaeli.2019.SinGAN:Learninga Generative Model From a Single Natural Image.In Proceedings of the IEEE/CVFInternational Conference on Computer Vision(ICCV).4569–4579)以及分割图到图像翻译模型SPADE(Taesung Park,Ming-Yu Liu,Ting-Chun Wang,and Jun-YanZhu.2019.Semantic Image Synthesis With Spatially-Adaptive Normalization.InProceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition(CVPR).2337–2346)进行的视觉质量比较，本发明上述方法会生成内容一致且语义分布对齐的结果，当源图像和目标图像在语义上不相似时，IA倾向于用重复的纹理填充，DIA则会产生不真实的结果。在不考虑语义结构的情况下，SinGAN编辑通常会更改未编辑的区域并产生不希望的纹理，或仅使粘贴的对象模糊化，这导致了非常相似的编辑结果版本。虽然SPADE的生成结果在语义上与目标布局一致，但其内容仅限于训练数据集，并且会丢失源图像的外观。我们的方法产生的图像在外观上忠实于源图像，并在语义上与目标布局保持一致。In Figure 4, Figure 5 and Figure 6, with the current optimal image analogy algorithms IA and DIA, single-image generative adversarial network model SinGAN (Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli. 2019. SinGAN: Learninga Generative Model From a Single Natural Image. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 4569–4579) and a segmentation map-to-image translation model SPADE (Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-YanZhu .2019. Visual quality comparison by Semantic Image Synthesis With Spatially-Adaptive Normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2337–2346), the above method of the present invention will generate consistent content and aligned semantic distribution As a result, when the source and target images are semantically dissimilar, IA tends to be filled with duplicate textures, and DIA produces unrealistic results. Without considering the semantic structure, SinGAN editing often changes unedited regions and produces undesired textures, or just blurs pasted objects, which leads to very similar versions of the edited results. While SPADE generates results that are semantically consistent with the target layout, their content is limited to the training dataset and loses the appearance of the source images. The images produced by our method are faithful to the source image in appearance and semantically consistent with the target layout.

本发明上述方法可以通过图像的分割图对图像进行语义处理。可以在源语义分割图中移动，调整大小或删除实例，以获得目标布局。如图7所示，本发明上述方法通过任意语义更改产生高质量的结果，同时很好地保留了更改后实例的局部外观。The above method of the present invention can perform semantic processing on the image through the segmentation map of the image. Instances can be moved, resized or deleted in the source semantic segmentation map to obtain the target layout. As shown in Fig. 7, the above method of the present invention produces high-quality results with arbitrary semantic changes, while well preserving the local appearance of the changed instances.

我们灵活的语义图像类比任务设置可实现各种应用。由于密集的条件输入，可以使用像素级控制将图像中的图像块重新组合。在图8、图9和图10中，展示了本发明上述方法的三个应用，包括1)物体移除，其中可以通过将语义分割图中的类标签修改为背景类来轻松移除不需要的对象，2)人脸编辑，其中可以通过更改分割图中人脸的形状来编辑人脸图像，以及3)边缘到图像生成，在这里可以使用其他空间条件(例如边缘图)作为条件输入。Our flexible semantic image analogy task setup enables a variety of applications. Due to the dense conditional input, patches in the image can be regrouped using pixel-level control. In Fig. 8, Fig. 9 and Fig. 10, three applications of the above-mentioned method of the present invention are shown, including 1) object removal, in which it can be easily removed by modifying the class label in the semantic segmentation map to the background class , 2) face editing, where the face image can be edited by changing the shape of the face in the segmentation map, and 3) edge-to-image generation, where other spatial conditions such as edge maps can be used as conditional inputs.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例可以通过软件实现，也可以借助软件加必要的通用硬件平台的方式来实现。基于这样的理解，上述实施例的技术方案可以以软件产品的形式体现出来，该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM，U盘，移动硬盘等)中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the above embodiments can be implemented by software or by means of software plus a necessary general hardware platform. Based on this understanding, the technical solutions of the above embodiments may be embodied in the form of software products, and the software products may be stored in a non-volatile storage medium (which may be CD-ROM, U disk, mobile hard disk, etc.), including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments of the present invention.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明披露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应该以权利要求书的保护范围为准。The above description is only a preferred embodiment of the present invention, but the protection scope of the present invention is not limited to this. Substitutions should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. A semantic image analogy method for generating an anti-collision network based on a single image is characterized by being realized by a network model consisting of an encoder, a generator, an auxiliary classifier and a discriminator; wherein:

a training stage: during each training iteration, carrying out the same random expansion operation on a given source image and a corresponding source semantic segmentation image to obtain a corresponding enhanced image and an enhanced semantic segmentation image; extracting respective feature tensors of the source semantic segmentation image and the enhanced semantic segmentation image through the same encoder, and predicting transformation parameters in an image domain based on the two feature tensors through a semantic feature conversion module in the generator so as to generate a target image by combining the source image under the guidance of the transformation parameters; the target image is respectively input into the discriminator and the auxiliary classifier, and the score map of the target image and the enhanced image and the target semantic segmentation image corresponding to the target image are respectively predicted; constructing a total loss function by using the appearance similarity loss between a target image and a source image, the feature matching loss between the target image and an enhanced image obtained based on a score map, and the semantic alignment loss between the target semantic segmentation image and the enhanced semantic segmentation image for training;

and (3) an inference stage: and inputting the source image, the corresponding source semantic segmentation image and the appointed semantic segmentation image into a semantic image analogy network, and outputting an image with the same semantic layout as the appointed semantic segmentation image.

2. The method of claim 1, wherein the stochastic augmentation operation comprises one or more of the following operations: random turning, size adjustment, rotation and cutting.

3. The method according to claim 1, wherein the predicting transformation parameters in the image domain based on two feature tensors by the semantic feature transformation module in the generator comprises: feature tensor F for source semantic segmentation image_sourceAnd enhancing the feature tensor F of the semantically segmented image_augPerforming element-by-element comparison and subtraction to obtain a feature scaling tensor F_scaleAnd an eigenshift tensor F_shiftFor a subsequent down-sampling stage; for the l-th downsampling stage, calculate:

wherein,

respectively the feature tensor F in the l-th down-sampling stage_augFeature tensor F_sourceExtracting a feature tensor;

using feature scaling tensors

And the characteristic shift tensor

As a scaling factor for the segmentation map transformation

And a shifting factor, modeling the conversion process from the segmentation domain to the image domain by using two semantic feature conversion modules, and respectively processing

And

obtaining a scaling factor and a shifting factor of an image field

4. The method for generating semantic image analogy based on single image antagonizing network as claimed in claim 3, wherein the output feature tensor of the l +1 th down-sampling stage in the generator is obtained by the following formula:

wherein DS represents the down-sampling module, mean and std represent the mean and standard deviation, respectively

An up-sampling module of the generator maps the image characteristic tensor output in the down-sampling stage to an image domain, so that a target image is generated;

the generator is an encoder-decoder structure having K downsample blocks and K upsample blocks; each block contains a 3 x 3 convolutional layer of step size 3 and a 4 x 4 convolutional or transposed convolutional layer of step size 2 for down-sampling or up-sampling, and each block also uses spectral normalization, batch normalization and leakage ReLU activation operations.

5. The method of claim 1, wherein the semantic image analogy method for generating the countermeasure network based on the single image,

measuring appearance similarity between a generated image and a source image through image block coherence loss, wherein the appearance similarity is defined as an average value of image block distance lower limits between the source image and a target image:

wherein N is_targetIs a target image I_targetNumber of image blocks in, I_sourceRepresenting a source image, G (I)_source)＝I_target；U_classAnd V_classThe segmentation labels, d (-) is a distance metric function, representing the image blocks UP and VQ.

6. The method of claim 1 or 3, wherein the semantic image analogy method for generating the countermeasure network based on the single image is characterized in that the semantic alignment loss is expressed as:

where CE represents cross entropy loss; i is_sourceRepresenting a source image, P_augRepresenting an enhanced semantically segmented image.

7. The method of claim 1, wherein the overall loss function is expressed as:

wherein,

indicating a loss of the degree of similarity of the appearance,

indicating that the loss of semantic alignment is present,

a loss of the matching of the features is indicated,

representing least squares GAN loss as an antagonistic constraint; lambda [ alpha ]_seg、λ_GANAnd λ_fmAre all hyper-parameters.

8. The semantic image analogy method for generating the countermeasure network based on the single image according to claim 1, characterized in that two modes of sampling and reconstruction are adopted for alternate training;

under the sampling mode, the generator takes the enhancement semantic segmentation image as a guide to generate the appearance and enhancement image I_augIdentical and semantically laid out and enhanced semantically segmented image P_augThe same target image;

the working process of the reconstruction mode and the sampling mode is the same, a given source image and a corresponding source semantic segmentation image are directly input, and the source image is reconstructed by utilizing the source semantic segmentation image; the loss of appearance similarity in the total loss function is replaced by the L1 reconstruction loss between the output reconstructed image and the source image, and the loss of feature matching and the loss of semantic alignment are respectively the loss between the target image and the source image and the loss between the target semantic segmentation image and the source semantic segmentation image.