CN114429603A

CN114429603A - Lightweight video semantic segmentation method, system, device and storage medium

Info

Publication number: CN114429603A
Application number: CN202210068739.0A
Authority: CN
Inventors: 王子磊; 庄嘉帆
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-01-20
Filing date: 2022-01-20
Publication date: 2022-05-03

Abstract

The invention discloses a lightweight video semantic segmentation method, system, equipment and storage medium, which utilizes the characteristic of sharing distortion mode between image domain and feature domain, jointly propagates key frame images and semantic features, and designs a distortion perception network. , to identify distorted regions by comparing the propagating frame with the current frame. Based on the recognition results of the distorted area, a feature correction network is designed to extract the necessary information of the distortion and missing in the propagation feature from the current frame, replace the distorted area of the propagation feature, and retain the original propagation feature in other areas, so as to achieve accurate Correction of propagation characteristics.

Description

Lightweight video semantic segmentation method, system, device and storage medium

技术领域technical field

本发明涉及视频语义分割技术领域，尤其涉及一种轻量级视频语义分割方法、系统、设备及存储介质。The present invention relates to the technical field of video semantic segmentation, in particular to a lightweight video semantic segmentation method, system, device and storage medium.

背景技术Background technique

随着视频监控技术和深度学习技术的发展，视频语义分割技术受到了越来越多的关注。视频语义分割任务的目的是为视频片段中的每一个像素点进行分类，从而完成对视频中场景与目标对象的精细化解析。与图像语义分割不同，视频语义分割可以通过挖掘视频数据存在的时序关联先验，利用相邻帧之间的时序相关性来引导当前帧的分割，从而减少冗余计算，提升语义分割的性能。With the development of video surveillance technology and deep learning technology, video semantic segmentation technology has received more and more attention. The purpose of the video semantic segmentation task is to classify each pixel in the video clip, so as to complete the refined analysis of the scene and target objects in the video. Different from image semantic segmentation, video semantic segmentation can use the temporal correlation between adjacent frames to guide the segmentation of the current frame by mining the temporal correlation prior of video data, thereby reducing redundant computation and improving the performance of semantic segmentation.

在目前针对视频语义分割任务的研究中，主流方法是挑选少量有代表性的关键帧进行完整语义分割，对于非关键帧则利用运动向量场(例如，残差图和光流)实现关键帧的特征传播与复用，从而减少当前帧语义分割过程的计算量，降低对整个视频片段进行语义分割的平均计算量。在专利《一种实时的语义视频分割方法》中，通过视频解码获得残差图、运动向量和RGB图像，如果当前帧是I帧，则利用语义分割网络对RGB图像进行完整的语义分割计算，如果当前帧是P帧，则利用运动向量将前一帧的分割结果传播至当前。在专利《视频语义分割中的自适应关键帧选择方法》和专利《一种基于光流特征融合的视频语义分割方法》中，构建自适应关键帧选择网络，用来挑选具有代表性的关键帧，然后利用光流网络预测关键帧与当前帧的光流，实现关键帧的特征传播与复用。In the current research on video semantic segmentation tasks, the mainstream method is to select a small number of representative keyframes for complete semantic segmentation, and for non-keyframes, use motion vector fields (such as residual maps and optical flow) to implement keyframe features. Propagation and multiplexing, thereby reducing the computational cost of the current frame semantic segmentation process and the average computational cost of semantic segmentation of the entire video segment. In the patent "A Real-time Semantic Video Segmentation Method", the residual image, motion vector and RGB image are obtained through video decoding. If the current frame is an I frame, the semantic segmentation network is used to perform a complete semantic segmentation calculation on the RGB image. If the current frame is a P frame, the motion vector is used to propagate the segmentation result of the previous frame to the current one. In the patent "Adaptive Keyframe Selection Method in Video Semantic Segmentation" and the patent "A Video Semantic Segmentation Method Based on Optical Flow Feature Fusion", an adaptive keyframe selection network is constructed to select representative keyframes , and then use the optical flow network to predict the optical flow of the key frame and the current frame to realize the feature propagation and multiplexing of the key frame.

然而，在弱纹理区域和细小物体区域，运动向量场的预测往往是不准确的，导致特征在运动向量场引导下进行传播时会发生特征扭曲，进而直接影响视频语义分割精度。However, in weak texture areas and small object areas, the prediction of motion vector fields is often inaccurate, resulting in feature distortion when features are propagated under the guidance of motion vector fields, which directly affects the accuracy of video semantic segmentation.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种轻量级视频语义分割方法、系统、设备及存储介质，能够对扭曲区域进行准确识别与矫正，大幅提升语义分割精度与鲁棒性，且整个过程仅引入少量的额外计算量，计算效率较高。The purpose of the present invention is to provide a lightweight video semantic segmentation method, system, equipment and storage medium, which can accurately identify and correct distorted regions, greatly improve the accuracy and robustness of semantic segmentation, and only introduce a small amount of Additional calculation amount, higher calculation efficiency.

本发明的目的是通过以下技术方案实现的：The purpose of this invention is to realize through the following technical solutions:

一种轻量级视频语义分割方法，包括：A lightweight video semantic segmentation method, including:

若当前帧图像为非关键帧图像，利用光流估计网络估计前一帧图像与当前帧图像的光流，利用光流分别对前一帧图像及其对应的语义特征进行像素级位移，得到传播帧图像与传播特征；If the current frame image is a non-key frame image, the optical flow estimation network is used to estimate the optical flow of the previous frame image and the current frame image, and the optical flow is used to perform pixel-level displacement of the previous frame image and its corresponding semantic features respectively to obtain the propagation Frame image and propagation characteristics;

利用扭曲感知网络，对比所述传播帧图像与当前帧图像的特征差异，预测传播特征中的扭曲区域；Using the distortion perception network, compare the feature difference between the propagation frame image and the current frame image, and predict the distortion area in the propagation feature;

利用特征矫正网络，基于预测出的传播特征中的扭曲区域，从当前帧图像中提取矫正信息，对预测出的传播特征中的扭曲区域进行替换，获得矫正后的特征；Using the feature correction network to extract correction information from the current frame image based on the distorted regions in the predicted propagation features, and replace the distorted regions in the predicted propagation features to obtain corrected features;

通过语义分割网络对所述矫正后的特征进行语义分割。Semantic segmentation is performed on the rectified features through a semantic segmentation network.

一种轻量级视频语义分割系统，基于前述方法实现，该系统包括：A lightweight video semantic segmentation system, implemented based on the foregoing method, the system includes:

特征与图像联合传播模块，用于在当前帧图像为非关键帧图像时，利用光流估计网络估计前一帧图像与当前帧图像的光流，利用光流分别对前一帧图像及其对应的语义特征进行像素级位移，得到传播帧图像与传播特征；The feature and image joint propagation module is used to estimate the optical flow of the previous frame image and the current frame image by using the optical flow estimation network when the current frame image is a non-key frame image, and use the optical flow to respectively analyze the previous frame image and its corresponding image. Pixel-level displacement is performed on the semantic features of , and the propagation frame images and propagation features are obtained;

扭曲感知网络，用于对比所述传播帧图像与当前帧图像的特征差异，预测传播特征中的扭曲区域；Distortion perception network, used to compare the feature difference between the propagation frame image and the current frame image, and predict the warped area in the propagation feature;

特征矫正网络，用于基于预测出的传播特征中的扭曲区域，从当前帧图像中提取矫正信息，对预测出的传播特征中的扭曲区域进行替换，获得矫正后的特征；The feature correction network is used to extract correction information from the current frame image based on the distorted region in the predicted propagation feature, replace the distorted region in the predicted propagation feature, and obtain the corrected feature;

语义分割网络，用于对所述矫正后的特征进行语义分割。A semantic segmentation network for performing semantic segmentation on the rectified features.

一种处理设备，包括：一个或多个处理器；存储器，用于存储一个或多个程序；A processing device, comprising: one or more processors; a memory for storing one or more programs;

其中，当所述一个或多个程序被所述一个或多个处理器执行时，使得所述一个或多个处理器实现前述的方法。Wherein, when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the aforementioned method.

一种可读存储介质，存储有计算机程序，其特征在于，当计算机程序被处理器执行时实现前述的方法。A readable storage medium storing a computer program, characterized in that the aforementioned method is implemented when the computer program is executed by a processor.

由上述本发明提供的技术方案可以看出，利用图像域与特征域共享扭曲模式的特性，将关键帧图像与语义特征进行联合传播，并设计了扭曲感知网络，通过对传播帧与当前帧进行对比识别扭曲区域。基于扭曲区域的识别结果，设计了特征矫正网络，从当前帧提取传播特征中扭曲缺失的必要信息，对传播特征的扭曲区域进行替换，而在其他区域上保留原有的传播特征，从而实现准确的传播特征矫正。It can be seen from the above technical solutions provided by the present invention that the image domain and the feature domain share the characteristics of the distortion mode, the key frame image and the semantic feature are jointly propagated, and the distortion perception network is designed. Contrast to identify distorted regions. Based on the recognition results of the distorted area, a feature correction network is designed to extract the necessary information of the distortion and missing in the propagation feature from the current frame, replace the distorted area of the propagation feature, and retain the original propagation feature in other areas, so as to achieve accurate Correction of propagation characteristics.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域的普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1为本发明实施例提供的一种轻量级视频语义分割方法的框架图；1 is a framework diagram of a lightweight video semantic segmentation method provided by an embodiment of the present invention;

图2为本发明实施例提供的DMNet网络结构示意图；2 is a schematic diagram of a DMNet network structure provided by an embodiment of the present invention;

图3为本发明实施例提供的DMNet网络训练策略示意图；3 is a schematic diagram of a DMNet network training strategy provided by an embodiment of the present invention;

图4为本发明实施例提供的传播特征矫正过程示意图；4 is a schematic diagram of a propagation feature correction process provided by an embodiment of the present invention;

图5为本发明实施例提供的CFNet网络结构与训练过程示意图；5 is a schematic diagram of a CFNet network structure and a training process provided by an embodiment of the present invention;

图6为本发明实施例提供的网络训练策略示意图；6 is a schematic diagram of a network training strategy provided by an embodiment of the present invention;

图7为本发明实施例提供的一种轻量级视频语义分割系统的示意图；7 is a schematic diagram of a lightweight video semantic segmentation system provided by an embodiment of the present invention;

图8为本发明实施例提供的一种处理设备的示意图。FIG. 8 is a schematic diagram of a processing device according to an embodiment of the present invention.

具体实施方式Detailed ways

下面结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明的保护范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present invention.

首先对本文中可能使用的术语进行如下说明：First a description of terms that may be used in this article:

术语“包括”、“包含”、“含有”、“具有”或其它类似语义的描述，应被解释为非排它性的包括。例如：包括某技术特征要素(如原料、组分、成分、载体、剂型、材料、尺寸、零件、部件、机构、装置、步骤、工序、方法、反应条件、加工条件、参数、算法、信号、数据、产品或制品等)，应被解释为不仅包括明确列出的某技术特征要素，还可以包括未明确列出的本领域公知的其它技术特征要素。The terms "comprising", "comprising", "containing", "having" or other descriptions with similar meanings should be construed as non-exclusive inclusions. For example: including certain technical characteristic elements (such as raw materials, components, ingredients, carriers, dosage forms, materials, dimensions, parts, components, mechanisms, devices, steps, processes, methods, reaction conditions, processing conditions, parameters, algorithms, signals, data, products or products, etc.), should be construed to include not only a certain technical feature element explicitly listed, but also other technical feature elements known in the art that are not explicitly listed.

术语“由……组成”表示排除任何未明确列出的技术特征要素。若将该术语用于权利要求中，则该术语将使权利要求成为封闭式，使其不包含除明确列出的技术特征要素以外的技术特征要素，但与其相关的常规杂质除外。如果该术语只是出现在权利要求的某子句中，那么其仅限定在该子句中明确列出的要素，其他子句中所记载的要素并不被排除在整体权利要求之外。The term "consisting of" means to exclude any element of technical characteristics not expressly listed. If the term is used in a claim, the term will make the claim closed so that it does not contain technical feature elements other than those expressly listed, except for the usual impurities associated therewith. If the term appears in only one clause of a claim, it is limited only to the elements expressly recited in that clause, and elements recited in other clauses are not excluded from the claim as a whole.

下面对本发明所提供的一种轻量级视频语义分割方案进行详细描述。本发明实施例中未作详细描述的内容属于本领域专业技术人员公知的现有技术。本发明实施例中未注明具体条件者，按照本领域常规条件或制造商建议的条件进行。A lightweight video semantic segmentation solution provided by the present invention will be described in detail below. Contents that are not described in detail in the embodiments of the present invention belong to the prior art known to those skilled in the art. If the specific conditions are not indicated in the examples of the present invention, it is carried out according to the conventional conditions in the art or the conditions suggested by the manufacturer.

实施例一Example 1

如图1所示，为一种轻量级视频语义分割方法的框架图，根据当前帧为关键帧或非关键帧采用不同的处理方式：对于非关键帧图像，采用逐帧传播的方式实现特征复用。具体地，利用光流网络计算当前帧与前一帧的光流场，从而对前一帧的语义特征进行像素位移从而实现特征传播，通过复用关键帧语义特征的方式获得当前帧的语义分割结果，降低整体的平均计算量。对于关键帧图像，直接利用预训练好的语义分割网络(Net_Seg)进行完整的语义分割操作(可通过常规技术实现)。As shown in Figure 1, it is a frame diagram of a lightweight video semantic segmentation method. Different processing methods are adopted according to whether the current frame is a key frame or a non-key frame: for non-key frame images, the feature is realized by frame-by-frame propagation. reuse. Specifically, the optical flow field of the current frame and the previous frame is calculated by the optical flow network, so as to perform pixel displacement on the semantic features of the previous frame to realize feature propagation, and the semantic segmentation of the current frame is obtained by multiplexing the semantic features of the key frame. As a result, the overall average calculation amount is reduced. For key frame images, the pre-trained semantic segmentation network (Net _Seg ) is directly used to perform a complete semantic segmentation operation (which can be achieved by conventional techniques).

本发明实施例的核心在于对非关键帧图像特征传播的处理，主要包括如下三部分：The core of the embodiment of the present invention lies in the processing of non-key frame image feature propagation, which mainly includes the following three parts:

1、利用光流估计网络估计前一帧图像与当前帧图像的光流，利用光流分别对前一帧图像及其对应的语义特征进行像素级位移，得到传播帧图像与传播特征。1. Use the optical flow estimation network to estimate the optical flow of the previous frame image and the current frame image, and use the optical flow to perform pixel-level displacement of the previous frame image and its corresponding semantic features to obtain the propagation frame image and propagation feature.

本发明实施例中，以t+1时刻为例，利用光流估计网络估计前一帧图像F_t与当前帧图像F_t+1的光流Flow_t，用于表征当前帧图像F_t+1每一个像素点与前一帧图像F_t对应像素点之间的相对位移；此处的t与t+1表示两个相邻时刻。在光流的引导下，将前一帧图像F_t与其对应的语义特征

分别进行像素级的特征位移，实现图像与语义特征的传播，得到传播帧图像

和传播特征

In the embodiment of the present invention, taking time t+1 as an example, the optical flow estimation network is used to estimate the optical flow Flow _t of the previous frame image F _t and the current frame image F _t+1 , which is used to represent the current frame image F _t+1 The relative displacement between each pixel point and the pixel point corresponding to the previous frame image F _t ; here t and t+1 represent two adjacent moments. Under the guidance of optical flow, the previous frame image F _t and its corresponding semantic features

Perform pixel-level feature displacement respectively to realize the propagation of image and semantic features, and obtain the propagation frame image

and propagation characteristics

本发明实施例中，如果前一帧是关键帧，则语义特征

通过语义分割网络从前一帧图像F_t中提取得到；如果前一帧是非关键帧，则是由更前一帧(即第t-1帧)传播得到，关键帧的选取规则是每隔N帧选为关键帧，N的数值可以由本领域技术人员根据实际情况或者经验自行设定，图1中给出了第t帧与第t+3帧为关键帧的示例。In this embodiment of the present invention, if the previous frame is a key frame, the semantic feature

It is extracted from the previous frame of image F _t through the semantic segmentation network; if the previous frame is a non-key frame, it is obtained by propagating from the previous frame (that is, the t-1th frame). The key frame selection rule is every N frames. Selected as the key frame, the value of N can be set by those skilled in the art according to the actual situation or experience, and FIG.

2、利用扭曲感知网络(DistortionMapNetwork，DMNet)，对比所述传播帧图像与当前帧图像的特征差异，预测传播特征中的扭曲区域。2. Using a Distortion Map Network (DMNet), compare the feature difference between the propagation frame image and the current frame image, and predict the distortion area in the propagation feature.

本发明实施例中，通过扭曲感知网络分别提取所述传播帧图像与当前帧图像的特征，记为

和f_t+1，并分别进行归一化后，计算两个特征的像素级余弦相似度，相似度越小则说明扭曲现象越严重，归一化后预测出扭曲图

其中包含了传播特征中的扭曲区域，扭曲图

预测方式表示为：In the embodiment of the present invention, the features of the propagating frame image and the current frame image are respectively extracted through a distortion perception network, which is denoted as

and f _t+1 , and normalize them respectively, and calculate the pixel-level cosine similarity of the two features. The smaller the similarity, the more serious the distortion phenomenon is. After normalization, the distortion map is predicted.

which contains the distorted region in the propagation feature, the distorted map

The prediction method is expressed as:

其中，

表示归一化后的当前图像的特征，

表示归一化后的传播帧图像的特征；T为转置符号；p表示单个像素，S_t+1(p)表示特征

与

中相同位置中单个像素p的余弦相似度，所有像素的余弦相似度S_t+1(p)构成余弦相似度矩阵S_t+1，＜＞表示计算余弦相似度的符号。in,

Represents the features of the normalized current image,

Represents the feature of the normalized propagation frame image; T is the transposed symbol; p represents a single pixel, and S _t+1 (p) represents the feature

and

The cosine similarity of a single pixel p in the same position in , and the cosine similarity of all pixels S _t+1 (p) constitute a cosine similarity matrix S _t+1 , and <> represents the symbol for calculating cosine similarity.

本发明实施例中，将所有像素的相关特征值都带入上述式子做扭曲图预测，即p∈I，I表示所有像素点的集合；扭曲图的尺寸与帧图像尺寸相同，表征帧图像的扭曲程度，扭曲图包含扭曲区域与非扭曲区域(即正常区域)，当像素p位于扭曲区域时，其扭曲值

较大，当像素p位于正常区域时，其扭曲值

较小。In the embodiment of the present invention, the relevant eigenvalues of all pixels are brought into the above formula to do warped graph prediction, that is, p∈I, where I represents the set of all pixel points; the size of the warped graph is the same as that of the frame image, which represents the frame image The degree of distortion, the distortion map contains distorted area and non-distorted area (ie normal area), when the pixel p is located in the distorted area, its distortion value

larger, when the pixel p is in the normal area, its distortion value

smaller.

如图2所示，为扭曲感知网络的示意图。所述扭曲感知网络设有特征提取器(Feature Extractor)，用于提取所述传播帧图像与当前帧图像的特征。为了保持较低的计算代价，特征提取器包括四个可分离的卷积层，每个卷积层搭配一个批归一化层和激活层；与语义分割网络相比，扭曲感知网络的整体计算量几乎可以忽略不计。利用扭曲感知网络，可以快速准确地判断传播特征中存在的扭曲区域，为下一步的传播特征矫正提供引导。As shown in Figure 2, it is a schematic diagram of the distortion perception network. The distortion perception network is provided with a feature extractor (Feature Extractor) for extracting the features of the image of the propagation frame and the image of the current frame. To keep the computational cost low, the feature extractor consists of four separable convolutional layers, each with a batch normalization layer and an activation layer; The amount is almost negligible. Using the distortion perception network, the distortion area in the propagation feature can be quickly and accurately judged, which provides guidance for the next step of propagation feature correction.

本领域技术人员可以理解，可分离的卷积层是现有的一种新的卷积层，相较以往的卷积层而言计算量较低。Those skilled in the art can understand that the separable convolutional layer is a new type of existing convolutional layer, which requires less computation than previous convolutional layers.

本发明实施例中，针对所述扭曲感知网络设计了一套有监督的训练。如图3所示，训练时，将前一帧图像F_t与对应的语义特征

传播至当前时刻，得到传播帧

和传播特征

以t+1时刻为例，传播帧和传播特征记为

与

利用语义分割网络提取当前帧图像F_t+1的语义特征f_t+1，并对当前帧图像F_t+1的语义特征f_t+1以及传播特征

分别进行语义分割，利用异或操作(XOR)得到两类语义分割结果的差异图，将差异图作为扭曲感知网络训练的监督信号(Supervision Signal)。In the embodiment of the present invention, a set of supervised training is designed for the distortion perception network. As shown in Figure 3, during training, the previous frame image F _t is compared with the corresponding semantic features

Propagated to the current moment to get the propagation frame

and propagation characteristics

Taking time t+1 as an example, the propagation frame and propagation feature are recorded as

and

Extract the semantic feature f _t+1 of the current frame image F _t ₊₁ by using the semantic segmentation network, and analyze the semantic feature f _t+1 and the propagation feature of the current frame image F t+1

Semantic segmentation is performed separately, and the difference map of the two types of semantic segmentation results is obtained by the exclusive OR operation (XOR), and the difference map is used as the supervision signal (Supervision Signal) for the training of the distortion-aware network.

3、利用特征矫正网络(Feature Correction Module，FCM)，基于预测出的传播特征中的扭曲区域，从当前帧图像中提取矫正信息，对预测出的传播特征中的扭曲区域进行替换，获得矫正后的特征。3. Use the Feature Correction Module (FCM) to extract correction information from the current frame image based on the distorted area in the predicted propagation feature, replace the distorted area in the predicted propagation feature, and obtain the corrected Characteristics.

由于在传播过程中发生扭曲，传播特征的扭曲区域的语义信息已经被破坏，需要从当前帧提取缺失的语义信息进行替换。为了保持较低的计算代价，本发明设计了轻量级网络(称为CFNet网络)用于当前帧图像矫正信息的提取。Due to the distortion in the propagation process, the semantic information of the warped region of the propagation feature has been destroyed, and the missing semantic information needs to be extracted from the current frame for replacement. In order to keep the computational cost low, the present invention designs a lightweight network (called CFNet network) to extract the correction information of the current frame image.

如图4所示，将当前帧图像中提取的矫正信息记为

以预测出的包含传播特征中的扭曲区域的扭曲图

为权重与传播特征

进行加权求和，获得矫正后的特征

表示为：As shown in Figure 4, the correction information extracted from the current frame image is recorded as

as a predicted warp map containing warped regions in the propagating features

for weights and propagation features

Perform weighted summation to obtain corrected features

Expressed as:

其中，⊙表示逐像素相乘。where ⊙ represents pixel-by-pixel multiplication.

如图1顶部所示的三个图像均表示矫正后的特征图像。The three images shown at the top of Figure 1 all represent rectified feature images.

如图5所示，所述CFNet网络使用编码器-解码器结构，其中，编码器由十个卷积层组成，解码器由四个反卷积层组成，每个卷积层与反卷积层均搭配一个批归一化层与激活层；与语义分割网络相比，CFNet网络的整体计算量同样可以忽略不计。As shown in Figure 5, the CFNet network uses an encoder-decoder structure, in which the encoder consists of ten convolutional layers, and the decoder consists of four deconvolutional layers, each convolutional layer is associated with a deconvolutional layer. Each layer is matched with a batch normalization layer and an activation layer; compared with the semantic segmentation network, the overall computation of the CFNet network is also negligible.

为了让CFNet网络更加关注扭曲区域的特征提取，本发明针对性地设计了损失函数，即利用扭曲图

训练所述CFNet网络；具体的，可以利用扭曲图

对交叉熵损失进行加权，降低非扭曲区域的损失权重，损失函数表示为：In order to make the CFNet network pay more attention to the feature extraction of the twisted area, the present invention designs the loss function in a targeted manner, that is, using the twisted graph

Train the CFNet network; specifically, a warped graph can be used

The cross-entropy loss is weighted to reduce the loss weight of the non-distorted region, and the loss function is expressed as:

其中，

是由矫正信息

预测得到的概率，即将矫正信息

通过语义分割网络的分类器得到的属于各语义类别的概率；p表示单个像素，I表示所有像素点的集合；扭曲区域中像素的扭曲值大于正常区域像素的扭曲值，使得损失更集中于扭曲区域。通过上述损失加权，轻量级的CFNet网络便可以准确地提取扭曲区域的特征，从而降低整个特征矫正方案引入的额外计算量。in,

is corrected information

Predicted probability, about to correct information

The probability of belonging to each semantic category obtained by the classifier of the semantic segmentation network; p represents a single pixel, and I represents the set of all pixel points; the distortion value of the pixel in the distorted area is greater than that of the normal area pixel, so that the loss is more concentrated in the distortion area. Through the above loss weighting, the lightweight CFNet network can accurately extract the features of the distorted regions, thereby reducing the extra computation introduced by the entire feature correction scheme.

此外，需要说明的时候，前述计算所涉及到的各类特征(传播特征、图像特征、矫正信息等)、各类图像(比如，帧图像、扭曲图)，它们的尺寸都是相同的，因此，单个相同都统一使用p来标识。In addition, when it needs to be explained, the various features (propagation features, image features, correction information, etc.) and various images (such as frame images, warped images) involved in the aforementioned calculations are all the same size, so , the single identical is uniformly identified by p.

基于上述1～3部分的处理后，通过语义分割网络(Net_Seg)对所述矫正后的特征进行语义分割，可通过常规技术实现，故不做赘述。After processing based on the above-mentioned parts 1 to 3, semantic segmentation is performed on the corrected features through a semantic segmentation network (Net _Seg ), which can be realized by conventional techniques, so it is not repeated here.

本发明实施例上述方案主要具备如下优点：首先，可以方便地嵌入到现有的视频语义分割框架中，通过预测传播特征的扭曲区域并进行针对性的矫正，高效地解决由光流估计不准确导致的特征扭曲问题，从而大幅提升语义分割精度；其次，所使用的扭曲区域预测网络和CFNet网络均进行了轻量级设计，大幅降低额外引入的计算代价，计算效率高。The above solutions of the embodiments of the present invention mainly have the following advantages: First, they can be easily embedded into the existing video semantic segmentation framework, and by predicting the distorted regions of the propagation features and performing targeted corrections, the inaccurate optical flow estimation can be efficiently solved. The resulting feature distortion problem, thereby greatly improving the accuracy of semantic segmentation; secondly, both the distorted region prediction network and CFNet network used are lightweight design, which greatly reduces the additional computational cost introduced, and the computational efficiency is high.

为了便于理解，下面结合一个具体实施例对本发明应用场景中的完整流程做详细的介绍。For ease of understanding, the following describes the complete process in the application scenario of the present invention in detail with reference to a specific embodiment.

第一阶段，准备视频语义分割数据集，包含大量视频片段，对每个视频片段挑选一帧图像进行像素级标注，然后将数据集划分为训练集和测试集。In the first stage, prepare a video semantic segmentation dataset, which contains a large number of video clips, select a frame of image for each video clip to perform pixel-level annotation, and then divide the dataset into training set and test set.

第二阶段，使用深度学习框架，建立网络模型，并根据选定的数据集确定网络结构参数，如图1所示。网络框架主要由语义分割网络、光流估计网络、DMNet和CFNet组成。语义分割网络直接使用现有的语义分割网络，例如，可以采用DeepLabv3+作为语义分割网络，因为它在准确性和效率方面都有很好的性能，并使用FlowNet2-S网络用于光流估计。In the second stage, a deep learning framework is used to build a network model and determine the network structure parameters based on the selected dataset, as shown in Figure 1. The network framework is mainly composed of semantic segmentation network, optical flow estimation network, DMNet and CFNet. The semantic segmentation network directly uses the existing semantic segmentation network, for example, DeepLabv3+ can be adopted as the semantic segmentation network because it has good performance in terms of accuracy and efficiency, and the FlowNet2-S network is used for optical flow estimation.

第三阶段，训练网络模型。如图6所示，训练阶段，针对光流估计网络、扭曲感知网络、CFNet网络以及语义分割网络进行训练；训练流程描述为：The third stage is to train the network model. As shown in Figure 6, in the training stage, the optical flow estimation network, distortion perception network, CFNet network and semantic segmentation network are trained; the training process is described as:

1)语义分割网络与光流估计网络各自进行预训练，使用选定数据集对预训练的语义分割网络进行微调训练；扭曲感知网络使用生成的扭曲区域标注进行训练。1) The semantic segmentation network and the optical flow estimation network are pre-trained separately, and the pre-trained semantic segmentation network is fine-tuned using the selected dataset; the distortion-aware network is trained using the generated distortion region annotations.

示例性的，使用数据集ImageNet对语义分割网络进行预训练；使用飞椅数据集对光流估计网络进行预训练。Exemplarily, the semantic segmentation network is pre-trained using the dataset ImageNet; the optical flow estimation network is pre-trained using the flying chair dataset.

2)固定语义分割网络与扭曲感知网络，将预训练的光流估计网络与随机初始化参数的CFNet网络共同训练，共同训练时采用双深度监督(DDS)策略。2) The semantic segmentation network and the distortion perception network are fixed, and the pre-trained optical flow estimation network and the CFNet network with random initialization parameters are jointly trained, and the dual depth supervision (DDS) strategy is used in the joint training.

所述双深度监督策略包括：在每个训练样本的关键帧图像和非关键帧图像间加入了一个中间帧图像，减少了特征传播距离，更利于测试阶段的逐帧特征传播。而且，利用中间帧的伪标签可以得到额外的监督信号，有利于提升模型的分割性能。The double-depth supervision strategy includes: adding an intermediate frame image between the key frame image and the non-key frame image of each training sample, which reduces the feature propagation distance and is more conducive to frame-by-frame feature propagation in the testing phase. Moreover, additional supervision signals can be obtained by using the pseudo-labels of intermediate frames, which is beneficial to improve the segmentation performance of the model.

所述双深度监督策略包括：将训练样本的关键帧图像记为F₁，非关键帧图像记为F₃，则二者之间加入一个额外的中间帧图像F₂(从数据集采样得到的)；通过预训练的光流估计网络估计关键帧图像F₁与中间帧图像F₂的光流，利用光流对关键帧图像F₁及其语义特征进行像素级位移，得到传播帧图像

与传播特征

利用训练后的扭曲感知网络，对比所述传播帧图像

与中间帧图像F₂的特征差异，预测包含传播特征

中的扭曲区域的扭曲图

(具体方式参见前文)，再通过CFNet网络从中间帧图像F₂提取矫正信息，结合传播特征

与扭曲图

获得矫正后特征

(具体方式参见前文)，对

和

施加语义约束，监督信号由F₂的伪标签(Pseudo Label)提供，如图6所示，

和

与F₂的伪标签之间的损失分别表示为L_C、L_P，F₂的伪标签通过前述1)训练后的语义分割网络提取得到；类似的，然后计算由F₂到F₃的传播特征

并经过特征矫正模块FCM获得矫正后特征

对

和

施加语义约束，监督信号由F₃的标签(来自数据集)提供，同样的

和

与F₃的标签之间的损失也分别表示为L_C、L_P。此部分的损失L_C、L_P分别为矫正后特征与标签(或伪标签)的损失、传播特征与标签(或伪标签)的损失。The double-depth supervision strategy includes: denoting the key frame image of the training sample as F ₁ and the non-key frame image as F ₃ , then adding an additional intermediate frame image F ₂ (obtained from the data set sampling) between the two. ); estimate the optical flow of the key frame image F ₁ and the intermediate frame image F ₂ through the pre-trained optical flow estimation network, and use the optical flow to perform pixel-level displacement on the key frame image F ₁ and its semantic features to obtain the propagation frame image

and propagation characteristics

Using the trained warp-aware network, compare the propagated frame images

Feature difference from intermediate frame image _F2 , predicted to contain propagating features

Distortion map of the warped region in

(See the previous section for details), and then extract the correction information from the intermediate frame image F2 through the _CFNet network, combined with the propagation characteristics

with distorted graph

Get corrected features

(For details, see above), yes

and

Imposing semantic constraints, the supervision signal is provided by the pseudo label (Pseudo Label) of F2, as shown in Fig. ₆ ,

and

The losses between the pseudo-labels of F ₂ are denoted as _L _C and LP , respectively. The pseudo-labels of F ₂ are extracted by the semantic segmentation network after training in the aforementioned 1); similarly, the propagation from F ₂ to F ₃ is calculated. feature

And through the feature correction module FCM to obtain the corrected features

right

and

Imposing semantic constraints, the supervision signal is provided by the labels _of F3 (from the dataset), the same

and

The loss between the labels of F ₃ is also denoted as L _C , L _P , respectively. The losses _L _C and LP of this part are the loss of the corrected feature and label (or pseudo-label), the loss of the propagation feature and the label (or pseudo-label), respectively.

第四阶段，对于测试集中的每个视频片段，等间距采样关键帧，然后将每个视频片段逐帧输入网络。若是输入帧是关键帧，直接使用图像分割网络对该帧进行语义分割，并保留语义特征(例如，前文提到的语义特征

)，通过光流网络传播到下一帧，同时将图像通过相同光流传播到下一帧。若是输入帧是非关键帧，首先将传播帧和当前帧输入DMNet得到预测的扭曲区域，然后将扭曲区域、传播特征和当前帧输入FCM，FCM中的CFNet从当前帧图像中提取矫正信息并和扭曲区域加权，从而实现对传播特征的矫正。最后将矫正后特征送入图像分割网络的分类器，从而得到分割结果。同时，若下一帧仍为非关键帧，则将传播特征继续向下一帧传播。In the fourth stage, for each video clip in the test set, keyframes are sampled at equal intervals, and then each video clip is fed into the network frame by frame. If the input frame is a key frame, directly use the image segmentation network to semantically segment the frame and retain the semantic features (for example, the semantic features mentioned above).

), propagate to the next frame through the optical flow network, while propagating the image to the next frame through the same optical flow. If the input frame is a non-key frame, first input the propagation frame and the current frame into DMNet to get the predicted distortion area, and then input the distortion area, propagation features and current frame into FCM, and CFNet in FCM extracts correction information from the current frame image and adds distortion. Region weighting, so as to correct the propagation characteristics. Finally, the corrected features are sent to the classifier of the image segmentation network to obtain the segmentation result. At the same time, if the next frame is still a non-key frame, the feature will be propagated to the next frame.

实施例二Embodiment 2

本发明还提供一种轻量级视频语义分割系统，其主要基于前述实施例一提供的方法实现，如图7所示，该系统主要包括：The present invention also provides a lightweight video semantic segmentation system, which is mainly implemented based on the method provided in the first embodiment. As shown in FIG. 7 , the system mainly includes:

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，仅以上述各功能模块的划分进行举例说明，实际应用中，可以根据需要而将上述功能分配由不同的功能模块完成，即将系统的内部结构划分成不同的功能模块，以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, only the division of the above-mentioned functional modules is used as an example. The internal structure of the system is divided into different functional modules to complete all or part of the functions described above.

此外，系统各模块所涉及的相关技术细节可参见前述实施例一中的介绍，此处不做赘述。In addition, for the relevant technical details involved in each module of the system, reference may be made to the introduction in the foregoing Embodiment 1, which will not be repeated here.

实施例三Embodiment 3

本发明还提供一种处理设备，如图8所示，其主要包括：一个或多个处理器；存储器，用于存储一个或多个程序；其中，当所述一个或多个程序被所述一个或多个处理器执行时，使得所述一个或多个处理器实现前述实施例提供的方法。The present invention also provides a processing device, as shown in FIG. 8 , which mainly includes: one or more processors; a memory for storing one or more programs; wherein, when the one or more programs are described When executed by one or more processors, the one or more processors are caused to implement the methods provided by the foregoing embodiments.

进一步的，所述处理设备还包括至少一个输入设备与至少一个输出设备；在所述处理设备中，处理器、存储器、输入设备、输出设备之间通过总线连接。Further, the processing device further includes at least one input device and at least one output device; in the processing device, the processor, the memory, the input device, and the output device are connected through a bus.

本发明实施例中，所述存储器、输入设备与输出设备的具体类型不做限定；例如：In this embodiment of the present invention, the specific types of the memory, the input device, and the output device are not limited; for example:

输入设备可以为触摸屏、图像采集设备、物理按键或者鼠标等；The input device can be a touch screen, an image capture device, a physical button or a mouse, etc.;

输出设备可以为显示终端；The output device can be a display terminal;

存储器可以为随机存取存储器(Random Access Memory，RAM)，也可为非不稳定的存储器(non-volatile memory)，例如磁盘存储器。The memory may be a random access memory (Random Access Memory, RAM), or a non-volatile memory (non-volatile memory), such as a disk memory.

实施例四Embodiment 4

本发明还提供一种可读存储介质，存储有计算机程序，当计算机程序被处理器执行时实现前述实施例提供的方法。The present invention also provides a readable storage medium storing a computer program, and when the computer program is executed by a processor, the methods provided by the foregoing embodiments are implemented.

本发明实施例中可读存储介质作为计算机可读存储介质，可以设置于前述处理设备中，例如，作为处理设备中的存储器。此外，所述可读存储介质也可以是U盘、移动硬盘、只读存储器(Read-Only Memory，ROM)、磁碟或者光盘等各种可以存储程序代码的介质。In this embodiment of the present invention, the readable storage medium, as a computer-readable storage medium, may be provided in the aforementioned processing device, for example, as a memory in the processing device. In addition, the readable storage medium may also be a U disk, a removable hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk, and other mediums that can store program codes.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明披露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应该以权利要求书的保护范围为准。The above description is only a preferred embodiment of the present invention, but the protection scope of the present invention is not limited to this. Substitutions should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. a lightweight video semantic segmentation method, is characterized in that, comprises:

If the current frame image is a non-key frame image, the optical flow estimation network is used to estimate the optical flow of the previous frame image and the current frame image, and the optical flow is used to perform pixel-level displacement of the previous frame image and its corresponding semantic features respectively to obtain the propagation Frame image and propagation characteristics;

Using the distortion perception network, compare the feature difference between the propagation frame image and the current frame image, and predict the distortion area in the propagation feature;

Using the feature correction network to extract correction information from the current frame image based on the distorted regions in the predicted propagation features, and replace the distorted regions in the predicted propagation features to obtain corrected features;

Semantic segmentation is performed on the rectified features through a semantic segmentation network.

2. A lightweight video semantic segmentation method according to claim 1, wherein the optical flow of the previous frame image and the current frame image is estimated by using an optical flow estimation network, and the optical flow of the previous frame image is estimated by using the optical flow. Pixel-level displacement is performed on the frame image and its corresponding semantic features, and the propagation frame image and propagation features include:

The optical flow of the previous frame image F _t and the current frame image F _t+1 is estimated by the optical flow estimation network, which is used to represent the difference between each pixel of the current frame image F _t+1 and the corresponding pixel of the previous frame image F _t The relative displacement of ; where t and t+1 represent two adjacent moments, if the previous frame is a key frame, the semantic

It is extracted from the previous frame image F _t through the semantic segmentation network. If the previous frame is a non-key frame, it is obtained by propagating from the previous most recent key frame image; under the guidance of optical flow, the previous frame image F _t and its Corresponding Semantic Features

and propagation characteristics

3. A kind of lightweight video semantic segmentation method according to claim 1, is characterized in that, described using distortion perception network, compare the difference between described propagation frame image and current frame image, predict the warped area in propagation characteristic include:

The features of the propagating frame image and the current frame image are respectively extracted through the distortion perception network, and after normalization respectively, the pixel-level cosine similarity of the two features is calculated, and the distortion map is predicted after normalization.

The prediction method is expressed as:

in,

Represents the features of the normalized current image,

and

The cosine similarity of a single pixel p in the same position in , the cosine similarity of all pixels S _t+1 (p) constitutes the cosine similarity matrix S _t+1 , <> represents the symbol for calculating the cosine similarity; the size of the warped graph is the same as The size of the frame image is the same, which represents the distortion degree of the frame image. The distortion map includes the distortion area and the normal area. The distortion value of the pixels in the distortion area is greater than the distortion value of the pixels in the normal area.

4. A lightweight video semantic segmentation method according to claim 1, wherein the distortion perception network is provided with a feature extractor for extracting the features of the propagation frame image and the current frame image. The extractor includes four separable convolutional layers, each with a batch normalization layer and an activation layer; supervised training is performed on the warp-aware network; during training, a semantic segmentation network is used to extract the current frame The semantic feature f t+1 of the image F _t+ 1, and the semantic feature f _t+1 of the current frame image F _t ₊₁ and the propagation feature

Semantic segmentation is performed separately, and the difference map of the two types of semantic segmentation results is obtained by XOR operation, and the difference map is used as the supervision signal for the training of the distortion-aware network.

5. A lightweight video semantic segmentation method according to claim 3, wherein, based on the distorted region in the predicted propagation feature, correction information is extracted from the current frame image, and the predicted propagation The distorted areas in the feature are replaced, and the corrected features include:

The correction information extracted from the current frame image is recorded as

as a predicted warp map containing warped regions in the propagating features

for weights and propagation features

Perform weighted summation to obtain corrected features

Expressed as:

where ⊙ represents pixel-by-pixel multiplication.

6. a kind of lightweight video semantic segmentation method according to claim 3, is characterized in that, is provided with CFNet network in described feature correction network, is used for extracting correction information from current frame image

The CFNet network uses an encoder-decoder structure, in which the encoder consists of ten convolutional layers, and the decoder consists of four deconvolutional layers, each of which is matched with a batch-returning layer. Unification layer and activation layer;

Use warped graphs

To train the CFNet network, the loss function is expressed as:

in,

is corrected information

The predicted probability; p represents a single pixel, and I represents the set of all pixel points; the distortion value of the pixel in the distorted area is greater than the distortion value of the pixel in the normal area.

7. a kind of lightweight video semantic segmentation method according to claim 1 or 4 or 6, is characterized in that, training phase, for optical flow estimation network, distortion perception network, CFNet network and semantic segmentation network are trained; The process is described as:

The semantic segmentation network and the optical flow estimation network are pre-trained separately, and the pre-trained semantic segmentation network is fine-tuned using the selected dataset; the distortion-aware network is trained using the generated distortion region annotations;

After that, the semantic segmentation network and the distortion perception network are fixed, and the pre-trained optical flow estimation network and the CFNet network with random initialization parameters are jointly trained, and the dual-depth supervision strategy is adopted during the joint training; among them, the CFNet network is a part of the feature correction network. for extracting correction information from the current frame image;

The double-depth supervision strategy includes: adding an intermediate frame between the key frame image and the non-key frame image of each training sample, and using the pseudo-label of the intermediate frame as an additional supervision signal.

8. A lightweight video semantic segmentation system, characterized in that, implemented based on the method according to any one of claims 1 to 7, the system comprising:

The feature and image joint propagation module is used to estimate the optical flow of the previous frame image and the current frame image by using the optical flow estimation network when the current frame image is a non-key frame image, and use the optical flow to respectively analyze the previous frame image and its corresponding image. Pixel-level displacement is performed on the semantic features of , and the propagation frame images and propagation features are obtained;

Distortion perception network, used to compare the feature difference between the propagation frame image and the current frame image, and predict the warped area in the propagation feature;

The feature correction network is used to extract correction information from the current frame image based on the distorted region in the predicted propagation feature, replace the distorted region in the predicted propagation feature, and obtain the corrected feature;

A semantic segmentation network for performing semantic segmentation on the rectified features.

9. A processing device, comprising: one or more processors; a memory for storing one or more programs;

Wherein, when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method according to any one of claims 1-7.

10. A readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the method according to any one of claims 1 to 7 is implemented.