CN114926760B

CN114926760B - Video saliency target detection system and method based on space-time convolutional neural network

Info

Publication number: CN114926760B
Application number: CN202210501874.XA
Authority: CN
Inventors: 雷为民; 姜怡晗; 侯玉莹; 张伟; 叶文慧
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2022-05-10
Filing date: 2022-05-10
Publication date: 2024-07-02
Anticipated expiration: 2042-05-10
Also published as: CN114926760A

Abstract

The present invention provides a video salient target detection system and method based on spatiotemporal convolutional neural network, the system includes a spatial feature extraction module, a spatiotemporal consistent feature enhancement module, a feature fusion and upsampling module, a low-level semantic information link module, and a decoder; a recurrent neural network embedded with a cross self-attention calculation mechanism is used to extract spatiotemporal consistent features, and the feature values in the feature map are weighted to enhance the contrast between significant features and non-significant features, thereby avoiding the interference of background information on foreground information to a certain extent; at the same time, a low-level semantic information link module is used to fuse low-level spatial features and spatiotemporal consistent features, thereby reducing the loss of low-level spatial features and making the prediction of object edges more accurate. Compared with traditional video salient target detection methods, the present invention can take into account both speed and accuracy, and is a method more suitable for actual video salient target detection.

Description

Video saliency target detection system and method based on space-time convolutional neural network

Technical Field

The invention belongs to the technical field of digital image processing, and particularly relates to a video saliency target detection system and method based on a space-time convolutional neural network.

Background

With the rapid development of the internet and communication technology, people can acquire more and more external information, and more than 80% of the external information is counted as visual information resources. As the requirements of people on the quality of information are higher and higher, the resolution of images and videos is continuously improved, which results in larger and larger computing resources and storage resources required for video analysis and other works. During the analysis processing of video, people tend to pay attention to only a part of things in the video, such as people who may be more attentive to people appearing in the video in conversation-type video with fixed background; in surveillance-type video, people are more concerned about new objects in the video. If the objects or areas of interest can be mined in advance, the limited resources are preferentially allocated to the areas, so that the capability of analyzing and processing the video can be well improved. How to efficiently mine the information most concerned by people from mass data becomes a big hot spot in the field of computer vision. The detection of a salient object based on human visual attention can accurately find the most attractive region in an image or video, and thus this field becomes an important research direction.

Salient object detection of video is classified into a conventional method and a deep learning-based method. Most of the traditional video saliency target detection methods generally rely on manual bottom layer characteristics to perform heuristic saliency reasoning, so that complex video sequences requiring knowledge and semantic reasoning cannot be processed, and the defects of poor detection effect or high detection cost generally exist. The salient object detection method based on deep learning mainly comprises three parts of spatial feature extraction, temporal feature extraction and space-time feature fusion. Although the method gradually replaces the traditional method due to the advantages of high detection accuracy, no pretreatment and high instantaneity, the existing salient target detection method based on deep learning still has a certain problem:

Problem 1: in the propagation process of the neural network, the detected feature images have the same attention to each pixel point, namely the network views the extracted features equally, so that some areas belonging to the background interfere with the prediction of the salient targets, and the detection performance of the network is affected;

problem 2: most methods predict using only high-level spatiotemporal features, ignoring the detail information of low-level semantic features, when finally performing significance prediction. Saliency target detection is a pixel-level prediction task that cannot be accurately predicted on the edges of some objects if there is insufficient detail information.

The Chinese patent CN109784183A provides a method for detecting the salient targets in the video based on a cascade convolution network and an optical flow video salient target detection method, which comprises the steps of firstly extracting spatial features based on the convolution neural network, then extracting an optical flow field based on an optical flow method, finally splicing the two, and then sending the two to a dynamic optimization network for pixel-level classification, thereby obtaining a salient map of each frame of image. Compared with the traditional video saliency target detection method, the method greatly improves the prediction accuracy from the experimental effect. Although the method extracts spatial features through a cascade convolution network and extracts time domain features by using an optical flow method, and then simply splices the spatial features and the time domain features to achieve the effect of detecting the salient targets, the method essentially cuts the temporal features and the spatial features of the video, which results in low detection accuracy and poor real-time performance. Furthermore, the optical flow method is very computationally expensive, which necessarily results in less efficient video processing in real traffic. Meanwhile, in the process of extracting the information from the characteristics by adopting convolution operation, the convolution operation is equivalent to treating the characteristics in the network, and the interference of background information on the salient target information is caused to a certain extent, so that the detection effect is not ideal.

Disclosure of Invention

Aiming at the problems, the invention provides a video saliency target detection system and method based on a space-time convolutional neural network, which can improve the accuracy of the saliency target detection method for carrying out saliency target prediction while considering efficiency on the premise of not splitting time features and space features.

A video saliency target detection system based on a spatio-temporal convolutional neural network, comprising: the device comprises a space feature extraction module, a space-time consistency feature enhancement module, a feature fusion and up-sampling module, a low-level semantic information link module and a decoder;

The spatial feature extraction module is used for extracting spatial features of the video frames;

The space-time consistency feature enhancement module is used for extracting the space-time consistency features of the video frames and carrying out weighting operation on feature values in the feature images;

The low-level semantic information link module is used for extracting low-level spatial features and removing background redundant information of the low-level spatial features;

The feature fusion and up-sampling module is used for fusing low-level spatial features and space-time consistency features and expanding a feature map to be the same as an input video in size;

The decoder is used for decoding the feature map to obtain a saliency target mask corresponding to each image in the video sequence.

The spatial feature extraction module comprises: a residual error module and a cavity convolution pyramid pooling module;

The residual error module is used for carrying out modeling operation on the space characteristics;

and the cavity convolution pyramid pooling module is used for extracting multi-scale spatial features to obtain a spatial feature map.

The space-time consistency feature enhancement module comprises: bidirectional ConvLSTM module, attention module I, attention module II, and splicing module;

The bidirectional ConvLSTM module is used for carrying out modeling operation according to the space-time correlation between the current frame and the forward frame and between the current frame and the backward frame;

the attention module I is used for weighting characteristic points in the characteristic diagram obtained by the forward unit of the bidirectional ConvLSTM module;

the attention module II is used for weighting characteristic points in the characteristic diagram obtained by the backward unit of the bidirectional ConvLSTM module;

The splicing module is used for splicing the characteristic diagram obtained by the forward unit of the bidirectional ConvLSTM module and the characteristic diagram obtained by the backward unit, and obtaining the video characteristic frame with space-time consistency through the tanh activation function.

The low-level semantic information link module includes: a link module I, a link module II and a link module III;

The link module I is used for extracting low-level space features output by a fourth convolution layer in the residual error module;

The link module II is used for extracting low-level space features output by a third convolution layer in the residual error module;

And the link module III is used for extracting low-level spatial features output by the second convolution layer in the residual error module.

The method for detecting the video saliency target based on the space-time convolutional neural network is realized based on a video saliency target detection system based on the space-time convolutional neural network, and comprises the following steps:

step 1: collecting a video containing a T frame image, and extracting spatial characteristics of a video frame;

step 2: extracting space-time consistency characteristics of video frames, and carrying out weighting operation on characteristic values in a characteristic diagram;

step 3: extracting low-level spatial features according to depth separable convolution operation;

Step 4: performing feature fusion and up-sampling operation on the low-level spatial features and the space-time consistency features to obtain a high-level feature map containing the T-frame video;

step 5: and decoding the advanced feature map to obtain a saliency target mask corresponding to each image in the video sequence.

The step 1 is specifically expressed as: modeling the spatial features by adopting a pre-trained residual error module, removing the downsampling operation of a fifth layer by using the first 5 groups of layers of residual error networks Resnet-50 by the residual error module, and then inputting the features output by the residual error module into a cavity convolution pyramid pooling module to extract the multi-scale spatial features to obtain a spatial feature map.

The step 2 comprises the following steps:

Step 2.1: carrying out space-time correlation modeling operation between the current frame and the forward frame by adopting a forward unit of the bidirectional convLSTM module to the space features output by the space feature extraction module, so as to obtain an output result of the forward unit;

Step 2.2: sending the output result of the forward unit into an attention module I, and weighting the characteristic points in the characteristic diagram obtained by the forward unit to obtain a characteristic diagram G1 containing non-salient targets and salient target contrast;

step 2.3: after being weighted by the attention module I, the obtained feature map is input into a backward unit to carry out space-time correlation modeling operation between the current frame and the backward frame;

step 2.4: sending the output result of the backward unit into an attention module II, and weighting the characteristic points in the characteristic diagram obtained by the backward unit to obtain a characteristic diagram G2 containing non-salient targets and salient target contrast;

step 2.5: and splicing the characteristic graphs G1 and G2, inputting a layer of convolution layer with a convolution kernel of 3*3 for characteristic extraction, and obtaining a video characteristic frame with space-time consistency through a tanh activation function.

The attention module I and the attention module III in the step 2 are crisscross attention module CCA constructed based on a self-attention mechanism, and input features pass through three parallel convolution layers with a convolution kernel size 1*1 to obtain three feature tensors: q, K, V; q and K are then input to the first attention profile calculation layer to obtain an attention profile a between Q and K, the calculation being as follows:

d_i,u＝q_uk_i,u (1)

A＝softmax(D) (2)

Where Q _u represents the tensor of one dimension in Q; k _i,u denotes all feature points in K and q _u having the same abscissa or ordinate; d _i,u represents the relationship between the feature points in each channel in Q and the feature points in K, softmax represents the activation function;

the obtained feature tensors A and V are input into a second attention distribution calculation layer, the attention distribution between A and V is calculated according to formulas (1) and (2), and then the attention is added into the original feature map as weight distribution, so that the advanced feature map containing the contrast of the salient object and the non-salient object is obtained.

The step3 comprises the following steps:

step 3.1: the spatial features with different granularities obtained by the 2 nd, 3 rd and 4 th convolution layers of the residual modules in the spatial feature extraction module are parallelly fed into the first layer of the link modules I, II and III, convolution operation is carried out first, and normalization operation is carried out after the convolution operation;

Step 3.2: the normalized spatial features are parallelly sent to a second layer of the link modules I, II and III, detail features of the salient objects are extracted by adopting depth separable convolution operation, the size of a convolution kernel is 3*3, the result of each convolution operation is sent to the normalization layer, and finally a ReLU activation function is passed;

step 3.3: and 3.2, sending the result obtained in the step 3.2 into a third layer of the link modules I, II and III in parallel for convolution operation, and adopting a convolution layer with a convolution kernel size 1*1 to carry out channel adjustment on the feature map obtained in the second layer.

The step 4 is specifically expressed as: the recursive form is adopted to realize the splicing fusion and upsampling operation of the features in the feature fusion and upsampling module, and the realization process is as follows: splicing the low-level space features output by the link module I and the space-time consistency features output by the splicing module, and completing fusion through one-layer convolution operation and up-sampling operation; fusing the output characteristics with the low-level space characteristics output by the link module II; fusing the output characteristics with the low-level space characteristics output by the link module III to obtain a characteristic frame with the same size as the input picture;

The step 5 is specifically expressed as: and carrying out dimension lifting operation on the feature frame obtained by the feature fusion and up-sampling module through a convolution layer with a convolution kernel of 3*3, then carrying out pixel-level classification through a convolution layer with a convolution kernel of 1*1, and finally carrying out normalization operation on the classification result through a sigmoid function to obtain a saliency target mask corresponding to the video frame.

The beneficial effects of the invention are as follows:

The invention provides a video saliency target detection system and method based on a space-time convolutional neural network, which are a saliency target detection method which is video-oriented, higher in efficiency and better in precision, and a lightweight backbone network and a cavity convolutional pyramid pooling module are adopted to extract the spatial characteristics of a saliency target; then, a cyclic neural network embedded with a cross self-attention computing mechanism is adopted to extract space-time consistency characteristics, and simultaneously, characteristic values of characteristic patterns are weighted to improve contrast of salient characteristics and non-salient characteristics, so that interference of background information on foreground information is avoided to a certain extent, and prediction accuracy of the method is improved; meanwhile, a low-level semantic information link module is adopted to fuse low-level spatial features and space-time consistency features, so that loss of the low-level spatial features is reduced as much as possible, and prediction at the edge of an object is more accurate. Compared with the traditional video salient object detection method, the method can give consideration to speed and precision, and is more suitable for detecting the actual video salient object.

Drawings

FIG. 1 is a block diagram of a video saliency target detection system based on a space-time convolutional neural network in the present invention;

FIG. 2 is a schematic diagram of a spatial feature extraction module according to the present invention;

FIG. 3 is a schematic diagram of a space-time consistent feature enhancement module in accordance with the present invention;

FIG. 4 is a schematic diagram of a feature fusion and upsampling module according to the present invention;

FIG. 5 is a graph comparing the visual results of the method of the present invention with other methods of detecting a significant object.

Detailed Description

The invention will be further described with reference to the accompanying drawings and examples of specific embodiments.

As shown in fig. 1, a video saliency target detection system based on a space-time convolutional neural network includes: the device comprises a space feature extraction module, a space-time consistency feature enhancement module, a feature fusion and up-sampling module, a low-level semantic information link module and a decoder;

the feature fusion and up-sampling module is used for fusing low-level spatial features and space-time consistency features and expanding a feature map to be the same as an input video in size; the structure diagram of the feature fusion and up-sampling module is shown in fig. 4, the low-level spatial features output by the link modules i, ii and iii and the space-time consistency features output by the space-time consistency feature enhancement module are sequentially fused and up-sampled in a recursive manner, wherein Conv represents convolution operation, F _l2、F_l3、F_l4 sequentially represents low-level spatial characteristics of the outputs of the 2 nd, 3 rd and 4 th convolution layers of the residual modules, si (i=1...t.) represents the resulting saliency target mask, F ^CL represents the resulting spatio-temporal consistency characteristics.

The structure diagram of the spatial feature extraction module is shown in figure 2, video containing T frames of images is input, the video frames sequentially pass through the residual error module and the cavity convolution pyramid pooling module to extract spatial features, I _i (i=1...t.) the video frames that are input are represented, ASPP represents a hole convolution pyramid pooling module, and F ^aspp represents the obtained spatial features.

The space-time consistency feature enhancement module is shown in fig. 3, the space features output by the space feature extraction module are sequentially sent to the forward unit of the bidirectional ConvLSTM module, the attention module I, the backward unit of the bidirectional ConvLSTM module and the attention module ii for extracting the space-time consistency features, and the feature points in the resulting feature map are weighted, CCA _i (i=1, 2..t.) represents a cross-attention module, convLSTM denotes a convolution long-short term memory network, F ^aspp denotes spatial features obtained by the spatial feature extraction module, and F ^CL denotes obtained spatio-temporal consistency features.

A video saliency target detection method based on a space-time convolutional neural network is realized based on a video saliency target detection system based on the space-time convolutional neural network, and comprises the following steps:

step 1: sending the video containing the T frame image into a spatial feature extraction module, and extracting the spatial features of the video frames from thick to thin; the spatial feature extraction module comprises a residual error module and a cavity convolution pyramid pooling module. Firstly, performing preliminary modeling operation on the spatial characteristics by adopting a pre-trained residual module, wherein the residual module uses the first 5 groups of layers of the Resnet-50 network, and simultaneously removing the downsampling operation of the fifth layer. And then inputting the features output by the residual error module into a cavity convolution pyramid pooling module to extract multi-scale spatial features so as to obtain a low-level spatial feature map. Representing a video containing T-frame images, the modeling process described above is as follows:

Where M _res represents a backbone network, M _aspp represents a hole convolution pyramid pooling module, Representing the resulting spatial signature.

Step 2: the spatial features extracted in the step 1 are sent to a space-time consistency feature enhancement module, further learn space-time consistency features of deeper layers, and weight the feature values in the feature map to improve the contrast ratio of the salient features and the non-salient features; comprising the following steps:

Step 2.1: carrying out space-time correlation modeling operation between the current frame and the forward frame on the space features output by the space feature extraction module by adopting a forward unit of the bidirectional ConvLSTM module;

the modeling process of the forward cell is shown in the formula:

Where t represents the current frame, Indicating the output result of the forward unit,Features of the forward frame representing the current frame,The spatial features obtained in step 1 are represented.

the weighting process is as follows:

In the method, in the process of the invention, Representing a feature map containing the contrast of the non-salient object and the salient object, +..

The modeling process of the backward unit is shown as the formula:

Where t represents the current frame, Representing the result output by the backward unit.

the weighting process is as follows:

Step 2.5: splicing the feature map G1 and the feature map G2, inputting a layer of convolution layer with a convolution kernel 3*3 for feature extraction, and obtaining a video feature frame with space-time consistency through a tanh activation function;

The modeling process is shown in the formula:

In the method, in the process of the invention, And the space-time consistency characteristics obtained after the space-time consistency characteristic enhancement module are represented.

d_i,u＝q_uk_i,u (1)

A＝softmax(D) (2)

inputting the obtained feature tensors A and V into a second attention distribution calculation layer, calculating the attention distribution between A and V according to formulas (1) and (2), and then adding the attention as weight distribution into the original feature map to obtain a high-level feature map containing the contrast of the salient object and the non-salient object;

The modeling process is shown in the formula:

f^out＝a_uv_i,u+f^input

Wherein a _u represents a tensor of one dimension in a; v _i,u denotes selecting a point in V having the same abscissa or ordinate as a _u; f ^out∈F^out,F^out denotes a spatiotemporal consistency feature comprising non-salient object and salient object feature contrast.

Step 3: inputting the spatial features with different granularities obtained by the 2 nd, 3 rd and 4 th convolution layers in the spatial feature extraction module into a low-level semantic information link module composed of Ghost, further extracting low-level spatial features, and removing background redundant information of the low-level spatial features; comprising the following steps:

Step 4: the low-level semantic features output by the low-level semantic information linking module in the step 3 and the space-time consistency features output by the space-time consistency feature enhancement module in the step 2 are input to a feature fusion and up-sampling module for fusion operation, and a feature map is expanded to the size of an input video frame; the concrete expression is as follows: the recursive form is adopted to realize the splicing fusion and upsampling operation of the features in the feature fusion and upsampling module, and the realization process is as follows: splicing the low-level space features output by the link module I and the space-time consistency features output by the splicing module, and completing fusion through one-layer convolution operation and up-sampling operation; fusing the output characteristics with the low-level space characteristics output by the link module II; fusing the output characteristics with the low-level space characteristics output by the link module III to obtain a characteristic frame with the same size as the input picture;

The modeling process is shown in the formula:

In the method, in the process of the invention, Representing the spatio-temporal consistency characteristics obtained through step 2,Representing low-level semantic features obtained by the features output by the 4 th convolution layer of the spatial feature extraction module through the low-level semantic link module,Representing low-level semantic features obtained by the low-level semantic link module of the features output by the 3 rd convolution layer of the spatial feature extraction module,Representing low-level semantic features obtained by the features output by the 2 nd convolution layer of the spatial feature extraction module through the low-level semantic link module, wherein F _t represents a finally obtained feature frame, conv represents convolution operation and up represents up-sampling operation.

Step 5: sending the feature map obtained in the step 4 to a decoder for decoding operation to obtain a salient target mask corresponding to each image in the video sequence; the concrete expression is as follows: and carrying out dimension lifting operation on the feature frame obtained by the feature fusion and up-sampling module through a convolution layer with a convolution kernel of 3*3, then carrying out pixel-level classification through a convolution layer with a convolution kernel of 1*1, and finally carrying out normalization operation on the classification result through a sigmoid function to obtain a saliency target mask corresponding to the video frame.

The modeling process is shown in the formula:

S_t＝δ_sigmoid(Conv(F_t))

Where δ _sigmoid represents a sigmoid activation function and S _t represents the saliency target mask of the resulting video frame.

The invention is based on the coding and decoding structure, performs space feature extraction and space-time consistency feature extraction at the coding end, performs feature fusion and significance target prediction at the decoding end, and improves the accuracy of significance target prediction while considering efficiency. A ConvLSTM structure (DCA_ ConvLSTM for short) embedded with a double-layer cross attention computing mechanism is designed based on a self-attention mechanism in a space-time information enhancement module, global information is introduced into space and time dimensions through the module, and feature values corresponding to a saliency map are weighted to improve contrast of saliency features and non-saliency features, so that interference of background information on foreground information is avoided to a certain extent, prediction precision of a method is improved, and simultaneously space-time consistency features are also obtained.

Furthermore, the invention provides a low-level semantic information link module consisting of the ghost. As the neural network hierarchy changes, the semantics expressed by the features extracted at different layers are not identical. The deep convolution layer can map a larger receptive field with a smaller convolution kernel, so that semantic features of a higher layer can be extracted, the shallow convolution layer has a smaller receptive field, and the extracted features can reflect local detail information of an image, such as contour information. Video saliency target detection is a pixel-level prediction task that cannot be accurately predicted at some object edges if detail information is missing. Therefore, the invention provides the low-level semantic information link module composed of the ghosts, so that the loss of low-level semantic features is reduced as much as possible, and the prediction at the edge of the object is more accurate.

To demonstrate the effectiveness of the video salient object detection method of the present invention, the method presented herein and the other 7 advanced salient object detection methods were tested on DAVIS, VOS, FBMS data sets, and the visualization results are shown in fig. 5. Wherein 7 methods are RFCN, DSS, piCA, SSA, FCNS, FGRN, PDB respectively; GT is a real label; in fig. 5, the first line is an original video frame, the last line GT is a real label, the next to last line is a detection result of the video saliency target detection method proposed by the present invention, and the other lines are prediction results of some saliency target detection methods proposed by the former in the field. As can be seen from the comparison experiment results in FIG. 5, the video salient object detection method provided by the invention is more accurate in contour positioning of the salient object and detail prediction of the salient object. In addition, aiming at a scene with a plurality of remarkable targets and a complex background, the method provided by the invention has good detection effect.

Claims

1. A video saliency target detection system based on a space-time convolutional neural network, comprising: the device comprises a space feature extraction module, a space-time consistency feature enhancement module, a feature fusion and up-sampling module, a low-level semantic information link module and a decoder;

the decoder is used for decoding the feature images to obtain a saliency target mask corresponding to each image in the video sequence;

the cavity convolution pyramid pooling module is used for extracting multi-scale spatial features to obtain a spatial feature map;

The splicing module is used for splicing the characteristic diagram obtained by the forward unit of the bidirectional ConvLSTM module and the characteristic diagram obtained by the backward unit, and obtaining a video characteristic frame with space-time consistency through a tanh activation function;

The attention module I and the attention module II are crisscross attention module CCA constructed based on a self-attention mechanism, and input features pass through three parallel convolution layers with the convolution kernel size of 1*1 to obtain three feature tensors: q, K, V; q and K are then input to the first attention profile calculation layer to obtain an attention profile a between Q and K, the calculation being as follows:

d_i,u＝q_uk_i,u (1)

A＝softmax(D) (2)

the link module III is used for extracting low-level space features output by a second convolution layer in the residual error module;

The spatial features with different granularities obtained by the 2 nd, 3 rd and 4 th convolution layers of the residual error module in the spatial feature extraction module are parallelly fed into the first layer of the link module I, II and III, the convolution operation is carried out first, and the normalization operation is used for carrying out data normalization after the convolution operation;

The normalized spatial features are parallelly sent to a second layer of the link modules I, II and III, detail features of the salient objects are extracted by adopting depth separable convolution operation, the size of a convolution kernel is 3*3, the result of each convolution operation is sent to the normalization layer, and finally a ReLU activation function is passed;

And the results are sent to a third layer of the link modules I, II and III in parallel to carry out convolution operation, and a convolution layer with the convolution kernel size of 1*1 is adopted to carry out channel adjustment on the feature map obtained by the second layer.

2. A method for detecting video saliency target based on a space-time convolutional neural network, which is realized based on the video saliency target detection system based on a space-time convolutional neural network according to any one of claim 1, and is characterized in that the method comprises the following steps:

3. The method for detecting video saliency target based on space-time convolutional neural network according to claim 2, wherein the step 1 is specifically expressed as: modeling the spatial features by adopting a pre-trained residual error module, removing the downsampling operation of a fifth layer by using the first 5 groups of layers of residual error networks Resnet-50 by the residual error module, and then inputting the features output by the residual error module into a cavity convolution pyramid pooling module to extract the multi-scale spatial features to obtain a spatial feature map.

4. The method for detecting video saliency target based on space-time convolutional neural network according to claim 2, wherein the step 2 comprises:

5. The method for detecting video saliency target based on space-time convolutional neural network according to claim 2, wherein the step 3 comprises:

6. The method for detecting video saliency target based on space-time convolutional neural network according to claim 2, wherein the step 4 is specifically expressed as: the recursive form is adopted to realize the splicing fusion and upsampling operation of the features in the feature fusion and upsampling module, and the realization process is as follows: splicing the low-level space features output by the link module I and the space-time consistency features output by the splicing module, and completing fusion through one-layer convolution operation and up-sampling operation; fusing the output characteristics with the low-level space characteristics output by the link module II; fusing the output characteristics with the low-level space characteristics output by the link module III to obtain a characteristic frame with the same size as the input picture;