Disclosure of Invention
Aiming at the problems, the invention provides a video saliency target detection system and method based on a space-time convolutional neural network, which can improve the accuracy of the saliency target detection method for carrying out saliency target prediction while considering efficiency on the premise of not splitting time features and space features.
A video saliency target detection system based on a spatio-temporal convolutional neural network, comprising: the device comprises a space feature extraction module, a space-time consistency feature enhancement module, a feature fusion and up-sampling module, a low-level semantic information link module and a decoder;
The spatial feature extraction module is used for extracting spatial features of the video frames;
The space-time consistency feature enhancement module is used for extracting the space-time consistency features of the video frames and carrying out weighting operation on feature values in the feature images;
The low-level semantic information link module is used for extracting low-level spatial features and removing background redundant information of the low-level spatial features;
The feature fusion and up-sampling module is used for fusing low-level spatial features and space-time consistency features and expanding a feature map to be the same as an input video in size;
The decoder is used for decoding the feature map to obtain a saliency target mask corresponding to each image in the video sequence.
The spatial feature extraction module comprises: a residual error module and a cavity convolution pyramid pooling module;
The residual error module is used for carrying out modeling operation on the space characteristics;
and the cavity convolution pyramid pooling module is used for extracting multi-scale spatial features to obtain a spatial feature map.
The space-time consistency feature enhancement module comprises: bidirectional ConvLSTM module, attention module I, attention module II, and splicing module;
The bidirectional ConvLSTM module is used for carrying out modeling operation according to the space-time correlation between the current frame and the forward frame and between the current frame and the backward frame;
the attention module I is used for weighting characteristic points in the characteristic diagram obtained by the forward unit of the bidirectional ConvLSTM module;
the attention module II is used for weighting characteristic points in the characteristic diagram obtained by the backward unit of the bidirectional ConvLSTM module;
The splicing module is used for splicing the characteristic diagram obtained by the forward unit of the bidirectional ConvLSTM module and the characteristic diagram obtained by the backward unit, and obtaining the video characteristic frame with space-time consistency through the tanh activation function.
The low-level semantic information link module includes: a link module I, a link module II and a link module III;
The link module I is used for extracting low-level space features output by a fourth convolution layer in the residual error module;
The link module II is used for extracting low-level space features output by a third convolution layer in the residual error module;
And the link module III is used for extracting low-level spatial features output by the second convolution layer in the residual error module.
The method for detecting the video saliency target based on the space-time convolutional neural network is realized based on a video saliency target detection system based on the space-time convolutional neural network, and comprises the following steps:
step 1: collecting a video containing a T frame image, and extracting spatial characteristics of a video frame;
step 2: extracting space-time consistency characteristics of video frames, and carrying out weighting operation on characteristic values in a characteristic diagram;
step 3: extracting low-level spatial features according to depth separable convolution operation;
Step 4: performing feature fusion and up-sampling operation on the low-level spatial features and the space-time consistency features to obtain a high-level feature map containing the T-frame video;
step 5: and decoding the advanced feature map to obtain a saliency target mask corresponding to each image in the video sequence.
The step 1 is specifically expressed as: modeling the spatial features by adopting a pre-trained residual error module, removing the downsampling operation of a fifth layer by using the first 5 groups of layers of residual error networks Resnet-50 by the residual error module, and then inputting the features output by the residual error module into a cavity convolution pyramid pooling module to extract the multi-scale spatial features to obtain a spatial feature map.
The step 2 comprises the following steps:
Step 2.1: carrying out space-time correlation modeling operation between the current frame and the forward frame by adopting a forward unit of the bidirectional convLSTM module to the space features output by the space feature extraction module, so as to obtain an output result of the forward unit;
Step 2.2: sending the output result of the forward unit into an attention module I, and weighting the characteristic points in the characteristic diagram obtained by the forward unit to obtain a characteristic diagram G1 containing non-salient targets and salient target contrast;
step 2.3: after being weighted by the attention module I, the obtained feature map is input into a backward unit to carry out space-time correlation modeling operation between the current frame and the backward frame;
step 2.4: sending the output result of the backward unit into an attention module II, and weighting the characteristic points in the characteristic diagram obtained by the backward unit to obtain a characteristic diagram G2 containing non-salient targets and salient target contrast;
step 2.5: and splicing the characteristic graphs G1 and G2, inputting a layer of convolution layer with a convolution kernel of 3*3 for characteristic extraction, and obtaining a video characteristic frame with space-time consistency through a tanh activation function.
The attention module I and the attention module III in the step 2 are crisscross attention module CCA constructed based on a self-attention mechanism, and input features pass through three parallel convolution layers with a convolution kernel size 1*1 to obtain three feature tensors: q, K, V; q and K are then input to the first attention profile calculation layer to obtain an attention profile a between Q and K, the calculation being as follows:
di,u=quki,u (1)
A=softmax(D) (2)
Where Q u represents the tensor of one dimension in Q; k i,u denotes all feature points in K and q u having the same abscissa or ordinate; d i,u represents the relationship between the feature points in each channel in Q and the feature points in K, softmax represents the activation function;
the obtained feature tensors A and V are input into a second attention distribution calculation layer, the attention distribution between A and V is calculated according to formulas (1) and (2), and then the attention is added into the original feature map as weight distribution, so that the advanced feature map containing the contrast of the salient object and the non-salient object is obtained.
The step3 comprises the following steps:
step 3.1: the spatial features with different granularities obtained by the 2 nd, 3 rd and 4 th convolution layers of the residual modules in the spatial feature extraction module are parallelly fed into the first layer of the link modules I, II and III, convolution operation is carried out first, and normalization operation is carried out after the convolution operation;
Step 3.2: the normalized spatial features are parallelly sent to a second layer of the link modules I, II and III, detail features of the salient objects are extracted by adopting depth separable convolution operation, the size of a convolution kernel is 3*3, the result of each convolution operation is sent to the normalization layer, and finally a ReLU activation function is passed;
step 3.3: and 3.2, sending the result obtained in the step 3.2 into a third layer of the link modules I, II and III in parallel for convolution operation, and adopting a convolution layer with a convolution kernel size 1*1 to carry out channel adjustment on the feature map obtained in the second layer.
The step 4 is specifically expressed as: the recursive form is adopted to realize the splicing fusion and upsampling operation of the features in the feature fusion and upsampling module, and the realization process is as follows: splicing the low-level space features output by the link module I and the space-time consistency features output by the splicing module, and completing fusion through one-layer convolution operation and up-sampling operation; fusing the output characteristics with the low-level space characteristics output by the link module II; fusing the output characteristics with the low-level space characteristics output by the link module III to obtain a characteristic frame with the same size as the input picture;
The step 5 is specifically expressed as: and carrying out dimension lifting operation on the feature frame obtained by the feature fusion and up-sampling module through a convolution layer with a convolution kernel of 3*3, then carrying out pixel-level classification through a convolution layer with a convolution kernel of 1*1, and finally carrying out normalization operation on the classification result through a sigmoid function to obtain a saliency target mask corresponding to the video frame.
The beneficial effects of the invention are as follows:
The invention provides a video saliency target detection system and method based on a space-time convolutional neural network, which are a saliency target detection method which is video-oriented, higher in efficiency and better in precision, and a lightweight backbone network and a cavity convolutional pyramid pooling module are adopted to extract the spatial characteristics of a saliency target; then, a cyclic neural network embedded with a cross self-attention computing mechanism is adopted to extract space-time consistency characteristics, and simultaneously, characteristic values of characteristic patterns are weighted to improve contrast of salient characteristics and non-salient characteristics, so that interference of background information on foreground information is avoided to a certain extent, and prediction accuracy of the method is improved; meanwhile, a low-level semantic information link module is adopted to fuse low-level spatial features and space-time consistency features, so that loss of the low-level spatial features is reduced as much as possible, and prediction at the edge of an object is more accurate. Compared with the traditional video salient object detection method, the method can give consideration to speed and precision, and is more suitable for detecting the actual video salient object.
Detailed Description
The invention will be further described with reference to the accompanying drawings and examples of specific embodiments.
As shown in fig. 1, a video saliency target detection system based on a space-time convolutional neural network includes: the device comprises a space feature extraction module, a space-time consistency feature enhancement module, a feature fusion and up-sampling module, a low-level semantic information link module and a decoder;
The spatial feature extraction module is used for extracting spatial features of the video frames;
The space-time consistency feature enhancement module is used for extracting the space-time consistency features of the video frames and carrying out weighting operation on feature values in the feature images;
The low-level semantic information link module is used for extracting low-level spatial features and removing background redundant information of the low-level spatial features;
the feature fusion and up-sampling module is used for fusing low-level spatial features and space-time consistency features and expanding a feature map to be the same as an input video in size; the structure diagram of the feature fusion and up-sampling module is shown in fig. 4, the low-level spatial features output by the link modules i, ii and iii and the space-time consistency features output by the space-time consistency feature enhancement module are sequentially fused and up-sampled in a recursive manner, wherein Conv represents convolution operation, F l2、Fl3、Fl4 sequentially represents low-level spatial characteristics of the outputs of the 2 nd, 3 rd and 4 th convolution layers of the residual modules, si (i=1...t.) represents the resulting saliency target mask, F CL represents the resulting spatio-temporal consistency characteristics.
The decoder is used for decoding the feature map to obtain a saliency target mask corresponding to each image in the video sequence.
The spatial feature extraction module comprises: a residual error module and a cavity convolution pyramid pooling module;
The residual error module is used for carrying out modeling operation on the space characteristics;
and the cavity convolution pyramid pooling module is used for extracting multi-scale spatial features to obtain a spatial feature map.
The structure diagram of the spatial feature extraction module is shown in figure 2, video containing T frames of images is input, the video frames sequentially pass through the residual error module and the cavity convolution pyramid pooling module to extract spatial features, I i (i=1...t.) the video frames that are input are represented, ASPP represents a hole convolution pyramid pooling module, and F aspp represents the obtained spatial features.
The space-time consistency feature enhancement module comprises: bidirectional ConvLSTM module, attention module I, attention module II, and splicing module;
The bidirectional ConvLSTM module is used for carrying out modeling operation according to the space-time correlation between the current frame and the forward frame and between the current frame and the backward frame;
the attention module I is used for weighting characteristic points in the characteristic diagram obtained by the forward unit of the bidirectional ConvLSTM module;
the attention module II is used for weighting characteristic points in the characteristic diagram obtained by the backward unit of the bidirectional ConvLSTM module;
The splicing module is used for splicing the characteristic diagram obtained by the forward unit of the bidirectional ConvLSTM module and the characteristic diagram obtained by the backward unit, and obtaining the video characteristic frame with space-time consistency through the tanh activation function.
The space-time consistency feature enhancement module is shown in fig. 3, the space features output by the space feature extraction module are sequentially sent to the forward unit of the bidirectional ConvLSTM module, the attention module I, the backward unit of the bidirectional ConvLSTM module and the attention module ii for extracting the space-time consistency features, and the feature points in the resulting feature map are weighted, CCA i (i=1, 2..t.) represents a cross-attention module, convLSTM denotes a convolution long-short term memory network, F aspp denotes spatial features obtained by the spatial feature extraction module, and F CL denotes obtained spatio-temporal consistency features.
The low-level semantic information link module includes: a link module I, a link module II and a link module III;
The link module I is used for extracting low-level space features output by a fourth convolution layer in the residual error module;
The link module II is used for extracting low-level space features output by a third convolution layer in the residual error module;
And the link module III is used for extracting low-level spatial features output by the second convolution layer in the residual error module.
A video saliency target detection method based on a space-time convolutional neural network is realized based on a video saliency target detection system based on the space-time convolutional neural network, and comprises the following steps:
step 1: sending the video containing the T frame image into a spatial feature extraction module, and extracting the spatial features of the video frames from thick to thin; the spatial feature extraction module comprises a residual error module and a cavity convolution pyramid pooling module. Firstly, performing preliminary modeling operation on the spatial characteristics by adopting a pre-trained residual module, wherein the residual module uses the first 5 groups of layers of the Resnet-50 network, and simultaneously removing the downsampling operation of the fifth layer. And then inputting the features output by the residual error module into a cavity convolution pyramid pooling module to extract multi-scale spatial features so as to obtain a low-level spatial feature map. Representing a video containing T-frame images, the modeling process described above is as follows:
Where M res represents a backbone network, M aspp represents a hole convolution pyramid pooling module, Representing the resulting spatial signature.
Step 2: the spatial features extracted in the step 1 are sent to a space-time consistency feature enhancement module, further learn space-time consistency features of deeper layers, and weight the feature values in the feature map to improve the contrast ratio of the salient features and the non-salient features; comprising the following steps:
Step 2.1: carrying out space-time correlation modeling operation between the current frame and the forward frame on the space features output by the space feature extraction module by adopting a forward unit of the bidirectional ConvLSTM module;
the modeling process of the forward cell is shown in the formula:
Where t represents the current frame, Indicating the output result of the forward unit,Features of the forward frame representing the current frame,The spatial features obtained in step 1 are represented.
Step 2.2: sending the output result of the forward unit into an attention module I, and weighting the characteristic points in the characteristic diagram obtained by the forward unit to obtain a characteristic diagram G1 containing non-salient targets and salient target contrast;
the weighting process is as follows:
In the method, in the process of the invention, Representing a feature map containing the contrast of the non-salient object and the salient object, +..
Step 2.3: after being weighted by the attention module I, the obtained feature map is input into a backward unit to carry out space-time correlation modeling operation between the current frame and the backward frame;
The modeling process of the backward unit is shown as the formula:
Where t represents the current frame, Representing the result output by the backward unit.
Step 2.4: sending the output result of the backward unit into an attention module II, and weighting the characteristic points in the characteristic diagram obtained by the backward unit to obtain a characteristic diagram G2 containing non-salient targets and salient target contrast;
the weighting process is as follows:
In the method, in the process of the invention, Representing a feature map containing the contrast of the non-salient object and the salient object, +..
Step 2.5: splicing the feature map G1 and the feature map G2, inputting a layer of convolution layer with a convolution kernel 3*3 for feature extraction, and obtaining a video feature frame with space-time consistency through a tanh activation function;
The modeling process is shown in the formula:
In the method, in the process of the invention, And the space-time consistency characteristics obtained after the space-time consistency characteristic enhancement module are represented.
The attention module I and the attention module III in the step 2 are crisscross attention module CCA constructed based on a self-attention mechanism, and input features pass through three parallel convolution layers with a convolution kernel size 1*1 to obtain three feature tensors: q, K, V; q and K are then input to the first attention profile calculation layer to obtain an attention profile a between Q and K, the calculation being as follows:
di,u=quki,u (1)
A=softmax(D) (2)
Where Q u represents the tensor of one dimension in Q; k i,u denotes all feature points in K and q u having the same abscissa or ordinate; d i,u represents the relationship between the feature points in each channel in Q and the feature points in K, softmax represents the activation function;
inputting the obtained feature tensors A and V into a second attention distribution calculation layer, calculating the attention distribution between A and V according to formulas (1) and (2), and then adding the attention as weight distribution into the original feature map to obtain a high-level feature map containing the contrast of the salient object and the non-salient object;
The modeling process is shown in the formula:
fout=auvi,u+finput
Wherein a u represents a tensor of one dimension in a; v i,u denotes selecting a point in V having the same abscissa or ordinate as a u; f out∈Fout,Fout denotes a spatiotemporal consistency feature comprising non-salient object and salient object feature contrast.
Step 3: inputting the spatial features with different granularities obtained by the 2 nd, 3 rd and 4 th convolution layers in the spatial feature extraction module into a low-level semantic information link module composed of Ghost, further extracting low-level spatial features, and removing background redundant information of the low-level spatial features; comprising the following steps:
step 3.1: the spatial features with different granularities obtained by the 2 nd, 3 rd and 4 th convolution layers of the residual modules in the spatial feature extraction module are parallelly fed into the first layer of the link modules I, II and III, convolution operation is carried out first, and normalization operation is carried out after the convolution operation;
Step 3.2: the normalized spatial features are parallelly sent to a second layer of the link modules I, II and III, detail features of the salient objects are extracted by adopting depth separable convolution operation, the size of a convolution kernel is 3*3, the result of each convolution operation is sent to the normalization layer, and finally a ReLU activation function is passed;
step 3.3: and 3.2, sending the result obtained in the step 3.2 into a third layer of the link modules I, II and III in parallel for convolution operation, and adopting a convolution layer with a convolution kernel size 1*1 to carry out channel adjustment on the feature map obtained in the second layer.
Step 4: the low-level semantic features output by the low-level semantic information linking module in the step 3 and the space-time consistency features output by the space-time consistency feature enhancement module in the step 2 are input to a feature fusion and up-sampling module for fusion operation, and a feature map is expanded to the size of an input video frame; the concrete expression is as follows: the recursive form is adopted to realize the splicing fusion and upsampling operation of the features in the feature fusion and upsampling module, and the realization process is as follows: splicing the low-level space features output by the link module I and the space-time consistency features output by the splicing module, and completing fusion through one-layer convolution operation and up-sampling operation; fusing the output characteristics with the low-level space characteristics output by the link module II; fusing the output characteristics with the low-level space characteristics output by the link module III to obtain a characteristic frame with the same size as the input picture;
The modeling process is shown in the formula:
In the method, in the process of the invention, Representing the spatio-temporal consistency characteristics obtained through step 2,Representing low-level semantic features obtained by the features output by the 4 th convolution layer of the spatial feature extraction module through the low-level semantic link module,Representing low-level semantic features obtained by the low-level semantic link module of the features output by the 3 rd convolution layer of the spatial feature extraction module,Representing low-level semantic features obtained by the features output by the 2 nd convolution layer of the spatial feature extraction module through the low-level semantic link module, wherein F t represents a finally obtained feature frame, conv represents convolution operation and up represents up-sampling operation.
Step 5: sending the feature map obtained in the step 4 to a decoder for decoding operation to obtain a salient target mask corresponding to each image in the video sequence; the concrete expression is as follows: and carrying out dimension lifting operation on the feature frame obtained by the feature fusion and up-sampling module through a convolution layer with a convolution kernel of 3*3, then carrying out pixel-level classification through a convolution layer with a convolution kernel of 1*1, and finally carrying out normalization operation on the classification result through a sigmoid function to obtain a saliency target mask corresponding to the video frame.
The modeling process is shown in the formula:
St=δsigmoid(Conv(Ft))
Where δ sigmoid represents a sigmoid activation function and S t represents the saliency target mask of the resulting video frame.
The invention is based on the coding and decoding structure, performs space feature extraction and space-time consistency feature extraction at the coding end, performs feature fusion and significance target prediction at the decoding end, and improves the accuracy of significance target prediction while considering efficiency. A ConvLSTM structure (DCA_ ConvLSTM for short) embedded with a double-layer cross attention computing mechanism is designed based on a self-attention mechanism in a space-time information enhancement module, global information is introduced into space and time dimensions through the module, and feature values corresponding to a saliency map are weighted to improve contrast of saliency features and non-saliency features, so that interference of background information on foreground information is avoided to a certain extent, prediction precision of a method is improved, and simultaneously space-time consistency features are also obtained.
Furthermore, the invention provides a low-level semantic information link module consisting of the ghost. As the neural network hierarchy changes, the semantics expressed by the features extracted at different layers are not identical. The deep convolution layer can map a larger receptive field with a smaller convolution kernel, so that semantic features of a higher layer can be extracted, the shallow convolution layer has a smaller receptive field, and the extracted features can reflect local detail information of an image, such as contour information. Video saliency target detection is a pixel-level prediction task that cannot be accurately predicted at some object edges if detail information is missing. Therefore, the invention provides the low-level semantic information link module composed of the ghosts, so that the loss of low-level semantic features is reduced as much as possible, and the prediction at the edge of the object is more accurate.
To demonstrate the effectiveness of the video salient object detection method of the present invention, the method presented herein and the other 7 advanced salient object detection methods were tested on DAVIS, VOS, FBMS data sets, and the visualization results are shown in fig. 5. Wherein 7 methods are RFCN, DSS, piCA, SSA, FCNS, FGRN, PDB respectively; GT is a real label; in fig. 5, the first line is an original video frame, the last line GT is a real label, the next to last line is a detection result of the video saliency target detection method proposed by the present invention, and the other lines are prediction results of some saliency target detection methods proposed by the former in the field. As can be seen from the comparison experiment results in FIG. 5, the video salient object detection method provided by the invention is more accurate in contour positioning of the salient object and detail prediction of the salient object. In addition, aiming at a scene with a plurality of remarkable targets and a complex background, the method provided by the invention has good detection effect.