[go: up one dir, main page]

CN114926760B - Video saliency target detection system and method based on space-time convolutional neural network - Google Patents

Video saliency target detection system and method based on space-time convolutional neural network Download PDF

Info

Publication number
CN114926760B
CN114926760B CN202210501874.XA CN202210501874A CN114926760B CN 114926760 B CN114926760 B CN 114926760B CN 202210501874 A CN202210501874 A CN 202210501874A CN 114926760 B CN114926760 B CN 114926760B
Authority
CN
China
Prior art keywords
module
space
features
feature
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210501874.XA
Other languages
Chinese (zh)
Other versions
CN114926760A (en
Inventor
雷为民
姜怡晗
侯玉莹
张伟
叶文慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN202210501874.XA priority Critical patent/CN114926760B/en
Publication of CN114926760A publication Critical patent/CN114926760A/en
Application granted granted Critical
Publication of CN114926760B publication Critical patent/CN114926760B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

本发明提供一种基于时空卷积神经网络的视频显著性目标检测系统及方法,所述系统包括空间特征提取模块、时空一致特征增强模块、特征融合及上采样模块、低级语义信息链接模块、解码器;采用一种嵌入了交叉自注意力计算机制的循环神经网络来进行时空一致性特征的提取,对特征图中的特征值进行加权提升显著性特征和非显著性特征的对比度,在一定程度上避免背景信息对前景信息的干扰;同时,采用一种低级语义信息链接模块将低级空间特征和时空一致性特征融合,减少低级空间特征的损失,使得物体边缘的预测更加准确。本发明相对于传统的视频显著性目标检测方法,能够兼顾速度和精度,是一种更加适用于实际视频显著性目标检测的方法。

The present invention provides a video salient target detection system and method based on spatiotemporal convolutional neural network, the system includes a spatial feature extraction module, a spatiotemporal consistent feature enhancement module, a feature fusion and upsampling module, a low-level semantic information link module, and a decoder; a recurrent neural network embedded with a cross self-attention calculation mechanism is used to extract spatiotemporal consistent features, and the feature values in the feature map are weighted to enhance the contrast between significant features and non-significant features, thereby avoiding the interference of background information on foreground information to a certain extent; at the same time, a low-level semantic information link module is used to fuse low-level spatial features and spatiotemporal consistent features, thereby reducing the loss of low-level spatial features and making the prediction of object edges more accurate. Compared with traditional video salient target detection methods, the present invention can take into account both speed and accuracy, and is a method more suitable for actual video salient target detection.

Description

Video saliency target detection system and method based on space-time convolutional neural network
Technical Field
The invention belongs to the technical field of digital image processing, and particularly relates to a video saliency target detection system and method based on a space-time convolutional neural network.
Background
With the rapid development of the internet and communication technology, people can acquire more and more external information, and more than 80% of the external information is counted as visual information resources. As the requirements of people on the quality of information are higher and higher, the resolution of images and videos is continuously improved, which results in larger and larger computing resources and storage resources required for video analysis and other works. During the analysis processing of video, people tend to pay attention to only a part of things in the video, such as people who may be more attentive to people appearing in the video in conversation-type video with fixed background; in surveillance-type video, people are more concerned about new objects in the video. If the objects or areas of interest can be mined in advance, the limited resources are preferentially allocated to the areas, so that the capability of analyzing and processing the video can be well improved. How to efficiently mine the information most concerned by people from mass data becomes a big hot spot in the field of computer vision. The detection of a salient object based on human visual attention can accurately find the most attractive region in an image or video, and thus this field becomes an important research direction.
Salient object detection of video is classified into a conventional method and a deep learning-based method. Most of the traditional video saliency target detection methods generally rely on manual bottom layer characteristics to perform heuristic saliency reasoning, so that complex video sequences requiring knowledge and semantic reasoning cannot be processed, and the defects of poor detection effect or high detection cost generally exist. The salient object detection method based on deep learning mainly comprises three parts of spatial feature extraction, temporal feature extraction and space-time feature fusion. Although the method gradually replaces the traditional method due to the advantages of high detection accuracy, no pretreatment and high instantaneity, the existing salient target detection method based on deep learning still has a certain problem:
Problem 1: in the propagation process of the neural network, the detected feature images have the same attention to each pixel point, namely the network views the extracted features equally, so that some areas belonging to the background interfere with the prediction of the salient targets, and the detection performance of the network is affected;
problem 2: most methods predict using only high-level spatiotemporal features, ignoring the detail information of low-level semantic features, when finally performing significance prediction. Saliency target detection is a pixel-level prediction task that cannot be accurately predicted on the edges of some objects if there is insufficient detail information.
The Chinese patent CN109784183A provides a method for detecting the salient targets in the video based on a cascade convolution network and an optical flow video salient target detection method, which comprises the steps of firstly extracting spatial features based on the convolution neural network, then extracting an optical flow field based on an optical flow method, finally splicing the two, and then sending the two to a dynamic optimization network for pixel-level classification, thereby obtaining a salient map of each frame of image. Compared with the traditional video saliency target detection method, the method greatly improves the prediction accuracy from the experimental effect. Although the method extracts spatial features through a cascade convolution network and extracts time domain features by using an optical flow method, and then simply splices the spatial features and the time domain features to achieve the effect of detecting the salient targets, the method essentially cuts the temporal features and the spatial features of the video, which results in low detection accuracy and poor real-time performance. Furthermore, the optical flow method is very computationally expensive, which necessarily results in less efficient video processing in real traffic. Meanwhile, in the process of extracting the information from the characteristics by adopting convolution operation, the convolution operation is equivalent to treating the characteristics in the network, and the interference of background information on the salient target information is caused to a certain extent, so that the detection effect is not ideal.
Disclosure of Invention
Aiming at the problems, the invention provides a video saliency target detection system and method based on a space-time convolutional neural network, which can improve the accuracy of the saliency target detection method for carrying out saliency target prediction while considering efficiency on the premise of not splitting time features and space features.
A video saliency target detection system based on a spatio-temporal convolutional neural network, comprising: the device comprises a space feature extraction module, a space-time consistency feature enhancement module, a feature fusion and up-sampling module, a low-level semantic information link module and a decoder;
The spatial feature extraction module is used for extracting spatial features of the video frames;
The space-time consistency feature enhancement module is used for extracting the space-time consistency features of the video frames and carrying out weighting operation on feature values in the feature images;
The low-level semantic information link module is used for extracting low-level spatial features and removing background redundant information of the low-level spatial features;
The feature fusion and up-sampling module is used for fusing low-level spatial features and space-time consistency features and expanding a feature map to be the same as an input video in size;
The decoder is used for decoding the feature map to obtain a saliency target mask corresponding to each image in the video sequence.
The spatial feature extraction module comprises: a residual error module and a cavity convolution pyramid pooling module;
The residual error module is used for carrying out modeling operation on the space characteristics;
and the cavity convolution pyramid pooling module is used for extracting multi-scale spatial features to obtain a spatial feature map.
The space-time consistency feature enhancement module comprises: bidirectional ConvLSTM module, attention module I, attention module II, and splicing module;
The bidirectional ConvLSTM module is used for carrying out modeling operation according to the space-time correlation between the current frame and the forward frame and between the current frame and the backward frame;
the attention module I is used for weighting characteristic points in the characteristic diagram obtained by the forward unit of the bidirectional ConvLSTM module;
the attention module II is used for weighting characteristic points in the characteristic diagram obtained by the backward unit of the bidirectional ConvLSTM module;
The splicing module is used for splicing the characteristic diagram obtained by the forward unit of the bidirectional ConvLSTM module and the characteristic diagram obtained by the backward unit, and obtaining the video characteristic frame with space-time consistency through the tanh activation function.
The low-level semantic information link module includes: a link module I, a link module II and a link module III;
The link module I is used for extracting low-level space features output by a fourth convolution layer in the residual error module;
The link module II is used for extracting low-level space features output by a third convolution layer in the residual error module;
And the link module III is used for extracting low-level spatial features output by the second convolution layer in the residual error module.
The method for detecting the video saliency target based on the space-time convolutional neural network is realized based on a video saliency target detection system based on the space-time convolutional neural network, and comprises the following steps:
step 1: collecting a video containing a T frame image, and extracting spatial characteristics of a video frame;
step 2: extracting space-time consistency characteristics of video frames, and carrying out weighting operation on characteristic values in a characteristic diagram;
step 3: extracting low-level spatial features according to depth separable convolution operation;
Step 4: performing feature fusion and up-sampling operation on the low-level spatial features and the space-time consistency features to obtain a high-level feature map containing the T-frame video;
step 5: and decoding the advanced feature map to obtain a saliency target mask corresponding to each image in the video sequence.
The step 1 is specifically expressed as: modeling the spatial features by adopting a pre-trained residual error module, removing the downsampling operation of a fifth layer by using the first 5 groups of layers of residual error networks Resnet-50 by the residual error module, and then inputting the features output by the residual error module into a cavity convolution pyramid pooling module to extract the multi-scale spatial features to obtain a spatial feature map.
The step 2 comprises the following steps:
Step 2.1: carrying out space-time correlation modeling operation between the current frame and the forward frame by adopting a forward unit of the bidirectional convLSTM module to the space features output by the space feature extraction module, so as to obtain an output result of the forward unit;
Step 2.2: sending the output result of the forward unit into an attention module I, and weighting the characteristic points in the characteristic diagram obtained by the forward unit to obtain a characteristic diagram G1 containing non-salient targets and salient target contrast;
step 2.3: after being weighted by the attention module I, the obtained feature map is input into a backward unit to carry out space-time correlation modeling operation between the current frame and the backward frame;
step 2.4: sending the output result of the backward unit into an attention module II, and weighting the characteristic points in the characteristic diagram obtained by the backward unit to obtain a characteristic diagram G2 containing non-salient targets and salient target contrast;
step 2.5: and splicing the characteristic graphs G1 and G2, inputting a layer of convolution layer with a convolution kernel of 3*3 for characteristic extraction, and obtaining a video characteristic frame with space-time consistency through a tanh activation function.
The attention module I and the attention module III in the step 2 are crisscross attention module CCA constructed based on a self-attention mechanism, and input features pass through three parallel convolution layers with a convolution kernel size 1*1 to obtain three feature tensors: q, K, V; q and K are then input to the first attention profile calculation layer to obtain an attention profile a between Q and K, the calculation being as follows:
di,u=quki,u (1)
A=softmax(D) (2)
Where Q u represents the tensor of one dimension in Q; k i,u denotes all feature points in K and q u having the same abscissa or ordinate; d i,u represents the relationship between the feature points in each channel in Q and the feature points in K, softmax represents the activation function;
the obtained feature tensors A and V are input into a second attention distribution calculation layer, the attention distribution between A and V is calculated according to formulas (1) and (2), and then the attention is added into the original feature map as weight distribution, so that the advanced feature map containing the contrast of the salient object and the non-salient object is obtained.
The step3 comprises the following steps:
step 3.1: the spatial features with different granularities obtained by the 2 nd, 3 rd and 4 th convolution layers of the residual modules in the spatial feature extraction module are parallelly fed into the first layer of the link modules I, II and III, convolution operation is carried out first, and normalization operation is carried out after the convolution operation;
Step 3.2: the normalized spatial features are parallelly sent to a second layer of the link modules I, II and III, detail features of the salient objects are extracted by adopting depth separable convolution operation, the size of a convolution kernel is 3*3, the result of each convolution operation is sent to the normalization layer, and finally a ReLU activation function is passed;
step 3.3: and 3.2, sending the result obtained in the step 3.2 into a third layer of the link modules I, II and III in parallel for convolution operation, and adopting a convolution layer with a convolution kernel size 1*1 to carry out channel adjustment on the feature map obtained in the second layer.
The step 4 is specifically expressed as: the recursive form is adopted to realize the splicing fusion and upsampling operation of the features in the feature fusion and upsampling module, and the realization process is as follows: splicing the low-level space features output by the link module I and the space-time consistency features output by the splicing module, and completing fusion through one-layer convolution operation and up-sampling operation; fusing the output characteristics with the low-level space characteristics output by the link module II; fusing the output characteristics with the low-level space characteristics output by the link module III to obtain a characteristic frame with the same size as the input picture;
The step 5 is specifically expressed as: and carrying out dimension lifting operation on the feature frame obtained by the feature fusion and up-sampling module through a convolution layer with a convolution kernel of 3*3, then carrying out pixel-level classification through a convolution layer with a convolution kernel of 1*1, and finally carrying out normalization operation on the classification result through a sigmoid function to obtain a saliency target mask corresponding to the video frame.
The beneficial effects of the invention are as follows:
The invention provides a video saliency target detection system and method based on a space-time convolutional neural network, which are a saliency target detection method which is video-oriented, higher in efficiency and better in precision, and a lightweight backbone network and a cavity convolutional pyramid pooling module are adopted to extract the spatial characteristics of a saliency target; then, a cyclic neural network embedded with a cross self-attention computing mechanism is adopted to extract space-time consistency characteristics, and simultaneously, characteristic values of characteristic patterns are weighted to improve contrast of salient characteristics and non-salient characteristics, so that interference of background information on foreground information is avoided to a certain extent, and prediction accuracy of the method is improved; meanwhile, a low-level semantic information link module is adopted to fuse low-level spatial features and space-time consistency features, so that loss of the low-level spatial features is reduced as much as possible, and prediction at the edge of an object is more accurate. Compared with the traditional video salient object detection method, the method can give consideration to speed and precision, and is more suitable for detecting the actual video salient object.
Drawings
FIG. 1 is a block diagram of a video saliency target detection system based on a space-time convolutional neural network in the present invention;
FIG. 2 is a schematic diagram of a spatial feature extraction module according to the present invention;
FIG. 3 is a schematic diagram of a space-time consistent feature enhancement module in accordance with the present invention;
FIG. 4 is a schematic diagram of a feature fusion and upsampling module according to the present invention;
FIG. 5 is a graph comparing the visual results of the method of the present invention with other methods of detecting a significant object.
Detailed Description
The invention will be further described with reference to the accompanying drawings and examples of specific embodiments.
As shown in fig. 1, a video saliency target detection system based on a space-time convolutional neural network includes: the device comprises a space feature extraction module, a space-time consistency feature enhancement module, a feature fusion and up-sampling module, a low-level semantic information link module and a decoder;
The spatial feature extraction module is used for extracting spatial features of the video frames;
The space-time consistency feature enhancement module is used for extracting the space-time consistency features of the video frames and carrying out weighting operation on feature values in the feature images;
The low-level semantic information link module is used for extracting low-level spatial features and removing background redundant information of the low-level spatial features;
the feature fusion and up-sampling module is used for fusing low-level spatial features and space-time consistency features and expanding a feature map to be the same as an input video in size; the structure diagram of the feature fusion and up-sampling module is shown in fig. 4, the low-level spatial features output by the link modules i, ii and iii and the space-time consistency features output by the space-time consistency feature enhancement module are sequentially fused and up-sampled in a recursive manner, wherein Conv represents convolution operation, F l2、Fl3、Fl4 sequentially represents low-level spatial characteristics of the outputs of the 2 nd, 3 rd and 4 th convolution layers of the residual modules, si (i=1...t.) represents the resulting saliency target mask, F CL represents the resulting spatio-temporal consistency characteristics.
The decoder is used for decoding the feature map to obtain a saliency target mask corresponding to each image in the video sequence.
The spatial feature extraction module comprises: a residual error module and a cavity convolution pyramid pooling module;
The residual error module is used for carrying out modeling operation on the space characteristics;
and the cavity convolution pyramid pooling module is used for extracting multi-scale spatial features to obtain a spatial feature map.
The structure diagram of the spatial feature extraction module is shown in figure 2, video containing T frames of images is input, the video frames sequentially pass through the residual error module and the cavity convolution pyramid pooling module to extract spatial features, I i (i=1...t.) the video frames that are input are represented, ASPP represents a hole convolution pyramid pooling module, and F aspp represents the obtained spatial features.
The space-time consistency feature enhancement module comprises: bidirectional ConvLSTM module, attention module I, attention module II, and splicing module;
The bidirectional ConvLSTM module is used for carrying out modeling operation according to the space-time correlation between the current frame and the forward frame and between the current frame and the backward frame;
the attention module I is used for weighting characteristic points in the characteristic diagram obtained by the forward unit of the bidirectional ConvLSTM module;
the attention module II is used for weighting characteristic points in the characteristic diagram obtained by the backward unit of the bidirectional ConvLSTM module;
The splicing module is used for splicing the characteristic diagram obtained by the forward unit of the bidirectional ConvLSTM module and the characteristic diagram obtained by the backward unit, and obtaining the video characteristic frame with space-time consistency through the tanh activation function.
The space-time consistency feature enhancement module is shown in fig. 3, the space features output by the space feature extraction module are sequentially sent to the forward unit of the bidirectional ConvLSTM module, the attention module I, the backward unit of the bidirectional ConvLSTM module and the attention module ii for extracting the space-time consistency features, and the feature points in the resulting feature map are weighted, CCA i (i=1, 2..t.) represents a cross-attention module, convLSTM denotes a convolution long-short term memory network, F aspp denotes spatial features obtained by the spatial feature extraction module, and F CL denotes obtained spatio-temporal consistency features.
The low-level semantic information link module includes: a link module I, a link module II and a link module III;
The link module I is used for extracting low-level space features output by a fourth convolution layer in the residual error module;
The link module II is used for extracting low-level space features output by a third convolution layer in the residual error module;
And the link module III is used for extracting low-level spatial features output by the second convolution layer in the residual error module.
A video saliency target detection method based on a space-time convolutional neural network is realized based on a video saliency target detection system based on the space-time convolutional neural network, and comprises the following steps:
step 1: sending the video containing the T frame image into a spatial feature extraction module, and extracting the spatial features of the video frames from thick to thin; the spatial feature extraction module comprises a residual error module and a cavity convolution pyramid pooling module. Firstly, performing preliminary modeling operation on the spatial characteristics by adopting a pre-trained residual module, wherein the residual module uses the first 5 groups of layers of the Resnet-50 network, and simultaneously removing the downsampling operation of the fifth layer. And then inputting the features output by the residual error module into a cavity convolution pyramid pooling module to extract multi-scale spatial features so as to obtain a low-level spatial feature map. Representing a video containing T-frame images, the modeling process described above is as follows:
Where M res represents a backbone network, M aspp represents a hole convolution pyramid pooling module, Representing the resulting spatial signature.
Step 2: the spatial features extracted in the step 1 are sent to a space-time consistency feature enhancement module, further learn space-time consistency features of deeper layers, and weight the feature values in the feature map to improve the contrast ratio of the salient features and the non-salient features; comprising the following steps:
Step 2.1: carrying out space-time correlation modeling operation between the current frame and the forward frame on the space features output by the space feature extraction module by adopting a forward unit of the bidirectional ConvLSTM module;
the modeling process of the forward cell is shown in the formula:
Where t represents the current frame, Indicating the output result of the forward unit,Features of the forward frame representing the current frame,The spatial features obtained in step 1 are represented.
Step 2.2: sending the output result of the forward unit into an attention module I, and weighting the characteristic points in the characteristic diagram obtained by the forward unit to obtain a characteristic diagram G1 containing non-salient targets and salient target contrast;
the weighting process is as follows:
In the method, in the process of the invention, Representing a feature map containing the contrast of the non-salient object and the salient object, +..
Step 2.3: after being weighted by the attention module I, the obtained feature map is input into a backward unit to carry out space-time correlation modeling operation between the current frame and the backward frame;
The modeling process of the backward unit is shown as the formula:
Where t represents the current frame, Representing the result output by the backward unit.
Step 2.4: sending the output result of the backward unit into an attention module II, and weighting the characteristic points in the characteristic diagram obtained by the backward unit to obtain a characteristic diagram G2 containing non-salient targets and salient target contrast;
the weighting process is as follows:
In the method, in the process of the invention, Representing a feature map containing the contrast of the non-salient object and the salient object, +..
Step 2.5: splicing the feature map G1 and the feature map G2, inputting a layer of convolution layer with a convolution kernel 3*3 for feature extraction, and obtaining a video feature frame with space-time consistency through a tanh activation function;
The modeling process is shown in the formula:
In the method, in the process of the invention, And the space-time consistency characteristics obtained after the space-time consistency characteristic enhancement module are represented.
The attention module I and the attention module III in the step 2 are crisscross attention module CCA constructed based on a self-attention mechanism, and input features pass through three parallel convolution layers with a convolution kernel size 1*1 to obtain three feature tensors: q, K, V; q and K are then input to the first attention profile calculation layer to obtain an attention profile a between Q and K, the calculation being as follows:
di,u=quki,u (1)
A=softmax(D) (2)
Where Q u represents the tensor of one dimension in Q; k i,u denotes all feature points in K and q u having the same abscissa or ordinate; d i,u represents the relationship between the feature points in each channel in Q and the feature points in K, softmax represents the activation function;
inputting the obtained feature tensors A and V into a second attention distribution calculation layer, calculating the attention distribution between A and V according to formulas (1) and (2), and then adding the attention as weight distribution into the original feature map to obtain a high-level feature map containing the contrast of the salient object and the non-salient object;
The modeling process is shown in the formula:
fout=auvi,u+finput
Wherein a u represents a tensor of one dimension in a; v i,u denotes selecting a point in V having the same abscissa or ordinate as a u; f out∈Fout,Fout denotes a spatiotemporal consistency feature comprising non-salient object and salient object feature contrast.
Step 3: inputting the spatial features with different granularities obtained by the 2 nd, 3 rd and 4 th convolution layers in the spatial feature extraction module into a low-level semantic information link module composed of Ghost, further extracting low-level spatial features, and removing background redundant information of the low-level spatial features; comprising the following steps:
step 3.1: the spatial features with different granularities obtained by the 2 nd, 3 rd and 4 th convolution layers of the residual modules in the spatial feature extraction module are parallelly fed into the first layer of the link modules I, II and III, convolution operation is carried out first, and normalization operation is carried out after the convolution operation;
Step 3.2: the normalized spatial features are parallelly sent to a second layer of the link modules I, II and III, detail features of the salient objects are extracted by adopting depth separable convolution operation, the size of a convolution kernel is 3*3, the result of each convolution operation is sent to the normalization layer, and finally a ReLU activation function is passed;
step 3.3: and 3.2, sending the result obtained in the step 3.2 into a third layer of the link modules I, II and III in parallel for convolution operation, and adopting a convolution layer with a convolution kernel size 1*1 to carry out channel adjustment on the feature map obtained in the second layer.
Step 4: the low-level semantic features output by the low-level semantic information linking module in the step 3 and the space-time consistency features output by the space-time consistency feature enhancement module in the step 2 are input to a feature fusion and up-sampling module for fusion operation, and a feature map is expanded to the size of an input video frame; the concrete expression is as follows: the recursive form is adopted to realize the splicing fusion and upsampling operation of the features in the feature fusion and upsampling module, and the realization process is as follows: splicing the low-level space features output by the link module I and the space-time consistency features output by the splicing module, and completing fusion through one-layer convolution operation and up-sampling operation; fusing the output characteristics with the low-level space characteristics output by the link module II; fusing the output characteristics with the low-level space characteristics output by the link module III to obtain a characteristic frame with the same size as the input picture;
The modeling process is shown in the formula:
In the method, in the process of the invention, Representing the spatio-temporal consistency characteristics obtained through step 2,Representing low-level semantic features obtained by the features output by the 4 th convolution layer of the spatial feature extraction module through the low-level semantic link module,Representing low-level semantic features obtained by the low-level semantic link module of the features output by the 3 rd convolution layer of the spatial feature extraction module,Representing low-level semantic features obtained by the features output by the 2 nd convolution layer of the spatial feature extraction module through the low-level semantic link module, wherein F t represents a finally obtained feature frame, conv represents convolution operation and up represents up-sampling operation.
Step 5: sending the feature map obtained in the step 4 to a decoder for decoding operation to obtain a salient target mask corresponding to each image in the video sequence; the concrete expression is as follows: and carrying out dimension lifting operation on the feature frame obtained by the feature fusion and up-sampling module through a convolution layer with a convolution kernel of 3*3, then carrying out pixel-level classification through a convolution layer with a convolution kernel of 1*1, and finally carrying out normalization operation on the classification result through a sigmoid function to obtain a saliency target mask corresponding to the video frame.
The modeling process is shown in the formula:
St=δsigmoid(Conv(Ft))
Where δ sigmoid represents a sigmoid activation function and S t represents the saliency target mask of the resulting video frame.
The invention is based on the coding and decoding structure, performs space feature extraction and space-time consistency feature extraction at the coding end, performs feature fusion and significance target prediction at the decoding end, and improves the accuracy of significance target prediction while considering efficiency. A ConvLSTM structure (DCA_ ConvLSTM for short) embedded with a double-layer cross attention computing mechanism is designed based on a self-attention mechanism in a space-time information enhancement module, global information is introduced into space and time dimensions through the module, and feature values corresponding to a saliency map are weighted to improve contrast of saliency features and non-saliency features, so that interference of background information on foreground information is avoided to a certain extent, prediction precision of a method is improved, and simultaneously space-time consistency features are also obtained.
Furthermore, the invention provides a low-level semantic information link module consisting of the ghost. As the neural network hierarchy changes, the semantics expressed by the features extracted at different layers are not identical. The deep convolution layer can map a larger receptive field with a smaller convolution kernel, so that semantic features of a higher layer can be extracted, the shallow convolution layer has a smaller receptive field, and the extracted features can reflect local detail information of an image, such as contour information. Video saliency target detection is a pixel-level prediction task that cannot be accurately predicted at some object edges if detail information is missing. Therefore, the invention provides the low-level semantic information link module composed of the ghosts, so that the loss of low-level semantic features is reduced as much as possible, and the prediction at the edge of the object is more accurate.
To demonstrate the effectiveness of the video salient object detection method of the present invention, the method presented herein and the other 7 advanced salient object detection methods were tested on DAVIS, VOS, FBMS data sets, and the visualization results are shown in fig. 5. Wherein 7 methods are RFCN, DSS, piCA, SSA, FCNS, FGRN, PDB respectively; GT is a real label; in fig. 5, the first line is an original video frame, the last line GT is a real label, the next to last line is a detection result of the video saliency target detection method proposed by the present invention, and the other lines are prediction results of some saliency target detection methods proposed by the former in the field. As can be seen from the comparison experiment results in FIG. 5, the video salient object detection method provided by the invention is more accurate in contour positioning of the salient object and detail prediction of the salient object. In addition, aiming at a scene with a plurality of remarkable targets and a complex background, the method provided by the invention has good detection effect.

Claims (6)

1. A video saliency target detection system based on a space-time convolutional neural network, comprising: the device comprises a space feature extraction module, a space-time consistency feature enhancement module, a feature fusion and up-sampling module, a low-level semantic information link module and a decoder;
The spatial feature extraction module is used for extracting spatial features of the video frames;
The space-time consistency feature enhancement module is used for extracting the space-time consistency features of the video frames and carrying out weighting operation on feature values in the feature images;
The low-level semantic information link module is used for extracting low-level spatial features and removing background redundant information of the low-level spatial features;
The feature fusion and up-sampling module is used for fusing low-level spatial features and space-time consistency features and expanding a feature map to be the same as an input video in size;
the decoder is used for decoding the feature images to obtain a saliency target mask corresponding to each image in the video sequence;
The spatial feature extraction module comprises: a residual error module and a cavity convolution pyramid pooling module;
The residual error module is used for carrying out modeling operation on the space characteristics;
the cavity convolution pyramid pooling module is used for extracting multi-scale spatial features to obtain a spatial feature map;
the space-time consistency feature enhancement module comprises: bidirectional ConvLSTM module, attention module I, attention module II, and splicing module;
The bidirectional ConvLSTM module is used for carrying out modeling operation according to the space-time correlation between the current frame and the forward frame and between the current frame and the backward frame;
the attention module I is used for weighting characteristic points in the characteristic diagram obtained by the forward unit of the bidirectional ConvLSTM module;
the attention module II is used for weighting characteristic points in the characteristic diagram obtained by the backward unit of the bidirectional ConvLSTM module;
The splicing module is used for splicing the characteristic diagram obtained by the forward unit of the bidirectional ConvLSTM module and the characteristic diagram obtained by the backward unit, and obtaining a video characteristic frame with space-time consistency through a tanh activation function;
The attention module I and the attention module II are crisscross attention module CCA constructed based on a self-attention mechanism, and input features pass through three parallel convolution layers with the convolution kernel size of 1*1 to obtain three feature tensors: q, K, V; q and K are then input to the first attention profile calculation layer to obtain an attention profile a between Q and K, the calculation being as follows:
di,u=quki,u (1)
A=softmax(D) (2)
Where Q u represents the tensor of one dimension in Q; k i,u denotes all feature points in K and q u having the same abscissa or ordinate; d i,u represents the relationship between the feature points in each channel in Q and the feature points in K, softmax represents the activation function;
inputting the obtained feature tensors A and V into a second attention distribution calculation layer, calculating the attention distribution between A and V according to formulas (1) and (2), and then adding the attention as weight distribution into the original feature map to obtain a high-level feature map containing the contrast of the salient object and the non-salient object;
the low-level semantic information link module includes: a link module I, a link module II and a link module III;
The link module I is used for extracting low-level space features output by a fourth convolution layer in the residual error module;
The link module II is used for extracting low-level space features output by a third convolution layer in the residual error module;
the link module III is used for extracting low-level space features output by a second convolution layer in the residual error module;
The spatial features with different granularities obtained by the 2 nd, 3 rd and 4 th convolution layers of the residual error module in the spatial feature extraction module are parallelly fed into the first layer of the link module I, II and III, the convolution operation is carried out first, and the normalization operation is used for carrying out data normalization after the convolution operation;
The normalized spatial features are parallelly sent to a second layer of the link modules I, II and III, detail features of the salient objects are extracted by adopting depth separable convolution operation, the size of a convolution kernel is 3*3, the result of each convolution operation is sent to the normalization layer, and finally a ReLU activation function is passed;
And the results are sent to a third layer of the link modules I, II and III in parallel to carry out convolution operation, and a convolution layer with the convolution kernel size of 1*1 is adopted to carry out channel adjustment on the feature map obtained by the second layer.
2. A method for detecting video saliency target based on a space-time convolutional neural network, which is realized based on the video saliency target detection system based on a space-time convolutional neural network according to any one of claim 1, and is characterized in that the method comprises the following steps:
step 1: collecting a video containing a T frame image, and extracting spatial characteristics of a video frame;
step 2: extracting space-time consistency characteristics of video frames, and carrying out weighting operation on characteristic values in a characteristic diagram;
step 3: extracting low-level spatial features according to depth separable convolution operation;
Step 4: performing feature fusion and up-sampling operation on the low-level spatial features and the space-time consistency features to obtain a high-level feature map containing the T-frame video;
step 5: and decoding the advanced feature map to obtain a saliency target mask corresponding to each image in the video sequence.
3. The method for detecting video saliency target based on space-time convolutional neural network according to claim 2, wherein the step 1 is specifically expressed as: modeling the spatial features by adopting a pre-trained residual error module, removing the downsampling operation of a fifth layer by using the first 5 groups of layers of residual error networks Resnet-50 by the residual error module, and then inputting the features output by the residual error module into a cavity convolution pyramid pooling module to extract the multi-scale spatial features to obtain a spatial feature map.
4. The method for detecting video saliency target based on space-time convolutional neural network according to claim 2, wherein the step 2 comprises:
Step 2.1: carrying out space-time correlation modeling operation between the current frame and the forward frame by adopting a forward unit of the bidirectional convLSTM module to the space features output by the space feature extraction module, so as to obtain an output result of the forward unit;
Step 2.2: sending the output result of the forward unit into an attention module I, and weighting the characteristic points in the characteristic diagram obtained by the forward unit to obtain a characteristic diagram G1 containing non-salient targets and salient target contrast;
step 2.3: after being weighted by the attention module I, the obtained feature map is input into a backward unit to carry out space-time correlation modeling operation between the current frame and the backward frame;
step 2.4: sending the output result of the backward unit into an attention module II, and weighting the characteristic points in the characteristic diagram obtained by the backward unit to obtain a characteristic diagram G2 containing non-salient targets and salient target contrast;
step 2.5: and splicing the characteristic graphs G1 and G2, inputting a layer of convolution layer with a convolution kernel of 3*3 for characteristic extraction, and obtaining a video characteristic frame with space-time consistency through a tanh activation function.
5. The method for detecting video saliency target based on space-time convolutional neural network according to claim 2, wherein the step 3 comprises:
step 3.1: the spatial features with different granularities obtained by the 2 nd, 3 rd and 4 th convolution layers of the residual modules in the spatial feature extraction module are parallelly fed into the first layer of the link modules I, II and III, convolution operation is carried out first, and normalization operation is carried out after the convolution operation;
Step 3.2: the normalized spatial features are parallelly sent to a second layer of the link modules I, II and III, detail features of the salient objects are extracted by adopting depth separable convolution operation, the size of a convolution kernel is 3*3, the result of each convolution operation is sent to the normalization layer, and finally a ReLU activation function is passed;
step 3.3: and 3.2, sending the result obtained in the step 3.2 into a third layer of the link modules I, II and III in parallel for convolution operation, and adopting a convolution layer with a convolution kernel size 1*1 to carry out channel adjustment on the feature map obtained in the second layer.
6. The method for detecting video saliency target based on space-time convolutional neural network according to claim 2, wherein the step 4 is specifically expressed as: the recursive form is adopted to realize the splicing fusion and upsampling operation of the features in the feature fusion and upsampling module, and the realization process is as follows: splicing the low-level space features output by the link module I and the space-time consistency features output by the splicing module, and completing fusion through one-layer convolution operation and up-sampling operation; fusing the output characteristics with the low-level space characteristics output by the link module II; fusing the output characteristics with the low-level space characteristics output by the link module III to obtain a characteristic frame with the same size as the input picture;
The step 5 is specifically expressed as: and carrying out dimension lifting operation on the feature frame obtained by the feature fusion and up-sampling module through a convolution layer with a convolution kernel of 3*3, then carrying out pixel-level classification through a convolution layer with a convolution kernel of 1*1, and finally carrying out normalization operation on the classification result through a sigmoid function to obtain a saliency target mask corresponding to the video frame.
CN202210501874.XA 2022-05-10 2022-05-10 Video saliency target detection system and method based on space-time convolutional neural network Active CN114926760B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210501874.XA CN114926760B (en) 2022-05-10 2022-05-10 Video saliency target detection system and method based on space-time convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210501874.XA CN114926760B (en) 2022-05-10 2022-05-10 Video saliency target detection system and method based on space-time convolutional neural network

Publications (2)

Publication Number Publication Date
CN114926760A CN114926760A (en) 2022-08-19
CN114926760B true CN114926760B (en) 2024-07-02

Family

ID=82809179

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210501874.XA Active CN114926760B (en) 2022-05-10 2022-05-10 Video saliency target detection system and method based on space-time convolutional neural network

Country Status (1)

Country Link
CN (1) CN114926760B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115512135A (en) * 2022-09-20 2022-12-23 海信电子科技(武汉)有限公司 A Salient Object Detection Method in Image
CN116630841B (en) * 2023-04-06 2025-11-07 喀什地区电子信息产业技术研究院 Attention-based target detection method and system
CN117197437B (en) * 2023-09-13 2026-01-06 上海应用技术大学 A fully supervised salient target detection method
CN118711105B (en) * 2024-06-28 2025-10-28 西安电子科技大学 Lightweight video saliency prediction method based on spatiotemporal octave convolution module
CN119625597B (en) * 2024-11-13 2025-10-17 河北师范大学 Video saliency target detection method with cooperative enhancement of attention and image
CN119251256B (en) * 2024-12-05 2025-02-07 成都与睿创新科技有限公司 A method, system, device and medium for locating and tracking bleeding points in minimally invasive surgery

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107749066A (en) * 2017-11-10 2018-03-02 深圳市唯特视科技有限公司 A kind of multiple dimensioned space-time vision significance detection method based on region
CN111242003A (en) * 2020-01-10 2020-06-05 南开大学 Video salient object detection method based on multi-scale constrained self-attention mechanism

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019136591A1 (en) * 2018-01-09 2019-07-18 深圳大学 Salient object detection method and system for weak supervision-based spatio-temporal cascade neural network
CN112149459B (en) * 2019-06-27 2023-07-25 哈尔滨工业大学(深圳) Video saliency object detection model and system based on cross attention mechanism
CN111523410B (en) * 2020-04-09 2022-08-26 哈尔滨工业大学 Video saliency target detection method based on attention mechanism

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107749066A (en) * 2017-11-10 2018-03-02 深圳市唯特视科技有限公司 A kind of multiple dimensioned space-time vision significance detection method based on region
CN111242003A (en) * 2020-01-10 2020-06-05 南开大学 Video salient object detection method based on multi-scale constrained self-attention mechanism

Also Published As

Publication number Publication date
CN114926760A (en) 2022-08-19

Similar Documents

Publication Publication Date Title
CN114926760B (en) Video saliency target detection system and method based on space-time convolutional neural network
CN113688723A (en) A pedestrian target detection method based on improved YOLOv5 in infrared images
CN112801027A (en) Vehicle target detection method based on event camera
CN114445705B (en) An efficient object detection method for aerial images based on dense area perception
CN111753732A (en) A vehicle multi-target tracking method based on target center point
CN118262093A (en) A hierarchical cross-modal attention and cascaded aggregate decoding approach for RGB-D salient object detection
CN115577768A (en) Semi-supervised model training method and device
CN114973202B (en) A traffic scene obstacle detection method based on semantic segmentation
CN115100405A (en) Pose estimation-oriented occlusion scene target detection method
Zheng et al. DCU-NET: Self-supervised monocular depth estimation based on densely connected U-shaped convolutional neural networks
CN117612072A (en) A video understanding method based on dynamic spatiotemporal graph
CN117727093A (en) Video human body posture estimation method and system based on example cross-frame consistency
CN120510538B (en) A small target detection method for aerial photography with complex background and dynamic interference
CN117237842A (en) Pseudo tag generated video significance detection method based on time sequence features
CN112598043B (en) A Cooperative Saliency Detection Method Based on Weakly Supervised Learning
Gao et al. Dual attention guided multi-scale fusion network for RGB-D salient object detection
Hu et al. Monocular depth estimation with boundary attention mechanism and Shifted Window Adaptive Bins
CN116563749A (en) Video action detection method based on spatio-temporal information and video context information mining
CN111476353B (en) A saliency-introducing GAN image super-resolution method
CN115331171A (en) Crowd counting method and system based on depth information and significance information
Zhang et al. A double feature fusion network with progressive learning for sharper inpainting
Xu et al. Leveraging Neural Radiance Field and Semantic Communication for Robust 3D Reconstruction
CN113269139A (en) Self-learning large-scale police officer image classification model aiming at complex scene
Yusiong et al. A semi-supervised approach to monocular depth estimation, depth refinement, and semantic segmentation of driving scenes using a siamese triple decoder architecture
CN120318503B (en) Multi-mode target detection method, device, terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant