CN109815911A

CN109815911A - Video moving object detection system, method and terminal based on deep fusion network

Info

Publication number: CN109815911A
Application number: CN201910078362.5A
Authority: CN
Inventors: 陈立; 蔡春磊; 张小云; 高志勇
Original assignee: Shanghai Jiao Tong University
Current assignee: Shanghai Jiao Tong University
Priority date: 2019-01-26
Filing date: 2019-01-26
Publication date: 2019-05-28
Anticipated expiration: 2039-01-26
Also published as: CN109815911B

Abstract

The present invention provides a video moving object detection system based on a deep fusion network, comprising: a video feature extraction module, which receives video sequence input, performs feature extraction on the video content, and obtains the feature expression of scene information in the video, that is, the video scene Feature expression, and send to the deep fusion module; basic result detection module, receive video sequence input, use basic detectors to detect moving objects, get the corresponding basic detection results, and send to the deep fusion module; deep fusion module, receive video The scene feature expression and basic detection results are optimally fused using deep neural networks, and the final detection results are output. At the same time, a video moving object detection method and terminal are provided. The present invention can obtain high-accuracy detection results.

Description

Video moving object detection system, method and terminal based on depth integration network

Technical field

The present invention relates to video moving object detection technique fields, and in particular, to a kind of based on depth integration network Video moving object detection system, method and terminal.

Background technique

Video moving object detects first link that can be used as video image processing and video content analysis, is subsequent Operation provides Preliminary Analysis Results, helps to improve the performance of entire video processing and analysis system, therefore video moving object Detection is a vital technology.

For video moving object test problems, a large amount of method is had been proposed in researcher.But these research at Most of fruit both for some or certain specific scene of class, based on Feature Engineering, using hand-designed operator method into The design of row method.These traditional methods are divided into based on statistical model, based on cluster, based on types such as sparse expressions.At present also There is no a kind of conventional method can be mostly all efficient just for certain scenes with the various scenes of the reply of robust, and to other Scape is then performed poor.

The video moving object detection method based on deep learning on a small quantity is had recently emerged, these methods and conventional method are most Big difference is not needing to carry out manually to adjust ginseng, but detection model is automatically learned from data.Such as Wang et al. A kind of automanual video moving object detection algorithm is devised using depth convolutional network.This method needs manually first to mark one The testing result of a little key frames, then depth convolutional neural networks are trained according to the result of mark, after the completion of training, automatically Remaining video frame is analyzed, the moving object segmentation result of these frames is obtained.This method can obtain very high accuracy Testing result, but require manual intervention, it can not be automatically finished.

The scarcity that the maximum difficult point of detection model is training data is obtained using deep learning, without enough mark numbers According to then can not effectively training neural network.Currently without the explanation or report for finding technology similar to the present invention, also not yet receive Collect domestic and international similar data.

Summary of the invention

The present invention aiming at the above shortcomings existing in the prior art, provides a kind of video fortune based on depth integration network Animal body detection system, method and terminal can obtain non-in conjunction with conventional method and depth learning technology for several scenes Normal steady testing result.

The present invention is achieved by the following technical solutions.

According to an aspect of the invention, there is provided a kind of video moving object detection system based on depth integration network System, including following module:

Video feature extraction module, receives video sequence input, carries out feature extraction to video content, obtains closing in video In the feature representation of scene information, i.e. video scene feature representation, and it is sent to depth integration module；

Infrastructure detection module receives video sequence input, is detected, is obtained to moving object using basis detection To corresponding basic testing result, and it is sent to depth integration module；

Depth integration module is received video scene feature representation and basic testing result, is carried out using deep neural network Optimum fusion exports final testing result.

Preferably, the video feature extraction module using the VGG-16 network based on pre-training as feature extractor, The feature of every frame video is extracted, then the feature of every frame video is stacked, one group of composition for describing retouching for video scene State son, i.e. video scene feature representation.

Preferably, basis detection is multiple, and wherein a kind of conventional motion is respectively adopted in each basis detection Detection method detects moving object, obtains multiple corresponding basic testing results.

Preferably, basis detection is four, and correspondingly, following tradition fortune is respectively adopted in each basis detection Dynamic detection method:

Adaptive semantic pixel-based is associated with dividing method；

Contexts dividing method based on edge detection；

Background segment method based on Share Model；

Background segment method based on sampled point weighting.

Preferably, the depth integration module receives video scene feature representation as input, by four layers of convolutional layer and One layer Soft-Max layers obtain optimum fusion weight map, carry out pixel-by-pixel further according to optimum fusion weight map to basic testing result Linear weighted function.

According to another aspect of the present invention, a kind of video moving object detection side based on depth integration network is provided Method includes the following steps:

S1: sequence reads the multiframe in video before present frame and present frame as video sequence input；

S2: every frame video in the video sequence of input is analyzed using feature extractor, obtains multiple groups video frame This multiple groups video features is stacked by feature in channel direction, forms description of a description video scene feature, That is video scene description；Moving object analysis is carried out using video sequence of the conventional motion detection method to input, obtains base Plinth testing result；

S3: the description of video scene obtained in S2 and basic testing result are input in depth integration network；It is described Depth integration network describes son to video scene and analyzes, and obtains optimum fusion weight map, right using optimum fusion weight map Basic testing result carries out linear weighted function fusion.

Preferably, the depth integration network is based on depth convolutional network, and video scene description of input is passed through four Layer convolutional layer and one layer Soft-Max layers obtain optimum fusion weight map, further according to optimum fusion weight map to basic testing result Carry out linear weighted function pixel-by-pixel.

Preferably, the Mobile object detection method based on depth integration network, further includes to feature extractor and depth The off-line training of converged network is spent, steps are as follows:

Stochastical sampling video clip is covered as predicted motion mask, and with the mark of real motion object in training video Mould, that is, real motion mask is together as training pair, and multiple training are to one training set of composition；To the training video of training centering Random cropping is carried out, training sample is obtained, left and right at random then is carried out to training sample and is spun upside down with expanding species training set；

A training is used to as input, using stochastic gradient descent algorithm to feature extractor and depth integration network Parameter carry out combined optimization, all training in training set to study is taken turns upper progress more, until loss convergence.

Preferably, the loss function used in the stochastic gradient descent algorithm is covered for predicted motion mask and real motion The average variance of mould.

Preferably, depth integration network parameter turnover rate is set as the 100~10000 of feature extractor parameter turnover rate Times.

According to the third aspect of the invention we, a kind of detection terminal is provided, comprising: memory, processor and be stored in storage On device and the computer program that can run on a processor, the processor can be used for executing above-mentioned base when executing described program In the video moving object detection method of depth integration network.

Compared with prior art, the present invention have it is following the utility model has the advantages that

1, the present invention makes full use of a variety of existing conventional video moving object segmentation systems, improves having for different scenes Effect property；

2, the present invention makes full use of depth learning technology, improves the descriptive power for video image high-level semantics features；

3, the present invention is not needed by the way that the parameter in system to be automatically learned from data using based on Feature Engineering Tune ginseng；

4, the present invention combines conventional method and deep learning method, has obtained a kind of steady high performance video moving object Detection system and method have higher accuracy in detection for various scenes；

5, the present invention combines conventional method to extract video image content for the efficient performance and deep learning of special scenes The powerful ability to express of feature, using a depth integration network, according to video scene feature, to a variety of traditional detection results into Row optimum fusion can obtain all steady testing result hence for various scenes.

Detailed description of the invention

Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, other feature of the invention, Objects and advantages will become more apparent upon:

Fig. 1 is the video moving object detection system structure based on depth integration network provided by one embodiment of the invention Block diagram；

Fig. 2 is the process of the Mobile object detection method based on depth integration network provided by one embodiment of the invention Figure.

Specific embodiment

The present invention is described in detail combined with specific embodiments below.Following embodiment will be helpful to the technology of this field Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that the ordinary skill of this field For personnel, without departing from the inventive concept of the premise, various modifications and improvements can be made.These belong to the present invention Protection scope.

The embodiment of the present invention provides a kind of video moving object detection system based on depth integration network, including such as lower die Block:

Module one: video feature extraction module receives video sequence input, carries out feature extraction to video content, obtain About the feature representation of scene information, i.e. video scene feature representation in video, and it is sent to depth integration module, is used for depth Fusion Module carries out optimum fusion to each basic testing result；

Module two: infrastructure detection module receives video sequence input, is carried out using basis detection to moving object Detection obtains corresponding basic testing result, and is sent to depth integration module；

Module three: depth integration module receives video scene feature representation and basic testing result, utilizes depth nerve net Network carries out optimum fusion, exports final testing result.

In certain preferred embodiments, video feature extraction module is using the VGG-16 network based on pre-training as special Extractor is levied, extracts the feature of every frame video, then the feature of every frame video is stacked, one group of composition can describe video Description of scene.

Further, basic detection of infrastructure detection module is multiple, and wherein son is detected respectively in each basis Moving object is detected using a kind of conventional motion detection method, obtains multiple corresponding basic testing results.Implementing In mode, detection in basis can be four, or other quantity.Such as when basis detection is four, correspondingly, Method for testing motion with the following method, but can be not limited to following method: PWACS, EFIC, SharedModel and WeSamBE.Wherein PWACS is that adaptive semantic pixel-based is associated with dividing method；EFIC is the front and back based on edge detection Background segment method；ShareModel is the background segment method based on Share Model；WeSamBE is to be weighted based on sampled point Background segment method.The above method contributes to the conventional method based on non-deep learning of moving object segmentation.Certainly, not With in embodiment, different method for testing motion can be used, the embodiment of the present invention is by merging several conventional motion detection sides Method as a result, an available more steady testing result.

In certain preferred embodiments, depth integration module receives video scene feature representation (video features description) As input, optimum fusion weight map is obtained by four layers of convolutional layer and one layer Soft-Max layers, is weighed further according to the optimum fusion Multigraph carries out linear weighted function pixel-by-pixel to basic testing result.

The embodiment of the present invention also provides a kind of video moving object detection method based on depth integration network, step packet It includes:

Step 1: sequence reads present frame in video and its multiframe before, and (such as 16 frames, the quantity are according to specific real Depending on existing input format, if realizing difference, quantity can also change here) it is inputted as video sequence；

Step 2: analyzing every frame video in the video sequence of input using feature extractor, obtain multiple groups (when When for 16 frame, 16 groups of video frames are obtained herein) feature of video frame, the feature of this multiple groups video frame is stacked in channel direction Together, description of a description video scene feature is formed；Conventional motion detection method is used simultaneously, to the video of input Sequence carries out moving object analysis, obtains basic testing result；

Step 3: depth integration network is input to the description of video scene obtained in step 2 and basic testing result In；Depth integration network describes son to video scene and is further analyzed, and obtains optimum fusion weight map, most using this finally Excellent fusion weight map carries out linear weighted function fusion to basic testing result.

In step 1, it is the piece of video for including present frame and its multiframe before that video sequence input is inputted as system Section.

In step 2, feature extractor output is one group of characteristic pattern based on deep learning.

In step 3, optimum fusion is based on depth convolutional network.The final step of fusion is based on optimum fusion figure Linear weighted function operation.

Further, the method can also include the off-line training step to feature extractor and depth integration network, It is specific as follows:

Step 1: in training video stochastical sampling video clip as predicted motion mask, and with real motion object Mask, that is, real motion mask is marked together as training pair, multiple training are to one training set of composition；To the instruction of training centering Practice video and carry out random cropping, obtain training sample, left and right at random then is carried out to sample and spins upside down with expanding species training set；

Step 2: using a training to as input, feature extractor and depth are melted using stochastic gradient descent algorithm The parameter for closing network carries out combined optimization, and all training in training set to study is taken turns upper progress more, until loss convergence.

In step 1, the size of training sample can be 128x128, be also possible to other sizes, according to computing resource It is fixed, it, can be using larger size, such as 256x256 or 512x512 etc. if computing resource allows.

In step 2, the loss function used in the stochastic gradient descent algorithm can be predicted motion mask and true The average variance of motion mask.Further, the turnover rate of depth integration network parameter is set as the update of feature extractor parameter 100~10000 times of rate.Combined optimization method is to carry out gradient descent method to the error of basic testing result in step 2, gradually Iteration optimization.After optimal model parameters after training save, it is used directly in video moving object detection method.

Based on above-mentioned, with reference to the accompanying drawing and specific example the technical solution of the present invention is further described in detail.

As shown in Figure 1, the video moving object detection system based on depth integration network in one embodiment of the invention, it should System includes three generic modules: video feature extraction module (video feature extraction network), infrastructure detection module and depth are melted It molds block (depth integration network).

In the present embodiment, system contains a video feature extraction module and a depth integration module, infrastructure inspection The type of basic detection system and quantity can be according to the flexible constituencies of performance of concrete scene feature and processing platform in survey module.

In the present embodiment, video feature extraction module uses the VGG-16 network of pre-training as feature extractor, to one All video frames in a video clip are successively analyzed, and obtained characteristic pattern is stacked the description as video features Son.

In the present embodiment, infrastructure detection module uses four kinds of basic detection systems: PWACS, EFIC, SharedModel and WeSamBE, they shake in dynamic background, night scenes, camera lens and have complementary property in IR Scene It can performance.

In the present embodiment, depth integration module is mainly made of four layers of convolutional layer and one layer of Soft-Max level connection.Module Video presentation is received as input, further analysis obtains optimum fusion weight map, detects further according to the weight map to basis As a result linear weighted function pixel-by-pixel is carried out.

As shown in Fig. 2, in one embodiment, utilizing the video moving object detection system based on depth integration network The method for carrying out video moving object detection, steps are as follows:

Step 1: sequence, which reads 16 frames including present frame and its before, inputs (video sequence input) as system；

Step 2: being analyzed using VGG-16 network every frame, every frame is passed through into the last layer feature that network obtains Figure is stacked in channel direction, forms description of a description video features；

Infrastructure detection module is used simultaneously, system input is analyzed, and is obtained 4 basic testing results, is denoted as B (n), n=1,2,3,4, indicate the testing result of four basic detection methods；

Step 3: being input in depth integration network to video presentation in step 2 with basic testing result.Depth Video presentation is further analyzed in converged network, obtains optimum fusion weight map, weight map M is finally utilized, to base Plinth testing result carries out the linear weighted function of formula (1) such as and merges；

B (n) represents n-th of basic testing result in formula (1), and M (n), which is represented, corresponds to adding for n-th of basic testing result Weight coefficient, they are the two dimensional images as input video frame size.The element that ⊙ is indicated multiplies.So formula (1) expression will Four basic testing results carry out pixel-by-pixel weighting and are averagely used as final prediction result P.

In the present embodiment, for the off-line training of the parameter in feature extractor and depth integration network, steps are as follows:

Step 1: the mark mask of stochastical sampling video clip and real motion object is together as instruction in training video White silk pair.Random cropping is carried out to training video, obtains the training sample of 128x128, then sample progress is controlled, up and down at random Overturning is with expanding species training set；

Step 2: combined optimization being carried out to the parameter in whole system using stochastic gradient descent algorithm, until loss is received It holds back；

Optimization method in step 2 is Adam optimization method.Loss function is set as formula (2):

The height and width of H and W representative image in formula (2), G represent true movement mark mask；

10 are set as the parameter learning rate in video feature extraction module in step 2^-7, and to depth integration network Habit rate is set as 10^-4.After training convergence, parameter is saved, is loaded directly into use in actual use.

Based on the above method, one embodiment of the invention also provides a kind of detection terminal, comprising: memory, processor and deposits The computer program that can be run on a memory and on a processor is stored up, the processor can be used for executing when executing described program The above-mentioned video moving object detection method based on depth integration network.

The above embodiment of the present invention provide based on the video moving object detection system and method for depth integration network, inspection Terminal is surveyed, after video sequence is input in system, while carrying out video feature extraction operation and basic result detection operation, then Using depth integration module, optimum fusion is carried out to multiple basic testing results according to video features.The above embodiment of the present invention Using depth convolutional network construction feature extraction module and depth integration module, it is trained to obtain optimal mould using mass data Shape parameter can carry out moving object segmentation automatically in practical applications；The experimental results showed that the system can obtain high accuracy Testing result.

Design parameter in the above embodiment of the present invention is only to illustrate the implementation of technical solution of the present invention and illustrate, the present invention In a further embodiment can also be using other design parameters, this realizes the present invention not essential influence.

It is noted that the method provided by the invention in step, can use corresponding module in the system, Device, unit etc. are achieved, and the technical solution that those skilled in the art are referred to the system realizes the step of the method Rapid process, that is, the embodiment in the system can be regarded as realizing the preference of the method, and it will not be described here.

One skilled in the art will appreciate that in addition to realizing system provided by the invention in a manner of pure computer readable program code It, completely can be by the way that method and step be carried out programming in logic come so that the present invention provides and its other than modules, device, unit System and its each device with logic gate, switch, specific integrated circuit, programmable logic controller (PLC) and embedded microcontroller The form of device etc. realizes identical function.So system provided by the invention and its every device are considered one kind firmly Part component, and the structure that the device for realizing various functions for including in it can also be considered as in hardware component；It can also be with It will be considered as realizing the device of various functions either the software module of implementation method can be the knot in hardware component again Structure.

Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited to above-mentioned Particular implementation, those skilled in the art can make various deformations or amendments within the scope of the claims, this not shadow Ring substantive content of the invention.

Claims

1. a kind of video moving object detection system based on depth integration network characterized by comprising

Video feature extraction module, receives video sequence input, carries out feature extraction to video content, obtains in video about field The feature representation of scape information, i.e. video scene feature representation, and it is sent to depth integration module；

Infrastructure detection module receives video sequence input, is detected using basis detection to moving object, obtains phase The basic testing result answered, and it is sent to depth integration module；

Depth integration module receives video scene feature representation and basic testing result, is carried out using deep neural network optimal Fusion, exports final testing result.

2. a kind of video moving object detection system based on depth integration network according to claim 1, feature exist In the video feature extraction module, as feature extractor, extracts every frame video using the VGG-16 network based on pre-training Feature, then the feature of every frame video is stacked, one group of composition description for describing video scene, i.e. video field Scape feature representation.

3. a kind of video moving object detection system based on depth integration network according to claim 1, feature exist Be in, basis detection it is multiple, wherein a kind of conventional motion detection method is respectively adopted to fortune in each basis detection Animal body is detected, and multiple corresponding basic testing results are obtained.

4. a kind of video moving object detection system based on depth integration network according to claim 3, feature exist In basis detection is four, and correspondingly, following conventional motion detection method is respectively adopted in each basis detection:

Adaptive semantic pixel-based is associated with dividing method；

Contexts dividing method based on edge detection；

Background segment method based on Share Model；

Background segment method based on sampled point weighting.

5. a kind of video moving object detection system based on depth integration network according to claim 1, feature exist In the depth integration module receives video scene feature representation as input, by four layers of convolutional layer and one layer of Soft-Max Layer obtains optimum fusion weight map, carries out linear weighted function pixel-by-pixel to basic testing result further according to optimum fusion weight map.

6. a kind of video moving object detection method based on depth integration network characterized by comprising

S2: analyzing every frame video in the video sequence of input using feature extractor, obtain multiple groups video frame feature, This multiple groups video features is stacked in channel direction, description of a description video scene feature is formed, that is, regards Frequency scene description；Moving object analysis is carried out using video sequence of the conventional motion detection method to input, obtains basic inspection Survey result；

S3: the description of video scene obtained in S2 and basic testing result are input in depth integration network；The depth Converged network describes son to video scene and analyzes, and obtains optimum fusion weight map, using optimum fusion weight map, to basis Testing result carries out linear weighted function fusion.

7. a kind of video moving object detection method based on depth integration network according to claim 6, feature exist In the depth integration network is based on depth convolutional network, and video scene description of input is passed through four layers of convolutional layer and one Soft-Max layers of layer obtains optimum fusion weight map, carries out line pixel-by-pixel to basic testing result further according to optimum fusion weight map Property weighting.

8. a kind of video moving object detection method based on depth integration network according to claim 6 or 7, feature It is, further includes the off-line training to feature extractor and depth integration network, in which:

Stochastical sampling video clip is as predicted motion mask, and with the mark mask of real motion object in training video Real motion mask is together as training pair, and multiple training are to one training set of composition；The training video of training centering is carried out Random cropping obtains training sample, then carries out left and right at random to training sample and/or spins upside down with expanding species training set；

A training is used to as input, using stochastic gradient descent algorithm to the ginseng of feature extractor and depth integration network Number carries out combined optimizations, and all training in training set to study is taken turns upper progress more, until loss convergence.

9. a kind of video moving object detection method based on depth integration network according to claim 8, feature exist In the loss function used in the stochastic gradient descent algorithm is the mean square of predicted motion mask and real motion mask Difference；And/or

The parameter turnover rate of the depth integration network is set as 100~10000 times of feature extractor parameter turnover rate.

10. a kind of detection terminal, comprising: memory, processor and storage are on a memory and the meter that can run on a processor Calculation machine program, which is characterized in that the processor can be used for executing any one of the claims 6-9 institute when executing described program State method.