Summary of the invention
The present invention aiming at the above shortcomings existing in the prior art, provides a kind of video fortune based on depth integration network
Animal body detection system, method and terminal can obtain non-in conjunction with conventional method and depth learning technology for several scenes
Normal steady testing result.
The present invention is achieved by the following technical solutions.
According to an aspect of the invention, there is provided a kind of video moving object detection system based on depth integration network
System, including following module:
Video feature extraction module, receives video sequence input, carries out feature extraction to video content, obtains closing in video
In the feature representation of scene information, i.e. video scene feature representation, and it is sent to depth integration module;
Infrastructure detection module receives video sequence input, is detected, is obtained to moving object using basis detection
To corresponding basic testing result, and it is sent to depth integration module;
Depth integration module is received video scene feature representation and basic testing result, is carried out using deep neural network
Optimum fusion exports final testing result.
Preferably, the video feature extraction module using the VGG-16 network based on pre-training as feature extractor,
The feature of every frame video is extracted, then the feature of every frame video is stacked, one group of composition for describing retouching for video scene
State son, i.e. video scene feature representation.
Preferably, basis detection is multiple, and wherein a kind of conventional motion is respectively adopted in each basis detection
Detection method detects moving object, obtains multiple corresponding basic testing results.
Preferably, basis detection is four, and correspondingly, following tradition fortune is respectively adopted in each basis detection
Dynamic detection method:
Adaptive semantic pixel-based is associated with dividing method;
Contexts dividing method based on edge detection;
Background segment method based on Share Model;
Background segment method based on sampled point weighting.
Preferably, the depth integration module receives video scene feature representation as input, by four layers of convolutional layer and
One layer Soft-Max layers obtain optimum fusion weight map, carry out pixel-by-pixel further according to optimum fusion weight map to basic testing result
Linear weighted function.
According to another aspect of the present invention, a kind of video moving object detection side based on depth integration network is provided
Method includes the following steps:
S1: sequence reads the multiframe in video before present frame and present frame as video sequence input;
S2: every frame video in the video sequence of input is analyzed using feature extractor, obtains multiple groups video frame
This multiple groups video features is stacked by feature in channel direction, forms description of a description video scene feature,
That is video scene description;Moving object analysis is carried out using video sequence of the conventional motion detection method to input, obtains base
Plinth testing result;
S3: the description of video scene obtained in S2 and basic testing result are input in depth integration network;It is described
Depth integration network describes son to video scene and analyzes, and obtains optimum fusion weight map, right using optimum fusion weight map
Basic testing result carries out linear weighted function fusion.
Preferably, the depth integration network is based on depth convolutional network, and video scene description of input is passed through four
Layer convolutional layer and one layer Soft-Max layers obtain optimum fusion weight map, further according to optimum fusion weight map to basic testing result
Carry out linear weighted function pixel-by-pixel.
Preferably, the Mobile object detection method based on depth integration network, further includes to feature extractor and depth
The off-line training of converged network is spent, steps are as follows:
Stochastical sampling video clip is covered as predicted motion mask, and with the mark of real motion object in training video
Mould, that is, real motion mask is together as training pair, and multiple training are to one training set of composition;To the training video of training centering
Random cropping is carried out, training sample is obtained, left and right at random then is carried out to training sample and is spun upside down with expanding species training set;
A training is used to as input, using stochastic gradient descent algorithm to feature extractor and depth integration network
Parameter carry out combined optimization, all training in training set to study is taken turns upper progress more, until loss convergence.
Preferably, the loss function used in the stochastic gradient descent algorithm is covered for predicted motion mask and real motion
The average variance of mould.
Preferably, depth integration network parameter turnover rate is set as the 100~10000 of feature extractor parameter turnover rate
Times.
According to the third aspect of the invention we, a kind of detection terminal is provided, comprising: memory, processor and be stored in storage
On device and the computer program that can run on a processor, the processor can be used for executing above-mentioned base when executing described program
In the video moving object detection method of depth integration network.
Compared with prior art, the present invention have it is following the utility model has the advantages that
1, the present invention makes full use of a variety of existing conventional video moving object segmentation systems, improves having for different scenes
Effect property;
2, the present invention makes full use of depth learning technology, improves the descriptive power for video image high-level semantics features;
3, the present invention is not needed by the way that the parameter in system to be automatically learned from data using based on Feature Engineering
Tune ginseng;
4, the present invention combines conventional method and deep learning method, has obtained a kind of steady high performance video moving object
Detection system and method have higher accuracy in detection for various scenes;
5, the present invention combines conventional method to extract video image content for the efficient performance and deep learning of special scenes
The powerful ability to express of feature, using a depth integration network, according to video scene feature, to a variety of traditional detection results into
Row optimum fusion can obtain all steady testing result hence for various scenes.
Specific embodiment
The present invention is described in detail combined with specific embodiments below.Following embodiment will be helpful to the technology of this field
Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that the ordinary skill of this field
For personnel, without departing from the inventive concept of the premise, various modifications and improvements can be made.These belong to the present invention
Protection scope.
The embodiment of the present invention provides a kind of video moving object detection system based on depth integration network, including such as lower die
Block:
Module one: video feature extraction module receives video sequence input, carries out feature extraction to video content, obtain
About the feature representation of scene information, i.e. video scene feature representation in video, and it is sent to depth integration module, is used for depth
Fusion Module carries out optimum fusion to each basic testing result;
Module two: infrastructure detection module receives video sequence input, is carried out using basis detection to moving object
Detection obtains corresponding basic testing result, and is sent to depth integration module;
Module three: depth integration module receives video scene feature representation and basic testing result, utilizes depth nerve net
Network carries out optimum fusion, exports final testing result.
In certain preferred embodiments, video feature extraction module is using the VGG-16 network based on pre-training as special
Extractor is levied, extracts the feature of every frame video, then the feature of every frame video is stacked, one group of composition can describe video
Description of scene.
Further, basic detection of infrastructure detection module is multiple, and wherein son is detected respectively in each basis
Moving object is detected using a kind of conventional motion detection method, obtains multiple corresponding basic testing results.Implementing
In mode, detection in basis can be four, or other quantity.Such as when basis detection is four, correspondingly,
Method for testing motion with the following method, but can be not limited to following method: PWACS, EFIC, SharedModel and
WeSamBE.Wherein PWACS is that adaptive semantic pixel-based is associated with dividing method;EFIC is the front and back based on edge detection
Background segment method;ShareModel is the background segment method based on Share Model;WeSamBE is to be weighted based on sampled point
Background segment method.The above method contributes to the conventional method based on non-deep learning of moving object segmentation.Certainly, not
With in embodiment, different method for testing motion can be used, the embodiment of the present invention is by merging several conventional motion detection sides
Method as a result, an available more steady testing result.
In certain preferred embodiments, depth integration module receives video scene feature representation (video features description)
As input, optimum fusion weight map is obtained by four layers of convolutional layer and one layer Soft-Max layers, is weighed further according to the optimum fusion
Multigraph carries out linear weighted function pixel-by-pixel to basic testing result.
The embodiment of the present invention also provides a kind of video moving object detection method based on depth integration network, step packet
It includes:
Step 1: sequence reads present frame in video and its multiframe before, and (such as 16 frames, the quantity are according to specific real
Depending on existing input format, if realizing difference, quantity can also change here) it is inputted as video sequence;
Step 2: analyzing every frame video in the video sequence of input using feature extractor, obtain multiple groups (when
When for 16 frame, 16 groups of video frames are obtained herein) feature of video frame, the feature of this multiple groups video frame is stacked in channel direction
Together, description of a description video scene feature is formed;Conventional motion detection method is used simultaneously, to the video of input
Sequence carries out moving object analysis, obtains basic testing result;
Step 3: depth integration network is input to the description of video scene obtained in step 2 and basic testing result
In;Depth integration network describes son to video scene and is further analyzed, and obtains optimum fusion weight map, most using this finally
Excellent fusion weight map carries out linear weighted function fusion to basic testing result.
In step 1, it is the piece of video for including present frame and its multiframe before that video sequence input is inputted as system
Section.
In step 2, feature extractor output is one group of characteristic pattern based on deep learning.
In step 3, optimum fusion is based on depth convolutional network.The final step of fusion is based on optimum fusion figure
Linear weighted function operation.
Further, the method can also include the off-line training step to feature extractor and depth integration network,
It is specific as follows:
Step 1: in training video stochastical sampling video clip as predicted motion mask, and with real motion object
Mask, that is, real motion mask is marked together as training pair, multiple training are to one training set of composition;To the instruction of training centering
Practice video and carry out random cropping, obtain training sample, left and right at random then is carried out to sample and spins upside down with expanding species training set;
Step 2: using a training to as input, feature extractor and depth are melted using stochastic gradient descent algorithm
The parameter for closing network carries out combined optimization, and all training in training set to study is taken turns upper progress more, until loss convergence.
In step 1, the size of training sample can be 128x128, be also possible to other sizes, according to computing resource
It is fixed, it, can be using larger size, such as 256x256 or 512x512 etc. if computing resource allows.
In step 2, the loss function used in the stochastic gradient descent algorithm can be predicted motion mask and true
The average variance of motion mask.Further, the turnover rate of depth integration network parameter is set as the update of feature extractor parameter
100~10000 times of rate.Combined optimization method is to carry out gradient descent method to the error of basic testing result in step 2, gradually
Iteration optimization.After optimal model parameters after training save, it is used directly in video moving object detection method.
Based on above-mentioned, with reference to the accompanying drawing and specific example the technical solution of the present invention is further described in detail.
As shown in Figure 1, the video moving object detection system based on depth integration network in one embodiment of the invention, it should
System includes three generic modules: video feature extraction module (video feature extraction network), infrastructure detection module and depth are melted
It molds block (depth integration network).
In the present embodiment, system contains a video feature extraction module and a depth integration module, infrastructure inspection
The type of basic detection system and quantity can be according to the flexible constituencies of performance of concrete scene feature and processing platform in survey module.
In the present embodiment, video feature extraction module uses the VGG-16 network of pre-training as feature extractor, to one
All video frames in a video clip are successively analyzed, and obtained characteristic pattern is stacked the description as video features
Son.
In the present embodiment, infrastructure detection module uses four kinds of basic detection systems: PWACS, EFIC,
SharedModel and WeSamBE, they shake in dynamic background, night scenes, camera lens and have complementary property in IR Scene
It can performance.
In the present embodiment, depth integration module is mainly made of four layers of convolutional layer and one layer of Soft-Max level connection.Module
Video presentation is received as input, further analysis obtains optimum fusion weight map, detects further according to the weight map to basis
As a result linear weighted function pixel-by-pixel is carried out.
As shown in Fig. 2, in one embodiment, utilizing the video moving object detection system based on depth integration network
The method for carrying out video moving object detection, steps are as follows:
Step 1: sequence, which reads 16 frames including present frame and its before, inputs (video sequence input) as system;
Step 2: being analyzed using VGG-16 network every frame, every frame is passed through into the last layer feature that network obtains
Figure is stacked in channel direction, forms description of a description video features;
Infrastructure detection module is used simultaneously, system input is analyzed, and is obtained 4 basic testing results, is denoted as B
(n), n=1,2,3,4, indicate the testing result of four basic detection methods;
Step 3: being input in depth integration network to video presentation in step 2 with basic testing result.Depth
Video presentation is further analyzed in converged network, obtains optimum fusion weight map, weight map M is finally utilized, to base
Plinth testing result carries out the linear weighted function of formula (1) such as and merges;
B (n) represents n-th of basic testing result in formula (1), and M (n), which is represented, corresponds to adding for n-th of basic testing result
Weight coefficient, they are the two dimensional images as input video frame size.The element that ⊙ is indicated multiplies.So formula (1) expression will
Four basic testing results carry out pixel-by-pixel weighting and are averagely used as final prediction result P.
In the present embodiment, for the off-line training of the parameter in feature extractor and depth integration network, steps are as follows:
Step 1: the mark mask of stochastical sampling video clip and real motion object is together as instruction in training video
White silk pair.Random cropping is carried out to training video, obtains the training sample of 128x128, then sample progress is controlled, up and down at random
Overturning is with expanding species training set;
Step 2: combined optimization being carried out to the parameter in whole system using stochastic gradient descent algorithm, until loss is received
It holds back;
Optimization method in step 2 is Adam optimization method.Loss function is set as formula (2):
The height and width of H and W representative image in formula (2), G represent true movement mark mask;
10 are set as the parameter learning rate in video feature extraction module in step 2-7, and to depth integration network
Habit rate is set as 10-4.After training convergence, parameter is saved, is loaded directly into use in actual use.
Based on the above method, one embodiment of the invention also provides a kind of detection terminal, comprising: memory, processor and deposits
The computer program that can be run on a memory and on a processor is stored up, the processor can be used for executing when executing described program
The above-mentioned video moving object detection method based on depth integration network.
The above embodiment of the present invention provide based on the video moving object detection system and method for depth integration network, inspection
Terminal is surveyed, after video sequence is input in system, while carrying out video feature extraction operation and basic result detection operation, then
Using depth integration module, optimum fusion is carried out to multiple basic testing results according to video features.The above embodiment of the present invention
Using depth convolutional network construction feature extraction module and depth integration module, it is trained to obtain optimal mould using mass data
Shape parameter can carry out moving object segmentation automatically in practical applications;The experimental results showed that the system can obtain high accuracy
Testing result.
Design parameter in the above embodiment of the present invention is only to illustrate the implementation of technical solution of the present invention and illustrate, the present invention
In a further embodiment can also be using other design parameters, this realizes the present invention not essential influence.
It is noted that the method provided by the invention in step, can use corresponding module in the system,
Device, unit etc. are achieved, and the technical solution that those skilled in the art are referred to the system realizes the step of the method
Rapid process, that is, the embodiment in the system can be regarded as realizing the preference of the method, and it will not be described here.
One skilled in the art will appreciate that in addition to realizing system provided by the invention in a manner of pure computer readable program code
It, completely can be by the way that method and step be carried out programming in logic come so that the present invention provides and its other than modules, device, unit
System and its each device with logic gate, switch, specific integrated circuit, programmable logic controller (PLC) and embedded microcontroller
The form of device etc. realizes identical function.So system provided by the invention and its every device are considered one kind firmly
Part component, and the structure that the device for realizing various functions for including in it can also be considered as in hardware component;It can also be with
It will be considered as realizing the device of various functions either the software module of implementation method can be the knot in hardware component again
Structure.
Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited to above-mentioned
Particular implementation, those skilled in the art can make various deformations or amendments within the scope of the claims, this not shadow
Ring substantive content of the invention.