CN106686472B

CN106686472B - A method and system for generating high frame rate video based on deep learning

Info

Publication number: CN106686472B
Application number: CN201611241691.XA
Authority: CN
Inventors: 王兴刚; 罗浩; 姜玉静; 刘文予
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2016-12-29
Filing date: 2016-12-29
Publication date: 2019-04-26
Anticipated expiration: 2036-12-29
Also published as: CN106686472A

Abstract

The high frame-rate video generation method based on deep learning that the invention discloses a kind of, comprising: generate training sample set using one or more original high frame-rate video segments；The multiple video frame set training binary channels convolutional neural networks model concentrated using the training sample, with binary channels convolutional neural networks after being optimized, the binary channels convolutional neural networks model is convolutional neural networks made of being merged as two convolutional channels；Using binary channels convolutional neural networks after the optimization, the insertion frame of this two video frame is generated according to two video frame of arbitrary neighborhood in low frame-rate video, to generate the video that frame per second is higher than the low frame-rate video.The method of the present invention whole process is end to end, it does not need to carry out subsequent processing to video frame, the problems such as video frame rate conversion effect is good, and the video fluency of synthesis is high, switches for shake existing during video capture, video scene has preferable robustness.

Description

A kind of high frame-rate video generation method and system based on deep learning

Technical field

The invention belongs to technical field of computer vision, regard more particularly, to a kind of high frame per second based on deep learning Frequency generation method and system.

Background technique

With the development of science and technology, the mode that people obtain video is more and more convenient, however due to hardware, it is most of Video is all that non-professional equipment is collected, and frame per second generally only has 24fps-30fps.The video of high frame per second has high smoothness Degree, can bring better visual experience.If people directly upload the video of high frame per second on the net, due to flow Consumption increases, and the cost of people also increases as.If the directly upper video for transmitting low frame per second, due to network line, Inevitably there is frame losing in video, the video the big more is easy to appear this phenomenon, so that the view of distal end during transmission Frequency quality cannot be effectively guaranteed, this greatly affected the experience of people.It is reasonable it is therefore desirable to be used in distal end Processing mode carries out subsequent processing to the video that people upload, so that the demand that the quality of video is able to satisfy people is even further Promote the experience of people.

Summary of the invention

Aiming at the above defects or improvement requirements of the prior art, the high frame per second based on deep learning that the present invention provides a kind of Thus video generation method solves to regard due to low frame per second its object is to be the video of high frame per second by the Video Quality Metric of low frame per second Frame losing of frequency during network transmission and cause video quality to decline the technical issues of affecting to the experience of people.

To achieve the above object, according to one aspect of the present invention, a kind of high frame per second view based on deep learning is provided Frequency generation method, comprising the following steps:

(1) training sample set is generated using one or more original high frame-rate video segments, the training sample concentrates packet It includes multiple video frame set, includes two training frames and a control frame in each video frame set, described two Training frames are two video frames that a frame or multiframe are spaced in high frame-rate video segment, and the control frame is two training frames Any one frame of midfeather；The frame per second of the high frame-rate video segment is higher than setting frame per second threshold value；

(2) the multiple video frame set training binary channels convolutional neural networks model concentrated using the training sample, With binary channels convolutional neural networks after being optimized；Wherein, the binary channels convolutional neural networks model is to be led to by two convolution Convolutional neural networks made of road fusion, two convolutional channels are respectively used to two video frames in input video frame subclass simultaneously Convolution is carried out to the video frame of input respectively, binary channels convolutional neural networks model carries out the convolution results of two convolutional channels It merges and exports to predict frame, the frame flyback training bilateral is compareed with the video frame set according to the prediction frame Road convolutional neural networks model；

(3) using binary channels convolutional neural networks after the optimization, according to two video of arbitrary neighborhood in low frame-rate video Frame generates the insertion frame of this two video frame, to generate the video that frame per second is higher than the low frame-rate video.

In one embodiment of the present of invention, each convolutional channel in the binary channels convolutional neural networks model includes k A convolutional layer, wherein k > 0, the mathematical description of each convolutional layer are as follows:

Z_i(Y)=W_i*F_i-1(Y)+B_i

Wherein i indicates that the number of plies of convolutional layer, input video frame are the 0th layer, and * represents convolution operation, F_i-1Indicate (i-1)-th layer Output, Z_i(Y) output after i-th layer of convolution operation, W are indicated_iFor i-th layer of convolution nuclear parameter, B_iJoin for i-th layer of biasing Number.

In one embodiment of the present of invention, in the convolutional channel, it is connected to one respectively after preceding k-1 convolutional layer The active coating of ReLU is to keep the sparsity of network, mathematical description are as follows:

F_i(Y)=max (0, Z_i)。

In one embodiment of the present of invention, in two video frames by the feature that obtains after the last one convolutional layer Response diagram is merged by the way of corresponding position value addition.

In one embodiment of the present of invention, swash in the Sigmoid that is followed by that the mixing operation obtains characteristic response figure Layer living is the pixel value of picture to be mapped between 0-1, mathematical description are as follows:

In one embodiment of the present of invention, use mean value for 0, the Gaussian Profile that standard deviation is 1 initializes convolution nuclear parameter, Biasing is initialized as 0, and benchmark learning rate is initialized as 1e^-6, benchmark learning rate reduces 10 times after the m period of iteration, wherein m For preset value.

In one embodiment of the present of invention, frame flyback instruction is compareed with the video frame set according to the prediction frame Practice the binary channels convolutional neural networks model, specifically:

Using prediction frame with compare the error between frame, the binary channels convolution is trained using error backpropagation algorithm Neural network；Wherein use least squares error for our majorized function, mathematical description are as follows:

Wherein i indicates i-th samples pictures, and n indicates the quantity of sample training collection, Y_iIndicate the video frame of neural network forecast, Indicate the true value of corresponding video frame.

In one embodiment of the present of invention, the k value is 3；First convolutional layer has the convolution kernel of 64 9*9, step-length For 1 pixel, Filling power 4, Filling power refers to the circle number in the zero padding of characteristic pattern periphery；Second convolutional layer has 32 1*1's Convolution kernel, step-length are 1 pixel, Filling power 0；Third convolutional layer has the convolution kernel of 3 5*5, step-length 1, and Filling power is 2。

It is another aspect of this invention to provide that additionally providing a kind of high frame-rate video generation system based on deep learning, packet Include training sample set generation module, binary channels convolutional neural networks optimization module and high frame-rate video generation module, in which:

The training sample set generation module, for generating training sample using one or more high frame-rate video segments Collection, it includes two training frames in each video frame set that it includes multiple video frame set that the training sample, which is concentrated, With a control frame, two training frames are two video frames that a frame or multiframe are spaced in high frame-rate video segment, described Compare any one frame for the midfeather that frame is two training frames；The frame per second of the high frame-rate video segment is higher than setting frame Rate threshold value；

The binary channels convolutional neural networks optimization module, multiple video frames for being concentrated using the training sample Gather training binary channels convolutional neural networks model, binary channels convolutional neural networks after being optimized；Wherein, the binary channels volume Product neural network model is the convolutional neural networks of two channels fusion, and two channels are respectively used to input the video frame Two video frames in conjunction simultaneously carry out convolution to the video frame of input respectively, binary channels convolutional neural networks model it is logical to two The result of road convolution is merged and is exported to predict frame, compares frame with the video frame set according to the prediction frame Binary channels convolutional neural networks model described in regression training；

The high frame-rate video generation module is used for using binary channels convolutional neural networks after the optimization, according to low frame Two video frame of arbitrary neighborhood in rate video generates the insertion frame of this two video frame, so that generating frame per second is higher than the low frame per second view The video of frequency.

Z_i(Y)=W_i*F_i-1(Y)+B_i

In general, contemplated above technical scheme through the invention, compared with prior art, the present invention has following Technical effect:

(1) feature extraction of the invention and the prediction of frame are obtained by the supervised learning of training sample, without artificial Intervene, spatial diversity information can be preferably fitted under the scene of large-scale data；

(2) whole process of the invention is end to end, using the ability of self-teaching of convolutional neural networks, to pass through self The mode of study learns model parameter, it is succinct efficiently, overcome traditional technology taken time and effort when handling video frame rate conversion and The unconspicuous feature of effect.

Detailed description of the invention

Fig. 1 is the flow chart of the method for converting video frame rate of the invention based on deep learning, wherein F_iIndicate i-th layer Output, Y_t-1、Y_t、Y_t+1Indicate continuous three frames video frame, Y_tNet is indicated for calculating error, Prediction as true value The video frame of network prediction.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below Not constituting a conflict with each other can be combined with each other.

Just technical term of the invention is explained and illustrated first below:

Convolutional neural networks (Convolutional Neural Network, CNN): one kind can be used for image classification, returns The neural network of tasks such as return, its particularity is embodied in two aspects, be on the one hand its interneuronal connection is non-complete Connection, the weight of the connection in another aspect same layer between certain neurons is shared.Network is usually by convolutional layer, pond Change layer and full articulamentum is constituted.Convolutional layer and pond layer are responsible for extracting the hierarchy characteristic of image, and full articulamentum is responsible for extracting Feature classified or returned.The parameter of network includes the parameter and biasing of convolution kernel and full articulamentum, and parameter can be with By reverse conduction algorithm from the acquistion of data middle school to.

Reverse conduction algorithm (Backpropagation Algorithm, BP): being a kind of and optimal method (such as gradient Descent method) be used in combination, for training the common methods of artificial neural network.This method damages weight calculations all in network The gradient of function is lost, this gradient can feed back to optimal method, for updating weight to minimize loss function.Algorithm master To include two stages: the forward direction of excitation, backpropagation and weight update.

With the arrival of big data era, the scale of video database is also increasing, and the solution of this problem is also more next It is more urgent.Deep neural network can by it is a kind of it is preferable in a manner of simulate the working method of human brain data analyzed, In recent years, deep learning all achieves successful application in the every field of computer vision, but video frame rate is turned The problem of changing there is no apparent research, and complicated in view of traditional method for converting video frame rate process, time human cost is higher, this hair The bright one kind that proposes is based on deep learning method for converting video frame rate.This method whole process be end to end, it is easy and efficiently, For video shake, scene switching the problems such as all there is stronger robustness.

As shown in Figure 1, may comprise steps of the present invention is based on the method for converting video frame rate of deep learning:

Specifically, high frame-rate video segment can be extracted and obtain sets of video frames, obtain training sample according to a certain percentage Collection；

Training sample set is combined by multiple video frames, and two instructions are included in each video frame set Practice frame and a control frame.Control frame is chosen for the most intermediate of two training frames or close to that most intermediate frame.Generally In the case of refer to and take continuous 3 frame, an intermediate frame is control frame, and another two frame is training frames；If frame per second is sufficiently high, can also take Be separated by two frames of multiframe (depending on frame per second, cannot too many) as training frames, and interphase every multiframe in can choose middle ware Every any one frame be control frame；Such as trained high video frame rate be 60, which has N frame, then according to interval one The mode of this training of frame sample, from the 2nd to N-1 frame in take a frame as true value (control frame) at random, and it is the frame is adjacent Two frames be input to inside network as training sample (two training frames).Similarly, can also come in the way of being spaced multiframe Training sample, can be used for the video of lower frame per second in this way, i.e., the Video Quality Metric of lower frame per second is the video of high frame per second.

It first has to design and Implement a binary channels convolutional neural networks, specifically:

The binary channels convolutional neural networks model established is the convolutional neural networks of two convolutional channels fusion, includes altogether K convolutional layer, k > 0, preferably 3 individually carry out convolution to two video frame pictures (training frames) respectively.First convolutional layer has The convolution kernel of 64 9*9, step-length are 1 pixel, and Filling power 4, Filling power refers to the circle number in the zero padding of characteristic pattern periphery.Second A convolutional layer has the convolution kernel of 32 1*1, and step-length is 1 pixel, Filling power 0.Third volume layer has the convolution kernel of 3 5*5, Step-length is 1, Filling power 2.The mathematical description of convolutional layer are as follows:

Z_i(Y)=W_i*F_i-1(Y)+B_i

Wherein i indicates the number of plies of network, and input picture is the 0th layer, and * represents convolution operation, F_i-1Indicate (i-1)-th layer defeated Out, Z_i(Y) output after i-th layer of convolution operation, W are indicated_iFor i-th layer of convolution nuclear parameter, B_iFor i-th layer of offset parameter；

In 3 convolutional layers, it is connected to the active coating of a ReLU respectively after the 1st and the 2nd convolutional layer to keep The sparsity of network, mathematical description are as follows:

F_i(Y)=max (0, Z_i)。

Two video frame pictures are added by the characteristic response figure obtained after third convolutional layer using corresponding position value Mode merged；

After the mixing operation, obtained characteristic response figure is followed by a Sigmoid active coating with by the picture of picture Plain value is mapped between 0-1, mathematical description are as follows:

Before the training binary channels convolutional neural networks, need to each pixel value in video frame divided by 255 into Row normalized, the pixel value after normalization is between 0 to 1；

Also, before the training binary channels convolutional neural networks, need to initialize the use of convolutional neural networks parameter Mean value is 0, and the Gaussian Profile that standard deviation is 1 initializes convolution nuclear parameter, and biasing is initialized as 0, the initialization of benchmark learning rate For 1e-6, benchmark learning rate reduces 10 times after the m period of iteration, and wherein m is preset value；For example, m preferably 2, then in preceding 1-m In a iteration cycle, learning rate=1e-6, after the m period of iteration, learning rate=1e-7, and be always maintained at constant.

Specifically, can use the predicted value of network with compare between error, instructed using error backpropagation algorithm Practice binary channels convolutional neural networks.Use least squares error for our majorized function, mathematical description are as follows:

Wherein i indicates i-th samples pictures, and n indicates the quantity of sample training collection, Y_iIndicate the video frame of neural network forecast, Indicate the true value of corresponding video frame；

As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should all include Within protection scope of the present invention.

Claims

1. a kind of high frame-rate video generation method based on deep learning, which is characterized in that the described method comprises the following steps:

(1) training sample set is generated using one or more original high frame-rate video segments, it includes more that the training sample, which is concentrated, A video frame set includes two training frames and a control frame, two training in each video frame set Frame is two video frames that a frame or multiframe are spaced in high frame-rate video segment, and the control frame is among two training frames Any one frame at interval；The frame per second of the high frame-rate video segment is higher than setting frame per second threshold value；

(2) the multiple video frame set training binary channels convolutional neural networks model concentrated using the training sample, to obtain Binary channels convolutional neural networks after must optimizing；Wherein, the binary channels convolutional neural networks model is to be melted by two convolutional channels Convolutional neural networks made of conjunction, two convolutional channels are respectively used to two video frames and difference in input video frame subclass Convolution is carried out to the video frame of input, binary channels convolutional neural networks model merges the convolution results of two convolutional channels And export to predict frame, it is rolled up according to the prediction frame with the frame flyback training binary channels that compares in the video frame set Product neural network model；Wherein,

Each convolutional channel in the binary channels convolutional neural networks model includes k convolutional layer, wherein k > 0, each convolution The mathematical description of layer are as follows:

Z_i(Y)=W_i*F_i-1(Y)+B_i

Wherein i indicates that the number of plies of convolutional layer, input video frame are the 0th layer, and * represents convolution operation, F_i-1Indicate (i-1)-th layer defeated Out, Z_i(Y) output after i-th layer of convolution operation, W are indicated_iFor i-th layer of convolution nuclear parameter, B_iFor i-th layer of offset parameter；

In the convolutional channel, it is connected to the active coating of a ReLU respectively after preceding k-1 convolutional layer to keep network Sparsity, mathematical description are as follows:

F_i(Y)=max (0, Z_i)；

Use mean value for 0, the Gaussian Profile that standard deviation is 1 initializes convolution nuclear parameter, and biasing is initialized as 0, benchmark study speed Rate is initialized as 1e^-6, benchmark learning rate reduces 10 times after the m period of iteration, and wherein m is preset value；

The k value is 3；First convolutional layer has the convolution kernel of 64 9*9, and step-length is 1 pixel, Filling power 4, Filling power Refer to the circle number in the zero padding of characteristic pattern periphery；Second convolutional layer has the convolution kernel of 32 1*1, and step-length is 1 pixel, Filling power It is 0；Third convolutional layer has the convolution kernel of 3 5*5, step-length 1, Filling power 2；

A Sigmoid active coating is followed by what the mixing operation obtained characteristic response figure the pixel value of picture to be mapped to Between 0-1, mathematical description are as follows:

The frame flyback training binary channels convolutional neural networks are compareed with the video frame set according to the prediction frame Model, specifically:

Using prediction frame with compare the error between frame, the binary channels convolutional Neural is trained using error backpropagation algorithm Network；Wherein use least squares error for majorized function, mathematical description are as follows:

Wherein i indicates i-th samples pictures, and n indicates the quantity of sample training collection, Y_iIndicate the video frame of neural network forecast,It indicates The true value of corresponding video frame；

(3) raw according to two video frame of arbitrary neighborhood in low frame-rate video using binary channels convolutional neural networks after the optimization At the insertion frame of this two video frame, to generate the video that frame per second is higher than the low frame-rate video.

2. the high frame-rate video generation method based on deep learning as described in claim 1, which is characterized in that at described two Video frame is merged by the way of corresponding position value addition by the characteristic response figure obtained after the last one convolutional layer.

3. a kind of high frame-rate video based on deep learning generates system, which is characterized in that including training sample set generation module, Binary channels convolutional neural networks optimization module and high frame-rate video generation module, in which:

The training sample set generation module, for generating training sample set, institute using one or more high frame-rate video segments Stating training sample and concentrating includes multiple video frame set, includes two training frames and one in each video frame set Frame is compareed, two training frames are two video frames that a frame or multiframe are spaced in high frame-rate video segment, the control frame For any one frame of the midfeather of two training frames；The frame per second of the high frame-rate video segment is higher than setting frame per second threshold Value；

The binary channels convolutional neural networks optimization module, multiple video frame set for being concentrated using the training sample Training binary channels convolutional neural networks model, binary channels convolutional neural networks after being optimized；Wherein, the binary channels convolution mind It is the convolutional neural networks of two channels fusion through network model, two channels are respectively used to input in the video frame set Two video frames and convolution carried out respectively to the video frame of input, binary channels convolutional neural networks model rolls up two channels Long-pending result is merged and is exported to predict frame, compares frame flyback with the video frame set according to the prediction frame The training binary channels convolutional neural networks model；

The high frame-rate video generation module, for being regarded according to low frame per second using binary channels convolutional neural networks after the optimization Two video frame of arbitrary neighborhood in frequency generates the insertion frame of this two video frame, to generate frame per second higher than the low frame-rate video Video；

Z_i(Y)=W_i*F_i-1(Y)+B_i

Wherein i indicates that the number of plies of convolutional layer, input video frame are the 0th layer, and * represents convolution operation, F_i-1Indicate (i-1)-th layer defeated Out, Z_i(Y) output after i-th layer of convolution operation, W are indicated_iFor i-th layer of convolution nuclear parameter, B_iFor i-th layer of offset parameter.