CN112446233A

CN112446233A - Action identification method and device based on multi-time scale reasoning

Info

Publication number: CN112446233A
Application number: CN201910799120.5A
Authority: CN
Inventors: 邹月娴; 张粲
Original assignee: Peking University Shenzhen Graduate School
Current assignee: Peking University Shenzhen Graduate School
Priority date: 2019-08-28
Filing date: 2019-08-28
Publication date: 2021-03-05

Abstract

The invention discloses an action recognition method and device based on multi-time scale reasoning. Wherein, the method includes the following steps: Step 1, sparsely sample the video sequence according to the sampling coefficient N, and obtain the original image of the sampled frame; Step 2, use the convolutional neural network to process the original image of each sampled frame to obtain a representative image The frame-level feature vector of each sampled frame; Step 3, perform k time-scale time series pooling on the frame-level feature vector to obtain the video-level feature vector at the k-th time scale; Step 4. The feature vector is sent to the multi-layer perceptron to obtain the action recognition confidence rate at the time scale; step 5, the action recognition confidence rate results under different time scales are fused using an inference function to obtain the action recognition result of the video sequence . The present invention can improve the accuracy of video action recognition through the method of multi-time scale reasoning.

Description

Action identification method and device based on multi-time scale reasoning

Technical Field

The invention relates to a visual perception and artificial intelligence technology, in particular to an action recognition method and device based on multi-time scale reasoning.

Background

Video-based motion recognition technology mainly recognizes motion existing in an original picture sequence by processing the original picture sequence. Motion recognition is becoming an important research direction in the field of visual perception and artificial intelligence. Video-based motion recognition technology has many potential applications in real-world scenes, such as: abnormal behavior recognition, rehabilitation training, intelligent nursing, tumble detection and the like in video monitoring.

In recent years, due to the continuous development of deep learning technology, the deep neural network makes great progress in the task of image classification, and even exceeds the accuracy of human recognition on the ImageNet data set containing 1000 types of pictures. At present, mainstream video motion recognition technologies based on deep learning are mainly classified into two types, namely a network based on double flow and a network based on three-dimensional convolution.

The network based on double flow is divided into two network branches of a space network and a time network. The input of the space network is an original RGB three-channel color image; the input of the temporal network is a stacked optical flow picture, wherein the optical flow picture includes both horizontal and vertical direction components; there are several ways of merging the two networks: the convolutional neural network shares the feature fusion mode of the last layers of weights and bias parameters, and the post-fusion mode that two network branches are respectively trained and the final classification scores are fused, and the like. The method has the advantages of high identification accuracy and the disadvantages that the optical flow pictures are required by time network branches as network input, but the extraction of the optical flow pictures is time-consuming and occupies a large amount of storage space.

The input of the network based on the three-dimensional convolution is an original RGB three-channel color image; different from the traditional image recognition backbone network which completely uses two-dimensional convolution operation to process spatial information, the processing of a video sequence needs to carry out down-sampling on a time sequence, so the convolution operation in the convolution neural network not only comprises two-dimensional convolution on the space, but also comprises convolution on the time, and the neural network which simultaneously carries out the convolution operation on the time and the space is called as a three-dimensional convolution network. The method has the advantages of end-to-end trainability, no need of extracting optical flow pictures in advance as motion representation on a time sequence, and huge parameter quantity and low identification accuracy.

Although the deeper neural network model and the larger-scale data set also enable the rapid development of the motion recognition technology, the modeling of the time sequence information is always a great challenge for motion recognition due to the complexity of the video time sequence information. How to effectively model timing information in video is very important for motion recognition and other video-based intelligent visual perception tasks.

Disclosure of Invention

The invention provides an action recognition method and device based on multi-time scale reasoning, aiming at the problem that time sequence information is lack of multi-time scale modeling in the current mainstream action recognition method. According to the method, the accuracy of motion recognition can be improved by performing multi-time scale reasoning on the video frame level high-level semantic information extracted from the two-dimensional convolutional neural network; and only the original RGB color picture frame is needed to be used as input, the pre-calculated optical flow is not needed to be used as motion auxiliary information, and the running speed of the method and the device meets the requirement of real-time motion recognition in the video. The technical scheme adopted by the invention is as follows:

a motion recognition method based on multi-time scale reasoning specifically comprises the following steps:

step 1, performing sparse sampling on a video sequence according to a sampling coefficient N to obtain an original image of a sampling frame;

step 2, processing the original image of each sampling frame by using a convolutional neural network to obtain a frame level feature vector for representing each sampling frame;

step 3, performing k time scale time sequence pooling processing on the frame level feature vector to obtain a video level feature vector under the k time scale;

step 4, sending the video-level feature vectors under different time scales into a multilayer perceptron to obtain the action recognition confidence rate under the time scale;

and 5, fusing the action recognition confidence rate results under different time scales by using a reasoning function to obtain an action recognition result of the video sequence.

Further, the sampling coefficient N in step 1 is a preset integer greater than 1, sparse sampling is performed on the video sequence according to the sampling coefficient N, a segment of video extracts total N frames at equal intervals as sampling frames, and the original 3-channel RGB image of each frame is referred to as the original image of the sampling frame.

Further, the convolutional neural network in step 2 includes a convolutional layer, a batch regularization layer, a ReLU layer, a Concat layer, a pooling layer, and the like; the input of the convolutional neural network is an original image of a sampling frame, and the output of the convolutional neural network is a frame-level feature vector which is used as a spatial semantic representation of the frame.

Further, the k-time scale time sequence pooling process in step 3 includes a hole maximum pooling and an average pooling operation; and setting the length of the frame-level feature vector to be D, and keeping the length of the video-level feature vector subjected to k time scale time sequence pooling to be D.

Further, the multilayer perceptron in step 4 is a 2-layer fully-connected layer, the input dimension is the same as the length of the video-level feature vector, and is D, the output dimension is the number c of motion categories, which represents the confidence rate that the video-level feature vector is judged to be c motion categories.

Further, the inference function in step 5 is used to fuse the action recognition results under different time scales, so as to obtain the final recognition result of the video sequence.

The k time scale time sequence pooling treatment specifically comprises the following steps:

step 1: setting the sampling coefficient to be N, and the frame-level feature vector output by the convolutional neural network to be: { f₁,f₂,…,f_N}; wherein f is_iA vector length of (i =1,2, …, N) is D; firstly, performing the maximum value pooling on N frame-level feature vectors in time sequence, where the kernel size is k, the hole coefficient is N/k, the step size is 1, and may be represented as:

F(k)=dilated_maxpool_(k,N/k,1){f₁,f₂,…,f_N}；

obtaining k characteristic vectors with the length of D after the k time scale time sequence is pooled, and expressing the k characteristic vectors as a vector set F (k);

step 2: performing a time-sequence average pooling on the k length-D eigenvectors in f (k), which can be expressed as:

V(k)=meanpool(F(k))；

v (k) is 1 feature vector with length D, which is the video-level feature vector at the kth time scale.

The specific steps for obtaining the action recognition confidence rate results under different time scales are as follows: for each of said time scales k, there is a specific multi-layer perceptron m_k(x; W); wherein, x is an input vector, and W is a parameter which can be learnt in the multilayer perceptron; sending the V (k) into a multi-layer perceptron m corresponding to the k time scale_k(x; W) to obtain:

S(k)= m_k(V(k);W)；

and the dimension of S (k) is consistent with the action category number c and represents the action identification confidence rate under the k time scale.

The specific steps of fusing the action recognition confidence rate results under different time scales by using the inference function are as follows: the reasoning function I (x) is used for fusing the confidence rates under different time scales so as to obtain a final video level confidence rate result; if n different time scale confidence rate results are set to be fused, the n-level time scale confidence rate fusion can be expressed as:

VIP(n)=I(S(1),S(2),…,S(2^n-1)；

the VIP (n) is an action recognition confidence rate fusion result under n time scales, and the dimension of the fusion result is consistent with the action category number c.

Specifically, the obtaining manner of the motion recognition result of the video sequence is as follows: and taking the category with the maximum value of the confidence rate in the confidence rate fusion result.

Specifically, the spatial resolution of the feature vector is down-sampled to 1 × 1, and the length is the number of channels D.

Specifically, the n time scales are selected in a pyramid increasing mode, that is, k =2^n-1And n is a positive integer.

The invention also provides a motion recognition device based on multi-time scale reasoning, which can be used for motion recognition in video signals or image sequences. The technical scheme is as follows:

the device comprises a sparse sampling unit, a spatial semantic representation extraction unit, a multi-time scale time sequence pooling unit, a classification unit and an inference unit; the sparse sampling unit is used for performing sparse sampling processing on the video sequence to obtain original images of a plurality of sampling frames; the spatial semantic representation extraction unit is used for carrying out abstraction processing on the sampling frames by utilizing a convolutional neural network to obtain the frame-level feature vector for representing each sampling frame; the multi-time scale time sequence pooling unit is used for performing multi-time scale time sequence pooling processing on the frame level feature vector to obtain a video level feature vector under the specific time scale; the classification unit is used for classifying the video-level feature vectors under different time scales to obtain the action recognition confidence rates under different time scales; and the reasoning unit is used for reasoning the action recognition confidence rates under different time scales to obtain the action recognition result of the video sequence.

Specifically, the output of the sparse sampling unit is used as the input of the spatial semantic representation extraction unit; the output of the spatial semantic representation extraction unit is used as the input of a plurality of scale time sequence pooling units; the output of the specific time scale time sequence pooling unit is connected with a specific classification unit; the outputs of the plurality of classification units serve as the inputs of the inference unit.

Due to the adoption of the technical means, the invention has the following advantages and beneficial effects:

1. the input of the method is only the original color 3-channel RGB sampling frame, compared with the traditional double-current network, the method does not need to additionally spend a large amount of computing resources and time in advance to compute the light stream picture as the input, the real-time performance of the method is guaranteed, the whole network can be trained end to end, tasks are more relevant, and the learning process is more focused on improving the accuracy of action identification;

2. the convolution neural network part of the method adopts two-dimensional convolution, compared with the traditional three-dimensional convolution network, the parameter quantity is small, the occupied space of the final network model is small, and the method can be applied to embedded equipment;

3. the invention provides an action recognition method based on multi-time scale reasoning, which integrates confidence information of different time scales and utilizes a reasoning function to carry out reasoning fusion so as to obtain a final action recognition result; the method can fully mine the time sequence relation information in the video signal, effectively avoid the error judgment caused by single time scale identification, and improve the accuracy of action identification;

4. the method has strong flexibility, a plurality of super parameters are reserved for setting, and a more scene-related super parameter set can be selected according to a specific application scene;

5. the device of the invention does not need an optical flow extraction part, so the hardware configuration requirement is low, the construction cost is low and the maintenance is easier.

Drawings

Fig. 1 shows a general flow chart of the method of the invention.

Fig. 2 shows a network architecture diagram of the method of the invention.

FIG. 3 shows a schematic diagram of a multi-time scale inference process of the method of the present invention.

Fig. 4 shows a schematic view of the device according to the invention.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

Fig. 1 is a general flowchart illustrating a method for recognizing an action based on multi-time scale inference according to an example, which specifically includes the following steps:

step 1: sparse sampling S1, performing sparse sampling on the video sequence according to the sampling coefficient N to obtain an original image of a sampling frame;

step 2: a convolutional neural network processing S2, which is to process the original image of each sampling frame by using a convolutional neural network to obtain a frame-level feature vector representing each sampling frame;

and step 3: pooling processing S3 of time sequences with different time scales, namely pooling processing of the frame-level feature vectors with the time scales of k to obtain video-level feature vectors under the kth time scale;

and 4, step 4: the multilayer perceptron obtains the action recognition confidence rate S4 related to the time scale, and sends the video-level feature vectors under different time scales into the multilayer perceptron to obtain the action recognition confidence rate under the time scale;

and 5: and performing inference function fusion operation S5, fusing the action recognition confidence rate results under different time scales by using an inference function, and obtaining an action recognition result of the video sequence.

FIG. 2 is a network architecture diagram illustrating a method of motion recognition based on multi-timescale inference to clarify the size of data dimensions through each operation, according to an example; setting the data dimension expression method "C × T × W × H" as "the number of channels × the timing length × the space width × the space height", where: 1-inputting an original video sequence with the length of L, wherein the original video sequence is a 3-channel RGB color image, so that the data dimension is 3 multiplied by L multiplied by W multiplied by H; 2-extracting an original image of N frames of sampling frames from an original video sequence according to a sampling coefficient N, wherein the sampling coefficient N is a preset integer larger than 1, performing sparse sampling on the video sequence according to the sampling coefficient N, extracting total N frames of a section of video at equal intervals to serve as the sampling frames, an original 3-channel RGB image of each frame is called as the original image of the sampling frame, and the data dimension is 3 multiplied by N multiplied by W multiplied by H; processing the original image of each sampling frame by a 3-convolutional neural network, wherein the convolutional neural network comprises a convolutional layer, a batch regularization layer, a ReLU layer, a Concat layer, a pooling layer and the like; the input of the convolutional neural network is an original image of a sampling frame, and the output of the convolutional neural network is a frame-level feature vector which is used as a spatial semantic representation of the frame; 4-a frame level feature vector for representing each sampling frame, setting the number of output channels of the last layer of the convolutional neural network as D, and setting the data dimension as D multiplied by N multiplied by 1 when the spatial dimension is down-sampled to 1 multiplied by 1; 5-the k time scale time sequence pooling treatment comprises maximum pooling of holes and average pooling; setting the length of the frame-level feature vector as D, and keeping the length of the video-level feature vector after k time scale time sequence pooling as D; the dimensionality of the video-level feature vector under n time scales is D multiplied by n multiplied by 1; 6-an inference function fuses confidence rate results, the multilayer perceptron is a 2-layer full-connection layer, the input dimension is the same as the length of the video-level feature vector and is D, the output dimension is the number c of action categories and represents the confidence rate of the video-level feature vector judged as c action categories; the reasoning function is used for fusing action recognition results under different time scales; setting the action categories to be c types, wherein the dimensionality of the inference function fusion confidence rate result is c; 7-selecting the category with the maximum confidence rate as the final action recognition result.

Fig. 3 is a diagram illustrating a multi-timescale inference process according to an example, which specifically includes the following steps:

F(k)=dilated_maxpool_(k,N/k,1){f₁,f₂,…,f_N}；

then k feature vectors with length D will be obtained after pooling of the k time scale time series, and are represented as a vector set f (k).

V(k)=meanpool(F(k))；

The specific steps for obtaining the action recognition confidence rate results under different time scales are as follows:

and 3, step 3: for each of said time scales k, there is a specific multi-layer perceptron m_k(x; W); wherein, x is an input vector, and W is a parameter which can be learnt in the multilayer perceptron; sending the V (k) into a multi-layer perceptron m corresponding to the k time scale_k(x; W) to obtain:

S(k)= m_k(V(k);W)；

The specific steps of fusing the action recognition confidence rate results under different time scales by using the inference function are as follows:

and 4, step 4: the reasoning function I (x) is used for fusing the confidence rates under different time scales so as to obtain a final video level confidence rate result; if n different time scale confidence rate results are set to be fused, the n-level time scale confidence rate fusion can be expressed as:

VIP(n)=I(S(1),S(2),…,S(2^n-1)；

Fig. 4 is a schematic diagram illustrating a motion recognition apparatus based on multi-time scale inference, which may be used for motion recognition in a video signal or a sequence of images, according to an example. The technical scheme is as follows:

the device comprises a sparse sampling unit 1, a spatial semantic representation extraction unit 2, a multi-time scale time sequence pooling unit 3, a classification unit 4 and an inference unit 5; the sparse sampling unit is used for performing sparse sampling processing on the video sequence to obtain original images of a plurality of sampling frames; the spatial semantic representation extraction unit is used for carrying out abstraction processing on the sampling frames by utilizing a convolutional neural network to obtain the frame-level feature vector for representing each sampling frame; the multi-time scale time sequence pooling unit is used for performing multi-time scale time sequence pooling processing on the frame level feature vector to obtain a video level feature vector under the specific time scale; the classification unit is used for classifying the video-level feature vectors under different time scales to obtain the action recognition confidence rates under different time scales; and the reasoning unit is used for reasoning the action recognition confidence rates under different time scales to obtain the action recognition result of the video sequence.

The foregoing examples are given solely for the purpose of illustrating the invention and are not to be construed as limiting the embodiments, and other variations and modifications in form thereof will be suggested to those skilled in the art upon reading the foregoing description, and it is not necessary or necessary to exhaustively enumerate all embodiments and all such obvious variations and modifications are deemed to be within the scope of the invention.

Claims

1. A motion recognition method based on multi-time scale reasoning comprises the following steps:

2. The method of claim 1, wherein:

in step 1, the sampling coefficient N is a preset integer greater than 1, sparse sampling is performed on a video sequence according to the sampling coefficient N, a segment of video extracts total N frames at equal intervals as sampling frames, and an original 3-channel RGB image of each frame is called an original image of the sampling frame;

in step 2, the convolutional neural network comprises a convolutional layer, a batch regularization layer, a ReLU layer, a Concat layer, a pooling layer and the like; the input of the convolutional neural network is an original image of a sampling frame, and the output of the convolutional neural network is a frame-level feature vector which is used as a spatial semantic representation of the frame;

in step 3, the k-time scale time sequence pooling process comprises a hole maximum pooling and average pooling operation; setting the length of the frame-level feature vector as D, and keeping the length of the video-level feature vector after k time scale time sequence pooling as D;

in step 4, the multilayer perceptron is a 2-layer fully-connected layer, the length of the input dimension is the same as that of the video-level feature vector, and the input dimension is D, the output dimension is the number c of action categories, and represents the confidence rate of the video-level feature vector judged as c action categories;

in step 5, the inference function is used for fusing action recognition results under different time scales so as to obtain a final recognition result of the video sequence; the inference function adopts a weighted sum function, and the weight coefficient is k.

3. The method according to claim 1 or 2, wherein the k-time scale time series pooling comprises the following steps:

F(k)=dilated_maxpool_(k,N/k,1) {f₁,f₂,…,f_N}；

V(k)=meanpool(F(k))；

4. The method according to claim 1 or 2, wherein the specific step of obtaining the confidence rate results of the motion recognition at different time scales is: for each of said time scales k, there is a specific multi-layer perceptron m_k(x; W); wherein, x is an input vector, and W is a parameter which can be learnt in the multilayer perceptron; sending the V (k) into a multi-layer perceptron m corresponding to the k time scale_k(x; W) to obtain:

S(k)= m_k(V(k);W)；

5. The method according to claim 1 or 2, wherein the step of fusing the action recognition confidence rate results at different time scales by using an inference function comprises the following specific steps: the reasoning function I (x) is used for fusing the confidence rates under different time scales so as to obtain a final video level confidence rate result; if n different time scale confidence rate results are set to be fused, the n-level time scale confidence rate fusion can be expressed as:

VIP(n)=I(S(1),S(2),…,S(2^n-1)；

6. The method according to claim 1 or 2, wherein the motion recognition result of the video sequence is obtained by: and taking the category with the highest confidence rate in the confidence rate fusion result.

7. The method of any of claims 1 to 4, wherein the spatial resolution of the feature vector is down-sampled to 1 x 1 and the length is the number of channels D.

8. The method of claim 5, wherein the n time scales are selected in a pyramid increasing manner, i.e., k =2^n-1And n is a positive integer.

9. An action recognition device based on multi-time scale reasoning, comprising:

the sparse sampling unit is used for carrying out sparse sampling processing on the video sequence to obtain original images of a plurality of sampling frames;

the spatial semantic representation extraction unit is used for carrying out abstraction processing on the sampling frames by utilizing a convolutional neural network to obtain the frame-level feature vector for representing each sampling frame;

the multi-time scale time sequence pooling unit is used for performing multi-time scale time sequence pooling processing on the frame level feature vector to obtain a video level feature vector under the specific time scale;

the classification unit is used for classifying the video-level feature vectors under different time scales to obtain the action recognition confidence rates under different time scales;

and the reasoning unit is used for reasoning the action recognition confidence rates under different time scales to obtain an action recognition result of the video sequence.