[go: up one dir, main page]

CN112446233A - Action identification method and device based on multi-time scale reasoning - Google Patents

Action identification method and device based on multi-time scale reasoning Download PDF

Info

Publication number
CN112446233A
CN112446233A CN201910799120.5A CN201910799120A CN112446233A CN 112446233 A CN112446233 A CN 112446233A CN 201910799120 A CN201910799120 A CN 201910799120A CN 112446233 A CN112446233 A CN 112446233A
Authority
CN
China
Prior art keywords
video
frame
feature vector
time scale
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910799120.5A
Other languages
Chinese (zh)
Inventor
邹月娴
张粲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Shenzhen Graduate School
Original Assignee
Peking University Shenzhen Graduate School
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Shenzhen Graduate School filed Critical Peking University Shenzhen Graduate School
Priority to CN201910799120.5A priority Critical patent/CN112446233A/en
Publication of CN112446233A publication Critical patent/CN112446233A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种基于多时间尺度推理的动作识别方法及装置。其中,所述方法包括以下步骤:步骤1、对视频序列按采样系数N进行稀疏采样,获得采样帧的原始图像;步骤2、利用卷积神经网络处理各采样帧的原始图像,获得用于代表各采样帧的帧级别特征向量;步骤3、对帧级别特征向量进行k时间尺度时序池化处理,获得第k个时间尺度下的视频级特征向量;步骤4、将不同时间尺度下的视频级特征向量送入多层感知器中,获得该时间尺度下的动作识别置信率;步骤5、对不同时间尺度下的动作识别置信率结果利用推理函数进行融合,获得所述视频序列的动作识别结果。本发明可以通过多时间尺度推理的方法,提高视频动作识别的准确率。

Figure 201910799120

The invention discloses an action recognition method and device based on multi-time scale reasoning. Wherein, the method includes the following steps: Step 1, sparsely sample the video sequence according to the sampling coefficient N, and obtain the original image of the sampled frame; Step 2, use the convolutional neural network to process the original image of each sampled frame to obtain a representative image The frame-level feature vector of each sampled frame; Step 3, perform k time-scale time series pooling on the frame-level feature vector to obtain the video-level feature vector at the k-th time scale; Step 4. The feature vector is sent to the multi-layer perceptron to obtain the action recognition confidence rate at the time scale; step 5, the action recognition confidence rate results under different time scales are fused using an inference function to obtain the action recognition result of the video sequence . The present invention can improve the accuracy of video action recognition through the method of multi-time scale reasoning.

Figure 201910799120

Description

Action identification method and device based on multi-time scale reasoning
Technical Field
The invention relates to a visual perception and artificial intelligence technology, in particular to an action recognition method and device based on multi-time scale reasoning.
Background
Video-based motion recognition technology mainly recognizes motion existing in an original picture sequence by processing the original picture sequence. Motion recognition is becoming an important research direction in the field of visual perception and artificial intelligence. Video-based motion recognition technology has many potential applications in real-world scenes, such as: abnormal behavior recognition, rehabilitation training, intelligent nursing, tumble detection and the like in video monitoring.
In recent years, due to the continuous development of deep learning technology, the deep neural network makes great progress in the task of image classification, and even exceeds the accuracy of human recognition on the ImageNet data set containing 1000 types of pictures. At present, mainstream video motion recognition technologies based on deep learning are mainly classified into two types, namely a network based on double flow and a network based on three-dimensional convolution.
The network based on double flow is divided into two network branches of a space network and a time network. The input of the space network is an original RGB three-channel color image; the input of the temporal network is a stacked optical flow picture, wherein the optical flow picture includes both horizontal and vertical direction components; there are several ways of merging the two networks: the convolutional neural network shares the feature fusion mode of the last layers of weights and bias parameters, and the post-fusion mode that two network branches are respectively trained and the final classification scores are fused, and the like. The method has the advantages of high identification accuracy and the disadvantages that the optical flow pictures are required by time network branches as network input, but the extraction of the optical flow pictures is time-consuming and occupies a large amount of storage space.
The input of the network based on the three-dimensional convolution is an original RGB three-channel color image; different from the traditional image recognition backbone network which completely uses two-dimensional convolution operation to process spatial information, the processing of a video sequence needs to carry out down-sampling on a time sequence, so the convolution operation in the convolution neural network not only comprises two-dimensional convolution on the space, but also comprises convolution on the time, and the neural network which simultaneously carries out the convolution operation on the time and the space is called as a three-dimensional convolution network. The method has the advantages of end-to-end trainability, no need of extracting optical flow pictures in advance as motion representation on a time sequence, and huge parameter quantity and low identification accuracy.
Although the deeper neural network model and the larger-scale data set also enable the rapid development of the motion recognition technology, the modeling of the time sequence information is always a great challenge for motion recognition due to the complexity of the video time sequence information. How to effectively model timing information in video is very important for motion recognition and other video-based intelligent visual perception tasks.
Disclosure of Invention
The invention provides an action recognition method and device based on multi-time scale reasoning, aiming at the problem that time sequence information is lack of multi-time scale modeling in the current mainstream action recognition method. According to the method, the accuracy of motion recognition can be improved by performing multi-time scale reasoning on the video frame level high-level semantic information extracted from the two-dimensional convolutional neural network; and only the original RGB color picture frame is needed to be used as input, the pre-calculated optical flow is not needed to be used as motion auxiliary information, and the running speed of the method and the device meets the requirement of real-time motion recognition in the video. The technical scheme adopted by the invention is as follows:
a motion recognition method based on multi-time scale reasoning specifically comprises the following steps:
step 1, performing sparse sampling on a video sequence according to a sampling coefficient N to obtain an original image of a sampling frame;
step 2, processing the original image of each sampling frame by using a convolutional neural network to obtain a frame level feature vector for representing each sampling frame;
step 3, performing k time scale time sequence pooling processing on the frame level feature vector to obtain a video level feature vector under the k time scale;
step 4, sending the video-level feature vectors under different time scales into a multilayer perceptron to obtain the action recognition confidence rate under the time scale;
and 5, fusing the action recognition confidence rate results under different time scales by using a reasoning function to obtain an action recognition result of the video sequence.
Further, the sampling coefficient N in step 1 is a preset integer greater than 1, sparse sampling is performed on the video sequence according to the sampling coefficient N, a segment of video extracts total N frames at equal intervals as sampling frames, and the original 3-channel RGB image of each frame is referred to as the original image of the sampling frame.
Further, the convolutional neural network in step 2 includes a convolutional layer, a batch regularization layer, a ReLU layer, a Concat layer, a pooling layer, and the like; the input of the convolutional neural network is an original image of a sampling frame, and the output of the convolutional neural network is a frame-level feature vector which is used as a spatial semantic representation of the frame.
Further, the k-time scale time sequence pooling process in step 3 includes a hole maximum pooling and an average pooling operation; and setting the length of the frame-level feature vector to be D, and keeping the length of the video-level feature vector subjected to k time scale time sequence pooling to be D.
Further, the multilayer perceptron in step 4 is a 2-layer fully-connected layer, the input dimension is the same as the length of the video-level feature vector, and is D, the output dimension is the number c of motion categories, which represents the confidence rate that the video-level feature vector is judged to be c motion categories.
Further, the inference function in step 5 is used to fuse the action recognition results under different time scales, so as to obtain the final recognition result of the video sequence.
The k time scale time sequence pooling treatment specifically comprises the following steps:
step 1: setting the sampling coefficient to be N, and the frame-level feature vector output by the convolutional neural network to be: { f1,f2,…,fN}; wherein f isiA vector length of (i =1,2, …, N) is D; firstly, performing the maximum value pooling on N frame-level feature vectors in time sequence, where the kernel size is k, the hole coefficient is N/k, the step size is 1, and may be represented as:
F(k)=dilated_maxpool(k,N/k,1){f1,f2,…,fN};
obtaining k characteristic vectors with the length of D after the k time scale time sequence is pooled, and expressing the k characteristic vectors as a vector set F (k);
step 2: performing a time-sequence average pooling on the k length-D eigenvectors in f (k), which can be expressed as:
V(k)=meanpool(F(k));
v (k) is 1 feature vector with length D, which is the video-level feature vector at the kth time scale.
The specific steps for obtaining the action recognition confidence rate results under different time scales are as follows: for each of said time scales k, there is a specific multi-layer perceptron mk(x; W); wherein, x is an input vector, and W is a parameter which can be learnt in the multilayer perceptron; sending the V (k) into a multi-layer perceptron m corresponding to the k time scalek(x; W) to obtain:
S(k)= mk(V(k);W);
and the dimension of S (k) is consistent with the action category number c and represents the action identification confidence rate under the k time scale.
The specific steps of fusing the action recognition confidence rate results under different time scales by using the inference function are as follows: the reasoning function I (x) is used for fusing the confidence rates under different time scales so as to obtain a final video level confidence rate result; if n different time scale confidence rate results are set to be fused, the n-level time scale confidence rate fusion can be expressed as:
VIP(n)=I(S(1),S(2),…,S(2n-1);
the VIP (n) is an action recognition confidence rate fusion result under n time scales, and the dimension of the fusion result is consistent with the action category number c.
Specifically, the obtaining manner of the motion recognition result of the video sequence is as follows: and taking the category with the maximum value of the confidence rate in the confidence rate fusion result.
Specifically, the spatial resolution of the feature vector is down-sampled to 1 × 1, and the length is the number of channels D.
Specifically, the n time scales are selected in a pyramid increasing mode, that is, k =2n-1And n is a positive integer.
The invention also provides a motion recognition device based on multi-time scale reasoning, which can be used for motion recognition in video signals or image sequences. The technical scheme is as follows:
the device comprises a sparse sampling unit, a spatial semantic representation extraction unit, a multi-time scale time sequence pooling unit, a classification unit and an inference unit; the sparse sampling unit is used for performing sparse sampling processing on the video sequence to obtain original images of a plurality of sampling frames; the spatial semantic representation extraction unit is used for carrying out abstraction processing on the sampling frames by utilizing a convolutional neural network to obtain the frame-level feature vector for representing each sampling frame; the multi-time scale time sequence pooling unit is used for performing multi-time scale time sequence pooling processing on the frame level feature vector to obtain a video level feature vector under the specific time scale; the classification unit is used for classifying the video-level feature vectors under different time scales to obtain the action recognition confidence rates under different time scales; and the reasoning unit is used for reasoning the action recognition confidence rates under different time scales to obtain the action recognition result of the video sequence.
Specifically, the output of the sparse sampling unit is used as the input of the spatial semantic representation extraction unit; the output of the spatial semantic representation extraction unit is used as the input of a plurality of scale time sequence pooling units; the output of the specific time scale time sequence pooling unit is connected with a specific classification unit; the outputs of the plurality of classification units serve as the inputs of the inference unit.
Due to the adoption of the technical means, the invention has the following advantages and beneficial effects:
1. the input of the method is only the original color 3-channel RGB sampling frame, compared with the traditional double-current network, the method does not need to additionally spend a large amount of computing resources and time in advance to compute the light stream picture as the input, the real-time performance of the method is guaranteed, the whole network can be trained end to end, tasks are more relevant, and the learning process is more focused on improving the accuracy of action identification;
2. the convolution neural network part of the method adopts two-dimensional convolution, compared with the traditional three-dimensional convolution network, the parameter quantity is small, the occupied space of the final network model is small, and the method can be applied to embedded equipment;
3. the invention provides an action recognition method based on multi-time scale reasoning, which integrates confidence information of different time scales and utilizes a reasoning function to carry out reasoning fusion so as to obtain a final action recognition result; the method can fully mine the time sequence relation information in the video signal, effectively avoid the error judgment caused by single time scale identification, and improve the accuracy of action identification;
4. the method has strong flexibility, a plurality of super parameters are reserved for setting, and a more scene-related super parameter set can be selected according to a specific application scene;
5. the device of the invention does not need an optical flow extraction part, so the hardware configuration requirement is low, the construction cost is low and the maintenance is easier.
Drawings
Fig. 1 shows a general flow chart of the method of the invention.
Fig. 2 shows a network architecture diagram of the method of the invention.
FIG. 3 shows a schematic diagram of a multi-time scale inference process of the method of the present invention.
Fig. 4 shows a schematic view of the device according to the invention.
Detailed Description
The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.
Fig. 1 is a general flowchart illustrating a method for recognizing an action based on multi-time scale inference according to an example, which specifically includes the following steps:
step 1: sparse sampling S1, performing sparse sampling on the video sequence according to the sampling coefficient N to obtain an original image of a sampling frame;
step 2: a convolutional neural network processing S2, which is to process the original image of each sampling frame by using a convolutional neural network to obtain a frame-level feature vector representing each sampling frame;
and step 3: pooling processing S3 of time sequences with different time scales, namely pooling processing of the frame-level feature vectors with the time scales of k to obtain video-level feature vectors under the kth time scale;
and 4, step 4: the multilayer perceptron obtains the action recognition confidence rate S4 related to the time scale, and sends the video-level feature vectors under different time scales into the multilayer perceptron to obtain the action recognition confidence rate under the time scale;
and 5: and performing inference function fusion operation S5, fusing the action recognition confidence rate results under different time scales by using an inference function, and obtaining an action recognition result of the video sequence.
FIG. 2 is a network architecture diagram illustrating a method of motion recognition based on multi-timescale inference to clarify the size of data dimensions through each operation, according to an example; setting the data dimension expression method "C × T × W × H" as "the number of channels × the timing length × the space width × the space height", where: 1-inputting an original video sequence with the length of L, wherein the original video sequence is a 3-channel RGB color image, so that the data dimension is 3 multiplied by L multiplied by W multiplied by H; 2-extracting an original image of N frames of sampling frames from an original video sequence according to a sampling coefficient N, wherein the sampling coefficient N is a preset integer larger than 1, performing sparse sampling on the video sequence according to the sampling coefficient N, extracting total N frames of a section of video at equal intervals to serve as the sampling frames, an original 3-channel RGB image of each frame is called as the original image of the sampling frame, and the data dimension is 3 multiplied by N multiplied by W multiplied by H; processing the original image of each sampling frame by a 3-convolutional neural network, wherein the convolutional neural network comprises a convolutional layer, a batch regularization layer, a ReLU layer, a Concat layer, a pooling layer and the like; the input of the convolutional neural network is an original image of a sampling frame, and the output of the convolutional neural network is a frame-level feature vector which is used as a spatial semantic representation of the frame; 4-a frame level feature vector for representing each sampling frame, setting the number of output channels of the last layer of the convolutional neural network as D, and setting the data dimension as D multiplied by N multiplied by 1 when the spatial dimension is down-sampled to 1 multiplied by 1; 5-the k time scale time sequence pooling treatment comprises maximum pooling of holes and average pooling; setting the length of the frame-level feature vector as D, and keeping the length of the video-level feature vector after k time scale time sequence pooling as D; the dimensionality of the video-level feature vector under n time scales is D multiplied by n multiplied by 1; 6-an inference function fuses confidence rate results, the multilayer perceptron is a 2-layer full-connection layer, the input dimension is the same as the length of the video-level feature vector and is D, the output dimension is the number c of action categories and represents the confidence rate of the video-level feature vector judged as c action categories; the reasoning function is used for fusing action recognition results under different time scales; setting the action categories to be c types, wherein the dimensionality of the inference function fusion confidence rate result is c; 7-selecting the category with the maximum confidence rate as the final action recognition result.
Fig. 3 is a diagram illustrating a multi-timescale inference process according to an example, which specifically includes the following steps:
the k time scale time sequence pooling treatment specifically comprises the following steps:
step 1: setting the sampling coefficient to be N, and the frame-level feature vector output by the convolutional neural network to be: { f1,f2,…,fN}; wherein f isiA vector length of (i =1,2, …, N) is D; firstly, performing the maximum value pooling on N frame-level feature vectors in time sequence, where the kernel size is k, the hole coefficient is N/k, the step size is 1, and may be represented as:
F(k)=dilated_maxpool(k,N/k,1){f1,f2,…,fN};
then k feature vectors with length D will be obtained after pooling of the k time scale time series, and are represented as a vector set f (k).
Step 2: performing a time-sequence average pooling on the k length-D eigenvectors in f (k), which can be expressed as:
V(k)=meanpool(F(k));
v (k) is 1 feature vector with length D, which is the video-level feature vector at the kth time scale.
The specific steps for obtaining the action recognition confidence rate results under different time scales are as follows:
and 3, step 3: for each of said time scales k, there is a specific multi-layer perceptron mk(x; W); wherein, x is an input vector, and W is a parameter which can be learnt in the multilayer perceptron; sending the V (k) into a multi-layer perceptron m corresponding to the k time scalek(x; W) to obtain:
S(k)= mk(V(k);W);
and the dimension of S (k) is consistent with the action category number c and represents the action identification confidence rate under the k time scale.
The specific steps of fusing the action recognition confidence rate results under different time scales by using the inference function are as follows:
and 4, step 4: the reasoning function I (x) is used for fusing the confidence rates under different time scales so as to obtain a final video level confidence rate result; if n different time scale confidence rate results are set to be fused, the n-level time scale confidence rate fusion can be expressed as:
VIP(n)=I(S(1),S(2),…,S(2n-1);
the VIP (n) is an action recognition confidence rate fusion result under n time scales, and the dimension of the fusion result is consistent with the action category number c.
Fig. 4 is a schematic diagram illustrating a motion recognition apparatus based on multi-time scale inference, which may be used for motion recognition in a video signal or a sequence of images, according to an example. The technical scheme is as follows:
the device comprises a sparse sampling unit 1, a spatial semantic representation extraction unit 2, a multi-time scale time sequence pooling unit 3, a classification unit 4 and an inference unit 5; the sparse sampling unit is used for performing sparse sampling processing on the video sequence to obtain original images of a plurality of sampling frames; the spatial semantic representation extraction unit is used for carrying out abstraction processing on the sampling frames by utilizing a convolutional neural network to obtain the frame-level feature vector for representing each sampling frame; the multi-time scale time sequence pooling unit is used for performing multi-time scale time sequence pooling processing on the frame level feature vector to obtain a video level feature vector under the specific time scale; the classification unit is used for classifying the video-level feature vectors under different time scales to obtain the action recognition confidence rates under different time scales; and the reasoning unit is used for reasoning the action recognition confidence rates under different time scales to obtain the action recognition result of the video sequence.
Specifically, the output of the sparse sampling unit is used as the input of the spatial semantic representation extraction unit; the output of the spatial semantic representation extraction unit is used as the input of a plurality of scale time sequence pooling units; the output of the specific time scale time sequence pooling unit is connected with a specific classification unit; the outputs of the plurality of classification units serve as the inputs of the inference unit.
The foregoing examples are given solely for the purpose of illustrating the invention and are not to be construed as limiting the embodiments, and other variations and modifications in form thereof will be suggested to those skilled in the art upon reading the foregoing description, and it is not necessary or necessary to exhaustively enumerate all embodiments and all such obvious variations and modifications are deemed to be within the scope of the invention.

Claims (9)

1. A motion recognition method based on multi-time scale reasoning comprises the following steps:
step 1, performing sparse sampling on a video sequence according to a sampling coefficient N to obtain an original image of a sampling frame;
step 2, processing the original image of each sampling frame by using a convolutional neural network to obtain a frame level feature vector for representing each sampling frame;
step 3, performing k time scale time sequence pooling processing on the frame level feature vector to obtain a video level feature vector under the k time scale;
step 4, sending the video-level feature vectors under different time scales into a multilayer perceptron to obtain the action recognition confidence rate under the time scale;
and 5, fusing the action recognition confidence rate results under different time scales by using a reasoning function to obtain an action recognition result of the video sequence.
2. The method of claim 1, wherein:
in step 1, the sampling coefficient N is a preset integer greater than 1, sparse sampling is performed on a video sequence according to the sampling coefficient N, a segment of video extracts total N frames at equal intervals as sampling frames, and an original 3-channel RGB image of each frame is called an original image of the sampling frame;
in step 2, the convolutional neural network comprises a convolutional layer, a batch regularization layer, a ReLU layer, a Concat layer, a pooling layer and the like; the input of the convolutional neural network is an original image of a sampling frame, and the output of the convolutional neural network is a frame-level feature vector which is used as a spatial semantic representation of the frame;
in step 3, the k-time scale time sequence pooling process comprises a hole maximum pooling and average pooling operation; setting the length of the frame-level feature vector as D, and keeping the length of the video-level feature vector after k time scale time sequence pooling as D;
in step 4, the multilayer perceptron is a 2-layer fully-connected layer, the length of the input dimension is the same as that of the video-level feature vector, and the input dimension is D, the output dimension is the number c of action categories, and represents the confidence rate of the video-level feature vector judged as c action categories;
in step 5, the inference function is used for fusing action recognition results under different time scales so as to obtain a final recognition result of the video sequence; the inference function adopts a weighted sum function, and the weight coefficient is k.
3. The method according to claim 1 or 2, wherein the k-time scale time series pooling comprises the following steps:
step 1: setting the sampling coefficient to be N, and the frame-level feature vector output by the convolutional neural network to be: { f1,f2,…,fN}; wherein f isiA vector length of (i =1,2, …, N) is D; firstly, performing the maximum value pooling on N frame-level feature vectors in time sequence, where the kernel size is k, the hole coefficient is N/k, the step size is 1, and may be represented as:
F(k)=dilated_maxpool(k,N/k,1) {f1,f2,…,fN};
obtaining k characteristic vectors with the length of D after the k time scale time sequence is pooled, and expressing the k characteristic vectors as a vector set F (k);
step 2: performing a time-sequence average pooling on the k length-D eigenvectors in f (k), which can be expressed as:
V(k)=meanpool(F(k));
v (k) is 1 feature vector with length D, which is the video-level feature vector at the kth time scale.
4. The method according to claim 1 or 2, wherein the specific step of obtaining the confidence rate results of the motion recognition at different time scales is: for each of said time scales k, there is a specific multi-layer perceptron mk(x; W); wherein, x is an input vector, and W is a parameter which can be learnt in the multilayer perceptron; sending the V (k) into a multi-layer perceptron m corresponding to the k time scalek(x; W) to obtain:
S(k)= mk(V(k);W);
and the dimension of S (k) is consistent with the action category number c and represents the action identification confidence rate under the k time scale.
5. The method according to claim 1 or 2, wherein the step of fusing the action recognition confidence rate results at different time scales by using an inference function comprises the following specific steps: the reasoning function I (x) is used for fusing the confidence rates under different time scales so as to obtain a final video level confidence rate result; if n different time scale confidence rate results are set to be fused, the n-level time scale confidence rate fusion can be expressed as:
VIP(n)=I(S(1),S(2),…,S(2n-1);
the VIP (n) is an action recognition confidence rate fusion result under n time scales, and the dimension of the fusion result is consistent with the action category number c.
6. The method according to claim 1 or 2, wherein the motion recognition result of the video sequence is obtained by: and taking the category with the highest confidence rate in the confidence rate fusion result.
7. The method of any of claims 1 to 4, wherein the spatial resolution of the feature vector is down-sampled to 1 x 1 and the length is the number of channels D.
8. The method of claim 5, wherein the n time scales are selected in a pyramid increasing manner, i.e., k =2n-1And n is a positive integer.
9. An action recognition device based on multi-time scale reasoning, comprising:
the sparse sampling unit is used for carrying out sparse sampling processing on the video sequence to obtain original images of a plurality of sampling frames;
the spatial semantic representation extraction unit is used for carrying out abstraction processing on the sampling frames by utilizing a convolutional neural network to obtain the frame-level feature vector for representing each sampling frame;
the multi-time scale time sequence pooling unit is used for performing multi-time scale time sequence pooling processing on the frame level feature vector to obtain a video level feature vector under the specific time scale;
the classification unit is used for classifying the video-level feature vectors under different time scales to obtain the action recognition confidence rates under different time scales;
and the reasoning unit is used for reasoning the action recognition confidence rates under different time scales to obtain an action recognition result of the video sequence.
CN201910799120.5A 2019-08-28 2019-08-28 Action identification method and device based on multi-time scale reasoning Pending CN112446233A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910799120.5A CN112446233A (en) 2019-08-28 2019-08-28 Action identification method and device based on multi-time scale reasoning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910799120.5A CN112446233A (en) 2019-08-28 2019-08-28 Action identification method and device based on multi-time scale reasoning

Publications (1)

Publication Number Publication Date
CN112446233A true CN112446233A (en) 2021-03-05

Family

ID=74741788

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910799120.5A Pending CN112446233A (en) 2019-08-28 2019-08-28 Action identification method and device based on multi-time scale reasoning

Country Status (1)

Country Link
CN (1) CN112446233A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113221884A (en) * 2021-05-13 2021-08-06 中国科学技术大学 Text recognition method and system based on low-frequency word storage memory
CN115393775A (en) * 2022-09-14 2022-11-25 西安理工大学 Multi-scale feature fusion behavior identification method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6038337A (en) * 1996-03-29 2000-03-14 Nec Research Institute, Inc. Method and apparatus for object recognition
CN109460734A (en) * 2018-11-08 2019-03-12 山东大学 Video Behavior Recognition Method and System Based on Hierarchical Dynamic Depth Projection Difference Image Representation
CN110097000A (en) * 2019-04-29 2019-08-06 东南大学 Video behavior recognition methods based on local feature Aggregation Descriptor and sequential relationship network
US20200125852A1 (en) * 2017-05-15 2020-04-23 Deepmind Technologies Limited Action recognition in videos using 3d spatio-temporal convolutional neural networks

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6038337A (en) * 1996-03-29 2000-03-14 Nec Research Institute, Inc. Method and apparatus for object recognition
US20200125852A1 (en) * 2017-05-15 2020-04-23 Deepmind Technologies Limited Action recognition in videos using 3d spatio-temporal convolutional neural networks
CN109460734A (en) * 2018-11-08 2019-03-12 山东大学 Video Behavior Recognition Method and System Based on Hierarchical Dynamic Depth Projection Difference Image Representation
CN110097000A (en) * 2019-04-29 2019-08-06 东南大学 Video behavior recognition methods based on local feature Aggregation Descriptor and sequential relationship network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李艳荻;徐熙平;: "基于空-时域特征决策级融合的人体行为识别算法", 光学学报, no. 08, 28 March 2018 (2018-03-28) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113221884A (en) * 2021-05-13 2021-08-06 中国科学技术大学 Text recognition method and system based on low-frequency word storage memory
CN113221884B (en) * 2021-05-13 2022-09-06 中国科学技术大学 Text recognition method and system based on low-frequency word storage memory
CN115393775A (en) * 2022-09-14 2022-11-25 西安理工大学 Multi-scale feature fusion behavior identification method

Similar Documents

Publication Publication Date Title
CN113688723B (en) A pedestrian target detection method in infrared images based on improved YOLOv5
CN109446923B (en) Deeply supervised convolutional neural network behavior recognition method based on training feature fusion
CN110119703B (en) Human body action recognition method fusing attention mechanism and spatio-temporal graph convolutional neural network in security scene
CN106991372B (en) Dynamic gesture recognition method based on mixed deep learning model
CN104281853B (en) A kind of Activity recognition method based on 3D convolutional neural networks
CN106709461B (en) Activity recognition method and device based on video
CN105787458B (en) The infrared behavior recognition methods adaptively merged based on artificial design features and deep learning feature
Fu et al. Fast crowd density estimation with convolutional neural networks
CN108830252A (en) A kind of convolutional neural networks human motion recognition method of amalgamation of global space-time characteristic
CN104992223B (en) Intensive population estimation method based on deep learning
CN114220154B (en) A method for micro-expression feature extraction and recognition based on deep learning
CN110490136B (en) Knowledge distillation-based human behavior prediction method
CN110110624A (en) A kind of Human bodys' response method based on DenseNet network and the input of frame difference method feature
CN112766062B (en) Human behavior identification method based on double-current deep neural network
CN106897738A (en) A kind of pedestrian detection method based on semi-supervised learning
CN107657204A (en) The construction method and facial expression recognizing method and system of deep layer network model
CN110188653A (en) Behavior recognition method based on local feature aggregation coding and long short-term memory network
CN110110648A (en) Method is nominated in view-based access control model perception and the movement of artificial intelligence
CN113065645A (en) Twin attention network, image processing method and device
CN113255394B (en) Pedestrian re-identification method and system based on unsupervised learning
CN113688761A (en) Pedestrian behavior category detection method based on image sequence
CN108647599A (en) In conjunction with the Human bodys' response method of 3D spring layers connection and Recognition with Recurrent Neural Network
CN109829414A (en) A kind of recognition methods again of the pedestrian based on label uncertainty and human body component model
CN117351542A (en) Facial expression recognition method and system
CN112560668A (en) Human behavior identification method based on scene prior knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210305