CN119314081A

CN119314081A - Video classification method, device, electronic device, storage medium and program product

Info

Publication number: CN119314081A
Application number: CN202411355315.8A
Authority: CN
Inventors: 李文娟; 原春锋; 李兵; 薛登峰; 胡卫明
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2024-09-26
Filing date: 2024-09-26
Publication date: 2025-01-14

Abstract

The present disclosure provides a video classification method, device, electronic device, storage medium and program product. The video classification method includes: obtaining a target video frame sequence; performing tree sampling on the target video frame sequence to obtain a two-layer structure video frame sequence and key frames; performing feature extraction processing on the video frame sequence based on a convolutional neural network model to obtain a time series feature; performing feature extraction processing on the key frames based on a pulse neural network model to obtain a pulse feature; performing fusion processing on the time series feature and the pulse feature to obtain a video fusion feature; performing classification processing according to the video fusion feature to obtain the category information of the target video frame sequence. The method can improve the feature expression of the video, thereby improving the accuracy of video classification, and can better complete the video classification task.

Description

Video classification method, apparatus, electronic device, storage medium, and program product

Technical Field

The present disclosure relates generally to the field of classification technology, and more particularly, to a video classification method, apparatus, electronic device, storage medium, and program product.

Background

Deep convolutional networks are used to perform classification tasks, which, while having had great success in most tasks, are often limited in processing video.

In the related art, when the video features are extracted by using the traditional convolutional neural network, the information is single, so that the network is difficult to capture the diversified features of the video, and the model identification performance is low. Therefore, how to better complete the video classification task is a problem to be solved in the current classification field.

Disclosure of Invention

The present disclosure provides a video classification method, apparatus, electronic device, storage medium, and program product for solving at least one of the above problems.

According to a first aspect of the embodiment of the present disclosure, a video classification method is provided, which includes obtaining a target video frame sequence, performing tree sampling on the target video frame sequence to obtain a video frame sequence and a key frame with a two-layer structure, performing feature extraction processing on the video frame sequence based on a convolutional neural network model to obtain a time sequence feature, performing feature extraction processing on the key frame based on a impulse neural network model to obtain an impulse feature, performing fusion processing on the time sequence feature and the impulse feature to obtain a video fusion feature, and performing classification processing according to the video fusion feature to obtain class information of the target video frame sequence.

The convolution neural network model comprises a first convolution layer and a second convolution layer, wherein the feature extraction processing is carried out on the video frame sequence based on the convolution neural network model to obtain time sequence features, the time sequence features comprise gray scale processing is carried out on the video frame sequence to obtain a gray scale video frame sequence, the feature extraction processing is carried out on the gray scale video frame sequence based on the first convolution layer to obtain short-term time sequence features, and the feature extraction processing is carried out on the short-term time sequence features based on the second convolution layer to obtain long-term time sequence features serving as the time sequence features.

Optionally, the feature extraction processing is performed on the key frames based on the pulse neural network model to obtain pulse features, wherein the feature extraction processing comprises the steps of performing pulse coding processing on the key frames to obtain key frame pulse sequences, and performing dynamic feature extraction processing on the key frame pulse sequences based on the pulse neural network model to obtain the pulse features.

Optionally, the convolutional neural network model performs data nonlinear operation by adopting a ReLU activation function, and the impulse neural network model performs data nonlinear operation by adopting LIF neurons as an activation function.

The convolutional neural network model and the impulse neural network model are obtained through training, wherein the convolutional neural network model and the impulse neural network model are obtained through obtaining a sample video frame sequence and sample category information, tree sampling is conducted on the sample video frame sequence to obtain a sampling video frame sequence and a sample key frame of a two-layer structure, feature extraction processing is conducted on the sampling video frame sequence based on the convolutional neural network model to be trained to obtain sample time sequence features, feature extraction processing is conducted on the sample key frame based on the impulse neural network model to be trained to obtain sample impulse features, fusion processing is conducted on the sample time sequence features and the sample impulse features to obtain sample video fusion features, classification processing is conducted according to the sample video fusion features to obtain prediction category information of the sample video frame sequence, loss values between the prediction category information and the sample category information are determined based on loss functions, and reverse propagation updating is conducted on the convolutional neural network model to be trained and the impulse neural network model to be trained by means of the loss values to obtain the convolutional neural network model to be trained and the impulse neural network model.

Optionally, the sample video frame sequence is a pre-processed video frame sequence, the pre-processing including at least one of a data amplification process, a resizing process, a normalization process, a outlier rejection process.

Optionally, the back propagation update of the impulse neural network model to be trained uses a surrogate gradient comprising: wherein alpha is learning rate and x is the loss value.

According to a second aspect of the embodiment of the present disclosure, there is provided a video classification device, including an acquisition unit configured to acquire a target video frame sequence, a sampling unit configured to tree-sample the target video frame sequence to obtain a video frame sequence and a key frame of a two-layer structure, a first extraction unit configured to perform feature extraction processing on the video frame sequence based on a convolutional neural network model to obtain a time sequence feature, a second extraction unit configured to perform feature extraction processing on the key frame based on a impulse neural network model to obtain an impulse feature, a fusion unit configured to fuse the time sequence feature and the impulse feature to obtain a video fusion feature, and a classification unit configured to perform classification processing according to the video fusion feature to obtain category information of the target video frame sequence.

Optionally, the convolutional neural network model comprises a first convolutional layer and a second convolutional layer, the first extraction unit is further configured to perform gray processing on the video frame sequence to obtain a gray video frame sequence, perform feature extraction processing on the gray video frame sequence based on the first convolutional layer to obtain short-term time sequence features, and perform feature extraction processing on the short-term time sequence features based on the second convolutional layer to obtain long-term time sequence features serving as the time sequence features.

Optionally, the second extraction unit is further configured to perform pulse encoding processing on the key frames to obtain a key frame pulse sequence, and perform dynamic feature extraction processing on the key frame pulse sequence based on the pulse neural network model to obtain the pulse features.

According to a third aspect of embodiments of the present disclosure, there is provided an electronic device comprising at least one processor, at least one memory storing computer executable instructions, wherein the computer executable instructions, when executed by the at least one processor, cause the at least one processor to perform a video classification method according to an exemplary embodiment of the present disclosure.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, instructions in which, when executed by at least one processor, cause the at least one processor to perform a video classification method according to an exemplary embodiment of the present disclosure.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by at least one processor, cause the at least one processor to perform a video classification method according to an exemplary embodiment of the present disclosure.

According to the video classification method and device, the electronic equipment and the storage medium, the video frame sequences and the key frames which are respectively suitable for the two models are sampled from the target video frame sequences by utilizing the convolutional neural network model and the impulse neural network model, the impulse characteristics are extracted from the key frames by utilizing the impulse neural network model, the power consumption is controlled by utilizing the impulse neural network model, and the video fusion characteristics with stronger characterization capability are constructed by combining the convolutional neural network model to extract time sequence characteristics from the video frame sequences, so that the characteristic expression capability of videos can be improved, the accuracy of video classification is improved, and video classification tasks can be completed better.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

Fig. 1 is a flowchart of a video classification method according to an exemplary embodiment of the present disclosure.

Fig. 2 is a schematic diagram of a training flow of convolutional neural network model and impulse neural network model in accordance with an exemplary embodiment of the present disclosure.

Fig. 3 is a schematic architecture diagram of a video classification model according to a specific embodiment of the disclosure.

Fig. 4 is a block diagram of a video classification device according to an exemplary embodiment of the present disclosure.

Fig. 5 is a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The embodiments described in the examples below are not representative of all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

It should be noted that, in this disclosure, "at least one of the items" refers to a case where three types of juxtaposition including "any one of the items", "a combination of any of the items", "an entirety of the items" are included. For example, "comprising at least one of A and B" includes the case of juxtaposition of three of (1) comprising A, (2) comprising B, and (3) comprising A and B. For example, "at least one of the first and second steps is executed", that is, three cases are shown in parallel, namely (1) execute the first step, (2) execute the second step, and (3) execute the first and second steps.

Hereinafter, a video classification method and apparatus, an electronic device, and a storage medium according to exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

Referring to fig. 1, in step S101, a target video frame sequence is acquired.

The target video frame sequence is a sequence of video frames in the target video that need to be classified. As an example, the target video frame sequence is obtained by decoding the target video. The target video can be a color RGB three-channel video shot by a common camera, and the target video can be obtained from a video data set or a database.

In step S102, a target video frame sequence is tree-sampled to obtain a video frame sequence and a key frame of a two-layer structure.

Tree sampling is the organization of a sequence of video frames into a tree-like structure that can help to efficiently select and process video frames. The tree structure can be regarded as a multi-level sampling of the time series, so that the extraction of key information therefrom becomes more systematic. When extracting a two-layer structure, a plurality of video frames are usually selected from an original video frame sequence (here, a target video frame sequence) according to a certain rule, and the video frame sequence formed by the video frames forms a more detailed layer. The key frames are representative frames extracted from the video frame sequence of the first layer, and are used for summarizing video content or capturing important moments to form a second layer structure.

In step S103, feature extraction processing is performed on the video frame sequence based on the convolutional neural network model, so as to obtain a time sequence feature.

The convolutional neural network model is used as the most commonly used neural network model at present, can extract time sequence characteristics with discriminant power, and is beneficial to guaranteeing the basic effect of video classification.

Optionally, the convolutional neural network model comprises a first convolutional layer and a second convolutional layer, and step S103 comprises the steps of carrying out gray scale processing on a video frame sequence to obtain a gray scale video frame sequence, carrying out feature extraction processing on the gray scale video frame sequence based on the first convolutional layer to obtain short-term time sequence features, and carrying out feature extraction processing on the short-term time sequence features based on the second convolutional layer to obtain long-term time sequence features serving as time sequence features.

The video frame sequence is firstly subjected to gray processing, and the multi-color channel is reduced to only a single brightness channel, so that the data volume and the calculation requirement can be greatly reduced, the complexity of a model can be reduced, the calculation resource can be saved, and the reasoning speed can be improved. In addition, in the video classification task, the color information may not be critical information, and may even generate interference (for example, the color change under different conditions may unnecessarily increase the sensitivity of the model to the color change, and for example, when performing specific medical image analysis or satellite image analysis, the color information may mask or interfere with important structural information), and the brightness information (i.e. gray information) of the video may be more important, so that the gray processing helps to make the model better extract the characteristic of brightness change with better recognition, thereby improving the video classification effect. In addition, by extracting the short-term time sequence features first and then further extracting the long-term time sequence features on the basis of the short-term time sequence features, rather than extracting the short-term time sequence features and the long-term time sequence features respectively, the short-term time sequence features and the long-term time sequence features can be fused naturally, the time sequence features which simultaneously reflect the characteristics of the target video in the short-term dimension and the long-term dimension are obtained, and a certain reference is provided for video classification.

As an example, the convolutional neural network model may include a plurality of parallel network structures, each including a first convolutional layer and a second convolutional layer, so that the same video frame sequence is subjected to parallel feature extraction to obtain a plurality of time sequence features, and the time sequence features are averaged (Mean operation is performed), and the average value is taken as a final time sequence feature of the video frame sequence. The present disclosure is not limited in this regard.

In step S104, feature extraction processing is performed on the key frame based on the impulse neural network model, so as to obtain impulse features.

The impulse neural network (Spiking Neural Network, SNN) is becoming increasingly interesting to students due to the operating mechanisms that are closer to the human brain. A comparatively popular impulse neural network construction method is converted from a deep convolution network. The impulse neural network is used as a highly biological simulation neural network model, and can simulate the capability of human brain to process visual information, so that the impulse neural network has remarkable potential in a small target recognition task. The pulse neural network model is based on to extract pulse characteristics in the key frames, and the pulse characteristics are combined with the time sequence characteristics extracted in the step S103, so that richer video characteristics can be extracted, the characteristic expression of videos is improved, the accuracy of video classification is improved, and video classification tasks can be completed better.

Optionally, step S104 includes performing pulse coding processing on the key frames to obtain a key frame pulse sequence, and performing dynamic feature extraction processing on the key frame pulse sequence based on a pulse neural network model to obtain pulse features. The key frame pulse sequence is obtained by adopting pulse coding processing, so that the obtained data is more suitable for pulse neural network model processing, when the key frame pulse sequence is processed, the space characteristics of the key frames can be processed statically, information can be read dynamically from the time dimension, and by simulating the dynamic processing capacity, the pulse neural network model can extract the pulse characteristics reflecting the pulse characteristics of the key frames while keeping low energy consumption, thereby being beneficial to improving the video classification precision.

As an example, after pulse encoding processing, a plurality of key frame pulse sequences may be obtained, and dynamic feature extraction processing may be performed on each of the key frame pulse sequences using a pulse neural network model, so as to obtain pulse features including a plurality of time sequences, and the pulse features may be averaged (Mean operation is performed) to obtain final pulse features.

By way of example, pulse encoding includes poisson encoding or other types of encoding of key frames. In bulk, the time step T is set to N, which is a natural number greater than 1 (e.g., n=16), and the probability distribution of the number of pulses occurring by poisson encoding within the interval [ T, t+τ ] is as follows:

Where λ is the reciprocal of each pixel value after image normalization, k is the number of pulses occurring within the interval [ t, t+τ ], τ is the time interval.

In step S105, the temporal feature and the pulse feature are fused, so as to obtain a video fusion feature.

As an example, when video fusion features are obtained by fusion, concat operations may be performed on the timing features and the pulse features, so that the two are spliced together to obtain video fusion features. Of course, other fusion processes may be employed, and this disclosure is not limited in this regard.

In step S106, classification processing is performed according to the video fusion feature, so as to obtain category information of the target video frame sequence.

This step performs a final video classification based on the video fusion features obtained in the previous step. As an example, the video fusion feature may be input into a full-connection layer with gradually reduced multi-level dimensions, so as to realize further reduction and extraction of the video fusion feature, where the last full-connection layer is a classifier, for example, a softmax classifier, and the dimensions thereof are the number of identifiable categories, so that the probability that the target video frame sequence belongs to each category can be obtained. For the multi-classification problem, one category with the highest probability can be determined as category information of the target video frame sequence, or one category with the highest probability exceeding a certain threshold can be determined as category information of the target video frame sequence, and the category of the target video frame sequence can not be identified when all probability values output by the classifier are smaller than the threshold.

According to the video classification method of the exemplary embodiment of the disclosure, in consideration of the fact that information extracted by a traditional convolutional neural network model is single, and the impulse neural network model relies on advantages of the information, although the impulse neural network model shows remarkable potential in a small target recognition task, information loss exists when the impulse neural network model is used for extracting video features in practice, and therefore video recognition effect is poor. By simultaneously utilizing the convolutional neural network model and the impulse neural network model, firstly sampling from a target video frame sequence to obtain a video frame sequence and a key frame which are respectively suitable for the two models, then utilizing the impulse neural network model to extract impulse characteristics from the key frame, controlling power consumption by means of the impulse neural network model, and combining the convolutional neural network model to extract time sequence characteristics from the video frame sequence to construct video fusion characteristics with stronger characterization capability, the characteristic expression capability of the video can be improved, thereby improving the accuracy of video classification and better completing video classification tasks.

It should be understood that, in fig. 1 and the text description related to fig. 1, different steps are labeled with serial numbers, so as to facilitate distinguishing between different steps and not to limit the execution order of the different steps. The execution sequence of the steps is defined by the execution logic precedence, for example, step S101 is performed to obtain the target video frame sequence, and then step S102 is performed to tree sample the target video frame sequence. Regarding step S103 and step S104, the feature extraction process is performed on different video data by using different models, so that there is no logical precedence relation, and thus in actual implementation, the feature extraction process may be performed in parallel or sequentially, which is not limited in the present disclosure.

In some embodiments, optionally, the convolutional neural network model uses a ReLU (RECTIFIED LINEAR Unit, modified Linear Unit) activation function for data non-linear operations, and the impulse neural network model uses LIF (LEAKY INTEGRATE-and-Fire, leaky integrate and Fire) neurons as the activation function for data non-linear operations. The activation functions for performing data nonlinear operation in the convolutional neural network model and the impulse neural network model respectively adopt a ReLU activation function and LIF neurons, so that the respective calculation requirements of the two models can be met.

Specifically, the current equation for the LIF neuron model is expressed as:

I(t)=I_R+I_C。

Wherein I (t) is the current through the LIF neuron model at time t, I _R is the current through resistor R, and I _C is the current through capacitor C. Derived from the above formula:

Where U (t) is the voltage across the LIF neuron model at time t, U _res is the resting potential, U ₀ is the membrane potential at t=0, τ _m is the membrane time constant, where τ _m =rc.

In some embodiments, optionally, the convolutional neural network model and the impulse neural network model are obtained through training by obtaining a sample video frame sequence and sample category information, performing tree sampling on the sample video frame sequence to obtain a sampling video frame sequence and a sample key frame with two-layer structures, performing feature extraction processing on the sampling video frame sequence based on the convolutional neural network model to be trained to obtain sample time sequence features, performing feature extraction processing on the sample key frame based on the impulse neural network model to be trained to obtain sample impulse features, performing fusion processing on the sample time sequence features and the sample impulse features to obtain sample video fusion features, performing classification processing according to the sample video fusion features to obtain prediction category information of the sample video frame sequence, determining loss values between the prediction category information and the sample category information based on a loss function, and performing reverse propagation updating on the convolutional neural network model to be trained and the impulse neural network model to be trained by using the loss values to obtain the convolutional neural network model and the impulse neural network model.

In this embodiment, the convolutional neural network model to be trained and the impulse neural network model to be trained are used, the same method as that of the previous embodiment is performed on the sample video frame sequence to achieve video classification processing, finally, the obtained category information is recorded as prediction category information, loss values between the prediction analog information and the explicitly marked sample category information are calculated to serve as references for updating the two models to be trained, training of the two models can be achieved through back propagation update, feature representation capability is improved, and video classification performance is further improved.

As an example, the loss function is a cross entropy loss function. For a classification problem, the cross entropy loss function is expressed as follows:

Loss=-(ylog(p)+(1-y)log(1-p))。

Where y is a label corresponding to sample class information, y=1 for positive class samples and y=0 for negative class samples. p is the prediction category information output by the final model, namely the probability value that the sample video frame sequence is of a positive category.

For multi-classification problems, the cross entropy loss function is expressed as follows:

Wherein C represents the category, C represents the number of categories that the model can identify, y _c represents the label of the sample category information corresponding to the category C, if y _c =1, it represents that the sample video frame sequence is the category C, otherwise, it is not the category C. p _c is the probability that the part corresponding to the category c, i.e. the sample video frame sequence, in the predicted category information output by the final model is the category c.

In practical application, the values of the loss functions can be obtained for a plurality of sample video frame sequences respectively, and then the average value of the values of the loss functions of the samples is calculated and used as the loss value used for updating the model parameters.

As an example, a loss threshold may be configured, and when the loss value is greater than or equal to the loss threshold, a counter propagation update is performed to update the model parameters, and then the model after the update of the parameters is used as a model to be trained, and the above training steps are repeatedly performed, so as to obtain an updated loss value, and the comparison with the loss threshold is continued until the loss value is less than the loss threshold, so that training may be ended, and the latest model is used as a convolutional neural network model and a impulse neural network model obtained by training. For example, since the loss value obtained for the first time is often greater than or equal to the loss threshold, the back propagation update may be performed directly after the loss value is obtained for the first time, and then the new loss value is continuously calculated by using the updated model, and then the new loss value is compared with the loss threshold.

In some embodiments, the sample video frame sequence is optionally a pre-processed video frame sequence, the pre-processing including at least one of a data amplification process, a resizing process, a normalization process, and an outlier rejection process.

In these embodiments, the number of sample video frames is expanded by performing a data expansion process, which can ensure the number of samples required for model training, reducing the risk of model over-fitting. By executing the size adjustment process, the size of the sample video frame input into the model can be enabled to accord with the data size standard allowed to be input by the model, and the sample video frame sequence is ensured to be effectively processed. Specifically, the video frames in the sample video frame sequence generally conform to the size standard, but the video frames obtained by the data amplification process generally need to be adjusted in size, and at this time, the newly added video frames after the data amplification process may be adjusted separately, or the entire video frame sequence obtained after the data amplification process may be subjected to the size adjustment process, which is not limited in this disclosure. The normalization processing can reduce the risk of unbalance of data among samples, improves the quality of the samples, and is beneficial to improving the robustness in the subsequent model training. The abnormal value eliminating process can reduce the interference of the abnormal value to the video classification process.

As an example, the data augmentation process may be implemented by such means as flipping, rotation, panning, scaling, noise disturbance, brightness contrast transformation, etc., or expanding the number of samples with a generative countermeasure network.

As an example, the goal of the resizing process is related to the data size criteria that the model allows to enter. In particular, the video frame size to be processed is often small, and video frame pixel expansion processing may be performed to increase the size thereof. If the video frame to be processed is larger in size, the size of the video frame can be reduced by removing pixels in the edge area.

As an example, for each pixel point in a sample video frame, normalization processing of the sample video frame is achieved by calculating a normalized pixel value by:

Wherein z _norm is a pixel value after the pixel point normalization processing, z is a pixel value before the pixel point normalization processing, min (z) is a pixel minimum value in the current sample video frame, and max (z) is a pixel maximum value in the current sample video frame.

The sample video frame sequence is obtained by performing data amplification processing on the sample video frames, adjusting the size of the amplified video frames to obtain a plurality of sample video frames, and performing normalization processing on the plurality of sample video frames to obtain the sample video frame sequence.

In some embodiments, optionally, the back-propagation update of the impulse neural network model to be trained uses a surrogate gradient comprising:

where α is the learning rate and x is the loss value.

In this embodiment, since the impulse neural network model processes a discontinuous function, calculation accurate extraction may be difficult, and by performing back propagation update of the impulse neural network model using a surrogate gradient, reliable progress of the update process can be ensured.

Referring to fig. 2, training the convolutional neural network model and the impulse neural network model mainly includes the following 5 steps.

(1) And initializing network parameters (corresponding to the neural network parameters) of the convolutional neural network model and the impulse neural network model.

(2) The method comprises the steps of inputting a preprocessed gray sampling video frame sequence into a convolutional neural network model, extracting sample time sequence characteristics, carrying out poisson coding on a preprocessed sample key frame to obtain a pulse sequence, inputting the pulse sequence into the pulse neural network model, extracting sample pulse characteristics, and outputting prediction category information corresponding to a sample video after the sample time sequence characteristics and the sample pulse characteristics are fused.

(3) A loss value (correspondence error) between the resulting prediction category information and sample category information of the sample video frame sequence is calculated based on the cross entropy loss function.

(4) And (3) if the loss value is greater than or equal to the loss threshold value, executing the step (5) to continue training until training is completed under the condition that the loss value is less than the loss threshold value.

(5) And (3) carrying out back propagation update on the convolutional neural network model and the impulse neural network model by using the loss value, repeatedly executing the steps (2) and (3) based on the updated convolutional neural network model and the updated impulse neural network model so as to realize iterative update of the loss value, and further executing the step (5).

In this embodiment, a ReLU activation function is used in the back propagation update process to prevent gradient extinction and gradient explosion during model training.

Next, a video classification method according to one embodiment of the present disclosure will be described.

The video classification method of the embodiment is based on a video classification model, and performs classification recognition on the target video to obtain category information serving as a classification recognition result. The video classification model is shown in fig. 3, and is constructed based on an upper convolutional neural network model and a lower impulse neural network model, where the convolutional neural network comprises three parallel network structures, and each network structure comprises a first convolutional layer (corresponding to the first square block of the gray video frame sequence in fig. 3, the number at the bottom of the square block represents the number of output neurons of the layer, and the other square blocks are the same and are not described one by one) and a second convolutional layer (corresponding to 9 square blocks after the first convolutional layer). A video classification method for generating a fusion pulse characteristic correspondingly, wherein the method mainly comprises the following 3 steps of obtaining the video fusion characteristic.

(1) And inputting the gray video frame sequence into a first convolution layer of each network structure of the convolution neural network model to obtain a short-term time sequence feature, inputting the short-term time sequence feature into a second convolution layer of each network structure of the convolution neural network model to continue processing to obtain a long-term time sequence feature fused with the short-term time sequence feature, taking the long-term time sequence feature as a time sequence feature, and averaging (executing Mean operation) three time sequence features output by the three network structures to obtain a final time sequence feature.

(2) And inputting the key frame pulse sequence into a pulse neural network model to obtain pulse characteristics comprising a plurality of time sequences, and then averaging the pulse characteristics (executing a Mean operation) to obtain the pulse characteristics of the video key frame.

(3) And fusing the time sequence characteristics output by the convolutional neural network model and the pulse characteristics output by the pulse neural network model through Concat operation to obtain video fusion characteristics.

The pulse neural network model is used for extracting pulse characteristics containing time sequence information from the key frames, carrying out characteristic fusion on time sequence characteristics output by the convolutional neural network model and pulse characteristics output by the pulse neural network model, calculating and outputting predicted values through a full-connection layer, sequentially and iteratively training the convolutional neural network model and the pulse neural network model, and obtaining an available video classification model under the condition of network convergence for executing video classification tasks subsequently.

In this embodiment, the impulse neural network model includes at least one first convolution kernel (corresponding to the first square of the key frame impulse sequence afferent in fig. 3) and a plurality of residual blocks (corresponding to the 9 squares after the first convolution kernel), the plurality of residual blocks are determined based on a plurality of second convolution kernels, i.e., the residual blocks take the form of convolution kernels, described as second convolution kernels for distinguishing from the preceding first convolution kernels, the activation layer included in each residual block is set to be LIF neurons, the convolution kernels in the convolution neural network model are the same as the first convolution kernels in the impulse neural network model, the residual blocks in the convolution neural network model are the same as the residual blocks in the impulse neural network model, and the data nonlinearity operation is performed in the convolution neural network model using a ReLU activation function, and the data nonlinearity operation is performed in the impulse neural network model using LIF neurons as the activation function. In other words, the convolutional neural network model differs from the impulse neural network model in that the activation functions used by the two in the residual block are different, and furthermore, the impulse neural network model has a time step dimension, i.e. LIF neurons can emit multiple impulses, and each time step emits at most one impulse.

In this particular embodiment, the size of the first convolution kernel and the second convolution kernel may be selected according to user requirements, for example, the first convolution kernel is 5*5 convolution kernels and the second convolution is at least one of 3*3 convolution kernels and 1*1 convolution kernels.

In this particular embodiment, the impulse neural network model includes one 5*5 convolution kernel (i.e., the first convolution kernel) and three residual blocks (i.e., 9 blocks after the first convolution kernel), where each residual block consists of two 3*3 convolution kernels and one 1*1 convolution kernel, and the activation function uses LIF neurons, which can balance biological rationality and practicality.

In this specific embodiment, the convolutional neural network model includes three parallel network structures, each network structure includes one 5*5 convolutional kernel and three residual blocks like the impulse neural network model, so that effective transmission of key features can be ensured, wherein each residual block is composed of two 3*3 convolutional kernels and one 1*1 convolutional kernel, and the problem of gradient disappearance and gradient explosion is effectively solved by adopting a ReLU as an activation function.

In this particular embodiment, the video classification model further includes three fully connected layers, each fully connected layer having a number of output neurons 1024, 256 and a number of categories (classes), each fully connected layer followed by a ReLU activation function.

In this embodiment, the video classification model is obtained by training using the sample video as a training sample and using the video fusion feature as a training feature by the training method described above. Relevant parameters in the training process include using a random gradient descent method (Stochastic GRADIENT DESCENT, SGD) as an optimizer, a learning rate set to 0.1, a sample batch size of 32, and a maximum training number of epochs of 30.

According to the video identification method for fusing pulse characteristics, which is provided by the embodiment, the pulse neural network model and the convolution neural network model are constructed through at least one first convolution kernel and a plurality of residual blocks, an activation function in the pulse neural network model is constructed by adopting LIF neurons, and a convolution kernel in the convolution neural network model is subjected to data nonlinear operation by adopting a ReLU activation function, so that the problems of gradient disappearance and gradient explosion during network training are relieved while the effective transmission of key characteristics is ensured.

Fig. 4 is a block diagram of a video classification device according to an exemplary embodiment of the present disclosure. Referring to fig. 4, the apparatus includes an acquisition unit 401, a sampling unit 402, a first extraction unit 403, a second extraction unit 404, a fusion unit 405, and a classification unit 406.

The acquisition unit 401 may acquire a target video frame sequence.

The sampling unit 402 may perform tree sampling on the target video frame sequence to obtain a video frame sequence and a key frame with a two-layer structure.

The first extraction unit 403 may perform feature extraction processing on the video frame sequence based on the convolutional neural network model, to obtain a time sequence feature.

The second extraction unit 404 may perform feature extraction processing on the key frame based on the impulse neural network model to obtain impulse features.

The fusion unit 405 may perform fusion processing on the temporal feature and the pulse feature to obtain a video fusion feature.

The classification unit 406 may perform classification processing according to the video fusion feature to obtain category information of the target video frame sequence.

Optionally, the convolutional neural network model includes a first convolutional layer and a second convolutional layer, and the first extraction unit 403 may further perform gray processing on the video frame sequence to obtain a gray video frame sequence, perform feature extraction processing on the gray video frame sequence based on the first convolutional layer to obtain a short-term time sequence feature, perform feature extraction processing on the short-term time sequence feature based on the second convolutional layer to obtain a long-term time sequence feature, and perform fusion processing on the short-term time sequence feature and the long-term time sequence feature to obtain the time sequence feature.

Optionally, the second extraction unit 404 may perform pulse encoding processing on the key frame to obtain a key frame pulse sequence, and perform dynamic feature extraction processing on the key frame pulse sequence based on the pulse neural network model to obtain pulse features.

Optionally, the convolutional neural network model performs data nonlinear operation by using a ReLU activation function, and the impulse neural network model performs data nonlinear operation by using LIF neurons as an activation function.

The convolutional neural network model and the impulse neural network model are obtained through training, wherein the convolutional neural network model and the impulse neural network model are obtained through obtaining a sample video frame sequence and sample category information, tree sampling is conducted on the sample video frame sequence to obtain a sampling video frame sequence and sample key frames of a two-layer structure, feature extraction processing is conducted on the sampling video frame sequence based on the convolutional neural network model to be trained to obtain sample time sequence features, feature extraction processing is conducted on the sample key frames based on the impulse neural network model to be trained to obtain sample impulse features, fusion processing is conducted on the sample time sequence features and the sample impulse features to obtain sample video fusion features, classification processing is conducted according to the sample video fusion features to obtain prediction category information of the sample video frame sequence, loss values between the prediction category information and the sample category information are determined based on a loss function, and reverse propagation updating is conducted on the convolutional neural network model to be trained and the impulse neural network model to be trained by means of the loss values to obtain the convolutional neural network model and the impulse neural network model.

Optionally, the back propagation update of the impulse neural network model to be trained uses a surrogate gradient comprising: where α is the learning rate and x is the loss value.

The specific manner in which the individual units perform the operations in relation to the apparatus of the above embodiments has been described in detail in relation to the embodiments of the method and will not be described in detail here.

Fig. 5 shows a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.

Referring to fig. 5, the electronic device includes at least one memory 501 and at least one processor 502, the at least one memory 501 having stored therein computer-executable instructions that, when executed by the at least one processor 502, cause the at least one processor to perform the target correspondence method as described in the above-described exemplary embodiments.

By way of example, the electronic device may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the above-described set of instructions. Here, the electronic device is not necessarily a single electronic device, but may be any device or an aggregate of circuits capable of executing the above-described instructions (or instruction set) singly or in combination. The electronic device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with either locally or remotely (e.g., via wireless transmission).

In an electronic device, the processor 502 may include a Central Processing Unit (CPU), a Graphics Processor (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processor 502 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.

The processor 502 may execute instructions or code stored in the memory 501, wherein the memory 501 may also store data. The instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.

The memory 501 may be integrated with the processor 502, for example, RAM or flash memory disposed within an integrated circuit microprocessor or the like. In addition, memory 501 may include a stand-alone device, such as an external disk drive, a storage array, or other storage device usable by any database system. The memory 501 and the processor 502 may be operatively coupled or may communicate with each other, for example, through an I/O port, network connection, etc., such that the processor 502 is able to read files stored in the memory.

In addition, the electronic device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device may be connected to each other via a bus and/or a network.

According to an exemplary embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform the object correspondence method as described in the above exemplary embodiment. Examples of computer readable storage media herein include read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, nonvolatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, blu-ray or optical disk memory, hard Disk Drive (HDD), solid State Disk (SSD), card memory (such as a multimedia card, secure Digital (SD) card or ultra-fast digital (XD) card), magnetic tape, floppy disk, magneto-optical data storage device, hard disk, solid state disk, and any other device configured to non-temporarily store a computer program and any associated data, data files and data structures and to cause the computer program and any associated data, data file and data structures to be provided to a processor or processor to execute the computer program. The computer programs in the computer readable storage media described above can be run in an environment deployed in a computer device, such as a client, host, proxy device, server, etc., and further, in one example, the computer programs and any associated data, data files, and data structures are distributed across networked computer systems such that the computer programs and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an exemplary embodiment of the present disclosure, a computer program product may also be provided, comprising computer instructions which, when executed by at least one processor, perform the object correspondence method as described in the above exemplary embodiment.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A video classification method, the video classification method comprising:

acquiring a target video frame sequence;

performing tree sampling on the target video frame sequence to obtain a video frame sequence and a key frame with a two-layer structure;

performing feature extraction processing on the video frame sequence based on a convolutional neural network model to obtain time sequence features;

Performing feature extraction processing on the key frames based on a pulse neural network model to obtain pulse features;

performing fusion processing on the time sequence characteristics and the pulse characteristics to obtain video fusion characteristics;

and carrying out classification processing according to the video fusion characteristics to obtain the category information of the target video frame sequence.

2. The video classification method of claim 1, wherein the convolutional neural network model comprises a first convolutional layer and a second convolutional layer, wherein the feature extraction processing is performed on the video frame sequence based on the convolutional neural network model to obtain a timing feature, and the method comprises:

gray processing is carried out on the video frame sequence to obtain a gray video frame sequence;

performing feature extraction processing on the gray video frame sequence based on the first convolution layer to obtain short-term time sequence features;

And carrying out feature extraction processing on the short-term time sequence features based on the second convolution layer to obtain long-term time sequence features serving as the time sequence features.

3. The video classification method of claim 1, wherein the feature extraction process is performed on the key frame based on a pulse neural network model to obtain pulse features, and the method comprises:

performing pulse coding treatment on the key frames to obtain key frame pulse sequences;

and carrying out dynamic feature extraction processing on the key frame pulse sequence based on the pulse neural network model to obtain the pulse features.

4. The video classification method of claim 1,

The convolution neural network model adopts a ReLU activation function to perform data nonlinear operation, and the impulse neural network model adopts LIF neurons as an activation function to perform data nonlinear operation.

5. The video classification method of any of claims 1-4, wherein the convolutional neural network model and the impulse neural network model are trained by:

Acquiring a sample video frame sequence and sample category information;

Performing tree sampling on the sample video frame sequence to obtain a sample video frame sequence with a two-layer structure and a sample key frame;

Performing feature extraction processing on the sampling video frame sequence based on a convolutional neural network model to be trained to obtain sample time sequence features;

Performing feature extraction processing on the sample key frames based on a pulse neural network model to be trained to obtain sample pulse features;

Carrying out fusion processing on the sample time sequence characteristics and the sample pulse characteristics to obtain sample video fusion characteristics;

classifying according to the sample video fusion characteristics to obtain prediction category information of the sample video frame sequence;

Determining a loss value between the prediction category information and the sample category information based on a loss function;

and carrying out back propagation update on the convolutional neural network model to be trained and the impulse neural network model to be trained by using the loss value to obtain the convolutional neural network model and the impulse neural network model.

6. The video classification method of claim 5,

The sample video frame sequence is a pre-processed video frame sequence, the pre-processing including at least one of a data amplification process, a resizing process, a normalization process, an outlier rejection process, and/or

The back propagation update of the impulse neural network model to be trained uses a surrogate gradient comprising:

wherein alpha is learning rate and x is the loss value.

7. A video classification device, the video classification device comprising:

An acquisition unit configured to acquire a target video frame sequence;

the sampling unit is configured to perform tree sampling on the target video frame sequence to obtain a video frame sequence and a key frame with a two-layer structure;

the first extraction unit is configured to perform feature extraction processing on the video frame sequence based on a convolutional neural network model to obtain time sequence features;

The second extraction unit is configured to perform feature extraction processing on the key frames based on the impulse neural network model to obtain impulse features;

the fusion unit is configured to fuse the time sequence characteristic and the pulse characteristic to obtain a video fusion characteristic;

and the classification unit is configured to perform classification processing according to the video fusion characteristics to obtain the category information of the target video frame sequence.

8. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

Wherein the computer executable instructions, when executed by the at least one processor, cause the at least one processor to perform the video classification method of any of claims 1 to 6.

9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by at least one processor, cause the at least one processor to perform the video classification method of any of claims 1-6.

10. A computer program product comprising computer instructions which, when executed by at least one processor, cause the at least one processor to perform the video classification method of any of claims 1 to 6.