CN118540515A

CN118540515A - Video frame stream processing method and system

Info

Publication number: CN118540515A
Application number: CN202410760158.2A
Authority: CN
Inventors: 黄仕同
Original assignee: Guangxi Fupeng Cultural Development Co ltd
Current assignee: Guangxi Fupeng Cultural Development Co ltd
Priority date: 2024-06-13
Filing date: 2024-06-13
Publication date: 2024-08-23

Abstract

The invention discloses a video frame stream processing method and a system thereof, belonging to the field of multimedia security; a video frame stream processing method includes: dividing a video frame stream into key frames and non-key frames; detecting and analyzing the video frame stream in real time by using a CNN-LSTM combined model, identifying possible security threats and detecting abnormal changes in the image; in the transmission process, a feedback mechanism is established, and corresponding feedback and processing are carried out according to the result output by the CNN-LSTM model; performing end-to-end encryption transmission on the key frames and the non-key frames by adopting an AES symmetric encryption algorithm and a partial encryption mode respectively; identity authentication is performed before transmission begins. Extracting image features through CNN, LSTM modeling time sequence information, and training by combining a linear function and a loss function, so that security threat and abnormal change in a video frame image are effectively identified; and a feedback mechanism is established, and corresponding feedback and processing are carried out according to the result output by the CNN-LSTM model, so as to improve the coping capability of the security threat.

Description

Video frame stream processing method and system

Technical Field

The invention belongs to the field of multimedia security, and particularly relates to a video frame stream processing method and a system thereof.

Background

With the continuous development of computer technology and network technology, a user can watch not only video files stored in terminal equipment used by the user, but also video files on a network side, wherein video files which can be transmitted to the user are stored in a video server on the network side, when the user wants to watch the video files stored in the video server on the network side, a video transmission request can be sent to the video server through the used terminal equipment, after the video server receives the video transmission request, the corresponding video files are transmitted to the terminal equipment, and the terminal equipment plays the received video files. A video file consists of a sequence of ordered video frames (i.e., a single still picture), and thus the video file may also be referred to as a video frame stream.

However, during video transmission, security threats, such as data leakage, tampering, or interception, may be faced. Therefore, appropriate security measures need to be taken during video transmission, and a video frame stream processing method is proposed for this purpose.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a video frame stream processing method and a system thereof, which solve the problems in the prior art.

The aim of the invention can be achieved by the following technical scheme:

A video frame stream processing method, comprising the steps of:

dividing a video frame stream into key frames and non-key frames;

Detecting and analyzing the video frame stream in real time by using a CNN-LSTM combined model, identifying possible security threats and detecting abnormal changes in the image;

in the transmission process, a feedback mechanism is established, and corresponding feedback and processing are carried out according to the result output by the CNN-LSTM model;

Performing end-to-end encryption transmission on the key frames and the non-key frames by adopting an AES symmetric encryption algorithm and a partial encryption mode respectively;

identity authentication is performed before transmission begins.

Further, the CNN-LSTM combination model includes:

convolution layer: for extracting features in the image;

pooling layer: the method is used for reducing the size of the feature map, reducing the number of parameters and simultaneously keeping key information;

stacking of convolution-pooling layers: for progressively extracting and abstracting image features;

Layer of flat: flattening the feature map output by the convolution layer into a one-dimensional vector serving as an input of the LSTM layer;

LSTM layer: for modeling timing information in the image sequence, capturing a temporal correlation between image frames;

hidden layer: for increasing the depth and complexity of the network;

output layer: and connecting the output of the LSTM layer to the full connection layer, and finally outputting the prediction result of the model.

Further, the linear function of the CNN-LSTM combination model is as follows:

Wherein z _j is the input weighted sum of the jth neuron of the full connection layer, w _ij is the weight connecting the ith input feature and the jth neuron, x _i is the ith element of the input feature vector, b _j is the bias term of the jth neuron, and N is the dimension of the input feature vector;

The loss function of the CNN-LSTM combination model is as follows:

in the method, in the process of the invention, Representing the difference between the predicted result output by the model and the real label; y _ij represents the tag value of the j-th class in the real tag of sample i; Representing the predicted probability of the j-th class in the model output for sample i.

Further, the specific steps of detecting and analyzing the video frame stream in real time by using the CNN-LSTM combination model include:

S21, preprocessing the acquired video frames, including image size adjustment and normalization operation;

S22, utilizing the preprocessed video frame, adjusting the structure and super parameters of the CNN-LSTM combined model through repeated iterative training and verification, and optimizing the CNN-LSTM combined model;

S23, inputting the frequency frames into a CNN-LSTM model after training and optimizing, analyzing and monitoring each frame in real time through the model, detecting abnormal changes, object shielding and scene change conditions in the image, and identifying security threats.

Further, the step of optimizing the CNN-LSTM combination model includes:

s221, dividing the data set into a training set, a verification set and a test set; the training set is used for parameter training of the model, the verification set is used for super-parameter adjustment and performance evaluation of the model, and the test set is used for performance evaluation of the final model;

s222, training the deep learning model by using a training set, and updating parameters of the model through a back propagation algorithm and an optimizer in the training process so that a loss function of the model is gradually reduced;

s223, evaluating the model obtained by training by using the verification set, and adjusting the super parameters of the model according to the performance of the verification set, wherein the super parameters comprise: learning rate, batch size, network structure, regularization parameters;

And S224, monitoring the performance of the model on the verification set, and stopping training when the performance is not improved any more.

Further, only the sensitive area is encrypted when encrypting the non-key frames.

A video frame stream processing system, comprising:

And a classification module: dividing a video frame stream into key frames and non-key frames;

And a detection and analysis module: detecting and analyzing the video frame stream in real time by using a CNN-LSTM combined model, identifying possible security threats and detecting abnormal changes in the image;

and a transmission feedback module: in the transmission process, a feedback mechanism is established, and corresponding feedback and processing are carried out according to the result output by the CNN-LSTM model;

and a transmission encryption module: performing end-to-end encryption transmission on the key frames and the non-key frames by adopting an AES symmetric encryption algorithm and a partial encryption mode respectively;

Identity authentication module: identity authentication is performed before transmission begins.

A computer storage medium storing a readable program capable of executing a video frame stream processing method as described above when the program is running.

An electronic device, comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;

the memory is configured to store at least one executable instruction, where the executable instruction causes the processor to perform an operation corresponding to the video frame stream processing method.

A computer program product comprising computer instructions for instructing a computing device to perform operations corresponding to one of the video frame stream processing methods described above.

The invention has the beneficial effects that:

1. the video frame stream is divided into the key frames and the non-key frames, and different encryption modes are adopted for the key frames and the non-key frames respectively for encryption, wherein the non-key frames are encrypted in a partial encryption mode (only the sensitive area is encrypted), so that the influence of the encryption process on bandwidth and performance is reduced, and meanwhile, the cost is reduced.

2. The CNN-LSTM combined model is used for detecting and analyzing the video frame stream, the CNN in the model is used for extracting image characteristics, LSTM modeling time sequence information is used for training by combining a linear function and a loss function, and security threat and abnormal change in the video frame image can be effectively identified.

3. By establishing a feedback mechanism, corresponding feedback and processing are carried out according to the result output by the CNN-LSTM model, so that the coping capability of the security threat is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to those skilled in the art that other drawings can be obtained according to these drawings without inventive effort.

Fig. 1 is a flow chart of a video frame stream processing method of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, a video frame stream processing method includes the following steps:

s1, before transmission, dividing a video frame stream into key frames (I frames) and non-key frames (P frames and B frames);

The key frames contain complete image information, but the non-key frames usually only contain partial image information, and by classifying the frames, different security treatments can be specifically carried out on the key frames and the non-key frames, so that the security of video content is more effectively protected.

Meanwhile, since the key frame has higher image quality and importance, security processing such as encryption, transmission and storage of the key frame may require more resources and calculation costs. And the security processing on the non-key frames can be lighter, so that resources and bandwidth are saved. By reasonably utilizing the resources, the overall performance and efficiency of the system can be improved.

The step of classifying the video frame stream specifically comprises the following steps:

S11, extracting frames: first, image data of each frame is extracted from a video frame stream.

S12, frame type identification: and carrying out type identification on each extracted frame, and determining which type of frame belongs to the extracted frame. In a common video coding standard, frames are generally classified into the following types:

Key frame (I frame): key frames are important frames in a video sequence that contain complete image information independent of the data of other frames. Typically, a key frame is a reference image that appears once every few frames, for preserving video, with subsequent frames all being changes relative to the key frame. Whether a frame is a key frame may be determined by examining frame header information or a specific flag.

Non-key frames (P-frames, B-frames): non-key frames are encoded in dependence on the data of other frames, typically only storing differences from a previous frame or frames before and after. In video coding standards such as h.264, non-key frames are generally classified into two types, i.e., predicted frames (P frames) and bi-predicted frames (B frames). P frames rely on the data of the previous frame for encoding, while B frames rely on the data of both the previous and subsequent frames, so B frames typically have higher compression rates, but are also more complex.

S13, frame classification: each frame is classified as either a key frame or a non-key frame according to the type of frame.

S14, marking a frame: each frame is marked indicating the frame type to which it belongs.

S2, detecting and analyzing the video frame stream in real time by utilizing a CNN-LSTM combination model, and identifying possible security threats such as data tampering and malicious attack; abnormal changes in the image, such as object shielding, scene change and the like, are detected, so that the safety monitoring capability of video content is improved;

the method comprises the following specific steps:

S21, preprocessing the acquired video frames, including image size adjustment and normalization operation, to ensure that the input data is matched with the requirements of a training model;

s22, utilizing the preprocessed video frame, adjusting the structure and super parameters of the CNN-LSTM combined model through repeated iterative training and verification, and optimizing the performance of the CNN-LSTM combined model so as to improve the accuracy and generalization capability of the model;

The CNN-LSTM combination model comprises:

Convolution layer: the method is used for extracting features in the image, including low-level features such as edges and textures and high-level features such as objects and scenes;

stacking of convolution-pooling layers: a stack of multiple convolution-pooling layers for progressively extracting and abstracting image features;

Hidden layer: a stack of LSTM layers for increasing the depth and complexity of the network;

In the linear function of the CNN-LSTM combination model, an output layer of the CNN-LSTM combination model is connected to a full-connection layer, nonlinear transformation is carried out by using an activation function (ReLU), the output of the full-connection layer is classified by a softmax function, and the probability of each class is output; the linear function of the CNN-LSTM combination model is:

on the basis of the linear function, the output of the neuron is obtained by performing nonlinear transformation by activating the function ReLU

Wherein f (z _j) activates the function;

the probability p _j of category j is calculated by softmax function:

Where K is the number of output categories.

The CNN-LSTM combination model uses a Cross entropy loss function (Cross-Entropy Loss) to measure the difference between the predicted result output by the model and the real label; the expression is as follows:

in the method, in the process of the invention, Representing the difference between the predicted result output by the model and the real label; y _ij represents the tag value of the j-th class in the real tag of sample i (1 represents belonging to the class, 0 represents not belonging to the class); Representing the predicted probability of the j-th class in the model output for sample i.

The image features are extracted through CNN, LSTM modeling time sequence information is combined with a linear function and a loss function to train, and security threats and abnormal changes in video frame images can be effectively identified.

The step of optimizing the CNN-LSTM combination model comprises the following steps:

s221, dividing the data set into a training set, a verification set and a test set; the training set is used for parameter training of the model, the verification set is used for super-parameter adjustment and performance evaluation of the model, and the test set is used for performance evaluation of the final model.

S222, training the deep learning model by using a training set, wherein in the training process, parameters of the model are updated through a back propagation algorithm and an optimizer (Adam) so that a loss function of the model is gradually reduced;

S223, evaluating the model obtained by training by using the verification set, and adjusting the super parameters of the model according to the performance of the verification set, wherein the super parameters comprise: learning rate: too large a step size of the control parameter update may cause model oscillation, and too small a step size may cause slow convergence speed.

Batch size: the number of samples used for each iterative training affects the convergence rate and generalization ability of the model.

Network structure: including the number of network layers, the number of neurons per layer, the size of the convolution kernel, etc.

Regularization parameters: such as L1 regularized, L2 regularized weights.

S224, monitoring the performance of the model on the verification set, and stopping training when the performance is not improved any more so as to prevent over fitting; if the data set is smaller in size, the generalization capability of the model can be evaluated in a cross-validation mode, so that the robustness of the model is further improved.

Through repeated iterative training and verification, the super parameters and the structure of the model are adjusted, so that the performance of the deep learning model can be gradually optimized, and the accuracy and generalization capability of the deep learning model are improved.

S23, inputting video frames into a CNN-LSTM model after training and optimizing, and analyzing and monitoring each frame in real time through the model; the model may detect abnormal changes in the image, object occlusions, scene changes, etc., and identify potential security threats.

And S3, in the transmission process, a feedback mechanism is established, and corresponding feedback and processing are carried out according to the result output by the CNN-LSTM model.

If an abnormal condition is detected, timely taking measures, such as triggering an alarm, taking preventive measures or automatically adjusting a security policy, to deal with potential security threats; the feedback mechanism is improved through continuous iteration, so that the coping capability and the transmission efficiency of the security threat are improved.

S4, performing end-to-end encryption transmission on the video key frames and the non-key frames by adopting an AES symmetric encryption algorithm and a partial encryption mode respectively, so as to ensure the data security in the transmission process; and meanwhile, secure transmission and management of the secret key are ensured so as to prevent data leakage and tampering.

The step of using an AES symmetric encryption algorithm to perform end-to-end encrypted transmission of video key frames includes:

1) And (3) key generation: the sender and receiver negotiate a key, and a secure key agreement protocol is used to generate a shared key.

2) Encrypting key frames: the sender encrypts the key frame by using the shared key to protect the content of the key frame; an AES (advanced encryption standard) symmetric encryption algorithm is used to encrypt the key frames.

3) Transmitting encrypted data: the sender transmits the encrypted key frame to the receiver, so that the data security in the transmission process is ensured.

4) Decrypting the received data: the receiving party uses the same shared secret key to decrypt the received encrypted data, and the original key frame content is restored.

The step of using a partial encryption method to perform end-to-end encrypted transmission on the video non-key frames comprises the following steps:

1) And (3) key generation: the sender and receiver still negotiate the key, generating a shared key.

2) Encrypting the non-key frames: unlike key frames, non-key frames can generally be more lightweight encrypted because their content is less important than key frames; encryption may be performed in a partial encryption (encryption of only sensitive areas) to reduce the impact of the encryption process on bandwidth and performance.

3) Transmitting encrypted data: the sender transmits the encrypted non-key frame to the receiver, so that the data security in the transmission process is ensured.

4) Decrypting the received data: the receiving party uses the same shared secret key to decrypt the received encrypted data, and the original non-key frame content is restored.

It is worth mentioning that encryption of key frames typically introduces a greater performance overhead, as its content is more important and requires more stringent protection. Encryption of non-key frames may employ a lighter weight encryption algorithm to reduce the impact of the encryption process on transmission performance.

S5, identity authentication is carried out before transmission starts, so that both sides of the transmission are ensured to be legal and trusted; the identity of the transmitting party can be verified by adopting digital certificates, token authentication and other modes so as to prevent unauthorized access and malicious attack.

The method of the present invention may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium and to be stored in a local recording medium downloaded through a network, so that the method described herein may be stored on such software process on a recording medium using a general purpose computer, special purpose processor, or programmable or special purpose hardware such as an ASIC or FPGA. It is understood that a computer, processor, microprocessor controller, or programmable hardware includes a storage component (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by a computer, processor, or hardware, performs the methods described herein. Furthermore, when a general purpose computer accesses code for implementing the methods illustrated herein, execution of the code converts the general purpose computer into a special purpose computer for performing the methods illustrated herein.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims.

Claims

1. A method for processing a video frame stream, comprising the steps of:

dividing a video frame stream into key frames and non-key frames;

identity authentication is performed before transmission begins.

2. The video frame stream processing method according to claim 1, wherein the CNN-LSTM combination model includes:

convolution layer: for extracting features in the image;

hidden layer: for increasing the depth and complexity of the network;

3. The video frame stream processing method according to claim 2, wherein the linear function of the CNN-LSTM combination model is:

The loss function of the CNN-LSTM combination model is as follows:

4. The method for processing a video frame stream according to claim 2, wherein the specific steps of detecting and analyzing the video frame stream in real time by using the CNN-LSTM combination model include:

5. The video frame stream processing method according to claim 4, wherein the step of optimizing the CNN-LSTM combining model comprises:

6. A video frame stream processing method according to claim 1, characterized in that only sensitive areas are encrypted when encrypting non-key frames.

7. A video frame stream processing system, comprising:

8. A computer storage medium storing a readable program, characterized in that when the program is run, a video frame stream processing method according to any one of claims 1 to 6 is executable.

9. An electronic device, comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;

the memory is configured to store at least one executable instruction, where the executable instruction causes the processor to perform operations corresponding to a video frame stream processing method according to any one of claims 1 to 6.

10. A computer program product comprising computer instructions that instruct a computing device to perform operations corresponding to a video frame stream processing method as claimed in any one of claims 1 to 6.