CN120201147A

CN120201147A - A surveillance video review method and device based on multimodal model

Info

Publication number: CN120201147A
Application number: CN202510404509.0A
Authority: CN
Inventors: 刘世杰; 王鹏; 陈海峰; 韩丰景; 郭文艳; 易平科; 马嘉兴
Original assignee: China Unicom Online Information Technology Co Ltd
Current assignee: China Unicom Online Information Technology Co Ltd
Priority date: 2025-04-01
Filing date: 2025-04-01
Publication date: 2025-06-24

Abstract

The invention relates to the technical field of monitoring video processing. A monitoring video review method and device based on a multi-mode model are provided. The method comprises the steps of slicing real-time monitoring videos of different service scenes according to specified time length to generate a video fragment sequence, further generating unique index identifiers for each video fragment in the video fragment sequence, establishing a mapping relation table of the index identifiers and each video fragment, extracting multi-modal features from the real-time monitoring videos of the different service scenes by adopting CNN and VGGish models, extracting global dependency relation features by adopting a Transformer model, fusing the extracted multi-modal features and the global dependency relation features to obtain fusion features, further generating natural language description, constructing a cross-modal retrieval vector space, storing the vector space into a vector database to form a storage path of each video fragment, receiving query sentences input by a user, converting the query sentences into vector representations, and carrying out similarity search in the vector database to obtain video fragments corresponding to the query sentences. The invention can more accurately search and improve the review efficiency.

Description

Monitoring video review method and device based on multi-mode model

Technical Field

The invention belongs to the technical field of video processing, and provides a monitoring video review method and device based on a multi-mode model.

Background

With the wide application of monitoring cameras in the fields of home care, public safety, traffic management, business, etc., the amount of monitoring video data has increased exponentially. Traditional monitoring review mode relies on manual frame-by-frame retrieval, which is inefficient and prone to missing key information. In order to improve the retrieval efficiency and the intelligent level of the monitoring video, an intelligent monitoring review technology is generated.

The existing method is to obtain the initial picture of the camera under different rotation angles in advance, screen the abnormal event by comparing the real-time monitoring picture with the initial picture, and visually display the abnormal event and the related information thereof in the form of a progress bar during playback. According to the method, the initial picture of the camera is compared with the real-time monitoring picture, so that the intelligent extraction and visual display of the abnormal event are realized, a user can intuitively and efficiently check the abnormal event in the playback video, and the user experience of the video playback function is improved to a certain extent. However, the abnormal event is screened by comparing the initial picture with the real-time monitoring picture, and the abnormal event is missed or false detected due to the limitation of the quality and coverage of the initial picture. In addition, the abnormal events are screened only through the picture difference, the semantic understanding of the video content is not carried out, and the query based on natural language and the analysis of complex scenes cannot be supported. In addition, although the review efficiency is improved by visualizing the abnormal event through the progress bar, the user still needs to view the video clip frame by frame, and cannot quickly locate the review content related to the specific semantic description. In addition, the prior art mainly aims at abnormal event detection, does not provide comprehensive understanding and indexing of video content, is difficult to adapt to diversified review requirements (such as query of specific people, behaviors or scenes), and further has insufficient expansibility.

Therefore, there is a need to provide a method and apparatus for review of surveillance video based on a multi-modal model to solve the above-mentioned problems.

Disclosure of Invention

The invention provides a monitoring video review method and device based on a multi-mode model, which are used for solving the technical problems that the review requirements of supporting query based on natural language and complex scenes cannot be realized in the prior art, related review fragments cannot be rapidly positioned and need to be reviewed frame by frame, the review efficiency is low, the accurate query and analysis of specific characters, behaviors or scenes by users cannot be met, and the like.

The first aspect of the invention provides a monitoring video review method based on a multi-modal model, which comprises the steps of slicing real-time monitoring videos of different service scenes according to appointed time length to generate a video clip sequence, further generating unique index marks for each video clip in the video clip sequence, establishing a mapping relation table of the index marks and each video clip, extracting multi-modal features from the real-time monitoring videos of the different service scenes by adopting CNN and VGGish models, extracting global dependency features by adopting a Transformer model, fusing the extracted multi-modal features and the global dependency features to obtain final fusion features, further generating natural language description of the final fusion features, storing the final fusion features into a vector database to form a storage path of each video clip, receiving query sentences input by a user, converting the query sentences into vector representations, and searching similarity in the vector database to obtain video clips corresponding to the query sentences.

The invention provides a monitoring video review device based on a multi-modal model, which comprises a generation processing module, a building module, an extraction processing module, a query processing module and a query module, wherein the generation processing module is used for slicing real-time monitoring videos of different service scenes according to appointed time length to generate video fragment sequences, the building module is further used for generating unique index identifications for each video fragment in the video fragment sequences and building a mapping relation table of the index identifications and each video fragment, the extraction processing module is used for extracting multi-modal features from the real-time monitoring videos of different service scenes by adopting CNN and VGGish models, extracting global dependency features by adopting a trans-former model, fusing the extracted multi-modal features and the global dependency features to obtain final fusion features, further generating natural language description of the final fusion features, storing the vector space of cross-modal retrieval into a vector database to form a storage path of each video fragment, the multi-modal features comprise audio features and behavior recognition features, and the query processing module is used for receiving query sentences input by a user, converting the query sentences into vector representations, and carrying out similarity query sentences in the database to obtain corresponding query sentences.

The third aspect of the invention provides an electronic device, which comprises one or more processors, a storage device and a display device, wherein the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are enabled to realize the monitoring video review method based on the multi-mode model.

A fourth aspect of the present invention provides a computer readable medium having stored thereon a computer program which, when executed by a processor, implements the multi-modal model-based surveillance video review method of the first aspect of the present invention.

The embodiment of the invention has the following advantages:

Compared with the prior art, the method has the advantages that zero copy transfer of the fragmented data is realized through a GPU video memory sharing mechanism by proposing a pipeline parallel architecture of video fragment generation, multi-modal model analysis and vector coding, the fact that the fragments are completed instantly is guaranteed to complete semantic indexing and vectorization storage is effectively solved, serial modes of separation of the fragment storage and semantic analysis are effectively solved, end-to-end processing time delay is effectively reduced, storage bandwidth pressure caused by secondary reading of video fragments is avoided, multi-modal features such as visual features and audio features are fused to obtain multi-modal features, unified semantic vectors are generated, a cross-modal retrieval vector space is constructed, the multi-modal features and the extracted global dependency features are further fused to obtain fusion features, each video fragment is represented by the fusion features, retrieval can be more accurately carried out, a retrieval mode of 'text search and image' is effectively achieved, and in each fragment, indexes are built based on time stamps to support time-range query data. The data is semantically analyzed, semantic feature vectors are extracted, and a semantic vector index is constructed using a vector indexing algorithm (and using an open source vector database, such as Milvus). The data objects are stored in an object storage system OSS, the storage address of each data object is recorded, and an index is built for quick positioning. And establishing a three-level association structure of the slice timestamp index, the semantic vector index and the object storage address index, and rapidly positioning the video segments according to natural language description, thereby effectively improving video retrieval efficiency.

In addition, the multi-modal model is utilized to carry out deep analysis on video content, detailed semantic description is generated, the query based on natural language and the review requirement of complex scenes are supported, through storing video segment indexes and semantic description into a vector database, a user can quickly position related review segments according to the description, view frame by frame is avoided, review efficiency is remarkably improved, comprehensive understanding and indexing of the video content are provided, diversified review requirements are supported, the precise query and analysis of specific characters, behaviors or scenes by the user are met, and expansibility and applicability are improved.

In addition, the multi-mode model carries out semantic understanding on the video clips, so that dependence on initial picture comparison is effectively avoided, and accuracy and coverage range of scene abnormal event detection are improved.

Drawings

FIG. 1 is a flowchart illustrating steps of an exemplary multi-modal model-based surveillance video review method of the present invention;

FIG. 2 is a schematic diagram of the model principle of the monitoring video review method based on the multi-mode model of the present invention;

FIG. 3 is a block diagram of a multi-modal model based surveillance video review device of the present invention;

FIG. 4 is a schematic structural view of an embodiment of an electronic device according to the present invention;

fig. 5 is a schematic diagram of an embodiment of a computer readable medium according to the present invention.

Detailed Description

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

In view of the above problems, the present invention provides a monitoring video review method based on a multi-modal model.

The method abstracts the monitoring video into a structured mathematical model, firstly carries out slicing treatment on the real-time monitoring video stream, generates video fragments of a time interval, and establishes a unique index for each fragment. And carrying out semantic understanding on the video clips through the multi-modal model, generating natural language description, converting the natural language description into high-dimensional feature vectors, storing the high-dimensional feature vectors in a vector database, and simultaneously storing the video clips in an object storage system. In the review query stage, query sentences input by a user are converted into feature vectors, similarity retrieval is carried out, the most relevant video segment indexes are located, and finally, corresponding video contents are extracted from object storage, so that complete technical closed loop from video slicing and semantic understanding to efficient retrieval is realized. By carrying out segmentation, indexing and semantic understanding on the monitoring video and combining the vector model and the vector database, intelligent retrieval based on natural language description is realized, and the management and review efficiency of the monitoring video is greatly improved.

Example 1

The following describes the present invention in detail with reference to fig. 1 and 2.

FIG. 1 is a flowchart illustrating steps of an exemplary multi-modal model-based video surveillance review method of the present invention.

As shown in fig. 1, in step S101, real-time monitoring videos of different service scenes are fragmented according to a specified duration, and a video fragment sequence is generated.

In a specific embodiment, in a traffic monitoring scene, real-time traffic monitoring video is segmented according to a specified duration. The specified duration is within the range of 2 seconds to 30 seconds.

For example, the input is a real-time traffic monitoring video stream V, and the video stream V is sliced according to a specified duration t (in this embodiment, the specified duration is 5 seconds), so as to generate a series of video segments { V ₁,v₂,...,v_n }, where each video segment V _i (where i represents an i-th video segment, i is a positive integer, specifically 1, 2, the term, n) corresponds to a time interval, and the following expression is used to represent the video segment V _i:

{v₁,v₂,...,v_n}=Slice(V,∆t)

The method comprises the steps of inputting a real-time traffic monitoring video stream V, wherein the V represents the input real-time traffic monitoring video stream, the fatter represents the time duration specified by the fatter, the Slice represents the slicing operation, and the real-time traffic monitoring video stream V is sliced into a plurality of slices according to the time duration specified by the fatter.

It should be noted that, in the present invention, the selection of the specified duration is determined according to the specific service requirement and the characteristics of the monitoring scenario. The selection of the specified duration takes into account the frequency of occurrence of events (if the frequency of occurrence of events in the monitored scene is high, the duration of the fragments is configured to be short so as to capture more details of the event), the limitations of storage and computation resources (shorter durations of fragments increase the overhead of storage and computation, and thus select the appropriate duration within the allowed range of resources), the granularity of the traffic analysis (the duration of fragments matches the granularity of the traffic analysis, e.g., if a minute-level of behavior is required to be analyzed, the duration of fragments may be configured to be minute-level). In addition, for traffic monitoring scenes, in traffic monitoring videos, the running track, the illegal behaviors and the like of the vehicles need to be captured, and short-time behaviors of the vehicles, such as red light running, illegal lane changing and the like, can be captured well by using a specified duration of 5 seconds. For a market security monitoring scene, in the market security monitoring, the behavior, abnormal events and the like of a customer are required to be monitored, the walking path, the stay time and the like of the customer can be captured by using the appointed duration of 10 seconds, and meanwhile, the abnormal behavior such as overlong stay and the like can be timely found. For a factory production monitoring scene, in a factory production monitoring video, the running state of a production line, the operation condition of equipment and the like are required to be monitored, and the running period of the production line can be better reflected by using a specified duration of 30 seconds, so that the analysis of the production efficiency is facilitated. For home care monitoring scenes, in home care video, sudden events such as falling of old people, accidental injuries of children or sudden discomfort of patients are involved, the events usually occur between a few seconds and tens of seconds, and the whole process of a key event can be captured while the timeliness is ensured by using a designated time length of 10 seconds. The foregoing is illustrative only and is not to be construed as limiting the invention.

Next, in step S102, a unique index identifier is further generated for each video clip in the video clip sequence, and a mapping relationship table between the index identifier and each video clip is established.

Specifically, for each video clipGenerating unique indexAnd establishing a mapping relation between the index and the fragment, wherein the specific expression is as follows:

;

wherein, the Representing the ith video clipI is a positive integer, specifically 1,2,..n; and a mapping relation table for representing the index and the video clips. The start time stamp of the slice is adopted, and the time stamp is accurate to millisecond level (such as 20250321145678901) and is used for recording the start time of the video slice.

Next, in step S103, a CNN and VGGish model is used to extract multi-modal features from the real-time monitoring video of different service scenes, a transducer model is used to extract global dependency features, the extracted multi-modal features and the global dependency features are fused to obtain fusion features, and further natural language description of the final fusion features is generated to construct a vector space for cross-modal retrieval, and the vector space is stored in a vector database to form a storage path of each video segment, wherein the multi-modal features include audio features and behavior recognition features.

Specifically, the multi-mode features are extracted from the real-time monitoring video extracted from different service scenes.

And carrying out semantic understanding on each video segment by using a multi-modal model, and generating a corresponding natural language description. The following expression is used specifically:

;

wherein, the Representing a natural language description model for generating each video clip; representing video clips Is described in the natural language.

Specifically, the multimodal model combines a Convolutional Neural Network (CNN), an audio feature extraction model (VGGish), and a transducer architecture for semantic understanding of each video segment and generating corresponding natural language descriptions. The core of the multimodal model is that local characteristics (such as character actions, facial expressions, glass breaking, background music, etc.) related to behaviors in each video segment are extracted through CNN (corresponding to 'ResNet' in FIG. 2) and VGGish, and global dependency relations of each video segment are captured through a transducer mechanism, so that natural language description is finally generated. See in particular fig. 2.

It should be noted that, in this example, the audio feature extraction model uses VGGish model, the VGGish model is a model constructed based on VGG network structure, and the video data (one or more frames of picture data) marked with key characters, risk behaviors or abnormal behaviors, facial expressions, key events (such as glass breaking, facial expression pain, etc.) are used to build a training data set by using the monitoring video data of each application scene to which the present invention is applicable, so as to perform incremental training on the existing VGGish model to obtain the audio feature extraction model.

Preprocessing a video segment to be processed to be input into a multi-mode model, and firstly preprocessing the video segmentDividing the CNN model into a plurality of frames, and extracting behavior recognition features from a CNN model pre-trained by each frame of picture. Video clips are providedComprisesFrame pictures, and feature vectors of each frame picture are as followsThe video clip is characterized asWhere d _f represents the dimension of the behavior recognition feature extracted by the pre-trained CNN model for each frame of picture, and T represents the time step.

It should be noted that, by adopting ResNet (Residual Neural Network, residual network) to construct a Convolutional Neural Network (CNN) model and introducing residual connection (also referred to as jump connection, skip connections), the gradient disappearance and degradation problems in deep neural network training are solved, and the method is particularly suitable for tasks such as image classification, target detection, feature extraction and the like.

Local features (e.g., character actions, facial expressions) of each frame of picture are extracted using the multi-layer convolution portion of the CNN model. Using one-dimensional convolution kernelsWhereinFor the convolution kernel size,Is the output feature dimension. The convolution operation is represented using the following expression:

;

wherein, the Representing the first obtained after one-dimensional convolution operationA feature vector of the frame; The representation is from the first Frame to the firstThe characteristics of the frame are such that,Is a bias term; Representing the one-dimensional convolution kernel of the convolved portion.

Local features of different scales (namely various local features related to behavior recognition, namely behavior recognition features) of each frame of video picture can be extracted through multi-layer convolution (34-layer convolution is preferred in the example), and the method specifically comprises the following features of bottom layer features such as outline, color gradient and simple texture of an object, middle layer features such as eyes, mouth and hand movements of a human face, and higher layer features such as visual information with semantics of running, frame beating, door opening and the like.

Then, audio features are extracted from each frame of video picture. The method comprises the steps of inputting a pre-trained audio extraction model (such as VGGish model) into audio data, outputting 128-dimensional feature vectors through a 4-layer convolutional neural network and a full-connection layer, and extracting features of audio in each video segment to obtain features such as acoustic textures, time sequence modes, semantic codes and the like. Audio clip settingCorresponding video clipThe audio model outputs feature vectorsThe audio clip is characterized asWhere d _a represents the dimension of each audio feature vector, and d _a has a value in the range of [64,256], in this example 128 dimensions.

And then, carrying out multi-mode feature fusion, and carrying out global dependency feature extraction by adopting a transducer model.

Particularly video features(In this example, behavior recognition features), audio featuresAnd carrying out multi-mode fusion. First, the audio features are time aligned to ensure consistent time steps with the video features. And mapping the video features and the audio features to the same dimensional space through the full connection layer:

;

wherein, the Representing the video features obtained after the video features F are mapped; representing the audio features obtained after the mapping processing of the audio features; And In order for the weight matrix to be trainable,AndIs a bias term. The fused features are weighted and summed to obtain the multi-modal fusion features’:

;

Wherein M' represents a multi-modal fusion feature obtained after weighted summation; As a learnable weight parameter, during the training of the model, For balancing contributions of behavior recognition features and audio features; Controlling the local characteristic interaction strength for the interaction item weight; for attention weight, for adjusting global dependency; 、、 The training parameters are all contained in a counter-propagating chain rule, the range is [0.2,0.9], modal suppression or overfitting caused by extreme weights is avoided, and optimization is carried out through a gradient descent method in the training process of the model, so that the model can adaptively learn the importance among different modal characteristics; Representing the Hadamard product as an element-by-element multiplication; Is lightweight cross attention, and belongs to a single-head attention mechanism.

For the trainable parameters described above、、Is continuously optimized in a counter-propagating chain rule to effectively avoid modal suppression or overfitting caused by extreme weights, and is optimized through a gradient descent method in the training process of the model to ensure that the model can adaptively learn the importance among different modal characteristics.

Performing position coding on the multi-mode fusion features, and obtaining model input for extracting global dependency features while maintaining time sequence information:

;

wherein, the Representing model inputs for extracting global dependency features,Is a time stepM ₁ represents a behavior recognition feature, M ₂ represents an audio feature; Representing time-position information obtained by position coding model inputs for extracting global dependency features Is a position matrix which can be learned, in particular is obtained in a model training optimization process,Representing a feature dimension of the model input; an input sequence length for the model input; Is a depth separable convolutional layer, wherein, Is a depth convolution kernel;Is a convolution operation along the time dimension; is a convolution kernel, the size is 1 to 6 (k is preferably 3), In order to convolve the layer point by point,Is a 1 x1 convolution kernel,Representing the feature dimension of the model input.

It should be noted that the core of the transducer model is a self-attention mechanism.

Specifically, X is input for the model. First, a Query (Query), a Key (Key) and a Value (Value) corresponding to the model input X are calculated:

;

wherein, the ,,Is a trainable weight matrix.

Next, the self-attention score of the model input X is calculated using the following expression:

;

wherein, the Representing a self-attention score of the model input X, Q representing a query vector corresponding to the model input X, K representing a key corresponding to the model input X, and V representing a value corresponding to the model input X.

In this example, the multi-headed attention mechanism divides the model input X into h subspaces, each head computes independent attention, and finally concatenates and linearly transforms:

;

wherein, the Representing a multi-headed attentiveness mechanism in a transducer model; For outputting the transformation matrix, i represents the ith attention head, h represents the number of attention heads in a multi-head attention mechanism, i and h are positive integers, i is specifically 1, 2, the number of the points is equal to three, h is 3-10, d _c represents the dimension of an input characteristic, and d _v represents the dimension of a median (Value) vector of each attention head.

Further, the multi-mode fusion features are fused with the global dependency features extracted by the transducer model. The fusion strategy adopts weighted summation, and nonlinear transformation is carried out through a full-connection layer, so that the final fusion characteristics are obtained:

;

Wherein z represents a final fusion feature obtained by secondarily fusing the multi-modal fusion feature and the extracted global dependency relationship feature; And The training parameters can be obtained in a model optimization process, wherein [ M; multiHead (Q, K, V) ] represents a splicing operation, M represents a multi-mode feature obtained by weighting and fusing the multi-feature, and MultiHead (Q, K, V) represents a multi-head attention mechanism in a transducer model.

Generating natural language description by a transducer decoder through the obtained final fusion feature zAnd is expressed by the following expression:

;

Wherein the convolution kernel size k is in the range of 2-6, k is preferably 3, and the characteristic dimension is output Within the range of [128,512],Preferably 512, a multi-head attention head number h in the range of [3,10], h preferably 8, a position-coding dimensionThe number of decoder layers is in the range of [2,16], preferably 4 layers, which is the same as the CNN output characteristic dimension.

Specifically, the video segments of different business scenes are generated into natural language descriptions and converted into text vector representations (i.e. high-dimensional feature vectors) to construct a cross-modal retrieved vector space, which is stored in a vector database, and the video segments are stored in an object storage system.

Video clipsStore object storeAnd generates its storage path. The following expression is used:

;

wherein, the Representing storage paths corresponding to all i video clips; representing an object storage system (e.g., AWS S3, some cloud OSS, etc.).

Converting natural language description into text vector by vector model, and indexing ith video clipIth feature vectorIth video clipStorage pathLogging vector database. Expressed by the following expression:

;

wherein: A representation vector model for converting natural language descriptions into text vectors. The representation vector database ‌ supports efficient feature vector storage and retrieval. For example, the vector model employs BERT, which is entered as a natural language description, typically a piece of text or sentence corresponding to the video clip to be processed, such as "red-clothing person enters a room". The output is a feature vector corresponding to the video segment to be processed, representing a vectorized representation of the input text, the vector having dimensions of 768, for example. The vector database adopts an open source vector database ‌ Milvus which can correlate the fragment timestamp index, the semantic vector index and the object storage address index to form a one-to-one multi-level index relationship so as to realize multi-dimensional retrieval based on time, semantics and storage paths. The user enters a time stampOr semantic vectorThe corresponding video segment index is found through the time stamp index or the semantic vector index, then the storage path of the video segment is found through the object storage address index, and the specific video segment is accurately positioned at the millisecond speed.

Next, in step S104, a query sentence input by a user is received, the query sentence is converted into a vector representation, and a similarity search is performed in the vector database, so as to obtain a video clip corresponding to the query sentence.

Specifically, a query statement entered by a userIf q is "person wearing red clothes enters room", the vector model BERT is used to convert the query statement q into a query vector(I.e., converted to a vector representation) and performing a similarity search in a vector database (e.g., ‌ Milvus), the specific search algorithm uses euclidean distance similarity calculation to specifically calculate the euclidean distance, i.e., the straight line distance, between the query vector a of the query statement and each vector in the vector database, the smaller the calculated distance, the higher the similarity, and the following expression is used to calculate:

;

wherein, the Representing the euclidean distance between the query vector of the computed query statement and each vector in the vector database, i.e. the straight line distance,AndIs two ofDimension vector, ai represents query vectorThe values in the i-th dimension, bi, represent the values in the i-th dimension of the database vector B, respectively.

In particular findNearest video clip index. Then according to the indexObtaining corresponding video clips from object store。

The specific expression is as follows:

;

wherein, the Representing searching in a vector databaseThe closest feature vector, quantity, return corresponding index;Representation according to indexFrom a mapping relation tableObtain the corresponding storage path;Representing the basis of paths from object storesCapturing video clips。

The above steps are integrated into a complete formulation:

;

wherein, the Representing a video clip corresponding to a query statement q input by a user; q represents a query sentence input by a user; representing the basis of paths from object stores Capturing video clips;Representation according to indexFrom a mapping relation tableObtain the corresponding storage path;Representing searching in a vector databaseThe nearest feature vector returns the corresponding index。

Compared with the prior art, the method has the advantages that zero copy transfer of the sliced data is realized through a GPU video memory sharing mechanism by proposing a pipeline parallel architecture of video slicing generation, large model analysis and vector coding, the situation that the sliced data is subjected to instantaneous completion is ensured, semantic indexing and vectorization storage are completed, the serial mode of separated sliced storage and semantic analysis is effectively solved, the end-to-end processing time delay is effectively reduced, storage bandwidth pressure caused by secondary reading of video slices is avoided, multi-modal features such as visual features and audio features are fused to obtain multi-modal features, unified semantic vectors are generated, a cross-modal retrieval vector space is constructed, the multi-modal features and the extracted global dependency features are further fused to obtain fusion features, each video slice is characterized by utilizing the fusion features, retrieval can be more accurately performed, a retrieval mode of' search by text is effectively realized, and the index is built in each slice based on time stamps to support time range query data. The data is semantically analyzed, semantic feature vectors are extracted, and a semantic vector index is constructed using a vector indexing algorithm (and using an open source vector database, such as Milvus). The data objects are stored in an object storage system OSS, the storage address of each data object is recorded, and an index is built for quick positioning. And establishing a three-level association structure of the slice timestamp index, the semantic vector index and the object storage address index, and rapidly positioning the video segments according to natural language description, thereby effectively improving video retrieval efficiency.

Example 2

The following are examples of the apparatus of the present invention that may be used to perform the method embodiments of the present invention. For details not disclosed in the embodiments of the apparatus of the present invention, please refer to the embodiments of the method of the present invention.

Fig. 3 is a schematic structural diagram of an example of a monitoring video review device based on a multi-mode model according to the present invention. The monitoring video review device will be described with reference to fig. 3. The monitoring video review device is used for executing the monitoring video review method according to the first aspect of the invention.

As shown in fig. 3, the surveillance video review device 300 includes a generation processing module 310, a setup module 320, an extraction processing module 330, and a query retrieval module 340.

In a specific embodiment, the generation processing module 310 is configured to segment the real-time surveillance video of different service scenarios according to a specified duration, and generate a video clip sequence. The establishing module 320 further generates a unique index identifier for each video clip in the video clip sequence, and establishes a mapping relationship table between the index identifiers and each video clip. The extraction processing module 330 extracts multi-modal features from the real-time monitoring video of different service scenes by adopting a CNN model, extracts global dependency features by adopting a transducer model, fuses the extracted multi-modal features and the global dependency features to obtain final fusion features, further generates natural language description of the final fusion features to construct a vector space of cross-modal retrieval, stores the vector space into a vector database to form a storage path of each video segment, and the multi-modal features comprise audio features and behavior recognition features. The query search module 340 is configured to receive a query sentence input by a user, convert the query sentence into a vector representation, and perform similarity search in the vector database to obtain a video clip corresponding to the query sentence.

According to an alternative embodiment, before fusing the extracted multimodal features and the global dependency features to obtain fusion features, the extracted multimodal features are weighted and fused by adopting the following expression to obtain multimodal fusion features:

;

According to an alternative embodiment, the multimodal fusion feature is position coded, and model input for extracting global dependency features is obtained while time sequence information is maintained:

;

wherein, the Representing model inputs for extracting global dependency features,Is a time stepM ₁ represents a behavior recognition feature, M ₂ represents an audio feature; representing time-position information obtained by position coding of model inputs for extracting global dependency characteristics; Is a position base matrix which can be learned, Representing a feature dimension of the model input; an input sequence length for the model input; Is a depth separable convolutional layer, wherein, Is a depth convolution kernel;Is a convolution operation along the time dimension; Is a convolution kernel with a size of 1 to 6, In order to convolve the layer point by point,Is a 1 x1 convolution kernel,Representing the feature dimension of the model input.

According to an alternative embodiment, the multi-mode fusion feature is fused with the global dependency feature extracted by the transducer model, so as to obtain a final fusion feature:

;

wherein z represents a final fusion feature obtained by secondarily fusing the multi-modal feature and the extracted global dependency feature; And As trainable parameters, [ M; multiHead (Q, K, V) ] represents a stitching operation, M' represents a multi-modal fusion feature obtained by weighted fusion of multiple features, and MultiHead (Q, K, V) represents a multi-headed attentiveness mechanism in a transducer model.

According to an alternative embodiment, the input X is divided into h subspaces by using a multi-head attention mechanism, each head calculates independent attention, and finally the independent attention is spliced and linearly transformed to obtain an output serving as a global dependency characteristic:

;

wherein, the An output representing a multi-headed attentiveness mechanism; For outputting the transformation matrix, i represents the ith attention head, h represents the number of attention heads in a multi-head attention mechanism, i and h are positive integers, i is specifically 1, 2, and is 3-10, d _c represents the dimension of the input feature, and d _v represents the dimension of the median vector of each attention head.

According to an alternative embodiment, when receiving a query statement input by a user, the query statement is converted into a vector representation and retrieved using the following expression:

;

wherein, the Representing a video clip corresponding to a query statement q input by a user; q represents a query sentence input by a user; representing the basis of paths from object stores Capturing video clips;Representation according to indexFrom a mapping relation tableObtain the corresponding storage path;Representing searching in a vector databaseThe closest feature vector, quantity, return corresponding index。

According to an alternative embodiment, a unique index identifier, namely a slicing timestamp index, is generated for each video segment in the video segment sequence, and a mapping relation table of the index identifier and each video segment is established.

And converting the natural language description of the final fusion characteristics of each video segment into a semantic vector index by using a vector model.

And associating object storage addresses, slice time stamp indexes and semantic vector indexes formed when each video segment is stored, finding a corresponding video segment index through the time stamp indexes or the semantic vector indexes when the video segment is searched, and finding a storage path of the video segment through the object storage address indexes so as to determine the video segment to be searched.

Note that, since the surveillance video review method performed by the surveillance video review apparatus of fig. 3 is substantially the same as the surveillance video review method in the example of fig. 1, the description of the same portions is omitted.

Fig. 4 is a schematic structural view of an embodiment of an electronic device according to the present invention.

As shown in fig. 4, the electronic device is in the form of a general purpose computing device. The processor may be one or a plurality of processors and work cooperatively. The invention does not exclude that the distributed processing is performed, i.e. the processor may be distributed among different physical devices. The electronic device of the present invention is not limited to a single entity, but may be a sum of a plurality of entity devices.

The memory stores a computer executable program, typically machine readable code. The computer readable program may be executable by the processor to enable an electronic device to perform the method, or at least some of the steps of the method, of the present invention.

The memory includes volatile memory, such as Random Access Memory (RAM) and/or cache memory, and may be non-volatile memory, such as Read Only Memory (ROM).

Optionally, in this embodiment, the electronic device further includes an I/O interface, which is used for exchanging data between the electronic device and an external device. The I/O interface may be a bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

It should be understood that the electronic device shown in fig. 4 is only one example of the present invention, and the electronic device of the present invention may further include elements or components not shown in the above examples. For example, some electronic devices further include a display unit such as a display screen, and some electronic devices further include a man-machine interaction element such as a button, a keyboard, and the like. The electronic device may be considered as covered by the invention as long as the electronic device is capable of executing a computer readable program in a memory for carrying out the method or at least part of the steps of the method.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, as shown in fig. 5, the technical solution according to the embodiment of the present invention may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several commands to cause a computing device (may be a personal computer, a server, or a network device, etc.) to perform the above-described method according to the embodiment of the present invention.

The software product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of a readable storage medium include an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. The readable storage medium can also be any readable medium that can communicate, propagate, or transport the program for use by or in connection with the command execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The computer-readable medium carries one or more programs, which when executed by one of the devices, cause the computer-readable medium to implement the data interaction methods of the present disclosure.

Those skilled in the art will appreciate that the modules may be distributed throughout several devices as described in the embodiments, and that corresponding variations may be implemented in one or more devices that are unique to the embodiments. The modules of the above embodiments may be combined into one module, or may be further split into a plurality of sub-modules.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present invention may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and which includes several commands to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the method according to the embodiments of the present invention.

It should be noted that the foregoing detailed description is exemplary and is intended to provide further explanation of the application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

In the above detailed description, reference is made to the accompanying drawings, which form a part hereof. In the drawings, like numerals typically identify like components unless context indicates otherwise. The illustrated embodiments described in the detailed description, drawings, and claims are not meant to be limiting. Other embodiments may be utilized, and other changes may be made, without departing from the spirit or scope of the subject matter presented herein.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for monitoring video review based on a multimodal model, the method comprising:

Slicing the real-time monitoring videos of different service scenes according to the appointed time length to generate a video fragment sequence;

generating a unique index identifier for each video segment in the video segment sequence, and establishing a mapping relation table of the index identifiers and each video segment;

Extracting multi-modal features from real-time monitoring videos of different service scenes by adopting a CNN and VGGish model, extracting global dependency features by adopting a transform model, fusing the extracted multi-modal features and the global dependency features to obtain final fusion features, further generating natural language description of the final fusion features to construct a vector space for cross-modal retrieval, and storing the vector space into a vector database to form a storage path of each video segment, wherein the multi-modal features comprise audio features and behavior recognition features;

receiving a query sentence input by a user, converting the query sentence into a vector representation, and carrying out similarity search in the vector database to obtain a video fragment corresponding to the sentence to be queried.

2. The surveillance video review method of claim 1, comprising:

Before the extracted multi-modal features and the global dependency relationship features are fused to obtain fusion features, the following expression is adopted to perform weighted fusion on the extracted multi-modal features to obtain multi-modal fusion features:

;

3. The surveillance video review method of claim 2, comprising:

;

wherein, the Representing model inputs for extracting global dependency features,Is a time stepM ₁ represents a behavior recognition feature, M ₂ represents an audio feature; representing time-position information obtained by position coding of model inputs for extracting global dependency characteristics; Is a position base matrix which can be learned, Representing a feature dimension of the model input; an input sequence length for the model input; is a depth separable convolutional layer, wherein, Is a depth convolution kernel;Is a convolution operation along the time dimension; Is a convolution kernel with a size of 1 to 6, In order to convolve the layer point by point,Is a1 x1 convolution kernel,Representing the feature dimension of the model input.

4. A surveillance video review method as claimed in claim 3, comprising:

Fusing the multi-mode fusion feature with the global dependency feature extracted by the transducer model to obtain a final fusion feature:

;

5. A surveillance video review method as claimed in claim 3, comprising:

dividing the input X into h subspaces by adopting a multi-head attention mechanism, calculating independent attention by each head, and finally splicing and linearly transforming to obtain an output serving as a global dependency characteristic:

;

6. The surveillance video review method of claim 1, comprising:

When receiving a query statement input by a user, converting the query statement into a vector representation, and searching by adopting the following expression:

;

7. The surveillance video review method of claim 1, further comprising:

generating a unique index identifier, namely a slicing timestamp index, for each video fragment in the video fragment sequence, and establishing a mapping relation table of the index identifier and each video fragment;

converting natural language description of final fusion characteristics of each video segment into semantic vector indexes by using a vector model;

8. A monitoring video review device based on a multimodal model, characterized in that it performs the monitoring video review method of any one of claims 1 to 7, the monitoring video review device comprising:

The generation processing module is used for slicing the real-time monitoring video of different service scenes according to the appointed time length to generate a video fragment sequence;

the building module is used for further generating a unique index identifier for each video segment in the video segment sequence and building a mapping relation table of the index identifiers and each video segment;

The extraction processing module is used for extracting multi-modal features from real-time monitoring videos of different service scenes by adopting a CNN and VGGish model, extracting global dependency features by adopting a transducer model, fusing the extracted multi-modal features and the global dependency features to obtain final fusion features, further generating natural language description of the final fusion features to construct a vector space of cross-modal retrieval, and storing the vector space into a vector database to form a storage path of each video segment, wherein the multi-modal features comprise audio features and behavior recognition features;

And the query retrieval module is used for receiving query sentences input by a user, converting the query sentences into vector representations, and carrying out similarity search in the vector database to obtain video fragments corresponding to the sentences to be queried.

9. The surveillance video review device of claim 8, comprising:

;

10. The surveillance video review device of claim 8, comprising:

;