CN120123970B

CN120123970B - Omnimedia fusion method and system based on complementary fusion

Info

Publication number: CN120123970B
Application number: CN202510170015.0A
Authority: CN
Inventors: 孙琪; 汤敬华; 郑波
Original assignee: Shanghai Shengtong Zhiming Technology Co ltd
Current assignee: Shanghai Shengtong Zhiming Technology Co ltd
Priority date: 2025-02-17
Filing date: 2025-02-17
Publication date: 2025-09-02
Anticipated expiration: 2045-02-17
Also published as: CN120123970A

Abstract

The present disclosure provides a full media fusion method and system based on complementary fusion, which relates to the field of converged communications. The method includes collecting different types of media data from multiple media sources, dividing them into several processable media units according to a preset segmentation strategy; performing feature extraction on different types of media data respectively to generate corresponding basic feature vectors; using a pseudo-query module to generate an associated pseudo-query vector; inputting the basic feature vector and the pseudo-query vector into an implicit interaction module, and outputting a fused media representation vector; storing the media representation vector and its corresponding media metadata in a vector database, and indexing the media representation vector in the vector database; in response to receiving a fusion request for a target media unit, obtaining media resources matching the target media unit by searching the vector database to perform cross-media display of the target media unit.

Description

Full-media fusion method and system based on complementary fusion

Technical Field

The disclosure relates to the field of converged communication, in particular to a full media fusion method and system based on complementary fusion.

Background

Multi-mode data such as text, images, audio, video, live streaming and the like are increasingly widely applied in various industries, the data of different modes have obvious differences in terms of structure, semantics and time sequence distribution, and the traditional single-mode or simple splicing type processing method is difficult to realize deep fusion and efficient matching of cross-mode content. Especially when the data of a plurality of modes are required to be combined and displayed (such as alignment of video and text captions, mixing of live streaming and image/audio, and the like), if an effective complementary mechanism of multi-mode features is lacking, the problems of low retrieval and matching efficiency, large redundancy of data, inaccurate alignment of cross-mode information and the like are easily caused, and the requirements of multimedia fusion in real-time and high concurrency environments are difficult to meet.

For example, china patent application with bulletin number of CN113033647A discloses a multi-mode feature fusion method, which mainly has the main ideas that features of each mode of a multimedia resource are extracted respectively, the features of each mode are combined in a mode dimension to form multi-channel features, and then fusion features are generated through multi-channel convolution processing, so that information complementation among the features of different modes is realized. The scheme provides multiple convolution operations in the dimension D direction based on convolution kernels, supports technical means such as feature aggregation and stretching treatment, and is expected to improve the effect of multi-mode feature fusion to a certain extent. The scheme emphasizes that the influence of a single characteristic value is effectively reduced through the convolution processing among channels, improves the expression capability of fusion characteristics, and is suitable for characteristic fusion scenes of multimedia resources such as video image frame data, audio data, text data and the like.

However, the above scheme mainly focuses on performing relatively static "same-dimension splicing" on multi-modal features by using convolution operation and channel convolution, and does not deeply consider how to reduce the calculation overhead of online large-scale comparison under the cross-media retrieval and real-time interaction scene, nor propose an incremental fusion strategy for media types with strong time continuity, such as live streaming. Although the system has a certain inter-channel complementary effect on the multi-mode feature combination, when the system needs to dynamically supplement, synchronize or multi-terminal display the potential requirements and the associations of different modes in real time in an online or high concurrency environment, a more perfect implicit interaction and off-line perception mechanism is lacked, and the real-time performance and the fusion precision are difficult to be considered. In addition, in the processing mode of carrying out channel convolution only aiming at the same dimension characteristics, when complementary query is required to be carried out on each media unit and an online fusion scene is realized, the network load is difficult to be further reduced or the cross-media retrieval efficiency is difficult to be improved. Therefore, a full-media fusion technical scheme which can be more suitable for offline preprocessing and online rapid matching of multi-mode data and can be used for carrying out low-time delay and multi-terminal synchronous display on live streaming and other real-time data is still needed, so that the limitation of the existing scheme in the aspects of cross-mode association precision and real-time mixed presentation is overcome.

Disclosure of Invention

Aiming at the defects of the prior art, the embodiment of the disclosure provides a full media fusion method and system based on complementary fusion.

In a first aspect, an embodiment of the present disclosure provides a full media fusion method based on complementary fusion, including:

Collecting different types of media data from a plurality of media sources, and performing preliminary formatting processing on the media data, wherein the media data comprises text, images, audio, video and live stream data;

Dividing the formatted media data into media units according to a preset segmentation strategy, wherein the media units refer to independent media fragments in a time or space range, and each media unit at least comprises one of a text, a frame or a video, one or a group of images and one audio;

respectively extracting features of different types of media data to generate basic feature vectors corresponding to the media units;

Generating, with a pseudo-query module, a pseudo-query vector associated with each media unit based on the base feature vector;

Inputting the basic feature vector and the pseudo query vector to an implicit interaction module, and outputting a fused media representation vector; storing the media representation vector and the corresponding media metadata thereof into a vector database, and establishing an index for the media representation vector in the vector database;

and when a fusion request for a target media unit is received, acquiring media resources matched with the target media unit by retrieving the vector database, and synchronously synthesizing the target media unit and the matched media resources according to the timestamp information of the matched media resources so as to perform cross-media display of the target media unit.

As an alternative implementation, the generating the pseudo-query vector associated with each media unit includes:

performing attention mechanism operation on the basic feature vector to generate potential demand characterization;

Generating a pseudo-query vector based on the potential demand characterization and optimizing the pseudo-query vector using reconstruction or contrast loss.

As an alternative embodiment, the outputting the fused media representation vector includes:

splicing the basic feature vector and the pseudo-query vector to form an input sequence;

Processing the input sequence through a multi-head self-attention network to generate a fused media representation vector;

Outputting the fused media representation vector for indexing and retrieval.

As an alternative embodiment, the storing the media representation vector and the corresponding media metadata into a vector database, and indexing the media representation vector in the vector database includes:

establishing a mapping relation between the media representation vector and the media metadata, wherein the media metadata comprises a media type, a timestamp and a frame index;

storing the media representation vector and media metadata thereof to a vector database;

the media representation vector is indexed based on an approximate nearest neighbor search algorithm or a hash index algorithm.

As an alternative embodiment, the cross-media presentation comprises:

receiving a fusion request for the target media unit;

searching candidate media units with the media representation vector similarity with the target media unit higher than a preset threshold value in the vector database;

Splicing and fusing the target media unit and the candidate media unit according to a set fusion strategy;

and outputting the fusion result to the front end for display.

As an alternative embodiment, the cross-media presentation further comprises:

slicing live stream data according to time periods, and generating the basic feature vector and the pseudo-query vector for each time period;

And incrementally writing the fused media representation vectors of the live stream data into the vector database, and matching by using the latest media representation vectors during online retrieval.

As an alternative embodiment, the feature extraction includes:

And extracting the basic feature vectors from the original media data by adopting a convolutional neural network, a visual transducer, a pre-training language model or a voice recognition model aiming at texts, images, audios and videos.

As an alternative implementation mode, the cross-media presentation further comprises embedding the matched media resources into rendering windows corresponding to target media units.

In a second aspect, the embodiment of the disclosure also provides an all-media fusion system based on complementary fusion, which comprises an acquisition module, a first processing module, a feature extraction module, a second processing module, a third processing module and a display module;

The acquisition module is used for acquiring different types of media data from a plurality of media sources and carrying out preliminary formatting processing on the media data, wherein the media data comprises text, images, audio, video and live stream data;

the first processing module is used for dividing the formatted media data into media units according to a preset segmentation strategy, wherein the media units refer to media fragments which are opposite in time or space range, and each media unit at least comprises one of a text, a frame or a video, a picture or a group of pictures and a section of audio;

The feature extraction module is used for respectively extracting features of different types of media data and generating basic feature vectors corresponding to the media units;

The second processing module is used for generating a pseudo-query vector associated with each media unit based on the basic feature vector by using the pseudo-query module;

The third processing module is used for inputting the basic feature vector and the pseudo query vector into the implicit interaction module and outputting a fused media representation vector, storing the media representation vector and corresponding media metadata thereof into a vector database, and establishing an index for the media representation vector in the vector database;

and the display module is used for responding to the fusion request for the target media unit, acquiring media resources matched with the target media unit by retrieving the vector database, and synchronously synthesizing the target media unit and the matched media resources according to the timestamp information of the matched media resources so as to perform cross-media display of the target media unit.

Compared with the prior art, the method has the beneficial effects that different media (text, images, audio, video and live broadcast streams) can realize cross-mode retrieval and complementation by utilizing the same vector database, and the method is suitable for being applied to scenes such as online teaching, video entertainment, conference live broadcast and telemedicine. By adopting approximate nearest neighbor search and time stamp synchronization of the vector database, cross-media matching and fusion can be completed only by dot product calculation or similarity calculation in an online stage, and the fusion response speed is greatly improved. With the increase of media data, the system only needs to perform feature extraction and implicit interaction on the newly added media units and then write the newly added media units into a vector database, and index updating is performed, so that large-scale distributed expansion is supported.

Drawings

FIG. 1 is a flowchart of a complementary fusion-based full media fusion method provided by an embodiment of the present disclosure;

FIG. 2 is a flow chart of a method for creating an index in a vector database according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a full media fusion system based on complementary fusion provided in an embodiment of the present disclosure.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.

The invention provides a full media fusion method based on complementary fusion, and referring to fig. 1, a flowchart of the full media fusion method based on complementary fusion provided by an embodiment of the disclosure is provided, and the method comprises steps S101 to S106, wherein:

S101, collecting media data of different types from a plurality of media sources, and performing preliminary formatting processing on the media data, wherein the media data comprises text, images, audio, video and live stream data;

s102, dividing the formatted media data into a plurality of processable media units according to a preset segmentation strategy, wherein the media units refer to relatively independent media fragments in a time or space range, and each media unit at least comprises one of a text, a frame or a video, a picture or a group of pictures and a section of audio;

S103, respectively extracting features of different types of media data to generate basic feature vectors corresponding to the media units;

s104, generating a pseudo-query vector associated with each media unit based on the basic feature vector by using a pseudo-query module;

S105, inputting the basic feature vector and the pseudo query vector into an implicit interaction module, and outputting a fused media representation vector, storing the media representation vector and corresponding media metadata into a vector database, and establishing an index for the media representation vector in the vector database;

And S106, responding to the fusion request for the target media unit, acquiring media resources matched with the target media unit by retrieving the vector database, and synchronously synthesizing the target media unit and the matched media resources according to the timestamp information of the matched media resources so as to perform cross-media display of the target media unit.

For S101 described above:

Under the full media fusion scene, the media data of different sources are various in types and non-uniform in format, and if the subsequent analysis is directly carried out, the recognition or the processing is easy to be difficult. The media data of different sources are collected uniformly, and preliminary formatting processing is carried out, so that a foundation can be laid for subsequent feature extraction and cross-media matching.

In implementations, different types of media data may be obtained from multiple media sources including text databases, image storage services, audio recording and storage servers, video streaming platforms, live broadcast servers, and the like.

For example, text data can be obtained from an existing document library, a news source or a document uploaded by a user through an API interface, image data can be read from materials uploaded by an image storage or photographing device, audio and video data can be loaded from a streaming platform or a local file, and streaming fragments can be captured through a real-time acquisition port for live streaming data.

In a specific implementation, the preliminary formatting processing of the collected original media data according to the respective media types may include:

for texts, removing redundant blank, unifying coding formats (such as UTF-8) and performing necessary cleaning;

For images, an image coding format (such as JPEG/PNG) can be scaled equally or unified over the length-width resolution;

Unifying the sampling rate, bit rate or audio coding format (such as MP 3/WAV) for audio;

for video, preliminary transcoding may be based on frame rate, resolution, or encoding format (e.g., h.264/h.265);

for live streaming, the real-time stream is sliced for a predetermined period of time (e.g., 5 seconds or 10 seconds) and stored as a temporary buffer file for subsequent processing.

Thus, through the above-described processing, a base media file or data stream is generated that is available for subsequent analysis and feature extraction.

For S102 described above:

After the unified formatting is completed, the entire segment or pieces of media data still need to be further divided into smaller "media units" for the system to perform finer feature extraction and matching of different types and different temporal/spatial ranges of segments in subsequent steps.

In a specific implementation, the formatted media data is divided into a plurality of processable media units according to a preset segmentation strategy.

The text data may be divided based on paragraph or sentence level granularity, each paragraph corresponds to one media unit, each image or group of images is regarded as one media unit for image data, a time segmentation strategy may be adopted for audio or video data, for example, each 10 seconds or each key frame interval is used as one media unit, and for live stream data, a live stream may be segmented into a plurality of relatively independent media units according to the same time segmentation strategy.

It should be noted that, in the present disclosure, a media unit refers to a relatively independent media segment in a temporal or spatial range, and may include one or more of a text, a frame or a video, a picture or a group of pictures, and a piece of audio. By the dividing mode, the media units can be used as processing objects in subsequent steps, and management and analysis of multi-mode data are simplified.

For S103 described above:

The key to multi-modal fusion is to convert different types of media into a comparable, retrievable vector representation. Without unified feature vector representation, it is difficult to achieve efficient alignment and complementation across text, image, audio-video, etc. media.

In a specific implementation, for each media unit obtained by division, a feature extraction algorithm adapted to the media type of each media unit needs to be adopted to obtain a basic feature vector. For example:

For text feature extraction, a vectorization operation may be performed on text units using a pre-trained language model (e.g., BERT, roBERTa) or a Transformer-based semantic encoder to generate semantic expression vectors.

For image feature extraction, the images may be convolved/self-attention analyzed using CNN (e.g., res net, VGG) or Vision Transformer to extract visual feature vectors.

For audio feature extraction, mel-spectrum (Mel-Spectrogram) or MFCC feature analysis may be performed on the audio media units (including live stream segments), and the audio feature vectors may be obtained in combination with RNN or transducer structures.

For video feature extraction, the space-time feature vectors of the video media units can be extracted through key frame extraction or 3D convolution network, video transform and other models;

It should be emphasized that if the video unit is long, the pooling may be performed on the average over multiple frames to obtain the overall representation.

Thus, through the multi-modal feature extraction described above, the present disclosure ultimately generates a corresponding base feature vector for each media unit and stores it in memory or temporary storage for subsequent processing.

For S104 described above:

The simple basic feature vector can only reflect the content of the media unit itself, but cannot reflect the complementary information that the media unit may need for other media resources. If complex cross comparison is directly performed between all media units in the online stage, the calculation cost is high and the real-time performance is not enough.

In implementations, the present disclosure inputs the base feature vector for each media unit into a pseudo-query module (e.g., a lightweight transducer decoder or self-attention-based sub-network) to generate a pseudo-query vector associated with each media unit. The pseudo-query vector may be understood as a potential expression of "the media unit may have a need or point of association with other media.

In implementations, contrast learning or reconstruction loss may be employed during the training phase such that the pseudo-query vector captures the core semantics and potentially interaction information of the media unit.

In a specific implementation, the generation manner of the pseudo query vector may be:

Inputting basic feature vectors;

The middle process is that the pseudo-query module carries out multi-head self-attention calculation or decoding operation on the basic feature vector according to the parameters obtained by training;

a pseudo-query vector associated with the media unit is output.

For example, if the media unit is a piece of text describing a cooking process, the pseudo-query vector may include potential needs for related food materials, cooking temperatures, durations, and for an image of a particular scene, the pseudo-query vector may express an association to information such as location, subject, or a gangable video.

Generating a pseudo-query vector based on the potential demand characterization, and optimizing the pseudo-query vector with a reconstruction or contrast penalty to enable the pseudo-query vector to characterize association features between the media unit and other media resources.

After the multi-modal feature extraction is completed, each media unit has a basic feature vector, denoted as F _base. To further generate pseudo-query vectors that characterize the correlation characteristics between the media unit and other media resources, the present disclosure introduces two partial steps of attention mechanism operation and reconstruction/contrast loss optimization.

In a specific implementation, the F _base is input into a lightweight 'attention generation sub-module', and the sub-module can be realized by adopting a multi-head self-attention structure, and at least comprises the following key components:

1. linear mapping layer:

F _base is mapped to three vector spaces of query (Q), key (K) and value (V). Specifically, a trainable weight matrix W _Q may be set to generate the query vector q=f _base·W_Q, and K, V may be similarly generated.

Because the feature vector of the media unit is often higher in dimension, the linear mapping layer can reduce parameters and ensure that information is not lost, and the method can be realized in a dimension reduction or same dimension maintenance mode.

2. Attention calculation layer:

The attention calculation layer performs dot product attention on Q, K, V:

The Attention (Q, K, V) is used for distributing the learnable weight among the input features to highlight the information most relevant to the current task, the softmax function is a normalization function, the input vector can be mapped to a (0, 1) interval and the sum of the input vector and the (0, 1) interval is 1, so that the Attention weight in the form of probability distribution is obtained, and d _k is the dimension of the key vector.

It is emphasized that for the case of only a single vector input (i.e., F _base for a single media unit), the present embodiment may treat F _base as a batch process or introduce virtual sequence lengths, or introduce fixed position coding, ensuring that the attention mechanisms do not conflict computationally. For example, if F _base corresponds to a sequence of video key frames, multi-vector attention operations can also be performed at the frame level.

3. Multi-head merging layer:

If the multi-head attention is adopted, the output generated by each head is spliced in the channel dimension, and then the linear mapping is carried out to obtain an output vector A. The output vector a can be regarded as a "potential demand representation" representing an abstract representation of the media unit and external association elements after the attention mechanism.

Through the attention mechanism operation, the obtained potential demand representation A can reflect the potential matching or complementary demand of the media unit to other media resources, and initially characterizes the 'which external information the media unit is possibly associated with'.

Furthermore, studies have found that in practical system implementations, a mere self-attention operation does not guarantee that a reflects cross-media correlation features accurately enough. Accordingly, the present disclosure further optimizes a by either reconstruction loss (Reconstruction Loss) or contrast loss (Contrastive Loss) to yield the final pseudo-query vector (Q _pseudo).

For reconstruction losses:

In implementations, if the system has additional labeling or context information about the media units (e.g., key labels, summaries, corresponding subtitles, etc.), a can be input into a "mini-decoder" or classifier, which can be used to form constraints by predicting or reconstructing the additional information.

For example, when the media unit is a text with a manually marked subject tag, the system can cause A to predict the subject tag, and if the prediction is successful, the capturing of the core semantic of the text by A is more accurate.

The loss function can be designed as cross entropy, mean square error and the like, and parameters of the attention generation submodule and the merging layer are optimized through back propagation, so that the A has higher semantic fitness.

For contrast loss:

in a specific implementation, if the system has a positive and a negative sample pair (for example, "two video segments of the same scene" and "the same timestamp" describe the text of the two video segments), a contrast learning mode may be adopted to make the vector distance between a and the matched sample closer and the vector distance between a and the unmatched sample farther.

For example, assuming that a media unit A and a media unit B are complementary resources, they can be labeled as positive sample pairs, with Q _pseudo generated by both being closer together, and for a random unrelated media unit C, as negative sample pairs, with Q _pseudo being farther apart. So that a will get stronger in the ability to distinguish between associated and unassociated.

The contrast loss common formulas such as InfoNCE or TripletLoss can enable A to learn the distinguishing capability of cross-media retrieval or matching requirements after a plurality of iterations.

Further, in a specific implementation, the pseudo-query vector Q _pseudo can be obtained stably after the a generated by the self-attention mechanism continues to iterate the training in the back propagation.

If reconstruction loss is used, the system can treat the output A as Q _pseudo after training is completed, and if contrast loss is used, a linear transformation layer (MLP) or normalization operation can be added after A to obtain the final Q _pseudo.

It should be noted that, in the implementation level, the attention module, the decoder/the comparison network may be trained end-to-end or in stages in the training stage, so as to ensure that each step of parameter update can reduce the prediction error or increase the distinction degree, thereby making the pseudo-query vector retain the original media unit feature information and have the ability of "sensing demand" for other media.

It is emphasized that in embodiments of the present invention, the use of reconstruction loss (Reconstruction Loss) or contrast loss (Contrastive Loss) is not merely a conventional algorithm choice for model training, but rather directly coupled with the depth of the multimedia fusion application scenario, playing a key technical role in cross-modal data processing and online real-time presentation.

For example, relying solely on "basis feature vectors" to index or match media content has the problem of difficulty in adequately expressing the complementary needs that one media unit may have for other modality information. Thus, large scale cross-modal interactions or complex alignments are often required in the online phase, resulting in significant increases in delay. In the present invention, the concept of "pseudo-query vectors" is presented such that each media unit learns "potential needs for external media content" during an offline or small batch training phase. In order to enable the pseudo-query vector to more accurately and stably represent cross-media association or complementation requirements, the invention selects reconstruction loss or contrast loss to be matched with an attention mechanism for use, so as to overcome the limitation that a pure perceptron or simple classification loss cannot fully capture 'demand-supply' semantic interaction among multiple modes.

For example, in an actual multi-modal service, subtitles, keyword tags, meta-event descriptions, etc. are often configured for video segments or audio segments, or there are cross-labels (e.g., text descriptions of a segment of video segment corresponding thereto) on different modalities for the same content.

The present invention exploits the reconstruction penalty such that the pseudo-query vector must be "reconfigurable" or "predictable" of these multi-modal additional labels. Once the model learns to reconstruct the auxiliary information from the pseudo-query vector during the training phase, it indicates that the pseudo-query vector does capture the inherent semantic joins between the media units and the external complementary resources.

In addition, in the application level, the use of reconstruction loss greatly improves the accuracy of injecting complementary demand perception for each media unit in the off-line stage, and reduces redundant inter-mode comparison and expense in on-line inquiry, thereby being beneficial to resource scheduling and fusion presentation in high concurrency scenes such as live broadcast, on-line on-demand and the like in real time.

Further, in order to accelerate online retrieval and ensure fusion accuracy, the invention can distinguish between cross-mode positive and negative sample pairs (such as audio and text for explaining the positive and negative sample pairs, or video frames and matched image illustrations and the like) in a training stage.

As described above, when it is determined that "media units A and B have complementary relationship in a certain application scenario", they are labeled as positive sample pairs, so that the corresponding pseudo-query vector distances are closer, otherwise, if they are not related, they are labeled as negative sample pairs, and the vector distances are pulled apart.

Thus, contrast loss is no longer merely a generic machine learning method, but in combination with the specific requirements of the multimedia fusion application, a pre-learning of the "cross-modality association" is achieved during offline training. When the system is online, the system can quickly locate other media resources with highest cross-media supplement degree only by executing vector retrieval on the pseudo query vector of the target media unit. Compared with the similar scheme without using contrast loss, the method can greatly improve both the cross-modal matching degree and the online query speed.

In the invention, the reconstruction loss or the comparison loss is used for optimizing the 'pseudo query vector', so that the system can enable a media unit to learn 'how to carry out complementary matching with other media resources' in an offline stage or in a small batch training update. At this time, the optimal fusion object and presentation mode can be determined only by simple vector similarity operation in the online stage.

This technology contributes to directly serving the online multimedia fusion needs. For example, in a live broadcast electronic market scene, the currently displayed commodity video clip of a host can be fused with the most relevant commodity poster or a matched use instruction in a popup window mode in real time, and in an educational live broadcast scene, courseware, practice problems or supplementary audio of corresponding chapters can be timely recommended, so that the acquisition efficiency and interactive experience of a user are remarkably improved.

Unlike simple use of reconstruction loss or contrast learning optimization neural network, the method emphasizes that the training strategy is fused in the whole flow of multi-modal feature extraction, implicit interaction and vector database retrieval so as to reduce the load of online calculation and improve the complementary efficiency of multi-modal content. In other words, the reconstruction/contrast loss is utilized, which is not in the pure deep learning algorithm level, but combines the system architecture requirement of the multimedia data offline-online fusion, thereby forming the specific technical effects of cross-media complementation, such as time delay reduction, bandwidth saving, retrieval accuracy improvement, user satisfaction degree and the like.

Further, the present embodiment may train the pseudo-query generation subsystem in an offline environment (e.g., GPU/TPU cluster), process existing media data on a large scale, and store the generated pseudo-query vectors together in a vector database;

When a new media unit is collected by the system and basic feature extraction is completed, immediately calling an attention module and a contrast/reconstruction network to generate a pseudo-query vector, and storing the pseudo-query vector into a database for real-time retrieval;

in addition, if the system finds that certain pseudo query vectors are not matched with the actual demands according to user feedback, the system can regularly (or in real time) train and update the parameters of the attention mechanism in small batches, so that the system adapts to the environment and new media.

In this way, the pseudo-query vector can more accurately represent the association characteristics between each media unit and other media resources and serve as the basis for subsequent cross-media complementary display. Compared with the traditional single feature vector method, the method has the advantages that the attribute of sensing the external media requirement is injected into each media unit in the off-line stage, so that the large-scale calculation amount in the on-line retrieval process is reduced, and the robustness and the response speed of the system in the multi-mode fusion scene are improved.

For S105 described above:

In the offline stage, the "basic feature vector" and the "pseudo-query vector" are further fused, so that the final representation vector (i.e. "fused media representation vector") of each media unit contains both its own features and retains the representation of other media requirements or associations. Therefore, real-time interaction of all media is not needed during subsequent online retrieval, and the operand is greatly reduced.

In implementations, the present disclosure inputs the "base feature vector" and the "pseudo-query vector" for each media unit together into an implicit interaction module. For example, implicit interactions may be implemented using a multi-layer transducer or multi-headed self-attention network.

Wherein the operation of implicit interaction comprises:

By self-attention or cross-attention mechanism, merging the 'self-characteristics of the media unit' and 'pseudo-query information expected or required by the media unit', and generating a merged media representation vector with more 'cross-media complementation awareness';

The purpose of implicit interactions is to inject a perception of other media modalities or content requirements for each media unit during the offline phase, thus eliminating the need for extensive cross-modal complex computations during online retrieval.

Finally, the fused media representation vector output by the implicit interaction module can better represent the semantic features of the media unit and the potential complementary relation between the semantic features and other modalities, and provides a basis for subsequent retrieval and synthesis.

Further, the present disclosure stores the fused media representation vector with media metadata information (e.g., media type, timestamp, frame index, text paragraph ID, live stream slice ID, etc.) for the media unit into a vector database.

Illustratively, the vector database may employ structures such as Faiss, milvus, or HNSW, to support large-scale vector similarity searches.

Further, after storage is complete, an Approximate Nearest Neighbor (ANN) or hash index may be constructed from the fused media representation vector to quickly perform subsequent search matching operations. At this point, each media unit has unique identification and index entries in the vector database for online phase lookup.

Outputting the fused media representation vector for indexing and retrieval.

In order to acquire the fused media representation vector and facilitate subsequent indexing and retrieval, the following operations may be further performed in addition to the foregoing generation and optimization of the pseudo-query vector:

After the generation of the pseudo-query vector Q _pseudo and the base feature vector F _base is completed, the system concatenates the two to form the input sequence.

By way of example, the following operational steps may be employed:

The first is serialization, when both F _base and Q _pseudo are unidirectional, the concatenation can be directly performed according to vector dimensions, and if both include timing information (e.g., video key frame or audio frame sequence), the combination can be performed on each frame vector according to time sequence or space sequence, and then the pseudo-query vector is appended to the sequence header or tail to form an input sequence (S _in).

By doing so, it can be ensured that the multimodal information and the 'demand/association' feature are presented in the same input tensor, and the subsequent processing layer does not need to execute complex cross-tensor operation.

Next, in a multi-head self-attention network process, the system inputs the input sequence (S _in) to a multi-head self-attention network (TransformerEncoder or similar structure). Each attention header calculates the attention weights between vectors in the sequence, respectively, to capture temporal or semantic dependencies.

If the underlying feature vector itself has a position code, such as a video frame index or text paragraph ID, this information can be retained along with the pseudo-query vector when spliced.

After the multi-headed self-care network is executed, a set of context enhancement vectors is generated (H _ctx). In this embodiment, a vector that best represents the sequence overall information (e.g., a [ CLS ] position vector or an average pooling result of all vectors) may be selected as the fused media representation vector (F _fused).

Finally, the fused media representation vector (F _fused) is output to a storage or retrieval module for subsequent index retrieval.

In particular, in order to facilitate subsequent indexing, the F _fused may be normalized (e.g., L2 regularized) or dimension reduced (e.g., PCA), to generate a representation vector (F _final).F_final, compared with the original base feature vector F _base and the pseudo-query vector Q _pseudo, better represents the fusion feature of the multi-modal information and the cross-media requirement), and is suitable for use in the similarity measurement or the distance measurement in subsequent retrieval.

Thus, through the splicing, multi-head self-attention operation and vector output processing, the invention can generate the fused media representation vector, and provide the feature representation with more complementary consciousness for the index and retrieval flow of the subsequent vector database.

Referring to fig. 2, a flowchart of a method for creating an index in a vector database according to an embodiment of the present disclosure is provided, as an optional implementation manner, the storing the media representation vector and the corresponding media metadata into the vector database, and creating an index for the media representation vector in the vector database includes steps S201 to S203, where:

S201, establishing a mapping relation between the media representation vector and the media metadata, wherein the media metadata comprises a media type, a time stamp and a frame index;

s202, storing the media representation vector and media metadata thereof to a vector database;

and S203, indexing the media representation vector based on an approximate nearest neighbor search algorithm or a hash index algorithm.

For S201 described above:

In an implementation, when generating the fused media representation vector F _fused or F _final, synchronously recording corresponding media metadata includes:

media types (such as text, video, audio, image, or live stream slices);

Time stamp (in video or audio scene, identify start and stop time of the clip in original file; in live scene, identify actual playing or recording time);

frame index (if video key frames, frame number or continuous frame range may be recorded).

By creating a "mapping structure" (e.g., key-value or JSON format) containing the above media metadata for each fused media representation vector, refined matching can be achieved by vector similarity plus media metadata conditional screening at a later time of retrieval.

For 202 above:

In implementations, a database or engine supporting large-scale vector search may be selected, such as Faiss, milvus, HNSW. F _fused is written to the database with the mapping structure at once, each entry being assigned a unique identifier (e.g., media_id).

When a user query or a system fusion request occurs, the stored vector table is searched in a similarity calculation mode, and a plurality of items which are the most similar are output.

The metadata fields (media type, timestamp, frame index) can then be used to further filter or sort the search results.

For S203 described above:

in a specific implementation, to improve the retrieval efficiency, the index construction process may be performed on all media representation vectors written into the database:

ANN (Approximate Nearest Neighbor) index all vectors are partitioned or a hierarchical graph structure (e.g., HNSW) is built in high-dimensional space to achieve an approximate search of O (log N) or sub-linearity at query time.

Hash indexing, namely utilizing LSH (Locality-SENSITIVE HASHING) and other algorithms to process the vector in a barrel-dividing way in a high-dimensional space, and only carrying out accurate comparison in the same or similar hash barrels during query, thereby improving the query speed.

The indexing process is typically performed off-line in batches and may also be incrementally updated as new media data is added.

For example, if the live data continuously generates new media units, the system may periodically (or in real time) insert the newly generated fused media representation vectors into the database and update the ANN or hash index structure to ensure timeliness of retrieval.

Therefore, through the mapping relation establishment, storage and index construction operation, the method and the device can realize vector retrieval after fusion in the large multi-mode media library, and combine media metadata to perform accurate matching, so that cross-media fusion or alignment is more convenient. Compared with the traditional mode of only storing the original document or file name, the method of the invention remarkably improves the cross-mode retrieval speed and precision, and can be applied to various application scenes such as online education, video retrieval, intelligent monitoring, social media content recommendation and the like.

For S106 described above:

After receiving the fusion request, a recommendation or a synthesis result of cross media needs to be provided for a user or a downstream application in time. By retrieving the similarity of the "fused media representation vectors" in combination with the media metadata (e.g., time stamps, frame indexes), other media that can complement or enhance its content with the target media unit can be quickly located.

In a specific implementation, when the system receives a fusion request for a certain target media unit, other media units with vector similarity of the fused media representation with the target media unit higher than a preset threshold value are first searched in a vector database. And screening out media resources which contain corresponding time stamp information and have complementary relation with the target media unit from the retrieval result. And synchronizing and synthesizing with the target media unit according to the obtained time stamp or frame information of the media resource.

For example, if the target media unit is the nth minute of the video and the matched asset is a piece of audio or subtitle text at the corresponding time, the audio or subtitle text may be embedded into the video play stream;

for another example, if the target media unit is a segment of live stream, image/text information of the same time segment or similar semantic points is found in the matching resource for real-time superposition display.

In a specific implementation, the present disclosure, after synthesis, cross-modal combination or simultaneous rendering of the target media unit and the matched media asset.

For example, the video and the text are displayed in a split screen mode in the same playing page, the audio is overlapped into the video stream to form a new multimedia stream, and corresponding images or text descriptions are popped up at key moments in a live scene.

The finally output cross-media presentation results can be played, browsed or interacted in a client or front-end interface.

In this way, the system fuses the potential complementary demands of different media units before storage by the pseudo query vector generation and implicit interaction module in the off-line stage, thereby reducing the large-scale calculation amount in the on-line stage.

As an alternative embodiment, the cross-media presentation comprises:

receiving a fusion request for the target media unit;

and outputting the fusion result to the front end for display.

In particular implementations, during the online run phase, the system waits or listens for converged requests from clients, upper layer business modules, or third party applications. The request includes the following information:

The target media unit identification, such as video_segment_id, audio_clip_id, text_parameter_id, or other unique ID;

Fusing preference, namely if a user designates that the subtitle needs to be spliced, the image needs to be inserted, or a multi-machine video clip needs to be switched;

terminal/platform information such as information of a mobile terminal, a PC terminal or AR/VR equipment and the like, and fusion strategies can be subjected to differentiated processing according to the information.

In this embodiment, a unified API interface, such as POST/media/fusion_request, is set at the server, and when an external call is made, the system parses and writes the request parameters into a queue or a memory cache, so as to trigger the next search and fusion process.

In a specific implementation, after parsing out the target media unit identity, the following steps are performed in the vector database:

Its corresponding "fused media representation vector" is found by the target media unit ID (e.g., F _fused). If the system performs a database-based or Shard-based storage of the media representation vector, it is necessary to locate the node storing the target entry first.

Based on the representation vector of the target media unit, an Approximate Nearest Neighbor (ANN) search or a hash bucket search is performed to obtain a set of candidate media units { CANDIDATE ₁,Candidate₂..8) having a similarity above a preset threshold (e.g., 0.8).

In addition, coarse Alignment (ANN) and then fine alignment (precise dot product or cosine similarity) can be performed to ensure the retrieval efficiency and accuracy.

For the retrieved candidate media units, the system may rank the similarity from high to low and perform a second filtering based on metadata information (e.g., time stamp, frame index, media type). For example, in a video subtitle scene, only text or audio units with the same or similar time stamps as the current video period may be retained.

In a specific implementation, the present application maintains a configurable list of fusion policies in this embodiment, including:

The splicing mode is time sequence splicing (time stamp merging), image/video overlapping (overlay) or split screen (spilt-screen) and the like;

Priority or weight, for example, caption is preferentially displayed in video+text scenes, and audio/video length alignment is preferentially ensured in audio+video scenes;

The device characteristics are that the user side is a mobile device and can select the segmentation preloading, and the PC side or the high-performance AR/VR side can allow more complex three-dimensional superposition or multi-window rendering.

Aiming at splicing treatment:

if the candidate media unit contains timestamp information, the system may splice the target media unit with the candidate media unit at the same or similar time period. For example, in a live review scene, video and subtitles at the same time are combined.

For multiple segments of images or videos retrieved simultaneously, overlapping (overlay) or picture-in-picture (picture) processing is performed according to frame indexes to generate a new composite media stream.

If the candidate media unit is text, the text content can be displayed in real time under the video playing picture or in a side rail in a manner of rolling captions, bubble prompts or barrages.

For audio+video fusion, an audio mixing engine (e.g., FFmpeg or self-grinding mixing module) may be used to adjust the volume, sampling rate to be consistent with the target video clip.

If there are externally set special effects (such as AR filters, specific watermarks, etc.), they can be added according to preset rules during the synthesis stage.

The present embodiment may use various media processing tools (e.g., FFmpeg, GStreamer, etc.) or self-developed multimedia mixing engines to perform the splicing operation in a streaming or file manner and generate the final composite output stream or composite file.

Further, the spliced multimedia data is encoded again or format-encapsulated. For example, H.264 encoding and decoding are carried out on the synthesized video stream, the synthesized video stream is packed into MP4 or MPEG-TS format, and text information of the superimposed subtitles is rendered to generate resource links which are convenient for web pages or APP ends to call.

If the video stream and the text content are rendered by split screens, the video stream and the text content can be used as independent windows, and the front end can be laid out and played through an HTML5/JS or mobile terminal SDK.

In an implementation, the synthesized multimedia file or stream address is returned to the requesting end (e.g., client browser, APP). The client performs self-adaptive playing according to the network conditions (such as bandwidth and delay) and the equipment performance, or performs automatic playing according to the system preset.

In the live broadcast scene, the fusion request can be executed in a real-time processing pipeline, the synthesized live broadcast stream is pushed to a CDN or a media server by protocols such as RTMP/HLS/WebRTC, and the client side realizes synchronous watching with the anchor/audience through a play address.

In the on-demand scene, the spliced file can be stored in a media server or an object storage, and the access URL is returned for the user to click and play in a front-end browser or an APP interface.

In addition, the embodiment can record the click rate, the stay time or the satisfaction evaluation of the user on the fusion content at the front end and transmit the interaction information back to the server so as to dynamically adjust in the follow-up fusion strategy. If the preference of the user to a certain type of fusion mode is detected to be high subsequently, the system can properly improve the priority of the fusion strategy in the next splicing.

Therefore, the invention aims at the cross-media display process of the target media unit, not only can efficiently search the candidate media unit, but also realizes a plurality of splicing modes through the set fusion strategy, and finally presents the processing result on the user terminal in a visual and diversified mode, thereby meeting the cross-media fusion requirements in the scenes of online education, live broadcast E-commerce, video conference, entertainment content aggregation and the like. The whole flow is from request to output, the complementarity and visual effect of the content can be obviously improved on the premise of not losing the quality of the multimedia, and the requirement of users for synchronous browsing of the multi-mode information can be met by quicker response.

As an alternative embodiment, the cross-media presentation further comprises:

The application is suitable for large-scale and real-time data stream in live broadcast scene, and further performs time slice, incremental vector writing and online retrieval matching on live broadcast stream data.

In particular implementations, after receiving live stream data, live content is continuously captured by a pre-deployed real-time acquisition module (e.g., FFmpeg plug flow reception or WebRTC reception), and the live stream is sliced into a plurality of discrete "live slices" in a set period of time (e.g., every 5 seconds, every 10 seconds). After slicing is completed, the system immediately extracts a basic feature vector (F _base) from the live slice at the server side or the GPU cluster side.

For example, if the live stream is video content, the system performs keyframe extraction or sparse sampling on the 5 second or 10 second video segment and convolves/self-attentive analyzes the keyframes or sampled frames using a video feature extraction model (e.g., 3D CNN or Video Transformer) to generate a base feature vector corresponding to the video slice.

If the live stream contains audio, the audio signal in the time period can be further subjected to sampling rate unification and mel spectrum conversion, and audio characteristics are extracted by utilizing an audio coding network (such as cnn+rnn or Audio Transformer) and combined into the characteristic vector representation of the same live slice.

After the basic feature vector of the live slice is obtained, a pseudo-query module is invoked to generate a pseudo-query vector (Q _pseudo) for the live slice based on a multi-headed self-attention or decoding operation on the feature vector. Because live data is often highly time-efficient and has no fixed file division, the embodiment treats a time-sliced video or audio segment as a "media unit", so that the subsequent pseudo-query vector generation and cross-media implicit interaction module can both follow the same logical and model structure as previously for offline media.

The base feature vector and the pseudo-query vector are then input to the implicit interaction module for fusion, and the output fused media representation vector (F _fused) represents the position of the live slice in the cross-media semantic space. Because the live stream has the characteristic of continuously generating new segments in real time, the embodiment adopts an incremental writing mechanism to continuously insert the newly generated fused media representation vector into the vector database.

In an implementation, a long connection to a vector database (e.g., milvus, faiss, or HNSW) may be maintained in advance, and after each slice processing is completed, the fused media representation vector (F _fused), corresponding metadata (live-room ID, time-period start-stop seconds, frame information) is written to the database to generate a new data entry.

If the database supports online index updating, the index construction module can be triggered after writing to execute ANN index updating operation on the incremental vector samples, and if the database adopts micro batch index, the index can be updated in batches within a short time interval to balance real-time performance and throughput.

When an external request needs to perform cross-media presentation or association retrieval on a live stream, a newly inserted live slice representation vector can be directly queried in a vector database. For example, in an educational live scene, a fused media representation vector corresponding to a certain period of the teacher's current live content may be used to retrieve a courseware picture or presentation page with the highest similarity.

For example, the most recently inserted live slice vector (falling within the time period [ T ₀-5s,T₀ ]) may be looked up based on the current time (e.g., T ₀), or conditional filtering may be performed according to the timestamp field, retrieving only the most recently generated portion, to ensure that the results remain synchronized with the live progress.

If other media resources with similarity higher than the threshold value are retrieved, the live broadcast slice and the matched video, audio or text resources can be spliced or overlapped according to the fusion strategy, and finally the multi-mode interaction experience of live broadcast and supplementary content is realized at the front end.

Thus, through the above-mentioned process of sectioning, incremental writing and online retrieval of live stream data, the present embodiment can realize real-time synchronization of cross-media fusion in live scenes. After each new slice is collected, segmented, features are extracted, and fused media representation vectors are generated, the media representation vectors are immediately written into a vector database and are indexed and updated, so that the latest live broadcast state or semantic information can be acquired by subsequent retrieval, the user experience is greatly improved in time delay, and the continuous changing content requirements in a live broadcast scene are ensured to be adapted by a full media fusion scheme.

Based on the same inventive concept, the embodiment of the disclosure further provides a full media fusion system based on complementary fusion corresponding to the full media fusion method based on complementary fusion, and since the principle of solving the problem by the system in the embodiment of the disclosure is similar to that of the full media fusion method based on complementary fusion described in the embodiment of the disclosure, the implementation of the system can refer to the implementation of the method, and the repetition is omitted.

Referring to fig. 3, a schematic diagram of a full media fusion system based on complementary fusion according to an embodiment of the disclosure is provided, where the system includes an acquisition module 10, a first processing module 20, a feature extraction module 30, a second processing module 40, a third processing module 50, and a display module 60;

The acquisition module 10 is configured to acquire different types of media data from a plurality of media sources, and perform preliminary formatting processing on the media data, where the media data includes text, image, audio, video and live stream data;

the first processing module 20 is configured to divide the formatted media data into a plurality of processable media units according to a preset segmentation strategy, where the media units refer to relatively independent media segments in a time or space range, and each media unit includes at least one of a text, a frame or a video, a picture or a group of pictures, and a piece of audio;

The feature extraction module 30 is configured to perform feature extraction on different types of media data, and generate a basic feature vector corresponding to each media unit;

The second processing module 40 is configured to generate, using a pseudo-query module, a pseudo-query vector associated with each media unit based on the basic feature vector;

The third processing module 50 is configured to input the basic feature vector and the pseudo-query vector to an implicit interaction module, and output a fused media representation vector; storing the media representation vector and the corresponding media metadata thereof into a vector database, and establishing an index for the media representation vector in the vector database;

The display module 60 is configured to, in response to receiving a fusion request for a target media unit, obtain media resources matched with the target media unit by retrieving the vector database, and synchronously synthesize the target media unit with the matched media resources according to timestamp information of the matched media resources, so as to perform cross-media display of the target media unit.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps. It should be understood that determining B from a does not mean determining B from a alone, but can also determine B from a and/or other information.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the description of the present specification, the descriptions of the terms "one embodiment," "example," "specific example," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Claims

1. A full media fusion method based on complementary fusion, characterized by including:

Collect different types of media data from multiple media sources and perform preliminary formatting on the media data; wherein the media data includes: text, images, audio, video and live streaming data;

The formatted media data is divided into media units according to a preset segmentation strategy; wherein a media unit refers to an independent media segment within a time or space range, and each media unit includes at least one of the following: a piece of text, a frame or a video, an image or a group of images, and an audio segment;

Perform feature extraction on different types of media data to generate basic feature vectors corresponding to each media unit;

generating, using a pseudo query module, a pseudo query vector associated with each media unit based on the basic feature vector;

Inputting the basic feature vector and the pseudo query vector into an implicit interaction module, and outputting a fused media representation vector; storing the media representation vector and its corresponding media metadata in a vector database, and creating an index for the media representation vector in the vector database;

In response to receiving a fusion request for a target media unit, the media resources matching the target media unit are obtained by retrieving the vector database, and the target media unit and the matching media resources are synchronously synthesized according to the timestamp information of the matching media resources to perform cross-media display of the target media unit.

2. The method for full media fusion based on complementary fusion according to claim 1, wherein generating a pseudo query vector associated with each media unit comprises:

Performing an attention mechanism operation on the basic feature vector to generate a potential demand representation;

A pseudo query vector is generated based on the potential demand representation, and the pseudo query vector is optimized using a reconstruction loss or a contrastive loss.

3. The method for full media fusion based on complementary fusion according to claim 2, wherein the output of the fused media representation vector comprises:

Concatenate the basic feature vector and the pseudo query vector to form an input sequence;

The fused media representation vector is output for indexing and retrieval.

4. The method for full media fusion based on complementary fusion according to claim 3, wherein storing the media representation vector and its corresponding media metadata in a vector database and indexing the media representation vector in the vector database comprises:

Establishing a mapping relationship between the media representation vector and the media metadata, wherein the media metadata includes a media type, a timestamp, and a frame index;

Storing the media representation vector and its media metadata in a vector database;

The media representation vectors are indexed based on an approximate nearest neighbor search algorithm or a hash index algorithm.

5. The method for full media fusion based on complementary fusion according to claim 4, wherein the cross-media presentation comprises:

receiving a fusion request for the target media unit;

Retrieving a candidate media unit from the vector database whose media representation vector similarity with the target media unit is higher than a preset threshold;

Output the fusion results to the front end for display.

6. The method for full media fusion based on complementary fusion according to claim 5, wherein the cross-media presentation further comprises:

Slicing the live streaming data according to time periods, and generating the basic feature vector and pseudo query vector for each time period;

The fused media representation vectors of the live streaming data are incrementally written into the vector database, and the latest media representation vectors are used for matching during online retrieval.

7. The full media fusion method based on complementary fusion according to claim 6, wherein the feature extraction comprises:

For text, image, audio and video, convolutional neural network, visual transformer, pre-trained language model or speech recognition model are used to extract the basic feature vector from different types of media data.

8. The full media fusion method based on complementary fusion according to claim 7 is characterized in that the cross-media presentation further comprises: embedding the matching media resources into a rendering window corresponding to the target media unit.

9. A full-media fusion system based on complementary fusion, characterized by comprising: an acquisition module, a first processing module, a feature extraction module, a second processing module, a third processing module, and a display module;

The acquisition module is used to collect different types of media data from multiple media sources and perform preliminary formatting on the media data; wherein the media data includes: text, images, audio, video and live streaming data;

The first processing module is configured to divide the formatted media data into media units according to a preset segmentation strategy; wherein a media unit refers to an independent media segment within a time or space range, and each media unit includes at least one of the following: a piece of text, a frame or a video, an image or a group of images, and an audio segment;

The feature extraction module is used to extract features from different types of media data and generate basic feature vectors corresponding to each media unit;

The second processing module is configured to generate a pseudo query vector associated with each media unit based on the basic feature vector using a pseudo query module;

The third processing module is configured to input the basic feature vector and the pseudo query vector into an implicit interaction module and output a fused media representation vector; store the media representation vector and its corresponding media metadata in a vector database, and create an index for the media representation vector in the vector database;

The display module is used to, in response to receiving a fusion request for a target media unit, obtain media resources matching the target media unit by searching the vector database, and synchronously synthesize the target media unit with the matching media resources based on the timestamp information of the matching media resources to perform cross-media display of the target media unit.