Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.
The invention provides a full media fusion method based on complementary fusion, and referring to fig. 1, a flowchart of the full media fusion method based on complementary fusion provided by an embodiment of the disclosure is provided, and the method comprises steps S101 to S106, wherein:
S101, collecting media data of different types from a plurality of media sources, and performing preliminary formatting processing on the media data, wherein the media data comprises text, images, audio, video and live stream data;
s102, dividing the formatted media data into a plurality of processable media units according to a preset segmentation strategy, wherein the media units refer to relatively independent media fragments in a time or space range, and each media unit at least comprises one of a text, a frame or a video, a picture or a group of pictures and a section of audio;
S103, respectively extracting features of different types of media data to generate basic feature vectors corresponding to the media units;
s104, generating a pseudo-query vector associated with each media unit based on the basic feature vector by using a pseudo-query module;
S105, inputting the basic feature vector and the pseudo query vector into an implicit interaction module, and outputting a fused media representation vector, storing the media representation vector and corresponding media metadata into a vector database, and establishing an index for the media representation vector in the vector database;
And S106, responding to the fusion request for the target media unit, acquiring media resources matched with the target media unit by retrieving the vector database, and synchronously synthesizing the target media unit and the matched media resources according to the timestamp information of the matched media resources so as to perform cross-media display of the target media unit.
For S101 described above:
Under the full media fusion scene, the media data of different sources are various in types and non-uniform in format, and if the subsequent analysis is directly carried out, the recognition or the processing is easy to be difficult. The media data of different sources are collected uniformly, and preliminary formatting processing is carried out, so that a foundation can be laid for subsequent feature extraction and cross-media matching.
In implementations, different types of media data may be obtained from multiple media sources including text databases, image storage services, audio recording and storage servers, video streaming platforms, live broadcast servers, and the like.
For example, text data can be obtained from an existing document library, a news source or a document uploaded by a user through an API interface, image data can be read from materials uploaded by an image storage or photographing device, audio and video data can be loaded from a streaming platform or a local file, and streaming fragments can be captured through a real-time acquisition port for live streaming data.
In a specific implementation, the preliminary formatting processing of the collected original media data according to the respective media types may include:
for texts, removing redundant blank, unifying coding formats (such as UTF-8) and performing necessary cleaning;
For images, an image coding format (such as JPEG/PNG) can be scaled equally or unified over the length-width resolution;
Unifying the sampling rate, bit rate or audio coding format (such as MP 3/WAV) for audio;
for video, preliminary transcoding may be based on frame rate, resolution, or encoding format (e.g., h.264/h.265);
for live streaming, the real-time stream is sliced for a predetermined period of time (e.g., 5 seconds or 10 seconds) and stored as a temporary buffer file for subsequent processing.
Thus, through the above-described processing, a base media file or data stream is generated that is available for subsequent analysis and feature extraction.
For S102 described above:
After the unified formatting is completed, the entire segment or pieces of media data still need to be further divided into smaller "media units" for the system to perform finer feature extraction and matching of different types and different temporal/spatial ranges of segments in subsequent steps.
In a specific implementation, the formatted media data is divided into a plurality of processable media units according to a preset segmentation strategy.
The text data may be divided based on paragraph or sentence level granularity, each paragraph corresponds to one media unit, each image or group of images is regarded as one media unit for image data, a time segmentation strategy may be adopted for audio or video data, for example, each 10 seconds or each key frame interval is used as one media unit, and for live stream data, a live stream may be segmented into a plurality of relatively independent media units according to the same time segmentation strategy.
It should be noted that, in the present disclosure, a media unit refers to a relatively independent media segment in a temporal or spatial range, and may include one or more of a text, a frame or a video, a picture or a group of pictures, and a piece of audio. By the dividing mode, the media units can be used as processing objects in subsequent steps, and management and analysis of multi-mode data are simplified.
For S103 described above:
The key to multi-modal fusion is to convert different types of media into a comparable, retrievable vector representation. Without unified feature vector representation, it is difficult to achieve efficient alignment and complementation across text, image, audio-video, etc. media.
In a specific implementation, for each media unit obtained by division, a feature extraction algorithm adapted to the media type of each media unit needs to be adopted to obtain a basic feature vector. For example:
For text feature extraction, a vectorization operation may be performed on text units using a pre-trained language model (e.g., BERT, roBERTa) or a Transformer-based semantic encoder to generate semantic expression vectors.
For image feature extraction, the images may be convolved/self-attention analyzed using CNN (e.g., res net, VGG) or Vision Transformer to extract visual feature vectors.
For audio feature extraction, mel-spectrum (Mel-Spectrogram) or MFCC feature analysis may be performed on the audio media units (including live stream segments), and the audio feature vectors may be obtained in combination with RNN or transducer structures.
For video feature extraction, the space-time feature vectors of the video media units can be extracted through key frame extraction or 3D convolution network, video transform and other models;
It should be emphasized that if the video unit is long, the pooling may be performed on the average over multiple frames to obtain the overall representation.
Thus, through the multi-modal feature extraction described above, the present disclosure ultimately generates a corresponding base feature vector for each media unit and stores it in memory or temporary storage for subsequent processing.
For S104 described above:
The simple basic feature vector can only reflect the content of the media unit itself, but cannot reflect the complementary information that the media unit may need for other media resources. If complex cross comparison is directly performed between all media units in the online stage, the calculation cost is high and the real-time performance is not enough.
In implementations, the present disclosure inputs the base feature vector for each media unit into a pseudo-query module (e.g., a lightweight transducer decoder or self-attention-based sub-network) to generate a pseudo-query vector associated with each media unit. The pseudo-query vector may be understood as a potential expression of "the media unit may have a need or point of association with other media.
In implementations, contrast learning or reconstruction loss may be employed during the training phase such that the pseudo-query vector captures the core semantics and potentially interaction information of the media unit.
In a specific implementation, the generation manner of the pseudo query vector may be:
Inputting basic feature vectors;
The middle process is that the pseudo-query module carries out multi-head self-attention calculation or decoding operation on the basic feature vector according to the parameters obtained by training;
a pseudo-query vector associated with the media unit is output.
For example, if the media unit is a piece of text describing a cooking process, the pseudo-query vector may include potential needs for related food materials, cooking temperatures, durations, and for an image of a particular scene, the pseudo-query vector may express an association to information such as location, subject, or a gangable video.
As an alternative implementation, the generating the pseudo-query vector associated with each media unit includes:
performing attention mechanism operation on the basic feature vector to generate potential demand characterization;
Generating a pseudo-query vector based on the potential demand characterization, and optimizing the pseudo-query vector with a reconstruction or contrast penalty to enable the pseudo-query vector to characterize association features between the media unit and other media resources.
After the multi-modal feature extraction is completed, each media unit has a basic feature vector, denoted as F base. To further generate pseudo-query vectors that characterize the correlation characteristics between the media unit and other media resources, the present disclosure introduces two partial steps of attention mechanism operation and reconstruction/contrast loss optimization.
In a specific implementation, the F base is input into a lightweight 'attention generation sub-module', and the sub-module can be realized by adopting a multi-head self-attention structure, and at least comprises the following key components:
1. linear mapping layer:
F base is mapped to three vector spaces of query (Q), key (K) and value (V). Specifically, a trainable weight matrix W Q may be set to generate the query vector q=f base·WQ, and K, V may be similarly generated.
Because the feature vector of the media unit is often higher in dimension, the linear mapping layer can reduce parameters and ensure that information is not lost, and the method can be realized in a dimension reduction or same dimension maintenance mode.
2. Attention calculation layer:
The attention calculation layer performs dot product attention on Q, K, V:
The Attention (Q, K, V) is used for distributing the learnable weight among the input features to highlight the information most relevant to the current task, the softmax function is a normalization function, the input vector can be mapped to a (0, 1) interval and the sum of the input vector and the (0, 1) interval is 1, so that the Attention weight in the form of probability distribution is obtained, and d k is the dimension of the key vector.
It is emphasized that for the case of only a single vector input (i.e., F base for a single media unit), the present embodiment may treat F base as a batch process or introduce virtual sequence lengths, or introduce fixed position coding, ensuring that the attention mechanisms do not conflict computationally. For example, if F base corresponds to a sequence of video key frames, multi-vector attention operations can also be performed at the frame level.
3. Multi-head merging layer:
If the multi-head attention is adopted, the output generated by each head is spliced in the channel dimension, and then the linear mapping is carried out to obtain an output vector A. The output vector a can be regarded as a "potential demand representation" representing an abstract representation of the media unit and external association elements after the attention mechanism.
Through the attention mechanism operation, the obtained potential demand representation A can reflect the potential matching or complementary demand of the media unit to other media resources, and initially characterizes the 'which external information the media unit is possibly associated with'.
Furthermore, studies have found that in practical system implementations, a mere self-attention operation does not guarantee that a reflects cross-media correlation features accurately enough. Accordingly, the present disclosure further optimizes a by either reconstruction loss (Reconstruction Loss) or contrast loss (Contrastive Loss) to yield the final pseudo-query vector (Q pseudo).
For reconstruction losses:
In implementations, if the system has additional labeling or context information about the media units (e.g., key labels, summaries, corresponding subtitles, etc.), a can be input into a "mini-decoder" or classifier, which can be used to form constraints by predicting or reconstructing the additional information.
For example, when the media unit is a text with a manually marked subject tag, the system can cause A to predict the subject tag, and if the prediction is successful, the capturing of the core semantic of the text by A is more accurate.
The loss function can be designed as cross entropy, mean square error and the like, and parameters of the attention generation submodule and the merging layer are optimized through back propagation, so that the A has higher semantic fitness.
For contrast loss:
in a specific implementation, if the system has a positive and a negative sample pair (for example, "two video segments of the same scene" and "the same timestamp" describe the text of the two video segments), a contrast learning mode may be adopted to make the vector distance between a and the matched sample closer and the vector distance between a and the unmatched sample farther.
For example, assuming that a media unit A and a media unit B are complementary resources, they can be labeled as positive sample pairs, with Q pseudo generated by both being closer together, and for a random unrelated media unit C, as negative sample pairs, with Q pseudo being farther apart. So that a will get stronger in the ability to distinguish between associated and unassociated.
The contrast loss common formulas such as InfoNCE or TripletLoss can enable A to learn the distinguishing capability of cross-media retrieval or matching requirements after a plurality of iterations.
Further, in a specific implementation, the pseudo-query vector Q pseudo can be obtained stably after the a generated by the self-attention mechanism continues to iterate the training in the back propagation.
If reconstruction loss is used, the system can treat the output A as Q pseudo after training is completed, and if contrast loss is used, a linear transformation layer (MLP) or normalization operation can be added after A to obtain the final Q pseudo.
It should be noted that, in the implementation level, the attention module, the decoder/the comparison network may be trained end-to-end or in stages in the training stage, so as to ensure that each step of parameter update can reduce the prediction error or increase the distinction degree, thereby making the pseudo-query vector retain the original media unit feature information and have the ability of "sensing demand" for other media.
It is emphasized that in embodiments of the present invention, the use of reconstruction loss (Reconstruction Loss) or contrast loss (Contrastive Loss) is not merely a conventional algorithm choice for model training, but rather directly coupled with the depth of the multimedia fusion application scenario, playing a key technical role in cross-modal data processing and online real-time presentation.
For example, relying solely on "basis feature vectors" to index or match media content has the problem of difficulty in adequately expressing the complementary needs that one media unit may have for other modality information. Thus, large scale cross-modal interactions or complex alignments are often required in the online phase, resulting in significant increases in delay. In the present invention, the concept of "pseudo-query vectors" is presented such that each media unit learns "potential needs for external media content" during an offline or small batch training phase. In order to enable the pseudo-query vector to more accurately and stably represent cross-media association or complementation requirements, the invention selects reconstruction loss or contrast loss to be matched with an attention mechanism for use, so as to overcome the limitation that a pure perceptron or simple classification loss cannot fully capture 'demand-supply' semantic interaction among multiple modes.
For example, in an actual multi-modal service, subtitles, keyword tags, meta-event descriptions, etc. are often configured for video segments or audio segments, or there are cross-labels (e.g., text descriptions of a segment of video segment corresponding thereto) on different modalities for the same content.
The present invention exploits the reconstruction penalty such that the pseudo-query vector must be "reconfigurable" or "predictable" of these multi-modal additional labels. Once the model learns to reconstruct the auxiliary information from the pseudo-query vector during the training phase, it indicates that the pseudo-query vector does capture the inherent semantic joins between the media units and the external complementary resources.
In addition, in the application level, the use of reconstruction loss greatly improves the accuracy of injecting complementary demand perception for each media unit in the off-line stage, and reduces redundant inter-mode comparison and expense in on-line inquiry, thereby being beneficial to resource scheduling and fusion presentation in high concurrency scenes such as live broadcast, on-line on-demand and the like in real time.
Further, in order to accelerate online retrieval and ensure fusion accuracy, the invention can distinguish between cross-mode positive and negative sample pairs (such as audio and text for explaining the positive and negative sample pairs, or video frames and matched image illustrations and the like) in a training stage.
As described above, when it is determined that "media units A and B have complementary relationship in a certain application scenario", they are labeled as positive sample pairs, so that the corresponding pseudo-query vector distances are closer, otherwise, if they are not related, they are labeled as negative sample pairs, and the vector distances are pulled apart.
Thus, contrast loss is no longer merely a generic machine learning method, but in combination with the specific requirements of the multimedia fusion application, a pre-learning of the "cross-modality association" is achieved during offline training. When the system is online, the system can quickly locate other media resources with highest cross-media supplement degree only by executing vector retrieval on the pseudo query vector of the target media unit. Compared with the similar scheme without using contrast loss, the method can greatly improve both the cross-modal matching degree and the online query speed.
In the invention, the reconstruction loss or the comparison loss is used for optimizing the 'pseudo query vector', so that the system can enable a media unit to learn 'how to carry out complementary matching with other media resources' in an offline stage or in a small batch training update. At this time, the optimal fusion object and presentation mode can be determined only by simple vector similarity operation in the online stage.
This technology contributes to directly serving the online multimedia fusion needs. For example, in a live broadcast electronic market scene, the currently displayed commodity video clip of a host can be fused with the most relevant commodity poster or a matched use instruction in a popup window mode in real time, and in an educational live broadcast scene, courseware, practice problems or supplementary audio of corresponding chapters can be timely recommended, so that the acquisition efficiency and interactive experience of a user are remarkably improved.
Unlike simple use of reconstruction loss or contrast learning optimization neural network, the method emphasizes that the training strategy is fused in the whole flow of multi-modal feature extraction, implicit interaction and vector database retrieval so as to reduce the load of online calculation and improve the complementary efficiency of multi-modal content. In other words, the reconstruction/contrast loss is utilized, which is not in the pure deep learning algorithm level, but combines the system architecture requirement of the multimedia data offline-online fusion, thereby forming the specific technical effects of cross-media complementation, such as time delay reduction, bandwidth saving, retrieval accuracy improvement, user satisfaction degree and the like.
Further, the present embodiment may train the pseudo-query generation subsystem in an offline environment (e.g., GPU/TPU cluster), process existing media data on a large scale, and store the generated pseudo-query vectors together in a vector database;
When a new media unit is collected by the system and basic feature extraction is completed, immediately calling an attention module and a contrast/reconstruction network to generate a pseudo-query vector, and storing the pseudo-query vector into a database for real-time retrieval;
in addition, if the system finds that certain pseudo query vectors are not matched with the actual demands according to user feedback, the system can regularly (or in real time) train and update the parameters of the attention mechanism in small batches, so that the system adapts to the environment and new media.
In this way, the pseudo-query vector can more accurately represent the association characteristics between each media unit and other media resources and serve as the basis for subsequent cross-media complementary display. Compared with the traditional single feature vector method, the method has the advantages that the attribute of sensing the external media requirement is injected into each media unit in the off-line stage, so that the large-scale calculation amount in the on-line retrieval process is reduced, and the robustness and the response speed of the system in the multi-mode fusion scene are improved.
For S105 described above:
In the offline stage, the "basic feature vector" and the "pseudo-query vector" are further fused, so that the final representation vector (i.e. "fused media representation vector") of each media unit contains both its own features and retains the representation of other media requirements or associations. Therefore, real-time interaction of all media is not needed during subsequent online retrieval, and the operand is greatly reduced.
In implementations, the present disclosure inputs the "base feature vector" and the "pseudo-query vector" for each media unit together into an implicit interaction module. For example, implicit interactions may be implemented using a multi-layer transducer or multi-headed self-attention network.
Wherein the operation of implicit interaction comprises:
By self-attention or cross-attention mechanism, merging the 'self-characteristics of the media unit' and 'pseudo-query information expected or required by the media unit', and generating a merged media representation vector with more 'cross-media complementation awareness';
The purpose of implicit interactions is to inject a perception of other media modalities or content requirements for each media unit during the offline phase, thus eliminating the need for extensive cross-modal complex computations during online retrieval.
Finally, the fused media representation vector output by the implicit interaction module can better represent the semantic features of the media unit and the potential complementary relation between the semantic features and other modalities, and provides a basis for subsequent retrieval and synthesis.
Further, the present disclosure stores the fused media representation vector with media metadata information (e.g., media type, timestamp, frame index, text paragraph ID, live stream slice ID, etc.) for the media unit into a vector database.
Illustratively, the vector database may employ structures such as Faiss, milvus, or HNSW, to support large-scale vector similarity searches.
Further, after storage is complete, an Approximate Nearest Neighbor (ANN) or hash index may be constructed from the fused media representation vector to quickly perform subsequent search matching operations. At this point, each media unit has unique identification and index entries in the vector database for online phase lookup.
As an alternative embodiment, the outputting the fused media representation vector includes:
splicing the basic feature vector and the pseudo-query vector to form an input sequence;
Processing the input sequence through a multi-head self-attention network to generate a fused media representation vector;
Outputting the fused media representation vector for indexing and retrieval.
In order to acquire the fused media representation vector and facilitate subsequent indexing and retrieval, the following operations may be further performed in addition to the foregoing generation and optimization of the pseudo-query vector:
After the generation of the pseudo-query vector Q pseudo and the base feature vector F base is completed, the system concatenates the two to form the input sequence.
By way of example, the following operational steps may be employed:
The first is serialization, when both F base and Q pseudo are unidirectional, the concatenation can be directly performed according to vector dimensions, and if both include timing information (e.g., video key frame or audio frame sequence), the combination can be performed on each frame vector according to time sequence or space sequence, and then the pseudo-query vector is appended to the sequence header or tail to form an input sequence (S in).
By doing so, it can be ensured that the multimodal information and the 'demand/association' feature are presented in the same input tensor, and the subsequent processing layer does not need to execute complex cross-tensor operation.
Next, in a multi-head self-attention network process, the system inputs the input sequence (S in) to a multi-head self-attention network (TransformerEncoder or similar structure). Each attention header calculates the attention weights between vectors in the sequence, respectively, to capture temporal or semantic dependencies.
If the underlying feature vector itself has a position code, such as a video frame index or text paragraph ID, this information can be retained along with the pseudo-query vector when spliced.
After the multi-headed self-care network is executed, a set of context enhancement vectors is generated (H ctx). In this embodiment, a vector that best represents the sequence overall information (e.g., a [ CLS ] position vector or an average pooling result of all vectors) may be selected as the fused media representation vector (F fused).
Finally, the fused media representation vector (F fused) is output to a storage or retrieval module for subsequent index retrieval.
In particular, in order to facilitate subsequent indexing, the F fused may be normalized (e.g., L2 regularized) or dimension reduced (e.g., PCA), to generate a representation vector (F final).Ffinal, compared with the original base feature vector F base and the pseudo-query vector Q pseudo, better represents the fusion feature of the multi-modal information and the cross-media requirement), and is suitable for use in the similarity measurement or the distance measurement in subsequent retrieval.
Thus, through the splicing, multi-head self-attention operation and vector output processing, the invention can generate the fused media representation vector, and provide the feature representation with more complementary consciousness for the index and retrieval flow of the subsequent vector database.
Referring to fig. 2, a flowchart of a method for creating an index in a vector database according to an embodiment of the present disclosure is provided, as an optional implementation manner, the storing the media representation vector and the corresponding media metadata into the vector database, and creating an index for the media representation vector in the vector database includes steps S201 to S203, where:
S201, establishing a mapping relation between the media representation vector and the media metadata, wherein the media metadata comprises a media type, a time stamp and a frame index;
s202, storing the media representation vector and media metadata thereof to a vector database;
and S203, indexing the media representation vector based on an approximate nearest neighbor search algorithm or a hash index algorithm.
For S201 described above:
In an implementation, when generating the fused media representation vector F fused or F final, synchronously recording corresponding media metadata includes:
media types (such as text, video, audio, image, or live stream slices);
Time stamp (in video or audio scene, identify start and stop time of the clip in original file; in live scene, identify actual playing or recording time);
frame index (if video key frames, frame number or continuous frame range may be recorded).
By creating a "mapping structure" (e.g., key-value or JSON format) containing the above media metadata for each fused media representation vector, refined matching can be achieved by vector similarity plus media metadata conditional screening at a later time of retrieval.
For 202 above:
In implementations, a database or engine supporting large-scale vector search may be selected, such as Faiss, milvus, HNSW. F fused is written to the database with the mapping structure at once, each entry being assigned a unique identifier (e.g., media_id).
When a user query or a system fusion request occurs, the stored vector table is searched in a similarity calculation mode, and a plurality of items which are the most similar are output.
The metadata fields (media type, timestamp, frame index) can then be used to further filter or sort the search results.
For S203 described above:
in a specific implementation, to improve the retrieval efficiency, the index construction process may be performed on all media representation vectors written into the database:
ANN (Approximate Nearest Neighbor) index all vectors are partitioned or a hierarchical graph structure (e.g., HNSW) is built in high-dimensional space to achieve an approximate search of O (log N) or sub-linearity at query time.
Hash indexing, namely utilizing LSH (Locality-SENSITIVE HASHING) and other algorithms to process the vector in a barrel-dividing way in a high-dimensional space, and only carrying out accurate comparison in the same or similar hash barrels during query, thereby improving the query speed.
The indexing process is typically performed off-line in batches and may also be incrementally updated as new media data is added.
For example, if the live data continuously generates new media units, the system may periodically (or in real time) insert the newly generated fused media representation vectors into the database and update the ANN or hash index structure to ensure timeliness of retrieval.
Therefore, through the mapping relation establishment, storage and index construction operation, the method and the device can realize vector retrieval after fusion in the large multi-mode media library, and combine media metadata to perform accurate matching, so that cross-media fusion or alignment is more convenient. Compared with the traditional mode of only storing the original document or file name, the method of the invention remarkably improves the cross-mode retrieval speed and precision, and can be applied to various application scenes such as online education, video retrieval, intelligent monitoring, social media content recommendation and the like.
For S106 described above:
After receiving the fusion request, a recommendation or a synthesis result of cross media needs to be provided for a user or a downstream application in time. By retrieving the similarity of the "fused media representation vectors" in combination with the media metadata (e.g., time stamps, frame indexes), other media that can complement or enhance its content with the target media unit can be quickly located.
In a specific implementation, when the system receives a fusion request for a certain target media unit, other media units with vector similarity of the fused media representation with the target media unit higher than a preset threshold value are first searched in a vector database. And screening out media resources which contain corresponding time stamp information and have complementary relation with the target media unit from the retrieval result. And synchronizing and synthesizing with the target media unit according to the obtained time stamp or frame information of the media resource.
For example, if the target media unit is the nth minute of the video and the matched asset is a piece of audio or subtitle text at the corresponding time, the audio or subtitle text may be embedded into the video play stream;
for another example, if the target media unit is a segment of live stream, image/text information of the same time segment or similar semantic points is found in the matching resource for real-time superposition display.
In a specific implementation, the present disclosure, after synthesis, cross-modal combination or simultaneous rendering of the target media unit and the matched media asset.
For example, the video and the text are displayed in a split screen mode in the same playing page, the audio is overlapped into the video stream to form a new multimedia stream, and corresponding images or text descriptions are popped up at key moments in a live scene.
The finally output cross-media presentation results can be played, browsed or interacted in a client or front-end interface.
In this way, the system fuses the potential complementary demands of different media units before storage by the pseudo query vector generation and implicit interaction module in the off-line stage, thereby reducing the large-scale calculation amount in the on-line stage.
As an alternative embodiment, the cross-media presentation comprises:
receiving a fusion request for the target media unit;
searching candidate media units with the media representation vector similarity with the target media unit higher than a preset threshold value in the vector database;
Splicing and fusing the target media unit and the candidate media unit according to a set fusion strategy;
and outputting the fusion result to the front end for display.
In particular implementations, during the online run phase, the system waits or listens for converged requests from clients, upper layer business modules, or third party applications. The request includes the following information:
The target media unit identification, such as video_segment_id, audio_clip_id, text_parameter_id, or other unique ID;
Fusing preference, namely if a user designates that the subtitle needs to be spliced, the image needs to be inserted, or a multi-machine video clip needs to be switched;
terminal/platform information such as information of a mobile terminal, a PC terminal or AR/VR equipment and the like, and fusion strategies can be subjected to differentiated processing according to the information.
In this embodiment, a unified API interface, such as POST/media/fusion_request, is set at the server, and when an external call is made, the system parses and writes the request parameters into a queue or a memory cache, so as to trigger the next search and fusion process.
In a specific implementation, after parsing out the target media unit identity, the following steps are performed in the vector database:
Its corresponding "fused media representation vector" is found by the target media unit ID (e.g., F fused). If the system performs a database-based or Shard-based storage of the media representation vector, it is necessary to locate the node storing the target entry first.
Based on the representation vector of the target media unit, an Approximate Nearest Neighbor (ANN) search or a hash bucket search is performed to obtain a set of candidate media units { CANDIDATE 1,Candidate2..8) having a similarity above a preset threshold (e.g., 0.8).
In addition, coarse Alignment (ANN) and then fine alignment (precise dot product or cosine similarity) can be performed to ensure the retrieval efficiency and accuracy.
For the retrieved candidate media units, the system may rank the similarity from high to low and perform a second filtering based on metadata information (e.g., time stamp, frame index, media type). For example, in a video subtitle scene, only text or audio units with the same or similar time stamps as the current video period may be retained.
In a specific implementation, the present application maintains a configurable list of fusion policies in this embodiment, including:
The splicing mode is time sequence splicing (time stamp merging), image/video overlapping (overlay) or split screen (spilt-screen) and the like;
Priority or weight, for example, caption is preferentially displayed in video+text scenes, and audio/video length alignment is preferentially ensured in audio+video scenes;
The device characteristics are that the user side is a mobile device and can select the segmentation preloading, and the PC side or the high-performance AR/VR side can allow more complex three-dimensional superposition or multi-window rendering.
Aiming at splicing treatment:
if the candidate media unit contains timestamp information, the system may splice the target media unit with the candidate media unit at the same or similar time period. For example, in a live review scene, video and subtitles at the same time are combined.
For multiple segments of images or videos retrieved simultaneously, overlapping (overlay) or picture-in-picture (picture) processing is performed according to frame indexes to generate a new composite media stream.
If the candidate media unit is text, the text content can be displayed in real time under the video playing picture or in a side rail in a manner of rolling captions, bubble prompts or barrages.
For audio+video fusion, an audio mixing engine (e.g., FFmpeg or self-grinding mixing module) may be used to adjust the volume, sampling rate to be consistent with the target video clip.
If there are externally set special effects (such as AR filters, specific watermarks, etc.), they can be added according to preset rules during the synthesis stage.
The present embodiment may use various media processing tools (e.g., FFmpeg, GStreamer, etc.) or self-developed multimedia mixing engines to perform the splicing operation in a streaming or file manner and generate the final composite output stream or composite file.
Further, the spliced multimedia data is encoded again or format-encapsulated. For example, H.264 encoding and decoding are carried out on the synthesized video stream, the synthesized video stream is packed into MP4 or MPEG-TS format, and text information of the superimposed subtitles is rendered to generate resource links which are convenient for web pages or APP ends to call.
If the video stream and the text content are rendered by split screens, the video stream and the text content can be used as independent windows, and the front end can be laid out and played through an HTML5/JS or mobile terminal SDK.
In an implementation, the synthesized multimedia file or stream address is returned to the requesting end (e.g., client browser, APP). The client performs self-adaptive playing according to the network conditions (such as bandwidth and delay) and the equipment performance, or performs automatic playing according to the system preset.
In the live broadcast scene, the fusion request can be executed in a real-time processing pipeline, the synthesized live broadcast stream is pushed to a CDN or a media server by protocols such as RTMP/HLS/WebRTC, and the client side realizes synchronous watching with the anchor/audience through a play address.
In the on-demand scene, the spliced file can be stored in a media server or an object storage, and the access URL is returned for the user to click and play in a front-end browser or an APP interface.
In addition, the embodiment can record the click rate, the stay time or the satisfaction evaluation of the user on the fusion content at the front end and transmit the interaction information back to the server so as to dynamically adjust in the follow-up fusion strategy. If the preference of the user to a certain type of fusion mode is detected to be high subsequently, the system can properly improve the priority of the fusion strategy in the next splicing.
Therefore, the invention aims at the cross-media display process of the target media unit, not only can efficiently search the candidate media unit, but also realizes a plurality of splicing modes through the set fusion strategy, and finally presents the processing result on the user terminal in a visual and diversified mode, thereby meeting the cross-media fusion requirements in the scenes of online education, live broadcast E-commerce, video conference, entertainment content aggregation and the like. The whole flow is from request to output, the complementarity and visual effect of the content can be obviously improved on the premise of not losing the quality of the multimedia, and the requirement of users for synchronous browsing of the multi-mode information can be met by quicker response.
As an alternative embodiment, the cross-media presentation further comprises:
slicing live stream data according to time periods, and generating the basic feature vector and the pseudo-query vector for each time period;
And incrementally writing the fused media representation vectors of the live stream data into the vector database, and matching by using the latest media representation vectors during online retrieval.
The application is suitable for large-scale and real-time data stream in live broadcast scene, and further performs time slice, incremental vector writing and online retrieval matching on live broadcast stream data.
In particular implementations, after receiving live stream data, live content is continuously captured by a pre-deployed real-time acquisition module (e.g., FFmpeg plug flow reception or WebRTC reception), and the live stream is sliced into a plurality of discrete "live slices" in a set period of time (e.g., every 5 seconds, every 10 seconds). After slicing is completed, the system immediately extracts a basic feature vector (F base) from the live slice at the server side or the GPU cluster side.
For example, if the live stream is video content, the system performs keyframe extraction or sparse sampling on the 5 second or 10 second video segment and convolves/self-attentive analyzes the keyframes or sampled frames using a video feature extraction model (e.g., 3D CNN or Video Transformer) to generate a base feature vector corresponding to the video slice.
If the live stream contains audio, the audio signal in the time period can be further subjected to sampling rate unification and mel spectrum conversion, and audio characteristics are extracted by utilizing an audio coding network (such as cnn+rnn or Audio Transformer) and combined into the characteristic vector representation of the same live slice.
After the basic feature vector of the live slice is obtained, a pseudo-query module is invoked to generate a pseudo-query vector (Q pseudo) for the live slice based on a multi-headed self-attention or decoding operation on the feature vector. Because live data is often highly time-efficient and has no fixed file division, the embodiment treats a time-sliced video or audio segment as a "media unit", so that the subsequent pseudo-query vector generation and cross-media implicit interaction module can both follow the same logical and model structure as previously for offline media.
The base feature vector and the pseudo-query vector are then input to the implicit interaction module for fusion, and the output fused media representation vector (F fused) represents the position of the live slice in the cross-media semantic space. Because the live stream has the characteristic of continuously generating new segments in real time, the embodiment adopts an incremental writing mechanism to continuously insert the newly generated fused media representation vector into the vector database.
In an implementation, a long connection to a vector database (e.g., milvus, faiss, or HNSW) may be maintained in advance, and after each slice processing is completed, the fused media representation vector (F fused), corresponding metadata (live-room ID, time-period start-stop seconds, frame information) is written to the database to generate a new data entry.
If the database supports online index updating, the index construction module can be triggered after writing to execute ANN index updating operation on the incremental vector samples, and if the database adopts micro batch index, the index can be updated in batches within a short time interval to balance real-time performance and throughput.
When an external request needs to perform cross-media presentation or association retrieval on a live stream, a newly inserted live slice representation vector can be directly queried in a vector database. For example, in an educational live scene, a fused media representation vector corresponding to a certain period of the teacher's current live content may be used to retrieve a courseware picture or presentation page with the highest similarity.
For example, the most recently inserted live slice vector (falling within the time period [ T 0-5s,T0 ]) may be looked up based on the current time (e.g., T 0), or conditional filtering may be performed according to the timestamp field, retrieving only the most recently generated portion, to ensure that the results remain synchronized with the live progress.
If other media resources with similarity higher than the threshold value are retrieved, the live broadcast slice and the matched video, audio or text resources can be spliced or overlapped according to the fusion strategy, and finally the multi-mode interaction experience of live broadcast and supplementary content is realized at the front end.
Thus, through the above-mentioned process of sectioning, incremental writing and online retrieval of live stream data, the present embodiment can realize real-time synchronization of cross-media fusion in live scenes. After each new slice is collected, segmented, features are extracted, and fused media representation vectors are generated, the media representation vectors are immediately written into a vector database and are indexed and updated, so that the latest live broadcast state or semantic information can be acquired by subsequent retrieval, the user experience is greatly improved in time delay, and the continuous changing content requirements in a live broadcast scene are ensured to be adapted by a full media fusion scheme.
Based on the same inventive concept, the embodiment of the disclosure further provides a full media fusion system based on complementary fusion corresponding to the full media fusion method based on complementary fusion, and since the principle of solving the problem by the system in the embodiment of the disclosure is similar to that of the full media fusion method based on complementary fusion described in the embodiment of the disclosure, the implementation of the system can refer to the implementation of the method, and the repetition is omitted.
Referring to fig. 3, a schematic diagram of a full media fusion system based on complementary fusion according to an embodiment of the disclosure is provided, where the system includes an acquisition module 10, a first processing module 20, a feature extraction module 30, a second processing module 40, a third processing module 50, and a display module 60;
The acquisition module 10 is configured to acquire different types of media data from a plurality of media sources, and perform preliminary formatting processing on the media data, where the media data includes text, image, audio, video and live stream data;
the first processing module 20 is configured to divide the formatted media data into a plurality of processable media units according to a preset segmentation strategy, where the media units refer to relatively independent media segments in a time or space range, and each media unit includes at least one of a text, a frame or a video, a picture or a group of pictures, and a piece of audio;
The feature extraction module 30 is configured to perform feature extraction on different types of media data, and generate a basic feature vector corresponding to each media unit;
The second processing module 40 is configured to generate, using a pseudo-query module, a pseudo-query vector associated with each media unit based on the basic feature vector;
The third processing module 50 is configured to input the basic feature vector and the pseudo-query vector to an implicit interaction module, and output a fused media representation vector; storing the media representation vector and the corresponding media metadata thereof into a vector database, and establishing an index for the media representation vector in the vector database;
The display module 60 is configured to, in response to receiving a fusion request for a target media unit, obtain media resources matched with the target media unit by retrieving the vector database, and synchronously synthesize the target media unit with the matched media resources according to timestamp information of the matched media resources, so as to perform cross-media display of the target media unit.
It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps. It should be understood that determining B from a does not mean determining B from a alone, but can also determine B from a and/or other information.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the description of the present specification, the descriptions of the terms "one embodiment," "example," "specific example," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.