[go: up one dir, main page]

CN120123970B - Omnimedia fusion method and system based on complementary fusion - Google Patents

Omnimedia fusion method and system based on complementary fusion

Info

Publication number
CN120123970B
CN120123970B CN202510170015.0A CN202510170015A CN120123970B CN 120123970 B CN120123970 B CN 120123970B CN 202510170015 A CN202510170015 A CN 202510170015A CN 120123970 B CN120123970 B CN 120123970B
Authority
CN
China
Prior art keywords
media
vector
fusion
unit
representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510170015.0A
Other languages
Chinese (zh)
Other versions
CN120123970A (en
Inventor
孙琪
汤敬华
郑波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Shengtong Zhiming Technology Co ltd
Original Assignee
Shanghai Shengtong Zhiming Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Shengtong Zhiming Technology Co ltd filed Critical Shanghai Shengtong Zhiming Technology Co ltd
Priority to CN202510170015.0A priority Critical patent/CN120123970B/en
Publication of CN120123970A publication Critical patent/CN120123970A/en
Application granted granted Critical
Publication of CN120123970B publication Critical patent/CN120123970B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/251Fusion techniques of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/41Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/483Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • G06F18/256Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开提供了基于互补融合的全媒体融合方法及系统,涉及融合通信领域。该方法包括从多个媒体来源采集不同类型的媒体数据,按照预设的分段策略,划分为若干可处理的媒体单元;针对不同类型的媒体数据分别进行特征提取,生成对应的基础特征向量;利用伪查询模块,生成关联的伪查询向量;将基础特征向量和伪查询向量输入至隐式交互模块,输出融合后的媒体表示向量;将媒体表示向量及其对应的媒体元数据存储至向量数据库中,并在向量数据库中为媒体表示向量建立索引;响应于接收到针对目标媒体单元的融合请求时,通过检索向量数据库获取与目标媒体单元匹配的媒体资源,以进行目标媒体单元的跨媒体展示。

The present disclosure provides a full media fusion method and system based on complementary fusion, which relates to the field of converged communications. The method includes collecting different types of media data from multiple media sources, dividing them into several processable media units according to a preset segmentation strategy; performing feature extraction on different types of media data respectively to generate corresponding basic feature vectors; using a pseudo-query module to generate an associated pseudo-query vector; inputting the basic feature vector and the pseudo-query vector into an implicit interaction module, and outputting a fused media representation vector; storing the media representation vector and its corresponding media metadata in a vector database, and indexing the media representation vector in the vector database; in response to receiving a fusion request for a target media unit, obtaining media resources matching the target media unit by searching the vector database to perform cross-media display of the target media unit.

Description

Full-media fusion method and system based on complementary fusion
Technical Field
The disclosure relates to the field of converged communication, in particular to a full media fusion method and system based on complementary fusion.
Background
Multi-mode data such as text, images, audio, video, live streaming and the like are increasingly widely applied in various industries, the data of different modes have obvious differences in terms of structure, semantics and time sequence distribution, and the traditional single-mode or simple splicing type processing method is difficult to realize deep fusion and efficient matching of cross-mode content. Especially when the data of a plurality of modes are required to be combined and displayed (such as alignment of video and text captions, mixing of live streaming and image/audio, and the like), if an effective complementary mechanism of multi-mode features is lacking, the problems of low retrieval and matching efficiency, large redundancy of data, inaccurate alignment of cross-mode information and the like are easily caused, and the requirements of multimedia fusion in real-time and high concurrency environments are difficult to meet.
For example, china patent application with bulletin number of CN113033647A discloses a multi-mode feature fusion method, which mainly has the main ideas that features of each mode of a multimedia resource are extracted respectively, the features of each mode are combined in a mode dimension to form multi-channel features, and then fusion features are generated through multi-channel convolution processing, so that information complementation among the features of different modes is realized. The scheme provides multiple convolution operations in the dimension D direction based on convolution kernels, supports technical means such as feature aggregation and stretching treatment, and is expected to improve the effect of multi-mode feature fusion to a certain extent. The scheme emphasizes that the influence of a single characteristic value is effectively reduced through the convolution processing among channels, improves the expression capability of fusion characteristics, and is suitable for characteristic fusion scenes of multimedia resources such as video image frame data, audio data, text data and the like.
However, the above scheme mainly focuses on performing relatively static "same-dimension splicing" on multi-modal features by using convolution operation and channel convolution, and does not deeply consider how to reduce the calculation overhead of online large-scale comparison under the cross-media retrieval and real-time interaction scene, nor propose an incremental fusion strategy for media types with strong time continuity, such as live streaming. Although the system has a certain inter-channel complementary effect on the multi-mode feature combination, when the system needs to dynamically supplement, synchronize or multi-terminal display the potential requirements and the associations of different modes in real time in an online or high concurrency environment, a more perfect implicit interaction and off-line perception mechanism is lacked, and the real-time performance and the fusion precision are difficult to be considered. In addition, in the processing mode of carrying out channel convolution only aiming at the same dimension characteristics, when complementary query is required to be carried out on each media unit and an online fusion scene is realized, the network load is difficult to be further reduced or the cross-media retrieval efficiency is difficult to be improved. Therefore, a full-media fusion technical scheme which can be more suitable for offline preprocessing and online rapid matching of multi-mode data and can be used for carrying out low-time delay and multi-terminal synchronous display on live streaming and other real-time data is still needed, so that the limitation of the existing scheme in the aspects of cross-mode association precision and real-time mixed presentation is overcome.
Disclosure of Invention
Aiming at the defects of the prior art, the embodiment of the disclosure provides a full media fusion method and system based on complementary fusion.
In a first aspect, an embodiment of the present disclosure provides a full media fusion method based on complementary fusion, including:
Collecting different types of media data from a plurality of media sources, and performing preliminary formatting processing on the media data, wherein the media data comprises text, images, audio, video and live stream data;
Dividing the formatted media data into media units according to a preset segmentation strategy, wherein the media units refer to independent media fragments in a time or space range, and each media unit at least comprises one of a text, a frame or a video, one or a group of images and one audio;
respectively extracting features of different types of media data to generate basic feature vectors corresponding to the media units;
Generating, with a pseudo-query module, a pseudo-query vector associated with each media unit based on the base feature vector;
Inputting the basic feature vector and the pseudo query vector to an implicit interaction module, and outputting a fused media representation vector; storing the media representation vector and the corresponding media metadata thereof into a vector database, and establishing an index for the media representation vector in the vector database;
and when a fusion request for a target media unit is received, acquiring media resources matched with the target media unit by retrieving the vector database, and synchronously synthesizing the target media unit and the matched media resources according to the timestamp information of the matched media resources so as to perform cross-media display of the target media unit.
As an alternative implementation, the generating the pseudo-query vector associated with each media unit includes:
performing attention mechanism operation on the basic feature vector to generate potential demand characterization;
Generating a pseudo-query vector based on the potential demand characterization and optimizing the pseudo-query vector using reconstruction or contrast loss.
As an alternative embodiment, the outputting the fused media representation vector includes:
splicing the basic feature vector and the pseudo-query vector to form an input sequence;
Processing the input sequence through a multi-head self-attention network to generate a fused media representation vector;
Outputting the fused media representation vector for indexing and retrieval.
As an alternative embodiment, the storing the media representation vector and the corresponding media metadata into a vector database, and indexing the media representation vector in the vector database includes:
establishing a mapping relation between the media representation vector and the media metadata, wherein the media metadata comprises a media type, a timestamp and a frame index;
storing the media representation vector and media metadata thereof to a vector database;
the media representation vector is indexed based on an approximate nearest neighbor search algorithm or a hash index algorithm.
As an alternative embodiment, the cross-media presentation comprises:
receiving a fusion request for the target media unit;
searching candidate media units with the media representation vector similarity with the target media unit higher than a preset threshold value in the vector database;
Splicing and fusing the target media unit and the candidate media unit according to a set fusion strategy;
and outputting the fusion result to the front end for display.
As an alternative embodiment, the cross-media presentation further comprises:
slicing live stream data according to time periods, and generating the basic feature vector and the pseudo-query vector for each time period;
And incrementally writing the fused media representation vectors of the live stream data into the vector database, and matching by using the latest media representation vectors during online retrieval.
As an alternative embodiment, the feature extraction includes:
And extracting the basic feature vectors from the original media data by adopting a convolutional neural network, a visual transducer, a pre-training language model or a voice recognition model aiming at texts, images, audios and videos.
As an alternative implementation mode, the cross-media presentation further comprises embedding the matched media resources into rendering windows corresponding to target media units.
In a second aspect, the embodiment of the disclosure also provides an all-media fusion system based on complementary fusion, which comprises an acquisition module, a first processing module, a feature extraction module, a second processing module, a third processing module and a display module;
The acquisition module is used for acquiring different types of media data from a plurality of media sources and carrying out preliminary formatting processing on the media data, wherein the media data comprises text, images, audio, video and live stream data;
the first processing module is used for dividing the formatted media data into media units according to a preset segmentation strategy, wherein the media units refer to media fragments which are opposite in time or space range, and each media unit at least comprises one of a text, a frame or a video, a picture or a group of pictures and a section of audio;
The feature extraction module is used for respectively extracting features of different types of media data and generating basic feature vectors corresponding to the media units;
The second processing module is used for generating a pseudo-query vector associated with each media unit based on the basic feature vector by using the pseudo-query module;
The third processing module is used for inputting the basic feature vector and the pseudo query vector into the implicit interaction module and outputting a fused media representation vector, storing the media representation vector and corresponding media metadata thereof into a vector database, and establishing an index for the media representation vector in the vector database;
and the display module is used for responding to the fusion request for the target media unit, acquiring media resources matched with the target media unit by retrieving the vector database, and synchronously synthesizing the target media unit and the matched media resources according to the timestamp information of the matched media resources so as to perform cross-media display of the target media unit.
Compared with the prior art, the method has the beneficial effects that different media (text, images, audio, video and live broadcast streams) can realize cross-mode retrieval and complementation by utilizing the same vector database, and the method is suitable for being applied to scenes such as online teaching, video entertainment, conference live broadcast and telemedicine. By adopting approximate nearest neighbor search and time stamp synchronization of the vector database, cross-media matching and fusion can be completed only by dot product calculation or similarity calculation in an online stage, and the fusion response speed is greatly improved. With the increase of media data, the system only needs to perform feature extraction and implicit interaction on the newly added media units and then write the newly added media units into a vector database, and index updating is performed, so that large-scale distributed expansion is supported.
Drawings
FIG. 1 is a flowchart of a complementary fusion-based full media fusion method provided by an embodiment of the present disclosure;
FIG. 2 is a flow chart of a method for creating an index in a vector database according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram of a full media fusion system based on complementary fusion provided in an embodiment of the present disclosure.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.
The invention provides a full media fusion method based on complementary fusion, and referring to fig. 1, a flowchart of the full media fusion method based on complementary fusion provided by an embodiment of the disclosure is provided, and the method comprises steps S101 to S106, wherein:
S101, collecting media data of different types from a plurality of media sources, and performing preliminary formatting processing on the media data, wherein the media data comprises text, images, audio, video and live stream data;
s102, dividing the formatted media data into a plurality of processable media units according to a preset segmentation strategy, wherein the media units refer to relatively independent media fragments in a time or space range, and each media unit at least comprises one of a text, a frame or a video, a picture or a group of pictures and a section of audio;
S103, respectively extracting features of different types of media data to generate basic feature vectors corresponding to the media units;
s104, generating a pseudo-query vector associated with each media unit based on the basic feature vector by using a pseudo-query module;
S105, inputting the basic feature vector and the pseudo query vector into an implicit interaction module, and outputting a fused media representation vector, storing the media representation vector and corresponding media metadata into a vector database, and establishing an index for the media representation vector in the vector database;
And S106, responding to the fusion request for the target media unit, acquiring media resources matched with the target media unit by retrieving the vector database, and synchronously synthesizing the target media unit and the matched media resources according to the timestamp information of the matched media resources so as to perform cross-media display of the target media unit.
For S101 described above:
Under the full media fusion scene, the media data of different sources are various in types and non-uniform in format, and if the subsequent analysis is directly carried out, the recognition or the processing is easy to be difficult. The media data of different sources are collected uniformly, and preliminary formatting processing is carried out, so that a foundation can be laid for subsequent feature extraction and cross-media matching.
In implementations, different types of media data may be obtained from multiple media sources including text databases, image storage services, audio recording and storage servers, video streaming platforms, live broadcast servers, and the like.
For example, text data can be obtained from an existing document library, a news source or a document uploaded by a user through an API interface, image data can be read from materials uploaded by an image storage or photographing device, audio and video data can be loaded from a streaming platform or a local file, and streaming fragments can be captured through a real-time acquisition port for live streaming data.
In a specific implementation, the preliminary formatting processing of the collected original media data according to the respective media types may include:
for texts, removing redundant blank, unifying coding formats (such as UTF-8) and performing necessary cleaning;
For images, an image coding format (such as JPEG/PNG) can be scaled equally or unified over the length-width resolution;
Unifying the sampling rate, bit rate or audio coding format (such as MP 3/WAV) for audio;
for video, preliminary transcoding may be based on frame rate, resolution, or encoding format (e.g., h.264/h.265);
for live streaming, the real-time stream is sliced for a predetermined period of time (e.g., 5 seconds or 10 seconds) and stored as a temporary buffer file for subsequent processing.
Thus, through the above-described processing, a base media file or data stream is generated that is available for subsequent analysis and feature extraction.
For S102 described above:
After the unified formatting is completed, the entire segment or pieces of media data still need to be further divided into smaller "media units" for the system to perform finer feature extraction and matching of different types and different temporal/spatial ranges of segments in subsequent steps.
In a specific implementation, the formatted media data is divided into a plurality of processable media units according to a preset segmentation strategy.
The text data may be divided based on paragraph or sentence level granularity, each paragraph corresponds to one media unit, each image or group of images is regarded as one media unit for image data, a time segmentation strategy may be adopted for audio or video data, for example, each 10 seconds or each key frame interval is used as one media unit, and for live stream data, a live stream may be segmented into a plurality of relatively independent media units according to the same time segmentation strategy.
It should be noted that, in the present disclosure, a media unit refers to a relatively independent media segment in a temporal or spatial range, and may include one or more of a text, a frame or a video, a picture or a group of pictures, and a piece of audio. By the dividing mode, the media units can be used as processing objects in subsequent steps, and management and analysis of multi-mode data are simplified.
For S103 described above:
The key to multi-modal fusion is to convert different types of media into a comparable, retrievable vector representation. Without unified feature vector representation, it is difficult to achieve efficient alignment and complementation across text, image, audio-video, etc. media.
In a specific implementation, for each media unit obtained by division, a feature extraction algorithm adapted to the media type of each media unit needs to be adopted to obtain a basic feature vector. For example:
For text feature extraction, a vectorization operation may be performed on text units using a pre-trained language model (e.g., BERT, roBERTa) or a Transformer-based semantic encoder to generate semantic expression vectors.
For image feature extraction, the images may be convolved/self-attention analyzed using CNN (e.g., res net, VGG) or Vision Transformer to extract visual feature vectors.
For audio feature extraction, mel-spectrum (Mel-Spectrogram) or MFCC feature analysis may be performed on the audio media units (including live stream segments), and the audio feature vectors may be obtained in combination with RNN or transducer structures.
For video feature extraction, the space-time feature vectors of the video media units can be extracted through key frame extraction or 3D convolution network, video transform and other models;
It should be emphasized that if the video unit is long, the pooling may be performed on the average over multiple frames to obtain the overall representation.
Thus, through the multi-modal feature extraction described above, the present disclosure ultimately generates a corresponding base feature vector for each media unit and stores it in memory or temporary storage for subsequent processing.
For S104 described above:
The simple basic feature vector can only reflect the content of the media unit itself, but cannot reflect the complementary information that the media unit may need for other media resources. If complex cross comparison is directly performed between all media units in the online stage, the calculation cost is high and the real-time performance is not enough.
In implementations, the present disclosure inputs the base feature vector for each media unit into a pseudo-query module (e.g., a lightweight transducer decoder or self-attention-based sub-network) to generate a pseudo-query vector associated with each media unit. The pseudo-query vector may be understood as a potential expression of "the media unit may have a need or point of association with other media.
In implementations, contrast learning or reconstruction loss may be employed during the training phase such that the pseudo-query vector captures the core semantics and potentially interaction information of the media unit.
In a specific implementation, the generation manner of the pseudo query vector may be:
Inputting basic feature vectors;
The middle process is that the pseudo-query module carries out multi-head self-attention calculation or decoding operation on the basic feature vector according to the parameters obtained by training;
a pseudo-query vector associated with the media unit is output.
For example, if the media unit is a piece of text describing a cooking process, the pseudo-query vector may include potential needs for related food materials, cooking temperatures, durations, and for an image of a particular scene, the pseudo-query vector may express an association to information such as location, subject, or a gangable video.
As an alternative implementation, the generating the pseudo-query vector associated with each media unit includes:
performing attention mechanism operation on the basic feature vector to generate potential demand characterization;
Generating a pseudo-query vector based on the potential demand characterization, and optimizing the pseudo-query vector with a reconstruction or contrast penalty to enable the pseudo-query vector to characterize association features between the media unit and other media resources.
After the multi-modal feature extraction is completed, each media unit has a basic feature vector, denoted as F base. To further generate pseudo-query vectors that characterize the correlation characteristics between the media unit and other media resources, the present disclosure introduces two partial steps of attention mechanism operation and reconstruction/contrast loss optimization.
In a specific implementation, the F base is input into a lightweight 'attention generation sub-module', and the sub-module can be realized by adopting a multi-head self-attention structure, and at least comprises the following key components:
1. linear mapping layer:
F base is mapped to three vector spaces of query (Q), key (K) and value (V). Specifically, a trainable weight matrix W Q may be set to generate the query vector q=f base·WQ, and K, V may be similarly generated.
Because the feature vector of the media unit is often higher in dimension, the linear mapping layer can reduce parameters and ensure that information is not lost, and the method can be realized in a dimension reduction or same dimension maintenance mode.
2. Attention calculation layer:
The attention calculation layer performs dot product attention on Q, K, V:
The Attention (Q, K, V) is used for distributing the learnable weight among the input features to highlight the information most relevant to the current task, the softmax function is a normalization function, the input vector can be mapped to a (0, 1) interval and the sum of the input vector and the (0, 1) interval is 1, so that the Attention weight in the form of probability distribution is obtained, and d k is the dimension of the key vector.
It is emphasized that for the case of only a single vector input (i.e., F base for a single media unit), the present embodiment may treat F base as a batch process or introduce virtual sequence lengths, or introduce fixed position coding, ensuring that the attention mechanisms do not conflict computationally. For example, if F base corresponds to a sequence of video key frames, multi-vector attention operations can also be performed at the frame level.
3. Multi-head merging layer:
If the multi-head attention is adopted, the output generated by each head is spliced in the channel dimension, and then the linear mapping is carried out to obtain an output vector A. The output vector a can be regarded as a "potential demand representation" representing an abstract representation of the media unit and external association elements after the attention mechanism.
Through the attention mechanism operation, the obtained potential demand representation A can reflect the potential matching or complementary demand of the media unit to other media resources, and initially characterizes the 'which external information the media unit is possibly associated with'.
Furthermore, studies have found that in practical system implementations, a mere self-attention operation does not guarantee that a reflects cross-media correlation features accurately enough. Accordingly, the present disclosure further optimizes a by either reconstruction loss (Reconstruction Loss) or contrast loss (Contrastive Loss) to yield the final pseudo-query vector (Q pseudo).
For reconstruction losses:
In implementations, if the system has additional labeling or context information about the media units (e.g., key labels, summaries, corresponding subtitles, etc.), a can be input into a "mini-decoder" or classifier, which can be used to form constraints by predicting or reconstructing the additional information.
For example, when the media unit is a text with a manually marked subject tag, the system can cause A to predict the subject tag, and if the prediction is successful, the capturing of the core semantic of the text by A is more accurate.
The loss function can be designed as cross entropy, mean square error and the like, and parameters of the attention generation submodule and the merging layer are optimized through back propagation, so that the A has higher semantic fitness.
For contrast loss:
in a specific implementation, if the system has a positive and a negative sample pair (for example, "two video segments of the same scene" and "the same timestamp" describe the text of the two video segments), a contrast learning mode may be adopted to make the vector distance between a and the matched sample closer and the vector distance between a and the unmatched sample farther.
For example, assuming that a media unit A and a media unit B are complementary resources, they can be labeled as positive sample pairs, with Q pseudo generated by both being closer together, and for a random unrelated media unit C, as negative sample pairs, with Q pseudo being farther apart. So that a will get stronger in the ability to distinguish between associated and unassociated.
The contrast loss common formulas such as InfoNCE or TripletLoss can enable A to learn the distinguishing capability of cross-media retrieval or matching requirements after a plurality of iterations.
Further, in a specific implementation, the pseudo-query vector Q pseudo can be obtained stably after the a generated by the self-attention mechanism continues to iterate the training in the back propagation.
If reconstruction loss is used, the system can treat the output A as Q pseudo after training is completed, and if contrast loss is used, a linear transformation layer (MLP) or normalization operation can be added after A to obtain the final Q pseudo.
It should be noted that, in the implementation level, the attention module, the decoder/the comparison network may be trained end-to-end or in stages in the training stage, so as to ensure that each step of parameter update can reduce the prediction error or increase the distinction degree, thereby making the pseudo-query vector retain the original media unit feature information and have the ability of "sensing demand" for other media.
It is emphasized that in embodiments of the present invention, the use of reconstruction loss (Reconstruction Loss) or contrast loss (Contrastive Loss) is not merely a conventional algorithm choice for model training, but rather directly coupled with the depth of the multimedia fusion application scenario, playing a key technical role in cross-modal data processing and online real-time presentation.
For example, relying solely on "basis feature vectors" to index or match media content has the problem of difficulty in adequately expressing the complementary needs that one media unit may have for other modality information. Thus, large scale cross-modal interactions or complex alignments are often required in the online phase, resulting in significant increases in delay. In the present invention, the concept of "pseudo-query vectors" is presented such that each media unit learns "potential needs for external media content" during an offline or small batch training phase. In order to enable the pseudo-query vector to more accurately and stably represent cross-media association or complementation requirements, the invention selects reconstruction loss or contrast loss to be matched with an attention mechanism for use, so as to overcome the limitation that a pure perceptron or simple classification loss cannot fully capture 'demand-supply' semantic interaction among multiple modes.
For example, in an actual multi-modal service, subtitles, keyword tags, meta-event descriptions, etc. are often configured for video segments or audio segments, or there are cross-labels (e.g., text descriptions of a segment of video segment corresponding thereto) on different modalities for the same content.
The present invention exploits the reconstruction penalty such that the pseudo-query vector must be "reconfigurable" or "predictable" of these multi-modal additional labels. Once the model learns to reconstruct the auxiliary information from the pseudo-query vector during the training phase, it indicates that the pseudo-query vector does capture the inherent semantic joins between the media units and the external complementary resources.
In addition, in the application level, the use of reconstruction loss greatly improves the accuracy of injecting complementary demand perception for each media unit in the off-line stage, and reduces redundant inter-mode comparison and expense in on-line inquiry, thereby being beneficial to resource scheduling and fusion presentation in high concurrency scenes such as live broadcast, on-line on-demand and the like in real time.
Further, in order to accelerate online retrieval and ensure fusion accuracy, the invention can distinguish between cross-mode positive and negative sample pairs (such as audio and text for explaining the positive and negative sample pairs, or video frames and matched image illustrations and the like) in a training stage.
As described above, when it is determined that "media units A and B have complementary relationship in a certain application scenario", they are labeled as positive sample pairs, so that the corresponding pseudo-query vector distances are closer, otherwise, if they are not related, they are labeled as negative sample pairs, and the vector distances are pulled apart.
Thus, contrast loss is no longer merely a generic machine learning method, but in combination with the specific requirements of the multimedia fusion application, a pre-learning of the "cross-modality association" is achieved during offline training. When the system is online, the system can quickly locate other media resources with highest cross-media supplement degree only by executing vector retrieval on the pseudo query vector of the target media unit. Compared with the similar scheme without using contrast loss, the method can greatly improve both the cross-modal matching degree and the online query speed.
In the invention, the reconstruction loss or the comparison loss is used for optimizing the 'pseudo query vector', so that the system can enable a media unit to learn 'how to carry out complementary matching with other media resources' in an offline stage or in a small batch training update. At this time, the optimal fusion object and presentation mode can be determined only by simple vector similarity operation in the online stage.
This technology contributes to directly serving the online multimedia fusion needs. For example, in a live broadcast electronic market scene, the currently displayed commodity video clip of a host can be fused with the most relevant commodity poster or a matched use instruction in a popup window mode in real time, and in an educational live broadcast scene, courseware, practice problems or supplementary audio of corresponding chapters can be timely recommended, so that the acquisition efficiency and interactive experience of a user are remarkably improved.
Unlike simple use of reconstruction loss or contrast learning optimization neural network, the method emphasizes that the training strategy is fused in the whole flow of multi-modal feature extraction, implicit interaction and vector database retrieval so as to reduce the load of online calculation and improve the complementary efficiency of multi-modal content. In other words, the reconstruction/contrast loss is utilized, which is not in the pure deep learning algorithm level, but combines the system architecture requirement of the multimedia data offline-online fusion, thereby forming the specific technical effects of cross-media complementation, such as time delay reduction, bandwidth saving, retrieval accuracy improvement, user satisfaction degree and the like.
Further, the present embodiment may train the pseudo-query generation subsystem in an offline environment (e.g., GPU/TPU cluster), process existing media data on a large scale, and store the generated pseudo-query vectors together in a vector database;
When a new media unit is collected by the system and basic feature extraction is completed, immediately calling an attention module and a contrast/reconstruction network to generate a pseudo-query vector, and storing the pseudo-query vector into a database for real-time retrieval;
in addition, if the system finds that certain pseudo query vectors are not matched with the actual demands according to user feedback, the system can regularly (or in real time) train and update the parameters of the attention mechanism in small batches, so that the system adapts to the environment and new media.
In this way, the pseudo-query vector can more accurately represent the association characteristics between each media unit and other media resources and serve as the basis for subsequent cross-media complementary display. Compared with the traditional single feature vector method, the method has the advantages that the attribute of sensing the external media requirement is injected into each media unit in the off-line stage, so that the large-scale calculation amount in the on-line retrieval process is reduced, and the robustness and the response speed of the system in the multi-mode fusion scene are improved.
For S105 described above:
In the offline stage, the "basic feature vector" and the "pseudo-query vector" are further fused, so that the final representation vector (i.e. "fused media representation vector") of each media unit contains both its own features and retains the representation of other media requirements or associations. Therefore, real-time interaction of all media is not needed during subsequent online retrieval, and the operand is greatly reduced.
In implementations, the present disclosure inputs the "base feature vector" and the "pseudo-query vector" for each media unit together into an implicit interaction module. For example, implicit interactions may be implemented using a multi-layer transducer or multi-headed self-attention network.
Wherein the operation of implicit interaction comprises:
By self-attention or cross-attention mechanism, merging the 'self-characteristics of the media unit' and 'pseudo-query information expected or required by the media unit', and generating a merged media representation vector with more 'cross-media complementation awareness';
The purpose of implicit interactions is to inject a perception of other media modalities or content requirements for each media unit during the offline phase, thus eliminating the need for extensive cross-modal complex computations during online retrieval.
Finally, the fused media representation vector output by the implicit interaction module can better represent the semantic features of the media unit and the potential complementary relation between the semantic features and other modalities, and provides a basis for subsequent retrieval and synthesis.
Further, the present disclosure stores the fused media representation vector with media metadata information (e.g., media type, timestamp, frame index, text paragraph ID, live stream slice ID, etc.) for the media unit into a vector database.
Illustratively, the vector database may employ structures such as Faiss, milvus, or HNSW, to support large-scale vector similarity searches.
Further, after storage is complete, an Approximate Nearest Neighbor (ANN) or hash index may be constructed from the fused media representation vector to quickly perform subsequent search matching operations. At this point, each media unit has unique identification and index entries in the vector database for online phase lookup.
As an alternative embodiment, the outputting the fused media representation vector includes:
splicing the basic feature vector and the pseudo-query vector to form an input sequence;
Processing the input sequence through a multi-head self-attention network to generate a fused media representation vector;
Outputting the fused media representation vector for indexing and retrieval.
In order to acquire the fused media representation vector and facilitate subsequent indexing and retrieval, the following operations may be further performed in addition to the foregoing generation and optimization of the pseudo-query vector:
After the generation of the pseudo-query vector Q pseudo and the base feature vector F base is completed, the system concatenates the two to form the input sequence.
By way of example, the following operational steps may be employed:
The first is serialization, when both F base and Q pseudo are unidirectional, the concatenation can be directly performed according to vector dimensions, and if both include timing information (e.g., video key frame or audio frame sequence), the combination can be performed on each frame vector according to time sequence or space sequence, and then the pseudo-query vector is appended to the sequence header or tail to form an input sequence (S in).
By doing so, it can be ensured that the multimodal information and the 'demand/association' feature are presented in the same input tensor, and the subsequent processing layer does not need to execute complex cross-tensor operation.
Next, in a multi-head self-attention network process, the system inputs the input sequence (S in) to a multi-head self-attention network (TransformerEncoder or similar structure). Each attention header calculates the attention weights between vectors in the sequence, respectively, to capture temporal or semantic dependencies.
If the underlying feature vector itself has a position code, such as a video frame index or text paragraph ID, this information can be retained along with the pseudo-query vector when spliced.
After the multi-headed self-care network is executed, a set of context enhancement vectors is generated (H ctx). In this embodiment, a vector that best represents the sequence overall information (e.g., a [ CLS ] position vector or an average pooling result of all vectors) may be selected as the fused media representation vector (F fused).
Finally, the fused media representation vector (F fused) is output to a storage or retrieval module for subsequent index retrieval.
In particular, in order to facilitate subsequent indexing, the F fused may be normalized (e.g., L2 regularized) or dimension reduced (e.g., PCA), to generate a representation vector (F final).Ffinal, compared with the original base feature vector F base and the pseudo-query vector Q pseudo, better represents the fusion feature of the multi-modal information and the cross-media requirement), and is suitable for use in the similarity measurement or the distance measurement in subsequent retrieval.
Thus, through the splicing, multi-head self-attention operation and vector output processing, the invention can generate the fused media representation vector, and provide the feature representation with more complementary consciousness for the index and retrieval flow of the subsequent vector database.
Referring to fig. 2, a flowchart of a method for creating an index in a vector database according to an embodiment of the present disclosure is provided, as an optional implementation manner, the storing the media representation vector and the corresponding media metadata into the vector database, and creating an index for the media representation vector in the vector database includes steps S201 to S203, where:
S201, establishing a mapping relation between the media representation vector and the media metadata, wherein the media metadata comprises a media type, a time stamp and a frame index;
s202, storing the media representation vector and media metadata thereof to a vector database;
and S203, indexing the media representation vector based on an approximate nearest neighbor search algorithm or a hash index algorithm.
For S201 described above:
In an implementation, when generating the fused media representation vector F fused or F final, synchronously recording corresponding media metadata includes:
media types (such as text, video, audio, image, or live stream slices);
Time stamp (in video or audio scene, identify start and stop time of the clip in original file; in live scene, identify actual playing or recording time);
frame index (if video key frames, frame number or continuous frame range may be recorded).
By creating a "mapping structure" (e.g., key-value or JSON format) containing the above media metadata for each fused media representation vector, refined matching can be achieved by vector similarity plus media metadata conditional screening at a later time of retrieval.
For 202 above:
In implementations, a database or engine supporting large-scale vector search may be selected, such as Faiss, milvus, HNSW. F fused is written to the database with the mapping structure at once, each entry being assigned a unique identifier (e.g., media_id).
When a user query or a system fusion request occurs, the stored vector table is searched in a similarity calculation mode, and a plurality of items which are the most similar are output.
The metadata fields (media type, timestamp, frame index) can then be used to further filter or sort the search results.
For S203 described above:
in a specific implementation, to improve the retrieval efficiency, the index construction process may be performed on all media representation vectors written into the database:
ANN (Approximate Nearest Neighbor) index all vectors are partitioned or a hierarchical graph structure (e.g., HNSW) is built in high-dimensional space to achieve an approximate search of O (log N) or sub-linearity at query time.
Hash indexing, namely utilizing LSH (Locality-SENSITIVE HASHING) and other algorithms to process the vector in a barrel-dividing way in a high-dimensional space, and only carrying out accurate comparison in the same or similar hash barrels during query, thereby improving the query speed.
The indexing process is typically performed off-line in batches and may also be incrementally updated as new media data is added.
For example, if the live data continuously generates new media units, the system may periodically (or in real time) insert the newly generated fused media representation vectors into the database and update the ANN or hash index structure to ensure timeliness of retrieval.
Therefore, through the mapping relation establishment, storage and index construction operation, the method and the device can realize vector retrieval after fusion in the large multi-mode media library, and combine media metadata to perform accurate matching, so that cross-media fusion or alignment is more convenient. Compared with the traditional mode of only storing the original document or file name, the method of the invention remarkably improves the cross-mode retrieval speed and precision, and can be applied to various application scenes such as online education, video retrieval, intelligent monitoring, social media content recommendation and the like.
For S106 described above:
After receiving the fusion request, a recommendation or a synthesis result of cross media needs to be provided for a user or a downstream application in time. By retrieving the similarity of the "fused media representation vectors" in combination with the media metadata (e.g., time stamps, frame indexes), other media that can complement or enhance its content with the target media unit can be quickly located.
In a specific implementation, when the system receives a fusion request for a certain target media unit, other media units with vector similarity of the fused media representation with the target media unit higher than a preset threshold value are first searched in a vector database. And screening out media resources which contain corresponding time stamp information and have complementary relation with the target media unit from the retrieval result. And synchronizing and synthesizing with the target media unit according to the obtained time stamp or frame information of the media resource.
For example, if the target media unit is the nth minute of the video and the matched asset is a piece of audio or subtitle text at the corresponding time, the audio or subtitle text may be embedded into the video play stream;
for another example, if the target media unit is a segment of live stream, image/text information of the same time segment or similar semantic points is found in the matching resource for real-time superposition display.
In a specific implementation, the present disclosure, after synthesis, cross-modal combination or simultaneous rendering of the target media unit and the matched media asset.
For example, the video and the text are displayed in a split screen mode in the same playing page, the audio is overlapped into the video stream to form a new multimedia stream, and corresponding images or text descriptions are popped up at key moments in a live scene.
The finally output cross-media presentation results can be played, browsed or interacted in a client or front-end interface.
In this way, the system fuses the potential complementary demands of different media units before storage by the pseudo query vector generation and implicit interaction module in the off-line stage, thereby reducing the large-scale calculation amount in the on-line stage.
As an alternative embodiment, the cross-media presentation comprises:
receiving a fusion request for the target media unit;
searching candidate media units with the media representation vector similarity with the target media unit higher than a preset threshold value in the vector database;
Splicing and fusing the target media unit and the candidate media unit according to a set fusion strategy;
and outputting the fusion result to the front end for display.
In particular implementations, during the online run phase, the system waits or listens for converged requests from clients, upper layer business modules, or third party applications. The request includes the following information:
The target media unit identification, such as video_segment_id, audio_clip_id, text_parameter_id, or other unique ID;
Fusing preference, namely if a user designates that the subtitle needs to be spliced, the image needs to be inserted, or a multi-machine video clip needs to be switched;
terminal/platform information such as information of a mobile terminal, a PC terminal or AR/VR equipment and the like, and fusion strategies can be subjected to differentiated processing according to the information.
In this embodiment, a unified API interface, such as POST/media/fusion_request, is set at the server, and when an external call is made, the system parses and writes the request parameters into a queue or a memory cache, so as to trigger the next search and fusion process.
In a specific implementation, after parsing out the target media unit identity, the following steps are performed in the vector database:
Its corresponding "fused media representation vector" is found by the target media unit ID (e.g., F fused). If the system performs a database-based or Shard-based storage of the media representation vector, it is necessary to locate the node storing the target entry first.
Based on the representation vector of the target media unit, an Approximate Nearest Neighbor (ANN) search or a hash bucket search is performed to obtain a set of candidate media units { CANDIDATE 1,Candidate2..8) having a similarity above a preset threshold (e.g., 0.8).
In addition, coarse Alignment (ANN) and then fine alignment (precise dot product or cosine similarity) can be performed to ensure the retrieval efficiency and accuracy.
For the retrieved candidate media units, the system may rank the similarity from high to low and perform a second filtering based on metadata information (e.g., time stamp, frame index, media type). For example, in a video subtitle scene, only text or audio units with the same or similar time stamps as the current video period may be retained.
In a specific implementation, the present application maintains a configurable list of fusion policies in this embodiment, including:
The splicing mode is time sequence splicing (time stamp merging), image/video overlapping (overlay) or split screen (spilt-screen) and the like;
Priority or weight, for example, caption is preferentially displayed in video+text scenes, and audio/video length alignment is preferentially ensured in audio+video scenes;
The device characteristics are that the user side is a mobile device and can select the segmentation preloading, and the PC side or the high-performance AR/VR side can allow more complex three-dimensional superposition or multi-window rendering.
Aiming at splicing treatment:
if the candidate media unit contains timestamp information, the system may splice the target media unit with the candidate media unit at the same or similar time period. For example, in a live review scene, video and subtitles at the same time are combined.
For multiple segments of images or videos retrieved simultaneously, overlapping (overlay) or picture-in-picture (picture) processing is performed according to frame indexes to generate a new composite media stream.
If the candidate media unit is text, the text content can be displayed in real time under the video playing picture or in a side rail in a manner of rolling captions, bubble prompts or barrages.
For audio+video fusion, an audio mixing engine (e.g., FFmpeg or self-grinding mixing module) may be used to adjust the volume, sampling rate to be consistent with the target video clip.
If there are externally set special effects (such as AR filters, specific watermarks, etc.), they can be added according to preset rules during the synthesis stage.
The present embodiment may use various media processing tools (e.g., FFmpeg, GStreamer, etc.) or self-developed multimedia mixing engines to perform the splicing operation in a streaming or file manner and generate the final composite output stream or composite file.
Further, the spliced multimedia data is encoded again or format-encapsulated. For example, H.264 encoding and decoding are carried out on the synthesized video stream, the synthesized video stream is packed into MP4 or MPEG-TS format, and text information of the superimposed subtitles is rendered to generate resource links which are convenient for web pages or APP ends to call.
If the video stream and the text content are rendered by split screens, the video stream and the text content can be used as independent windows, and the front end can be laid out and played through an HTML5/JS or mobile terminal SDK.
In an implementation, the synthesized multimedia file or stream address is returned to the requesting end (e.g., client browser, APP). The client performs self-adaptive playing according to the network conditions (such as bandwidth and delay) and the equipment performance, or performs automatic playing according to the system preset.
In the live broadcast scene, the fusion request can be executed in a real-time processing pipeline, the synthesized live broadcast stream is pushed to a CDN or a media server by protocols such as RTMP/HLS/WebRTC, and the client side realizes synchronous watching with the anchor/audience through a play address.
In the on-demand scene, the spliced file can be stored in a media server or an object storage, and the access URL is returned for the user to click and play in a front-end browser or an APP interface.
In addition, the embodiment can record the click rate, the stay time or the satisfaction evaluation of the user on the fusion content at the front end and transmit the interaction information back to the server so as to dynamically adjust in the follow-up fusion strategy. If the preference of the user to a certain type of fusion mode is detected to be high subsequently, the system can properly improve the priority of the fusion strategy in the next splicing.
Therefore, the invention aims at the cross-media display process of the target media unit, not only can efficiently search the candidate media unit, but also realizes a plurality of splicing modes through the set fusion strategy, and finally presents the processing result on the user terminal in a visual and diversified mode, thereby meeting the cross-media fusion requirements in the scenes of online education, live broadcast E-commerce, video conference, entertainment content aggregation and the like. The whole flow is from request to output, the complementarity and visual effect of the content can be obviously improved on the premise of not losing the quality of the multimedia, and the requirement of users for synchronous browsing of the multi-mode information can be met by quicker response.
As an alternative embodiment, the cross-media presentation further comprises:
slicing live stream data according to time periods, and generating the basic feature vector and the pseudo-query vector for each time period;
And incrementally writing the fused media representation vectors of the live stream data into the vector database, and matching by using the latest media representation vectors during online retrieval.
The application is suitable for large-scale and real-time data stream in live broadcast scene, and further performs time slice, incremental vector writing and online retrieval matching on live broadcast stream data.
In particular implementations, after receiving live stream data, live content is continuously captured by a pre-deployed real-time acquisition module (e.g., FFmpeg plug flow reception or WebRTC reception), and the live stream is sliced into a plurality of discrete "live slices" in a set period of time (e.g., every 5 seconds, every 10 seconds). After slicing is completed, the system immediately extracts a basic feature vector (F base) from the live slice at the server side or the GPU cluster side.
For example, if the live stream is video content, the system performs keyframe extraction or sparse sampling on the 5 second or 10 second video segment and convolves/self-attentive analyzes the keyframes or sampled frames using a video feature extraction model (e.g., 3D CNN or Video Transformer) to generate a base feature vector corresponding to the video slice.
If the live stream contains audio, the audio signal in the time period can be further subjected to sampling rate unification and mel spectrum conversion, and audio characteristics are extracted by utilizing an audio coding network (such as cnn+rnn or Audio Transformer) and combined into the characteristic vector representation of the same live slice.
After the basic feature vector of the live slice is obtained, a pseudo-query module is invoked to generate a pseudo-query vector (Q pseudo) for the live slice based on a multi-headed self-attention or decoding operation on the feature vector. Because live data is often highly time-efficient and has no fixed file division, the embodiment treats a time-sliced video or audio segment as a "media unit", so that the subsequent pseudo-query vector generation and cross-media implicit interaction module can both follow the same logical and model structure as previously for offline media.
The base feature vector and the pseudo-query vector are then input to the implicit interaction module for fusion, and the output fused media representation vector (F fused) represents the position of the live slice in the cross-media semantic space. Because the live stream has the characteristic of continuously generating new segments in real time, the embodiment adopts an incremental writing mechanism to continuously insert the newly generated fused media representation vector into the vector database.
In an implementation, a long connection to a vector database (e.g., milvus, faiss, or HNSW) may be maintained in advance, and after each slice processing is completed, the fused media representation vector (F fused), corresponding metadata (live-room ID, time-period start-stop seconds, frame information) is written to the database to generate a new data entry.
If the database supports online index updating, the index construction module can be triggered after writing to execute ANN index updating operation on the incremental vector samples, and if the database adopts micro batch index, the index can be updated in batches within a short time interval to balance real-time performance and throughput.
When an external request needs to perform cross-media presentation or association retrieval on a live stream, a newly inserted live slice representation vector can be directly queried in a vector database. For example, in an educational live scene, a fused media representation vector corresponding to a certain period of the teacher's current live content may be used to retrieve a courseware picture or presentation page with the highest similarity.
For example, the most recently inserted live slice vector (falling within the time period [ T 0-5s,T0 ]) may be looked up based on the current time (e.g., T 0), or conditional filtering may be performed according to the timestamp field, retrieving only the most recently generated portion, to ensure that the results remain synchronized with the live progress.
If other media resources with similarity higher than the threshold value are retrieved, the live broadcast slice and the matched video, audio or text resources can be spliced or overlapped according to the fusion strategy, and finally the multi-mode interaction experience of live broadcast and supplementary content is realized at the front end.
Thus, through the above-mentioned process of sectioning, incremental writing and online retrieval of live stream data, the present embodiment can realize real-time synchronization of cross-media fusion in live scenes. After each new slice is collected, segmented, features are extracted, and fused media representation vectors are generated, the media representation vectors are immediately written into a vector database and are indexed and updated, so that the latest live broadcast state or semantic information can be acquired by subsequent retrieval, the user experience is greatly improved in time delay, and the continuous changing content requirements in a live broadcast scene are ensured to be adapted by a full media fusion scheme.
Based on the same inventive concept, the embodiment of the disclosure further provides a full media fusion system based on complementary fusion corresponding to the full media fusion method based on complementary fusion, and since the principle of solving the problem by the system in the embodiment of the disclosure is similar to that of the full media fusion method based on complementary fusion described in the embodiment of the disclosure, the implementation of the system can refer to the implementation of the method, and the repetition is omitted.
Referring to fig. 3, a schematic diagram of a full media fusion system based on complementary fusion according to an embodiment of the disclosure is provided, where the system includes an acquisition module 10, a first processing module 20, a feature extraction module 30, a second processing module 40, a third processing module 50, and a display module 60;
The acquisition module 10 is configured to acquire different types of media data from a plurality of media sources, and perform preliminary formatting processing on the media data, where the media data includes text, image, audio, video and live stream data;
the first processing module 20 is configured to divide the formatted media data into a plurality of processable media units according to a preset segmentation strategy, where the media units refer to relatively independent media segments in a time or space range, and each media unit includes at least one of a text, a frame or a video, a picture or a group of pictures, and a piece of audio;
The feature extraction module 30 is configured to perform feature extraction on different types of media data, and generate a basic feature vector corresponding to each media unit;
The second processing module 40 is configured to generate, using a pseudo-query module, a pseudo-query vector associated with each media unit based on the basic feature vector;
The third processing module 50 is configured to input the basic feature vector and the pseudo-query vector to an implicit interaction module, and output a fused media representation vector; storing the media representation vector and the corresponding media metadata thereof into a vector database, and establishing an index for the media representation vector in the vector database;
The display module 60 is configured to, in response to receiving a fusion request for a target media unit, obtain media resources matched with the target media unit by retrieving the vector database, and synchronously synthesize the target media unit with the matched media resources according to timestamp information of the matched media resources, so as to perform cross-media display of the target media unit.
It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps. It should be understood that determining B from a does not mean determining B from a alone, but can also determine B from a and/or other information.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the description of the present specification, the descriptions of the terms "one embodiment," "example," "specific example," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Claims (9)

1.基于互补融合的全媒体融合方法,其特征在于,包括:1. A full media fusion method based on complementary fusion, characterized by including: 从多个媒体来源采集不同类型的媒体数据,并对所述媒体数据进行初步格式化处理;其中,所述媒体数据包括:文本、图像、音频、视频以及直播流数据;Collect different types of media data from multiple media sources and perform preliminary formatting on the media data; wherein the media data includes: text, images, audio, video and live streaming data; 针对格式化处理后的所述媒体数据,按照预设的分段策略,划分为媒体单元;其中,媒体单元是指在时间或空间范围内独立的媒体片段,每一媒体单元至少包括下述一种:一段文本、一帧或一段视频、一张或一组图像以及一段音频;The formatted media data is divided into media units according to a preset segmentation strategy; wherein a media unit refers to an independent media segment within a time or space range, and each media unit includes at least one of the following: a piece of text, a frame or a video, an image or a group of images, and an audio segment; 针对不同类型的媒体数据分别进行特征提取,生成与各媒体单元对应的基础特征向量;Perform feature extraction on different types of media data to generate basic feature vectors corresponding to each media unit; 利用伪查询模块,基于所述基础特征向量生成与各媒体单元相关联的伪查询向量;generating, using a pseudo query module, a pseudo query vector associated with each media unit based on the basic feature vector; 将所述基础特征向量和所述伪查询向量输入至隐式交互模块,输出融合后的媒体表示向量;将所述媒体表示向量及其对应的媒体元数据存储至向量数据库中,并在向量数据库中为所述媒体表示向量建立索引;Inputting the basic feature vector and the pseudo query vector into an implicit interaction module, and outputting a fused media representation vector; storing the media representation vector and its corresponding media metadata in a vector database, and creating an index for the media representation vector in the vector database; 响应于接收到针对目标媒体单元的融合请求时,通过检索所述向量数据库获取与所述目标媒体单元匹配的媒体资源,并根据所述匹配的媒体资源的时间戳信息,将所述目标媒体单元与所述匹配的媒体资源进行同步合成,以进行所述目标媒体单元的跨媒体展示。In response to receiving a fusion request for a target media unit, the media resources matching the target media unit are obtained by retrieving the vector database, and the target media unit and the matching media resources are synchronously synthesized according to the timestamp information of the matching media resources to perform cross-media display of the target media unit. 2.根据权利要求1所述的基于互补融合的全媒体融合方法,其特征在于,所述生成与各媒体单元相关联的伪查询向量包括:2. The method for full media fusion based on complementary fusion according to claim 1, wherein generating a pseudo query vector associated with each media unit comprises: 对所述基础特征向量进行注意力机制运算,生成潜在需求表征;Performing an attention mechanism operation on the basic feature vector to generate a potential demand representation; 基于所述潜在需求表征生成伪查询向量,并利用重构损失或对比损失优化所述伪查询向量。A pseudo query vector is generated based on the potential demand representation, and the pseudo query vector is optimized using a reconstruction loss or a contrastive loss. 3.根据权利要求2所述的基于互补融合的全媒体融合方法,其特征在于,所述输出融合后的媒体表示向量包括:3. The method for full media fusion based on complementary fusion according to claim 2, wherein the output of the fused media representation vector comprises: 将所述基础特征向量与所述伪查询向量拼接形成输入序列;Concatenate the basic feature vector and the pseudo query vector to form an input sequence; 通过多头自注意力网络对所述输入序列进行处理,生成融合后的媒体表示向量;Processing the input sequence through a multi-head self-attention network to generate a fused media representation vector; 输出所述融合后的媒体表示向量,以用于索引和检索。The fused media representation vector is output for indexing and retrieval. 4.根据权利要求3所述的基于互补融合的全媒体融合方法,其特征在于,所述将所述媒体表示向量及其对应的媒体元数据存储至向量数据库中,并在向量数据库中为所述媒体表示向量建立索引包括:4. The method for full media fusion based on complementary fusion according to claim 3, wherein storing the media representation vector and its corresponding media metadata in a vector database and indexing the media representation vector in the vector database comprises: 建立所述媒体表示向量与所述媒体元数据之间的映射关系,所述媒体元数据包括媒体类型、时间戳以及帧索引;Establishing a mapping relationship between the media representation vector and the media metadata, wherein the media metadata includes a media type, a timestamp, and a frame index; 将所述媒体表示向量及其媒体元数据存储至向量数据库;Storing the media representation vector and its media metadata in a vector database; 基于近似最近邻搜索算法或哈希索引算法为媒体表示向量建立索引。The media representation vectors are indexed based on an approximate nearest neighbor search algorithm or a hash index algorithm. 5.根据权利要求4所述的基于互补融合的全媒体融合方法,其特征在于,所述跨媒体展示包括:5. The method for full media fusion based on complementary fusion according to claim 4, wherein the cross-media presentation comprises: 接收针对所述目标媒体单元的融合请求;receiving a fusion request for the target media unit; 检索所述向量数据库中与所述目标媒体单元的媒体表示向量相似度高于预设阈值的候选媒体单元;Retrieving a candidate media unit from the vector database whose media representation vector similarity with the target media unit is higher than a preset threshold; 根据设定的融合策略对所述目标媒体单元与所述候选媒体单元进行拼接融合;splicing and fusing the target media unit and the candidate media unit according to a set fusion strategy; 将融合结果输出至前端展示。Output the fusion results to the front end for display. 6.根据权利要求5所述的基于互补融合的全媒体融合方法,其特征在于,所述跨媒体展示还包括:6. The method for full media fusion based on complementary fusion according to claim 5, wherein the cross-media presentation further comprises: 对直播流数据依据时间段切片,并为每一时间段生成所述基础特征向量及伪查询向量;Slicing the live streaming data according to time periods, and generating the basic feature vector and pseudo query vector for each time period; 增量式的将直播流数据的融合后的媒体表示向量写入所述向量数据库,并在在线检索时使用最新的媒体表示向量进行匹配。The fused media representation vectors of the live streaming data are incrementally written into the vector database, and the latest media representation vectors are used for matching during online retrieval. 7.根据权利要求6所述的基于互补融合的全媒体融合方法,其特征在于,所述特征提取包括:7. The full media fusion method based on complementary fusion according to claim 6, wherein the feature extraction comprises: 针对文本、图像、音频和视频分别采用卷积神经网络、视觉Transformer、预训练语言模型或语音识别模型从不同类型的媒体数据中提取所述基础特征向量。For text, image, audio and video, convolutional neural network, visual transformer, pre-trained language model or speech recognition model are used to extract the basic feature vector from different types of media data. 8.根据权利要求7所述的基于互补融合的全媒体融合方法,其特征在于,所述跨媒体展示还包括:将所述匹配的媒体资源嵌入目标媒体单元对应的渲染窗口中。8. The full media fusion method based on complementary fusion according to claim 7 is characterized in that the cross-media presentation further comprises: embedding the matching media resources into a rendering window corresponding to the target media unit. 9.基于互补融合的全媒体融合系统,其特征在于,包括:采集模块、第一处理模块、特征提取模块、第二处理模块、第三处理模块、以及展示模块;9. A full-media fusion system based on complementary fusion, characterized by comprising: an acquisition module, a first processing module, a feature extraction module, a second processing module, a third processing module, and a display module; 所述采集模块,用于从多个媒体来源采集不同类型的媒体数据,并对所述媒体数据进行初步格式化处理;其中,所述媒体数据包括:文本、图像、音频、视频以及直播流数据;The acquisition module is used to collect different types of media data from multiple media sources and perform preliminary formatting on the media data; wherein the media data includes: text, images, audio, video and live streaming data; 所述第一处理模块,用于针对格式化处理后的所述媒体数据,按照预设的分段策略,划分为媒体单元;其中,媒体单元是指在时间或空间范围内独立的媒体片段,每一媒体单元至少包括下述一种:一段文本、一帧或一段视频、一张或一组图像以及一段音频;The first processing module is configured to divide the formatted media data into media units according to a preset segmentation strategy; wherein a media unit refers to an independent media segment within a time or space range, and each media unit includes at least one of the following: a piece of text, a frame or a video, an image or a group of images, and an audio segment; 所述特征提取模块,用于针对不同类型的媒体数据分别进行特征提取,生成与各媒体单元对应的基础特征向量;The feature extraction module is used to extract features from different types of media data and generate basic feature vectors corresponding to each media unit; 所述第二处理模块,用于利用伪查询模块,基于所述基础特征向量生成与各媒体单元相关联的伪查询向量;The second processing module is configured to generate a pseudo query vector associated with each media unit based on the basic feature vector using a pseudo query module; 所述第三处理模块,用于将所述基础特征向量和所述伪查询向量输入至隐式交互模块,输出融合后的媒体表示向量;将所述媒体表示向量及其对应的媒体元数据存储至向量数据库中,并在向量数据库中为所述媒体表示向量建立索引;The third processing module is configured to input the basic feature vector and the pseudo query vector into an implicit interaction module and output a fused media representation vector; store the media representation vector and its corresponding media metadata in a vector database, and create an index for the media representation vector in the vector database; 所述展示模块,用于响应于接收到针对目标媒体单元的融合请求时,通过检索所述向量数据库获取与所述目标媒体单元匹配的媒体资源,并根据所述匹配的媒体资源的时间戳信息,将所述目标媒体单元与所述匹配的媒体资源进行同步合成,以进行所述目标媒体单元的跨媒体展示。The display module is used to, in response to receiving a fusion request for a target media unit, obtain media resources matching the target media unit by searching the vector database, and synchronously synthesize the target media unit with the matching media resources based on the timestamp information of the matching media resources to perform cross-media display of the target media unit.
CN202510170015.0A 2025-02-17 2025-02-17 Omnimedia fusion method and system based on complementary fusion Active CN120123970B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510170015.0A CN120123970B (en) 2025-02-17 2025-02-17 Omnimedia fusion method and system based on complementary fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510170015.0A CN120123970B (en) 2025-02-17 2025-02-17 Omnimedia fusion method and system based on complementary fusion

Publications (2)

Publication Number Publication Date
CN120123970A CN120123970A (en) 2025-06-10
CN120123970B true CN120123970B (en) 2025-09-02

Family

ID=95919203

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510170015.0A Active CN120123970B (en) 2025-02-17 2025-02-17 Omnimedia fusion method and system based on complementary fusion

Country Status (1)

Country Link
CN (1) CN120123970B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120950707A (en) * 2025-10-15 2025-11-14 人工智能与数字经济广东省实验室(深圳) A method, system, terminal, and storage medium for intelligent media asset management and content production.

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118916529A (en) * 2024-10-10 2024-11-08 传播大脑科技(浙江)股份有限公司 Media information cross-modal retrieval method, system and medium based on semantic alignment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102444712B1 (en) * 2016-01-12 2022-09-20 한국전자통신연구원 System for automatically re-creating a personal media with Multi-modality feature and method thereof
CN108319686B (en) * 2018-02-01 2021-07-30 北京大学深圳研究生院 Adversarial cross-media retrieval method based on restricted text space
CN109783657B (en) * 2019-01-07 2022-12-30 北京大学深圳研究生院 Multi-step self-attention cross-media retrieval method and system based on limited text space
CN112818906B (en) * 2021-02-22 2023-07-11 浙江传媒学院 Intelligent cataloging method of all-media news based on multi-mode information fusion understanding

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118916529A (en) * 2024-10-10 2024-11-08 传播大脑科技(浙江)股份有限公司 Media information cross-modal retrieval method, system and medium based on semantic alignment

Also Published As

Publication number Publication date
CN120123970A (en) 2025-06-10

Similar Documents

Publication Publication Date Title
CN111191078B (en) Video information processing method and device based on video information processing model
US8972840B2 (en) Time ordered indexing of an information stream
Jayanthiladevi et al. AI in video analysis, production and streaming delivery
CN112749326B (en) Information processing method, information processing device, computer equipment and storage medium
CN113806588B (en) Method and device for searching videos
WO2020103674A1 (en) Method and device for generating natural language description information
WO2007046708A1 (en) Intelligent video summaries in information access
CN112989212B (en) Media content recommendation method, device and equipment and computer storage medium
US20240371164A1 (en) Video localization using artificial intelligence
CN109474843A (en) The method of speech control terminal, client, server
CN116977701A (en) Video classification model training method, video classification method and device
CN114662002A (en) Object recommendation method, medium, device and computing equipment
CN118916519B (en) Data processing method, device, equipment and readable storage medium
CN120123970B (en) Omnimedia fusion method and system based on complementary fusion
WO2023142590A1 (en) Sign language video generation method and apparatus, computer device, and storage medium
CN116977992A (en) Text information recognition method, device, computer equipment and storage medium
CN117710845A (en) Video labeling method, video labeling device, computer equipment and computer readable storage medium
Orlandi et al. Leveraging knowledge graphs of movies and their content for web-scale analysis
Nixon et al. Data-driven personalisation of television content: a survey
CN113919446B (en) Model training and similarity determining method and device for multimedia resources
CN115115984A (en) Video data processing method, apparatus, program product, computer device, and medium
Qi et al. An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes
Sudhan et al. Learning to Summarize YouTube Videos with Transformers: A Multi-Task Approach
CN117648504A (en) Method, device, computer equipment and storage medium for generating media resource sequence
Chandran et al. An AI-Powered Framework for Real-Time YouTube Video Transcript Extraction and Summarization using Google Gemini

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant