CN120218091A - Real-time video translation and audio-visual synchronization method and system based on multimodal large model - Google Patents
Real-time video translation and audio-visual synchronization method and system based on multimodal large model Download PDFInfo
- Publication number
- CN120218091A CN120218091A CN202510380695.9A CN202510380695A CN120218091A CN 120218091 A CN120218091 A CN 120218091A CN 202510380695 A CN202510380695 A CN 202510380695A CN 120218091 A CN120218091 A CN 120218091A
- Authority
- CN
- China
- Prior art keywords
- translation
- video
- modal
- source
- real
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention provides a real-time video translation and audio-visual synchronization method and system based on a multi-mode large model, which relate to the technical field of video translation and comprise the steps of obtaining a source video; extracting the source video based on the multi-modal large model to obtain multi-modal features, fusing the multi-modal features through a cross-modal attention mechanism to generate an up-down Wen Yuyi vector, translating the up-down Wen Yuyi vector into a target language text in real time based on the up-down Wen Yuyi vector, processing the translated language text based on the multi-modal features to obtain a translated language sound source, carrying out mouth shape adjustment on the source video based on the translated language sound source, and combining the translated language sound source and the mouth shape animation video to obtain a real-time translated video with synchronous sound and picture. The invention breaks through the limitation of the traditional single-mode translation, dynamically aligns the context information by combining the multi-mode characteristics and the cross-mode attention mechanism, and remarkably improves the semantic accuracy of the translation.
Description
Technical Field
The invention relates to the technical field of video translation, in particular to a real-time video translation and audio-visual synchronization method and system based on a multi-mode large model.
Background
In recent years, with the rapid development of artificial intelligence technology, video translation technology has been widely applied to the fields of film and television localization, live broadcast across the country, online education and the like.
Traditional video translation methods rely mainly on speech recognition (ASR) or subtitle text for single-mode translation, such as speech translation systems (e.g. Google translation) or plain text translation engines (e.g. transfomer) based on Recurrent Neural Networks (RNNs). Such methods ignore visual context information (e.g., expression, gesture, scene object of speaker) in the video entirely, resulting in the translation result being disjointed from the scene context. In the prior art, when the translation voice is processed, the original audio or the superimposed caption is simply replaced, and the problem of tone consistency and mouth shape synchronization is not solved.
Therefore, a real-time video translation and audio-visual synchronization method and system based on a multi-mode large model are provided.
Disclosure of Invention
The specification provides a real-time video translation and audio-visual synchronization method and system based on a multi-mode large model, breaks through the limitation of traditional single-mode translation, dynamically aligns context information by combining a multi-mode characteristic and a cross-mode attention mechanism, and remarkably improves semantic accuracy of translation.
The specification provides a real-time video translation and audio-visual synchronization method based on a multi-mode large model, which comprises the following steps:
Acquiring a source video;
Extracting the source video based on the multi-modal large model to obtain multi-modal characteristics;
Fusing the multi-modal features through a cross-modal attention mechanism to generate an up-down Wen Yuyi vector;
Translating the translation language text into a target language text in real time based on the upper and lower Wen Yuyi vectors, and processing the translation language text based on the multi-modal characteristics to obtain a translation language sound source;
and carrying out mouth shape adjustment on the source video based on the translation language sound source, and combining the translation language sound source and the mouth shape animation video to obtain the real-time translation video with synchronous sound and picture.
Optionally, the fusing the multi-modal features through a cross-modal attention mechanism generates an up-down Wen Yuyi vector, including:
The multi-modal features include voice duration features and mouth-shaped key frame features;
Determining a cross-modal time similarity matrix based on the voice duration features and the mouth-shaped key frame features;
Superposing the cross-modal time similarity matrix and a mask matrix, and generating attention weight through Softmax;
and carrying out weighted fusion on the voice characteristics to generate an upper Wen Yuyi vector and a lower Wen Yuyi vector.
Optionally, the translating, in real time, the target language text based on the upper and lower Wen Yuyi vectors includes:
Based on the upper and lower Wen Yuyi vectors, the domain knowledge base is dynamically loaded through a sliding window mechanism, and a translation text of the target language is generated.
Optionally, the processing the translated language text based on the multimodal feature to obtain a translated language sound source includes:
the multi-modal features include source speaker voiceprint features;
and cloning the tone of the source speaker based on the voiceprint characteristics of the source speaker and the translation language text-driven text-to-speech model to obtain the translation language sound source.
Optionally, the processing the translated language text based on the multimodal feature to obtain a translated language sound source further includes:
the multi-modal features include voice duration features;
And adjusting the speech speed of the translation speaker based on the speech duration characteristics and the translation language text driving text-to-speech model to obtain a translation language sound source.
Optionally, the performing, based on the translation language audio source, die-shape adjustment on the source video, and merging the translation language audio source and the die-shape animation video to obtain a real-time translation video with synchronous audio and video, including:
Inputting the translation language sound source into a diffusion model, and outputting a frame-by-frame lip key point offset;
and superposing the frame-by-frame lip key point offset to the source video by bilinear interpolation superposition value.
Optionally, the superimposing the frame-by-frame lip keypoint offset to the source video by a bilinear interpolation overlay value includes:
performing time stamp alignment on the translation language sound source, and compensating time sequence deviation of the translation language sound source and the mouth-shaped animation video through a dynamic time warping algorithm according to a frame level alignment relation output by a cross-modal attention mechanism;
performing motion compensation on a non-mouth-shaped region of the source video through optical flow estimation;
Mapping the mouth shape animation to the source video through bilinear interpolation to generate a mouth shape mask;
And synthesizing the real-time translation video of the audio-video synchronization through the residual fusion model.
The present specification provides a video translation and audio-visual synchronization system based on a multi-modal large model, comprising:
the acquisition module is used for acquiring a source video;
the extraction module is used for extracting the source video based on the multi-modal large model to obtain multi-modal characteristics;
the fusion module is used for fusing the multi-modal characteristics through a cross-modal attention mechanism to generate an up-down Wen Yuyi vector;
the translation module is used for translating the upper and lower Wen Yuyi vectors into a target language text in real time and processing the translation language text based on the multi-modal characteristics to obtain a translation language sound source;
And the adaptation module is used for carrying out mouth shape adjustment on the source video based on the translation language sound source, and combining the translation language sound source and the mouth shape animation video to obtain a real-time translation video with synchronous sound and picture.
Optionally, the fusion module includes:
The multi-modal features include voice duration features and mouth-shaped key frame features;
Determining a cross-modal time similarity matrix based on the voice duration features and the mouth-shaped key frame features;
Superposing the cross-modal time similarity matrix and a mask matrix, and generating attention weight through Softmax;
and carrying out weighted fusion on the voice characteristics to generate an upper Wen Yuyi vector and a lower Wen Yuyi vector.
Optionally, the translation module includes:
Based on the upper and lower Wen Yuyi vectors, the domain knowledge base is dynamically loaded through a sliding window mechanism, and a translation text of the target language is generated.
Optionally, the translation module includes:
the multi-modal features include source speaker voiceprint features;
and cloning the tone of the source speaker based on the voiceprint characteristics of the source speaker and the translation language text-driven text-to-speech model to obtain the translation language sound source.
Optionally, the translation module further includes:
the multi-modal features include voice duration features;
And adjusting the speech speed of the translation speaker based on the speech duration characteristics and the translation language text driving text-to-speech model to obtain a translation language sound source.
Optionally, the adapting module includes:
Inputting the translation language sound source into a diffusion model, and outputting a frame-by-frame lip key point offset;
and superposing the frame-by-frame lip key point offset to the source video by bilinear interpolation superposition value.
Optionally, the superimposing the frame-by-frame lip keypoint offset to the source video by a bilinear interpolation overlay value includes:
performing time stamp alignment on the translation language sound source, and compensating time sequence deviation of the translation language sound source and the mouth-shaped animation video through a dynamic time warping algorithm according to a frame level alignment relation output by a cross-modal attention mechanism;
performing motion compensation on a non-mouth-shaped region of the source video through optical flow estimation;
Mapping the mouth shape animation to the source video through bilinear interpolation to generate a mouth shape mask;
And synthesizing the real-time translation video of the audio-video synchronization through the residual fusion model.
In the invention, the limitation of the traditional single-mode translation is broken through, the context information is dynamically aligned by combining the multi-mode characteristics and the cross-mode attention mechanism, and the semantic accuracy of the translation is obviously improved. Through the deep fusion of the multi-mode large model and the cross-mode time sequence alignment technology, the four-in-one video translation of 'semantics-timbre-mouth-scene' is realized, and core pain points such as audio and video splitting, high delay, scene disjointing and the like in the traditional technology are overcome.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of a real-time video translation and audio-video synchronization method based on a multi-mode large model according to an embodiment of the present disclosure;
fig. 2 is a schematic diagram of a video translation and audio-video synchronization system based on a multi-mode large model according to an embodiment of the present disclosure.
Detailed Description
The following description is presented to enable one of ordinary skill in the art to make and use the invention. The preferred embodiments in the following description are by way of example only and other obvious variations will occur to those skilled in the art. The basic principles of the invention defined in the following description may be applied to other embodiments, variations, modifications, equivalents, and other technical solutions without departing from the spirit and scope of the invention.
Exemplary embodiments of the present invention are described more fully below in connection with fig. 1-2. However, the exemplary embodiments can be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art. The same reference numerals in the drawings denote the same or similar elements, components or portions, and thus a repetitive description thereof will be omitted.
The features, structures, characteristics or other details described in a particular embodiment do not exclude that may be combined in one or more other embodiments in a suitable manner, without departing from the technical idea of the invention.
In the description of specific embodiments, features, structures, characteristics, or other details described in the present invention are provided to enable one skilled in the art to fully understand the embodiments. It is not excluded that one skilled in the art may practice the present invention without one or more of the specific features, structures, characteristics, or other details.
The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The term "and/or" and/or "includes all combinations of any one or more of the associated listed items.
Fig. 1 is a schematic diagram of a real-time video translation and audio-video synchronization method based on a multi-mode large model according to an embodiment of the present disclosure, where the method may include:
s110, acquiring a source video;
In a specific embodiment of the present specification, a source video is obtained from a local file, a live stream and a cloud storage, where the source video includes an audio track, a video track and an optional subtitle track.
The method for carrying out format unification processing on the input video comprises the following steps:
if the video is in compressed format, a hardware decoder is used for real-time decoding, and if the video contains multiple tracks, the audio of the main speaker is extracted through sound source separation (such as Spleeter model).
S120, extracting the source video based on the multi-modal large model to obtain multi-modal characteristics;
S130, fusing the multi-modal features through a cross-modal attention mechanism to generate an up-down Wen Yuyi vector;
in the specific implementation mode of the specification, a pre-trained HRNet model is adopted to extract mouth shape key frame characteristics in video frames, micro-expression characteristics (such as eyebrow actions and blink frequency) of continuous frames are captured through a 3D-CNN model, scene context characteristics are extracted through a CLIP-ViT model, object, background and limb action semantics in the video are identified, acoustic characteristics of voice are extracted through a Wav2Vec 2.0 model, a VQ-VAE encoder is utilized to compress and generate duration codes, text semantic vectors are directly generated through a BERT model if embedded subtitles exist, and OCR-TransNet models are adopted to scan embedded characters (such as labels and bullet screens in the scene) of the video to generate visual text characteristics if no subtitles exist.
Optionally, the S130 includes:
The multi-modal features include voice duration features and mouth-shaped key frame features;
Determining a cross-modal time similarity matrix based on the voice duration features and the mouth-shaped key frame features;
Superposing the cross-modal time similarity matrix and a mask matrix, and generating attention weight through Softmax;
and carrying out weighted fusion on the voice characteristics to generate an upper Wen Yuyi vector and a lower Wen Yuyi vector.
In a specific embodiment of the present specification, the speech duration feature of the speech signal in the source video stream is extracted(T is time step and d a is feature dimension) extracting the mouth shape key frame feature of visual mouth shape action in source video streamThe key frames are generated through lip key point detection (20-point model) and optical flow tracking, and a cross-modal time similarity matrix S epsilon R T×T of the language characteristic A t and the mouth shape characteristic V k is calculated, specifically:
Introducing a timing mask matrix M epsilon R T×T, wherein the maximum timing offset of the constraint language and the mouth shape is + -tau frames (tau=10), and the mask rule is as follows:
The similarity matrix S is overlapped with the mask matrix M, and the attention weight W epsilon R T×T is generated through Softmax:
the voice feature A t is weighted and fused to generate an upper Wen Yuyi vector and a lower Wen Yuyi vector
C=W·Vk
S140, translating the translation language text into a target language text in real time based on the upper and lower Wen Yuyi vectors, and processing the translation language text based on the multi-modal characteristics to obtain a translation language sound source;
optionally, the S140 includes:
Based on the upper and lower Wen Yuyi vectors, the domain knowledge base is dynamically loaded through a sliding window mechanism, and a translation text of the target language is generated.
In the specific embodiment of the present specification, the upper and lower Wen Yuyi vectors areInputting a pre-trained domain classification model (BERT-Based), inputting scene class probabilities of the current video clips, namely P epsilon R K (K is a domain number such as medical treatment, law and film and television), selecting domains with probabilities exceeding a preset value, triggering loading of a corresponding domain knowledge base, wherein the knowledge base comprises a glossary, a translation style module and a domain entity base. Setting the size of a sliding window to W=10 seconds, setting the window step size S=5 seconds, caching historical semantic vectors { C t-W,...,Ct } in the window, calculating the association weight of the current semantic C t and the historical context through a self-attention mechanism, and generating a strong semantic vector
According to the loaded domain knowledge base, a dynamic prompt word template is constructed, for example, [ domain mode ] translates the following { source language } text into { target language }, using { glossary }, style requirements } { style description }. Text:
{ current Window text }
Will enhance semantic vectorsInputting a multi-mode large model (such as GPT-4) with a prompt word template to generate a target language text T out, and simultaneously injecting proper nouns (such as ' CT ' -computer tomography ') in a domain entity library; comparing the translated text T out with visual context (e.g., object, person action in scene), if a conflict is detected (e.g., video display "handshake" but translation to "reject"), a confidence weighted correction is triggered:
T final=α·Tout+(1-α)·Tbackup (α=visual matching degree)
Optionally, the S140 includes:
the multi-modal features include source speaker voiceprint features;
and cloning the tone of the source speaker based on the voiceprint characteristics of the source speaker and the translation language text-driven text-to-speech model to obtain the translation language sound source.
In particular embodiments of the present description, a multitasking learning voiceprint encoder (e.g., d-vector improvement model) is used to extract the speaker's voiceprint feature vector from the original speech. The model enhances the robustness to complex environments (such as live background sound and multi-user dialogue) by jointly training voiceprint recognition and background noise classification tasks, and ensures that voiceprint features are not interfered by the environments.
The method comprises the steps of converting translated text into a phoneme sequence, extracting semantic content characteristics, injecting voiceprint characteristics of a source speaker, controlling timbre, intonation and rhythm of synthesized voice, generating a Mel frequency spectrum according to the content and the voiceprint characteristics by a generator (such as a modified Tacotron & lt 2 & gt), and forcing the generated voice to be highly consistent with the source speaker in timbre by a discriminator through comparing the real and synthesized voice frequency spectrum details (such as fundamental frequency and formants).
Aiming at the problem of phoneme difference (such as the difference of Chinese and English pronunciation mouth shapes) in cross-language tone cloning, a phoneme-phoneme mapping table is constructed, and the phoneme duration and the intensity of target speech are dynamically adjusted. For example, the fricatives "th" of English are mapped to the "s" tones of Chinese, and natural prosodic transitions between different languages are predicted through the LSTM network.
Generating coarse-granularity Mel frequency spectrum, ensuring consistency of overall tone, optimizing details (such as plosive and continuous reading) by local attention mechanism, and improving voice naturalness. And adopting PARALLEL WAVEGAN and other low-delay models to convert the Mel spectrum into waveforms, and supporting stream output (delay is less than or equal to 50 ms) in a real-time scene. According to the voice characteristics of the target language (such as more Japanese high frequency), the frequency response curve of the synthesized voice is automatically adjusted, and tone color distortion is avoided. And (3) performing secondary alignment on the duration of the synthesized voice and the mouth-shaped animation sequence, and ensuring that the sound-to-picture synchronization accuracy error is less than or equal to 3 frames by fine tuning the length of the voice mute section.
Optionally, the step S140 further includes:
the multi-modal features include voice duration features;
And adjusting the speech speed of the translation speaker based on the speech duration characteristics and the translation language text driving text-to-speech model to obtain a translation language sound source.
In the specific embodiment of the present disclosure, the duration features of the phoneme level are extracted from the source speech, including the pronunciation duration, inter-syllable pause interval and sentence integral rhythm pattern of each phoneme, and encoded into a structured vector through a bidirectional LSTM network, so as to capture the personalized speech speed habit (such as quick speech phrases and emphasized extension speech) of the speaker. And carrying out phoneme boundary prediction on the translated target text, and generating phoneme segmentation marks by combining pronunciation rules (such as English continuous reading and Japanese voice promotion) of the target language so as to provide an alignment standard for cross-language speech speed adaptation. And aligning the time length characteristics of the source voice with the phoneme segmentation of the target text in a cross-language time sequence, calculating the time length proportion relation of the source voice and the phoneme segmentation in the phoneme, word and sentence level, and dynamically generating the voice speed scaling factor of the target voice. For example, if the end of question phoneme in the source speech is prolonged by 20%, the target speech is synchronously prolonged by the corresponding position phoneme. Aiming at the cross-language pronunciation difference (such as Chinese monosyllabic vs English multi-syllable words), a phoneme merging and splitting strategy is adopted to adjust local speech speed, so that the target voice is ensured to keep natural and smooth when the syllable quantity is changed. The target speech reference rate is set based on the average speech rate of the source speaker (e.g., 4.5 syllables/second) to prevent the whole from being too fast or too slow. And dynamically adjusting the phoneme duration according to the emphasis mode of the source voice at emotion expression nodes such as question sentences and exclamation sentences, and keeping the emotion expression habit of the source speaker. An countermeasure training mechanism is introduced, and a discriminator distinguishes the rhythm naturalness of the real and the synthesized voice, so that a generator is forced to learn a language speed change curve conforming to the habit of the target language. By adopting sliding window increment processing, the speech speed parameter is updated every 500ms, and the output rate is dynamically fine-tuned by combining the upper part and the lower part Wen Yuyi (such as emergency broadcasting needs to be accelerated). Smooth transition is realized through a lightweight voice buffer pool (capacity=2 seconds), mechanical sense caused by abrupt change of speech speed is eliminated, and continuity of synthesized voice in a real-time scene is ensured. And (3) performing millisecond alignment verification on the synthesized voice and the mouth shape animation sequence, and if the deviation between the voice duration and the mouth shape action exceeds a threshold value (such as +/-50 ms), automatically triggering voice speed fine adjustment, and regenerating a voice fragment. And establishing a speech rate self-adaptive learning mechanism based on user feedback data (such as real-time scoring of live scenes), and continuously optimizing long-tail performance of a speech rate control model.
And S150, carrying out mouth shape adjustment on the source video based on the translation language sound source, and combining the translation language sound source and the mouth shape animation video to obtain the real-time translation video with synchronous sound and picture.
Optionally, the S150 includes:
Inputting the translation language sound source into a diffusion model, and outputting a frame-by-frame lip key point offset;
and superposing the frame-by-frame lip key point offset to the source video by bilinear interpolation superposition value.
Optionally, the superimposing the frame-by-frame lip keypoint offset to the source video by a bilinear interpolation overlay value includes:
performing time stamp alignment on the translation language sound source, and compensating time sequence deviation of the translation language sound source and the mouth-shaped animation video through a dynamic time warping algorithm according to a frame level alignment relation output by a cross-modal attention mechanism;
performing motion compensation on a non-mouth-shaped region of the source video through optical flow estimation;
Mapping the mouth shape animation to the source video through bilinear interpolation to generate a mouth shape mask;
And synthesizing the real-time translation video of the audio-video synchronization through the residual fusion model.
In the specific embodiment of the present disclosure, based on a frame level alignment relationship (such as a voice ith frame corresponds to a mouth shape jth frame) output by a cross-modal attention mechanism, a dynamic time warping algorithm (DTW) is adopted, and microsecond level alignment compensation is performed on an audio stream of translated voice and a mouth shape animation sequence with a precision unit of 10 ms. Aiming at local deviation (such as that the voice tail sound is prolonged but the mouth shape is closed in advance), a mute segment or a smooth transition frame is inserted, so that the sound-picture deviation is ensured to be smaller. And synchronously detecting the correlation between the peak value of the voice energy and the opening degree of the mouth shape, and triggering a real-time regeneration process to partially replace the abnormal segment if continuous deviation is detected (if the voice is played but the mouth shape is still moving). And predicting the pixel motion vector of a non-mouth-shaped region (background and limb motion) in the original video by adopting a lightweight optical flow model (such as PWC-Net), and performing motion compensation distortion on adjacent frames to eliminate background shake or tearing caused by mouth-shaped animation superposition. For occlusion areas (e.g., hands are stroked across the lips), motion estimation padding is performed based on the content of the previous and subsequent frames. The dynamic clipping is carried out on the light flow calculation area, only the 200px range around the mouth shape animation is processed, the calculation load is reduced, and the background integrity is reserved. The generated mouth shape animation key points (20-point lip model) are mapped from a low-resolution feature space (such as 256 multiplied by 256) to an original video resolution (such as 1080 p), a smooth mouth shape area mask is generated through bilinear interpolation, and the lip movement range is accurately covered. And carrying out Gaussian blur and Alpha channel gradual change on the edge of the mask, and eliminating the sawtooth effect caused by resolution scaling, so that the synthesized mouth shape and the original video skin color are in natural transition. The mouth shape mask is decomposed into a lip reserved area and a background reserved area, a residual error fusion formula is adopted, the background reserved area directly multiplexes the original video pixels, the lip reserved area is overlapped to generate a mouth shape animation frame, and the brightness and the tone (such as cold tone/warm tone adaptation) of the original video are matched through a color correction module. The synthesized frames are temporarily stored by using a GPU-accelerated ring buffer (capacity=500 ms), and the output resolution and the code rate are dynamically adjusted by combining with adaptive code rate control (ABR), so that the real-time rendering of the 4K video stream is ensured (delay is less than or equal to 200 ms).
In the invention, the limitation of the traditional single-mode (voice/text) translation is broken through, the semantic accuracy of the translation is obviously improved by integrating the multi-mode characteristics of voice, text and vision (mouth shape, expression and scene) and combining the dynamic alignment of context information by a cross-mode attention mechanism. Through the deep fusion of the multi-mode large model and the cross-mode time sequence alignment technology, the four-in-one video translation of 'semantics-timbre-mouth-scene' is realized, and core pain points such as audio and video splitting, high delay, scene disjointing and the like in the traditional technology are overcome.
Fig. 2 is a schematic diagram of a video translation and audio-visual synchronization system based on a multi-mode large model according to an embodiment of the present disclosure, where the system may include:
An acquisition module 10, configured to acquire a source video;
The extracting module 20 is configured to extract the source video based on the multi-modal large model, so as to obtain multi-modal features;
the fusion module 30 is configured to fuse the multi-modal features through a cross-modal attention mechanism to generate an up-down Wen Yuyi vector;
The translation module 40 is configured to translate the translation language text into a target language text in real time based on the upper and lower Wen Yuyi vectors, and process the translation language text based on the multimodal features to obtain a translation language sound source;
the adaptation module 50 is configured to perform a mouth shape adjustment on the source video based on the translation language audio source, and combine the translation language audio source with a mouth shape animation video to obtain a real-time translation video with synchronous audio and video.
Optionally, the fusion module 30 includes:
The multi-modal features include voice duration features and mouth-shaped key frame features;
Determining a cross-modal time similarity matrix based on the voice duration features and the mouth-shaped key frame features;
Superposing the cross-modal time similarity matrix and a mask matrix, and generating attention weight through Softmax;
and carrying out weighted fusion on the voice characteristics to generate an upper Wen Yuyi vector and a lower Wen Yuyi vector.
Optionally, the translation module 40 includes:
Based on the upper and lower Wen Yuyi vectors, the domain knowledge base is dynamically loaded through a sliding window mechanism, and a translation text of the target language is generated.
Optionally, the translation module 40 includes:
the multi-modal features include source speaker voiceprint features;
and cloning the tone of the source speaker based on the voiceprint characteristics of the source speaker and the translation language text-driven text-to-speech model to obtain the translation language sound source.
Optionally, the translation module 40 further includes:
the multi-modal features include voice duration features;
And adjusting the speech speed of the translation speaker based on the speech duration characteristics and the translation language text driving text-to-speech model to obtain a translation language sound source.
Optionally, the adapting module 50 includes:
Inputting the translation language sound source into a diffusion model, and outputting a frame-by-frame lip key point offset;
and superposing the frame-by-frame lip key point offset to the source video by bilinear interpolation superposition value.
Optionally, the superimposing the frame-by-frame lip keypoint offset to the source video by a bilinear interpolation overlay value includes:
performing time stamp alignment on the translation language sound source, and compensating time sequence deviation of the translation language sound source and the mouth-shaped animation video through a dynamic time warping algorithm according to a frame level alignment relation output by a cross-modal attention mechanism;
performing motion compensation on a non-mouth-shaped region of the source video through optical flow estimation;
Mapping the mouth shape animation to the source video through bilinear interpolation to generate a mouth shape mask;
And synthesizing the real-time translation video of the audio-video synchronization through the residual fusion model.
The functions of the system according to the embodiments of the present invention have been described in the above-described method embodiments, so that the descriptions of the embodiments are not exhaustive, and reference may be made to the related descriptions in the foregoing embodiments, which are not repeated herein.
The invention may be implemented in hardware or in software modules running on one or more processors or in a combination thereof. Those skilled in the art will appreciate that some or all of the functionality of some or all of the components in accordance with embodiments of the present invention may be implemented in practice using a general purpose data processing device such as a microprocessor or Digital Signal Processor (DSP). The present invention can also be implemented as an apparatus or device program (e.g., a computer program and a computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present invention may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.
The above-described specific embodiments further describe the objects, technical solutions and advantageous effects of the present invention in detail, and it should be understood that the present invention is not inherently related to any particular computer, virtual device or electronic apparatus, and various general-purpose devices may also implement the present invention. The foregoing description of the embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.
Claims (8)
1. The real-time video translation and audio-visual synchronization method based on the multi-mode large model is characterized by comprising the following steps of:
Acquiring a source video;
Extracting the source video based on the multi-modal large model to obtain multi-modal characteristics;
Fusing the multi-modal features through a cross-modal attention mechanism to generate an up-down Wen Yuyi vector;
Translating the translation language text into a target language text in real time based on the upper and lower Wen Yuyi vectors, and processing the translation language text based on the multi-modal characteristics to obtain a translation language sound source;
and carrying out mouth shape adjustment on the source video based on the translation language sound source, and combining the translation language sound source and the mouth shape animation video to obtain the real-time translation video with synchronous sound and picture.
2. The method for real-time video translation and audio-visual synchronization based on a multi-modal large model according to claim 1, wherein the fusing the multi-modal features by a cross-modal attention mechanism to generate up-down Wen Yuyi vectors comprises:
The multi-modal features include voice duration features and mouth-shaped key frame features;
Determining a cross-modal time similarity matrix based on the voice duration features and the mouth-shaped key frame features;
Superposing the cross-modal time similarity matrix and a mask matrix, and generating attention weight through Softmax;
and carrying out weighted fusion on the voice characteristics to generate an upper Wen Yuyi vector and a lower Wen Yuyi vector.
3. The method for real-time video translation and audio-visual synchronization based on the multi-modal large model as claimed in claim 2, wherein the real-time translation based on the up-down Wen Yuyi vector is a target language text, and comprises the steps of dynamically loading a domain knowledge base through a sliding window mechanism based on the up-down Wen Yuyi vector to generate the translation text of the target language.
4. The method for real-time video translation and audio-visual synchronization based on a multimodal big model according to claim 3, wherein the processing the translation language text based on the multimodal features to obtain a translation language audio source comprises:
the multi-modal features include source speaker voiceprint features;
and cloning the tone of the source speaker based on the voiceprint characteristics of the source speaker and the translation language text-driven text-to-speech model to obtain the translation language sound source.
5. The method for synchronizing real-time video translation and audio and video based on a multimodal big model according to claim 4, wherein the processing the translated language text based on the multimodal features to obtain a translated language audio source further comprises:
the multi-modal features include voice duration features;
And adjusting the speech speed of the translation speaker based on the speech duration characteristics and the translation language text driving text-to-speech model to obtain a translation language sound source.
6. The method for real-time video translation and audio-visual synchronization based on a multi-mode large model according to claim 5, wherein said performing the mouth shape adjustment on the source video based on the translation language audio source and combining the translation language audio source and the mouth shape animation video to obtain the real-time translation video with audio-visual synchronization comprises:
Inputting the translation language sound source into a diffusion model, and outputting a frame-by-frame lip key point offset;
and superposing the frame-by-frame lip key point offset to the source video by bilinear interpolation superposition value.
7. The method for real-time video translation and audio-visual synchronization based on a multi-modal large model as set forth in claim 6, wherein the superimposing the frame-by-frame lip keypoint offset to the source video by bilinear interpolation overlap values includes:
performing time stamp alignment on the translation language sound source, and compensating time sequence deviation of the translation language sound source and the mouth-shaped animation video through a dynamic time warping algorithm according to a frame level alignment relation output by a cross-modal attention mechanism;
performing motion compensation on a non-mouth-shaped region of the source video through optical flow estimation;
Mapping the mouth shape animation to the source video through bilinear interpolation to generate a mouth shape mask;
And synthesizing the real-time translation video of the audio-video synchronization through the residual fusion model.
8. Video translation and sound and picture synchronization system based on multimode big model, characterized by comprising:
the acquisition module is used for acquiring a source video;
the extraction module is used for extracting the source video based on the multi-modal large model to obtain multi-modal characteristics;
the fusion module is used for fusing the multi-modal characteristics through a cross-modal attention mechanism to generate an up-down Wen Yuyi vector;
the translation module is used for translating the upper and lower Wen Yuyi vectors into a target language text in real time and processing the translation language text based on the multi-modal characteristics to obtain a translation language sound source;
And the adaptation module is used for carrying out mouth shape adjustment on the source video based on the translation language sound source, and combining the translation language sound source and the mouth shape animation video to obtain a real-time translation video with synchronous sound and picture.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202510380695.9A CN120218091A (en) | 2025-03-28 | 2025-03-28 | Real-time video translation and audio-visual synchronization method and system based on multimodal large model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202510380695.9A CN120218091A (en) | 2025-03-28 | 2025-03-28 | Real-time video translation and audio-visual synchronization method and system based on multimodal large model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN120218091A true CN120218091A (en) | 2025-06-27 |
Family
ID=96114880
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202510380695.9A Pending CN120218091A (en) | 2025-03-28 | 2025-03-28 | Real-time video translation and audio-visual synchronization method and system based on multimodal large model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN120218091A (en) |
-
2025
- 2025-03-28 CN CN202510380695.9A patent/CN120218091A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112562721B (en) | Video translation method, system, device and storage medium | |
US11551664B2 (en) | Audio and video translator | |
Dupont et al. | Audio-visual speech modeling for continuous speech recognition | |
US8170878B2 (en) | Method and apparatus for automatically converting voice | |
JP3215823B2 (en) | Method and apparatus for audio signal driven animation of synthetic model of human face | |
CN114401438A (en) | Video generation method and device for virtual digital person, storage medium and terminal | |
KR20010072936A (en) | Post-Synchronizing an information stream | |
CN116828129B (en) | Ultra-clear 2D digital person generation method and system | |
EP4404574A1 (en) | Video processing method and apparatus, and medium and program product | |
US20250118336A1 (en) | Automatic Dubbing: Methods and Apparatuses | |
CN120218091A (en) | Real-time video translation and audio-visual synchronization method and system based on multimodal large model | |
Mattheyses et al. | On the importance of audiovisual coherence for the perceived quality of synthesized visual speech | |
CN113053364A (en) | Voice recognition method and device for voice recognition | |
CN120298559B (en) | Multi-mode-driven virtual digital human face animation generation method and system | |
Kuriakose et al. | Dip Into: A Novel Method for Visual Speech Recognition using Deep Learning | |
Venkataraghavan et al. | Wav2Lip Bridges Communication Gap: Automating Lip Sync and Language Translation for Indian Languages | |
Deena | Visual speech synthesis by learning joint probabilistic models of audio and video | |
CN119274574B (en) | Audio and video mouth shape translation method, device, equipment and storage medium | |
US20250273194A1 (en) | Multi-lingual text-to-speech controlling | |
Weiss | A Framework for Data-driven Video-realistic Audio-visual Speech-synthesis. | |
Rao | Audio-visual interaction in multimedia | |
CN119967226A (en) | Video conversion method and device | |
WO2025133682A1 (en) | System and method for processing a video clip | |
CN119729036A (en) | Role dubbing method, device, equipment and storage medium of video content | |
Chung et al. | Audio2Face-3D: Audio-driven Realistic Facial Animation For Digital Avatars |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |