Zhang et al., 2022 - Google Patents

Text2video: Text-driven talking-head video synthesis with personalized phoneme-pose dictionary

Zhang et al., 2022

Document ID: 4839771785100713042
Author: Zhang S; Yuan J; Liao M; Zhang L
Publication year: 2022
Publication venue: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP)

External Links

Cited by

Snippet

With the advance of deep learning technology, automatic video generation from audio or text has become an emerging and promising research topic. In this paper, we present a novel approach to synthesize video from the text. The method builds a phoneme-pose dictionary …

Continue reading at arxiv.org (PDF) (other versions)

230000002194 synthesizing 0 title description 8

Classifications

- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids transforming into visible information
- G10L2021/105—Synthesis of the lips movements from speech, e.g. for talking heads
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. hidden Markov models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
- G10L15/144—Training of HMMs
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
- G10L15/07—Adaptation to the speaker
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/205—3D [Three Dimensional] animation driven by audio data
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
- G10L15/25—Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/26—Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/66—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/40—3D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems

Similar Documents

Publication	Publication Date	Title
Zhang et al.	2022	Text2video: Text-driven talking-head video synthesis with personalized phoneme-pose dictionary
US11587548B2 (en)	2023-02-21	Text-driven video synthesis with phonetic dictionary
Karras et al.	2017	Audio-driven facial animation by joint end-to-end learning of pose and emotion
US7636662B2 (en)	2009-12-22	System and method for audio-visual content synthesis
Brand	1999	Voice puppetry
US7133535B2 (en)	2006-11-07	System and method for real time lip synchronization
Xie et al.	2007	Realistic mouth-synching for speech-driven talking face using articulatory modelling
US6735566B1 (en)	2004-05-11	Generating realistic facial animation from speech
WO2021023869A1 (en)	2021-02-11	Audio-driven speech animation using recurrent neutral network
JP2000508845A (en)	2000-07-11	Automatic synchronization of video image sequences to new soundtracks
Hassid et al.	2022	More than words: In-the-wild visually-driven prosody for text-to-speech
Wang et al.	2022	Anyonenet: Synchronized speech and talking head generation for arbitrary persons
Wang et al.	2010	Synthesizing photo-real talking head via trajectory-guided sample selection.
Sadoughi et al.	2018	Expressive speech-driven lip movements with multitask learning
Pan et al.	2022	Vocal: Vowel and consonant layering for expressive animator-centric singing animation
Deena et al.	2013	Visual speech synthesis using a variable-order switching shared Gaussian process dynamical model
CN114255307A (en)	2022-03-29	Virtual face control method, device, equipment and storage medium
Hussen Abdelaziz et al.	2019	Speaker-independent speech-driven visual speech synthesis using domain-adapted acoustic models
Liu et al.	2011	Real-time speech-driven animation of expressive talking faces
Fang et al.	2024	Audio-to-Deep-Lip: Speaking lip synthesis based on 3D landmarks
Asadiabadi et al.	2018	Multimodal speech driven facial shape animation using deep neural networks
Sadiq et al.	2020	Emotion dependent domain adaptation for speech driven affective facial feature synthesis
Mahavidyalaya	2014	Phoneme and viseme based approach for lip synchronization
Zhang et al.	2020	Realistic Speech‐Driven Talking Video Generation with Personalized Pose
Abdelaziz	2017	Improving acoustic modeling using audio-visual speech