CN108538283B - A conversion method from lip image features to speech coding parameters - Google Patents
A conversion method from lip image features to speech coding parameters Download PDFInfo
- Publication number
- CN108538283B CN108538283B CN201810215220.4A CN201810215220A CN108538283B CN 108538283 B CN108538283 B CN 108538283B CN 201810215220 A CN201810215220 A CN 201810215220A CN 108538283 B CN108538283 B CN 108538283B
- Authority
- CN
- China
- Prior art keywords
- lip
- speech
- time
- predictor
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/027—Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Signal Processing (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Image Analysis (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
Description
技术领域technical field
本发明涉及计算机视觉、数字图像处理和微电子技术领域,尤其是涉及一种由唇部图像特征到语音编码参数的转换方法The invention relates to the technical fields of computer vision, digital image processing and microelectronics, in particular to a conversion method from lip image features to speech coding parameters
背景技术Background technique
唇语识别是依据嘴唇视频生成对应的文字表达,以下是现有的相关的技术方案:Lip recognition is to generate corresponding text expressions based on lip videos. The following are the existing related technical solutions:
(1)CN107122646A,发明名称:一种实现唇语解锁的方法。其原理是将实时采集的嘴唇特征与预先存储的嘴唇特征比对,以确定身份,但是只能获取嘴唇特征。(1) CN107122646A, title of invention: a method for realizing lip language unlocking. The principle is to compare the lip features collected in real time with the pre-stored lip features to determine the identity, but only the lip features can be obtained.
(2)CN107437019A,发明名称:唇语识别的身份验证方法和装置。其原理与(1)类似,差异在于采用了3D图像。(2) CN107437019A, title of invention: identity verification method and device for lip language recognition. The principle is similar to (1), the difference lies in the use of 3D images.
(3)CN106504751A,发明名称:自适应唇语交互方法以及交互装置。其原理依然是将嘴唇识别成文字,然后基于文字进行指令交互,转换步骤繁复。(3) CN106504751A, Title of Invention: Adaptive lip language interaction method and interaction device. The principle is still to recognize the lips as text, and then perform command interaction based on the text, and the conversion steps are complicated.
(4)LipNet,是牛津大学联合DeepMind发布的深度学习唇语识别算法,其目的也是将嘴唇识别成文字。与之前技术相比,识别率更高一些,但转换的过程也很复杂。(4) LipNet is a deep learning lip recognition algorithm released by Oxford University and DeepMind. Its purpose is to recognize lips as text. Compared with the previous technology, the recognition rate is higher, but the conversion process is also complicated.
(5)CN107610703A,发明名称:一种基于唇语采集和语音拾取的多语言翻译器。它利用了现有的语音识别模块来识别成文字,然后再利用现有的语音合成模块将文字转换成语音。(5) CN107610703A, title of invention: a multilingual translator based on lip language acquisition and voice pickup. It uses the existing speech recognition module to recognize text, and then uses the existing speech synthesis module to convert the text into speech.
发明内容SUMMARY OF THE INVENTION
本发明的目的就是为了克服上述现有技术存在的缺陷而提供一种由唇部图像特征到语音编码参数的转换方法The purpose of the present invention is to provide a conversion method from lip image features to speech coding parameters in order to overcome the above-mentioned defects in the prior art
本发明的目的可以通过以下技术方案来实现:The object of the present invention can be realized through the following technical solutions:
一种由唇部图像特征到语音编码参数的转换方法,包括以下步骤:A method for converting from lip image features to speech coding parameters, comprising the following steps:
1)构建语音编码参数转换器,包括输入缓存和训练后的预测器,按照时间先后顺序依次接收唇部特征向量,并将其存储在转换器的输入缓存中;1) construct a speech coding parameter converter, including input buffer and trained predictor, receive lip feature vector in chronological order, and store it in the input buffer of the converter;
2)每隔一定的时间,将当前时刻缓存的k个最新的唇部特征向量作为一个短时向量序列送入预测器,并获取一个预测结果,该预测结果为一个语音帧的编码参数向量;2) every certain time, the k latest lip feature vectors cached at the current moment are sent into the predictor as a short-term vector sequence, and obtain a prediction result, and the prediction result is a coding parameter vector of a speech frame;
3)语音编码参数转换器输出预测结果。3) The speech coding parameter converter outputs the prediction result.
所述的预测器的训练方法具体包括以下步骤:The training method of the predictor specifically includes the following steps:
21)同步采集视频和语音:通过视频和音频采集设备,同步采集视频和对应的语音数据,并从视频中提取唇部图像,包括整个嘴部以及以嘴为中心的一个矩形区域,获取由一系列唇部图像I1,I2,...,In组成的唇部视频,所述的语音数据为语音样值序列S1,S2,...,SM,并使唇部图像和语音数据保持时间对应关系;21) Simultaneous acquisition of video and voice: through video and audio acquisition equipment, synchronously collect video and corresponding voice data, and extract lip images from the video, including the entire mouth and a rectangular area centered on the mouth, and obtain a A lip video composed of a series of lip images I 1 , I 2 ,...,In, the speech data is a sequence of speech samples S 1 , S 2 ,..., S M , and the lip images are Keep the time corresponding relationship with the voice data;
22)获取任意时刻t的唇部特征向量短时序列FISt,对唇部视频中的每一帧唇部图像I计算其图像特征向量FI,获得一系列唇部特征向量FI1,FI2,...,FIn,对给定的任意时刻t,提取k个连续的唇部特征向量作为t时刻的唇部特征向量短时序列FISt=(FIt-k+1,...,FIt-2,Ft-1,FIt),其中,FIt为时间上最接近t的一个唇部特征向量,k为指定参数;22) Obtain the short-term sequence FIS t of lip feature vectors at any time t, calculate its image feature vector FI for each frame of lip image I in the lip video, and obtain a series of lip feature vectors FI 1 , FI 2 , ...,FI n , for a given arbitrary time t, extract k continuous lip feature vectors as the short-term sequence of lip feature vectors at time t FIS t =(FI t-k+1 ,..., FI t-2 , F t-1 , FI t ), where FI t is a lip feature vector closest to t in time, and k is a specified parameter;
23)获取任意时刻t的语音帧编码参数向量FAt,对任意时刻t,提取L个连续语音采样值作为一个语音帧At=(St-L+1,...,St-2,St-1,St),其中St是时间上最接近t的一个语音采样,采用语音分析算法计算该语音帧的编码参数,即为t时刻的语音帧编码参数向量FAt,其中,L为固定参数;23) Obtain the speech frame coding parameter vector FA t at any time t, and extract L continuous speech sample values as a speech frame A t =(S t-L+1 ,...,S t-2 for any time t , S t-1 , S t ), where S t is a speech sample closest to t in time, and the speech analysis algorithm is used to calculate the encoding parameters of the speech frame, which is the speech frame encoding parameter vector FA t at time t , where , L is a fixed parameter;
24)采用样本训练预测器:任取一时刻t,根据步骤22)和23)得到的训练样本对{FISt,At}作为预测器的输入和期望输出,并在有效范围内随机选取多个t值,依据预测器的类型,对预测器进行训练。24) Use samples to train the predictor: take any time t, and use the training sample pair {FIS t , A t } obtained in steps 22) and 23) as the input and expected output of the predictor, and randomly select multiple samples within the effective range. A t-value, depending on the type of predictor, to train the predictor.
所述的步骤22)中,采用对唇部特征向量进行时间插值使其帧率加倍,或采用高速的图像采集设备进行采集的方式提高唇部特征向量的帧率。In the step 22), the frame rate of the lip feature vector is doubled by performing time interpolation on the lip feature vector, or the frame rate of the lip feature vector is increased by using a high-speed image acquisition device for acquisition.
所述的预测器采用人工神经网络,所述的人工神经网络由3个LSTM层和2个全连接层依次连接组成。The predictor adopts an artificial neural network, and the artificial neural network is composed of 3 LSTM layers and 2 fully connected layers connected in sequence.
所述的步骤22)中,获取唇部特征向量具体包括以下步骤:In the described step 22), obtaining the lip feature vector specifically includes the following steps:
对于每帧唇部图像,提取围绕嘴唇的内边缘和外边缘共20个特征点,获取这20个特征点的中心坐标,并将各个点的坐标都减去该中心坐标,得到40个坐标数据,对40个坐标值进行归一化处理,最终获取一个唇部特征向量。For each frame of lip image, extract a total of 20 feature points around the inner and outer edges of the lips, obtain the center coordinates of these 20 feature points, and subtract the center coordinates from the coordinates of each point to obtain 40 coordinate data , normalize the 40 coordinate values, and finally obtain a lip feature vector.
所述的步骤23)中,所述的语音分析算法为LPC10e算法,所述的编码参数向量为LPC参数,包括1个前半帧清浊音标志、1个后半帧清浊音标志、1个基音周期、1个增益和10个反射系数。In described step 23), described speech analysis algorithm is LPC10e algorithm, and described encoding parameter vector is LPC parameter, comprises 1 first half frame unvoiced sound mark, 1 second half frame unvoiced sound mark, 1 pitch period. , 1 gain and 10 reflection coefficients.
与现有技术相比,本发明具有以下特点:Compared with the prior art, the present invention has the following characteristics:
一、直接转换:本发明采用机器学习技术构造了一个特殊的转换器,它实现从唇部图像特征向量到语音帧编码参数向量的转换。其中的预测器,可用人工神经网络来实现,但并不限于人工神经网络。1. Direct conversion: The present invention uses machine learning technology to construct a special converter, which realizes the conversion from the lip image feature vector to the speech frame coding parameter vector. The predictor may be implemented by an artificial neural network, but is not limited to an artificial neural network.
二、无需文字转换:该转换器采用唇部图像特征向量序列作为输入,语音帧编码参数向量作为输出。其输出的语音帧编码参数向量,可以由语音合成技术直接合成为语音采样帧,而不需要经过“文字”这一中间环节。2. No text conversion required: The converter takes the lip image feature vector sequence as input, and the speech frame encoding parameter vector as output. The output speech frame encoding parameter vector can be directly synthesized into speech sample frames by speech synthesis technology without going through the intermediate link of "text".
三、便于构造训练:本发明还提供了所设计预测器的训练方法,以及训练样本的构造方法。3. Easy to construct training: The present invention also provides a training method for the designed predictor and a method for constructing training samples.
附图说明Description of drawings
图1为转换器的组成和接口结构图。Figure 1 is the composition and interface structure diagram of the converter.
图2为预测器的训练流程图。Figure 2 shows the training flow chart of the predictor.
图3为预测器的人工神经网络结构。Figure 3 shows the artificial neural network structure of the predictor.
具体实施方式Detailed ways
下面结合附图和具体实施例对本发明进行详细说明。The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
本发明设计了一种由唇部图像特征到语音编码参数转换的转换器,它能够将接收唇部图像的特征向量序列,并将它们转换成语音帧编码参数向量序列、并予以输出。The invention designs a converter from lip image features to speech coding parameters, which can convert the received lip image feature vector sequences into speech frame coding parameter vector sequences and output them.
该转换器主要包括输入缓存、预测器、和配置参数。其核心是一个预测器,该预测器是一个机器学习模型,能够利用训练样本对它进行训练。训练完成后的预测器,能够将唇部特征向量的一个短时序列预测输出为一个对应的语音编码参数向量。The converter mainly includes input buffers, predictors, and configuration parameters. At its core is a predictor, which is a machine learning model that can be trained using training samples. After the training is completed, the predictor can output a short-term sequence prediction of the lip feature vector as a corresponding speech coding parameter vector.
如图1所示,该转换器主要包括输入缓存、预测器、和配置参数,以及输入和输出接口。转换器接收一个个唇部特征向量,并将其存储在输入缓存中。每隔一定的时间间隔Δt,将缓存的k个最新的唇部特征向量送入预测器,并由测器得到一个预测结果,并将该结果从输出口输出。该预测结果是一个语音帧的编码参数。配置参数主要存储了预测器的配置参数。As shown in Figure 1, the converter mainly includes input buffers, predictors, and configuration parameters, as well as input and output interfaces. The transformer takes a lip feature vector and stores it in the input buffer. At a certain time interval Δt, the k latest lip feature vectors in the cache are sent to the predictor, and the tester obtains a prediction result and outputs the result from the output port. The predicted result is the encoding parameter of a speech frame. Configuration parameters mainly store the configuration parameters of the predictor.
转换器的工作过程描述如下:The working process of the converter is described as follows:
(1)转换器接收一系列唇部特征向量FI1,FI2,...,FIn,并将其存储在输入缓存中。这些唇部特征向量按照时间先后顺序依次输入。(1) The converter receives a series of lip feature vectors FI 1 , FI 2 , . . . , FI n and stores them in the input buffer. These lip feature vectors are sequentially input in chronological order.
(2)每隔一定的时间Δt,转换器就将当前时刻缓存的k个最新的唇部特征向量作为一个短时向量序列FISt=(FIt-k+1,...,FIt-2,FIt-1,FIt)送入预测器,并得到一个预测结果FAt。该预测结果是一个语音帧的编码参数向量。其中Δt等于一个语音帧所占的时长,k是一个固定的参数。(2) Every certain time Δt, the converter takes the k latest lip feature vectors cached at the current moment as a short-term vector sequence FIS t =(FI t-k+1 ,...,FI t- 2 , FI t-1 , FI t ) into the predictor, and get a prediction result FA t . The predicted result is a vector of encoded parameters for a speech frame. Among them, Δt is equal to the duration occupied by a speech frame, and k is a fixed parameter.
(3)得到一个预测结果FAt后,随即将其从输出接口输出。(3) After a prediction result FA t is obtained, it is output from the output interface immediately.
以上步骤持续循环运行,从而将唇部图像特征向量序列FI1,FI2,...,FIn转换成语音帧的编码参数向量序列FA1,FA2,...,FAm。由于语音帧的频率和视频帧的频率并不一定相等,因此这里输入的图像特征向量FI的个数n和输出的语音帧参数向量FA个数m也不一定相等。 The above steps continue to run in a loop, thereby converting the sequence of lip image feature vectors FI 1 , FI 2 , . Since the frequency of the speech frame and the frequency of the video frame are not necessarily equal, the number n of the input image feature vector FI and the number m of the output speech frame parameter vector FA are not necessarily equal either.
本专利描述的转换器中,涉及一个预测器,该预测器用一种具有数据预测能力的机器学习模型来实现,例如用一个人工神经网络来实现,但不限于人工神经网络。在应用之前,需要对其进行训练(即让预测器进行学习)。下面是训练的方法,其原理如图2所示,分别从唇部视频中提取唇部图像特征向量、从对应的语音中提取语音编码参数向量。唇部图像特征向量的一个短时序列序列FISt=(FIt-k+1,...,FIt-2,FIt-1,FIt)作为训练的输入样本;与FISt对应的语音帧的编码参数向量FAt作为期望输出,即标签。由此获取大量的训练样本和标签对{FISt,FAt},实现对预测器的训练,其中t为随机的任意有效时刻。In the converter described in this patent, a predictor is involved, and the predictor is implemented by a machine learning model with data prediction capability, such as implemented by an artificial neural network, but is not limited to an artificial neural network. It needs to be trained (i.e. let the predictor learn) before it can be applied. The following is the training method. The principle is shown in Figure 2. The lip image feature vector is extracted from the lip video, and the speech coding parameter vector is extracted from the corresponding speech. A short-term sequence of lip image feature vectors FIS t = (FI t-k+1 , . . . , FI t-2 , FI t-1 , FI t ) is used as an input sample for training; corresponding to FIS t The encoded parameter vector FA t of the speech frame serves as the desired output, the label. Thereby, a large number of training samples and label pairs {FIS t , FA t } are obtained to realize the training of the predictor, where t is a random and any valid time.
训练预测器具体包括以下步骤:Training the predictor specifically includes the following steps:
(1)同步采集视频和语音。通过视频和音频采集设备,同步采集视频和对应的语音数据。视频中需要包含嘴唇部分。从视频中提取嘴唇部分,即包含整个嘴部、以觜为中心的一个矩形区域。最终的唇部视频是由一系列唇部图像I1,I2,...,In组成。语音数据则表现为语音样值序列S1,S2,...,SM(这里M为大写,表示采样数。语音帧数表示为小写m)。图像和语音保持时间对应关系。(1) Simultaneously capture video and voice. Through video and audio capture equipment, video and corresponding voice data are collected synchronously. The lip portion needs to be included in the video. Extract the lip part from the video, that is, a rectangular area containing the entire mouth and centered on the mouth. The final lip video is composed of a series of lip images I 1 ,I 2 ,...,In. The speech data is represented as a sequence of speech samples S 1 , S 2 ,..., S M (here, M is an uppercase, representing the number of samples. The number of speech frames is represented by a lowercase m). Image and voice maintain a time-correspondence relationship.
(2)任意时刻t的唇部特征向量短时序列FISt。对唇部视频中的每一帧唇部图像I计算其图像特征向量FI,于是得到一系列唇部特征向量FI1,FI2,...,FIn。对给定的任意时刻t,提取k个连续的唇部特征向量作为t时刻的唇部特征向量短时序列FISt=(FIt-k+1,...,FIt-2,Ft-1,FIt),其中FIt为时间上最接近t的一个唇部特征向量,k是一个指定的参数。为了提高唇部特征向量的帧率,可对唇部特征向量进行时间插值使其帧率加倍,或者直接采用高速的图像采集设备。(2) The short-term sequence FIS t of the lip feature vector at any time t . Calculate its image feature vector FI for each frame of lip image I in the lip video, and then obtain a series of lip feature vectors FI 1 , FI 2 , . . . , FI n . For a given arbitrary time t, extract k continuous lip feature vectors as the short-term sequence of lip feature vectors at time t FIS t =(FI t-k+1 ,...,FI t-2 ,F t -1 , FI t ), where FI t is a lip feature vector closest to t in time, and k is a specified parameter. In order to improve the frame rate of the lip feature vector, temporal interpolation can be performed on the lip feature vector to double the frame rate, or a high-speed image acquisition device can be directly used.
(3)任意时刻t的语音帧编码参数向量FAt。对任意时刻t,提取L个连续语音采样值作为一个语音帧At=(St-L+1,...,St-2,St-1,St),其中St是时间上最接近t的一个语音采样。采用语音分析算法,对该语音帧计算其编码参数,即为t时刻的语音帧编码参数向量FAt。其中L是一个固定的参数。(3) Speech frame encoding parameter vector FA t at any time t . For any time t, extract L consecutive speech samples as a speech frame A t =(S t-L+1 ,...,S t-2 ,S t-1 ,S t ), where S t is time The one speech sample that is closest to t on . Using the speech analysis algorithm, the encoding parameters of the speech frame are calculated, that is, the speech frame encoding parameter vector FA t at time t . where L is a fixed parameter.
(4)用样本训练预测器。任取一时刻t,根据(2)和(3)得到一个训练样本对{FISt,At},其中FISt为预测器的输入,At为预测器的期望输出,即标签。在有效范围内随机选取大量t值,可得到大量的样本。用这些样本,依据预测器的类型,采用相应的方法对预测器进行训练。(4) Train the predictor with samples. At any time t , a training sample pair {FIS t , A t } is obtained according to (2) and (3), where FIS t is the input of the predictor, and At is the expected output of the predictor, that is, the label. A large number of samples can be obtained by randomly selecting a large number of t values within the valid range. Using these samples, the predictor is trained according to the type of predictor.
(5)将训练完成后的预测器,作为一个组件用于构建唇音转换器。(5) The trained predictor is used as a component to construct a lip sound converter.
实施例1:Example 1:
下面是一种具体的实施方法,但本发明所述的方法和原理并不限于其中所给出的具体数字。The following is a specific implementation method, but the methods and principles described in the present invention are not limited to the specific numbers given therein.
(1)预测器,可采用人工神经网络来实现。亦可采用其他机器学习技术对预测器进行构建。下面的过程中,预测器采用人工神经网络,即预测器等同于一个人工神经网络。(1) Predictor, which can be realized by artificial neural network. Predictors can also be built using other machine learning techniques. In the following process, the predictor uses an artificial neural network, that is, the predictor is equivalent to an artificial neural network.
在本实施例中,该神经网络由3个LSTM层+2个全连接层Dense依次连接组成。每两层之间以及LSTM的内部反馈层之间都添加Dropout层,为了架构清晰,这些在图中没有画出。如图3所示:In this embodiment, the neural network consists of 3 LSTM layers + 2 fully connected layers Dense connected in sequence. Dropout layers are added between every two layers and between the internal feedback layers of the LSTM, which are not shown in the figure for the sake of clarity. As shown in Figure 3:
其中,三层LSTM都各有80个神经元,前两层采用“return_sequences”模式。两个Dense层分别有100个神经元和14个神经元。Among them, each of the three layers of LSTM has 80 neurons, and the first two layers use the "return_sequences" mode. The two Dense layers have 100 and 14 neurons, respectively.
第一个LSTM层接收唇部特征序列的输入,输入的格式是一个3维数组(BATCHES,STEPS,LIP_DIM)。最后一个全连接层是神经网络的输出层,输出格式是一个2维数组(BATCHES,LPC_DIM)。这些格式中,BATCHES指定每次送入神经网络的样本数目(习惯地称为批数),训练时BATCHES通常为大于1的数值,而应用时BATCHES=1;一个输入样本的形状由STEPS,LIP_DIM指定,STEPS指定一个唇部特征短时序列的长度(习惯地称为步数),也就是FISt=(FIt-k+1,...,FIt-2,Ft-1,FIt)中的k值,即STEPS=k;LIP_DIM指定一个唇部特征向量FI的维度,对于由40个坐标数据构成的唇部特征向量,则LIP_DIM=40。输出格式中,LPC_DIM是一个语音编码参数向量的维度,对于LPC10e来说,LPC_DIM=14。The first LSTM layer receives the input of the lip feature sequence in the format of a 3-dimensional array (BATCHES, STEPS, LIP_DIM). The last fully connected layer is the output layer of the neural network, and the output format is a 2-dimensional array (BATCHES, LPC_DIM). In these formats, BATCHES specifies the number of samples (habitually called the number of batches) fed into the neural network each time. BATCHES is usually a value greater than 1 during training, and BATCHES=1 during application; the shape of an input sample is determined by STEPS, LIP_DIM Specify, STEPS specifies the length of a short-term sequence of lip features (habitually called the number of steps), that is, FIS t = (FI t-k+1 ,...,FI t-2 ,F t-1 ,FI The value of k in t ), that is, STEPS=k; LIP_DIM specifies a dimension of a lip feature vector FI, and for a lip feature vector composed of 40 coordinate data, LIP_DIM=40. In the output format, LPC_DIM is the dimension of a speech coding parameter vector. For LPC10e, LPC_DIM=14.
神经元的数目和层数,依据应用场景的不同做适当调整。词汇量大的应用情景中,神经元数目和层数相可以设置得较多些。The number of neurons and the number of layers can be adjusted appropriately according to different application scenarios. In application scenarios with large vocabulary, the number of neurons and layers can be set to be larger.
(2)k的值的确定。k值需要根据应用情景来确定。对于简单的应用场景,可能只需要进行逐个汉字的识别,由于一个汉字的发音大约0.5秒,如果视频为50帧/秒,则k为0.5秒所包含的视频帧数,即k=50x0.5=25。对于用字较多的情景,则需要以词汇甚至短句作为一个整体来识别,这时k的数值相应倍增。例如“大小”和“卡车”两个词中,由于“大”和“卡”的口型近似,难以单字区分,则需要对“大小”和“卡车”进行整词识别,k至少需要等于2x25=50左右(2) Determination of the value of k. The value of k needs to be determined according to the application scenario. For simple application scenarios, it may only be necessary to recognize Chinese characters one by one. Since the pronunciation of a Chinese character is about 0.5 seconds, if the video is 50 frames per second, then k is the number of video frames contained in 0.5 seconds, that is, k=50x0.5 = 25. For situations that use more words, it is necessary to identify the vocabulary or even short sentences as a whole, and the value of k is doubled accordingly. For example, in the two words "size" and "truck", due to the similar mouth shapes of "big" and "ka", it is difficult to distinguish single words, then it is necessary to perform whole word recognition on "size" and "truck", and k needs to be at least equal to 2x25 = around 50
(3)唇部特征向量的计算。对于每帧图像,围绕嘴唇的内边缘和外边缘,共提取20个特征点,以描述嘴唇的当前形状。计算这20点的中心坐标,并将各个点的坐标都减去该中心坐标。每个点有x和y两个坐标值,20个点共有40个坐标数据。对40个坐标值进行归一化处理后,作为一个唇部特征向量FI。由连续的视频图像可以得到一系列的唇部特征向量FI1,FI2,...,FIn。鉴于图像帧率通常不高,可对唇部特征向量进行插值,以提高帧率。连续k个唇部特征向量组成一个短时序列FISt=(FIt-k+1,...,FIt-2,FIt-1,FIt)作为预测器的输入样本,其中FIt为时间上最接近t的一个唇部特征向量。(3) Calculation of lip feature vector. For each frame of image, a total of 20 feature points are extracted around the inner and outer edges of the lips to describe the current shape of the lips. Calculate the center coordinates of these 20 points, and subtract the center coordinates from the coordinates of each point. Each point has two coordinate values of x and y, and 20 points have a total of 40 coordinate data. After normalizing the 40 coordinate values, it is used as a lip feature vector FI. A series of lip feature vectors FI 1 , FI 2 , . . . , FI n can be obtained from continuous video images. Given that the image frame rate is usually not high, the lip feature vector can be interpolated to increase the frame rate. Consecutive k lip feature vectors form a short-term sequence FIS t =(FI t-k+1 ,...,FI t-2 ,FI t-1 ,FI t ) as the input sample of the predictor, where FI t is a lip feature vector closest to t in time.
(4)语音帧编码参数向量的计算。t时刻的语音帧为At=(St-L+1,...,St-2,St-1,St),其中St是时间上最接近t的一个语音采样,t为任意的有效时刻。这里,语音可采用8000Hz的采样率,L设为180,即每180个样值作为一个音频帧,占22.5ms的时间。语音编码可采用LPC10e算法。用此算法对一个语音帧At进行分析,可得到该帧的编码参数向量FAt,即14个数值的LPC参数,包括1个前半帧清浊音标志、1个后半帧清浊音标志、1个基音周期、1个增益、和10个反射系数。这样,可计算任意有效时刻的语音帧编码参数向量FAt。不同的语音帧可以有重叠。这种音帧编码参数向量,在对预测器训练时作为期望输出格式,在应用时作为预测器的实际输出格式。(4) Calculation of speech frame coding parameter vector. The speech frame at time t is A t =(S t-L+1 ,...,S t-2 ,S t-1 ,S t ), where S t is a speech sample closest to t in time, t for any valid time. Here, a sampling rate of 8000 Hz can be used for speech, and L is set to 180, that is, every 180 samples is taken as an audio frame, which takes 22.5 ms. Speech coding can use LPC10e algorithm. Using this algorithm to analyze a speech frame A t , the encoding parameter vector FA t of the frame can be obtained, that is, 14 numerical LPC parameters, including 1 unvoiced and voiced sound mark in the first half of the frame, 1 unvoiced sound mark in the second half of the frame, 1 pitch period, 1 gain, and 10 reflection coefficients. In this way, the speech frame coding parameter vector FA t at any valid moment can be calculated. Different speech frames can overlap. This audio frame encoding parameter vector is used as the expected output format when the predictor is trained and the actual output format of the predictor when applied.
(5)预测器的训练:由上述唇部特征向量的一个短时序列FISt作为输入样本,对应时刻的语音帧编码参数向量FAt作为其标签(即预测目标),组成一个样本对{FISt,FAt},由于t可取任意有效时间内的值,因此可得到大量的训练样本,用于对预测器的训练。训练时,采用均方误差MSE来计算预测误差,并采用误差反向传播的方法逐步调整网络权值。最终提供一个可用的预测器。(5) Training of the predictor: a short-term sequence FIS t of the above lip feature vector is used as the input sample, and the speech frame encoding parameter vector FA t at the corresponding moment is used as its label (ie, the prediction target), forming a sample pair {FIS t , FA t }, since t can take a value in any valid time, a large number of training samples can be obtained for the training of the predictor. During training, the mean square error (MSE) is used to calculate the prediction error, and the method of error back propagation is used to gradually adjust the network weights. Finally provide a usable predictor.
(6)预测器训练完成后,作为一个模块用于转换器中。预测器结构描述数据和权值数据、以及其他参数,都存储在“配置参数”中,当转换器启动时将配置参数读取出来,并依据这些参数重建预测器。(6) After the predictor is trained, it is used as a module in the converter. The predictor structure description data and weight data, as well as other parameters, are stored in "configuration parameters", and the configuration parameters are read out when the converter is started, and the predictor is reconstructed according to these parameters.
(7)本文所述的方法可采用软件手段实现,亦可部分或全部地采用硬件手段实现。(7) The method described in this paper can be implemented by means of software, and can also be implemented by means of hardware in part or in whole.
Claims (2)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810215220.4A CN108538283B (en) | 2018-03-15 | 2018-03-15 | A conversion method from lip image features to speech coding parameters |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810215220.4A CN108538283B (en) | 2018-03-15 | 2018-03-15 | A conversion method from lip image features to speech coding parameters |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108538283A CN108538283A (en) | 2018-09-14 |
CN108538283B true CN108538283B (en) | 2020-06-26 |
Family
ID=63484002
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810215220.4A Active CN108538283B (en) | 2018-03-15 | 2018-03-15 | A conversion method from lip image features to speech coding parameters |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108538283B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10891969B2 (en) * | 2018-10-19 | 2021-01-12 | Microsoft Technology Licensing, Llc | Transforming audio content into images |
CN111023470A (en) * | 2019-12-06 | 2020-04-17 | 厦门快商通科技股份有限公司 | Air conditioner temperature adjusting method, medium, equipment and device |
CN111508509A (en) * | 2020-04-02 | 2020-08-07 | 广东九联科技股份有限公司 | Sound quality processing system and method based on deep learning |
CN113869212B (en) * | 2021-09-28 | 2024-06-21 | 平安科技(深圳)有限公司 | Multi-mode living body detection method, device, computer equipment and storage medium |
CN116013354B (en) * | 2023-03-24 | 2023-06-09 | 北京百度网讯科技有限公司 | Training method of deep learning model and method for controlling mouth shape change of virtual image |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104217218A (en) * | 2014-09-11 | 2014-12-17 | 广州市香港科大霍英东研究院 | Lip language recognition method and system |
CN105321519A (en) * | 2014-07-28 | 2016-02-10 | 刘璟锋 | Speech recognition system and unit |
CN105632497A (en) * | 2016-01-06 | 2016-06-01 | 昆山龙腾光电有限公司 | Voice output method, voice output system |
CN107799125A (en) * | 2017-11-09 | 2018-03-13 | 维沃移动通信有限公司 | A kind of audio recognition method, mobile terminal and computer-readable recording medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7133535B2 (en) * | 2002-12-21 | 2006-11-07 | Microsoft Corp. | System and method for real time lip synchronization |
-
2018
- 2018-03-15 CN CN201810215220.4A patent/CN108538283B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105321519A (en) * | 2014-07-28 | 2016-02-10 | 刘璟锋 | Speech recognition system and unit |
CN104217218A (en) * | 2014-09-11 | 2014-12-17 | 广州市香港科大霍英东研究院 | Lip language recognition method and system |
CN105632497A (en) * | 2016-01-06 | 2016-06-01 | 昆山龙腾光电有限公司 | Voice output method, voice output system |
CN107799125A (en) * | 2017-11-09 | 2018-03-13 | 维沃移动通信有限公司 | A kind of audio recognition method, mobile terminal and computer-readable recording medium |
Also Published As
Publication number | Publication date |
---|---|
CN108538283A (en) | 2018-09-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108538283B (en) | A conversion method from lip image features to speech coding parameters | |
CN112735373B (en) | Speech synthesis method, device, equipment and storage medium | |
CN112151030B (en) | Multi-mode-based complex scene voice recognition method and device | |
CN111933110B (en) | Video generation method, generation model training method, device, medium and equipment | |
CN108648745B (en) | Method for converting lip image sequence into voice coding parameter | |
CN113936637B (en) | Speech adaptive completion system based on multimodal knowledge graph | |
CN114360491B (en) | Speech synthesis method, device, electronic equipment and computer readable storage medium | |
JP4236815B2 (en) | Face synthesis device and face synthesis method | |
CN111488807A (en) | Video description generation system based on graph convolution network | |
JP3485508B2 (en) | Facial image transmitting method and system, and facial image transmitting device and facial image reproducing device used in the system | |
US12165634B2 (en) | Speech recognition method and apparatus, device, storage medium, and program product | |
CN109147763A (en) | A kind of audio-video keyword recognition method and device based on neural network and inverse entropy weighting | |
KR102319753B1 (en) | Method and apparatus for producing video contents based on deep learning | |
Cao et al. | Nonparallel Emotional Speech Conversion Using VAE-GAN. | |
CN118897887B (en) | An efficient digital human interaction system integrating multimodal information | |
CN112381040A (en) | Transmembrane state generation method based on voice and face image | |
KR20250048367A (en) | Data processing system and method for speech recognition model, speech recognition method | |
CN117409121A (en) | Fine-grained emotion control speaking face video generation method, system, equipment and media based on audio and single image driving | |
CN118471250B (en) | A method for automatically generating lip shape and expression by inputting speech | |
Chen et al. | Transformer-s2a: Robust and efficient speech-to-animation | |
CN115294962A (en) | Training method, device, equipment and storage medium for speech synthesis model | |
CN116825083A (en) | Speech synthesis system based on face grid | |
CN117528135A (en) | Speech-driven face video generation method and device, electronic equipment and medium | |
CN111653270B (en) | Voice processing method and device, computer readable storage medium and electronic equipment | |
CN113053356A (en) | Voice waveform generation method, device, server and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |