CN109215680B - A speech restoration method based on convolutional neural network - Google Patents
A speech restoration method based on convolutional neural network Download PDFInfo
- Publication number
- CN109215680B CN109215680B CN201810937126.XA CN201810937126A CN109215680B CN 109215680 B CN109215680 B CN 109215680B CN 201810937126 A CN201810937126 A CN 201810937126A CN 109215680 B CN109215680 B CN 109215680B
- Authority
- CN
- China
- Prior art keywords
- formant
- voice
- sequence
- data
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Signal Processing (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
本发明涉及语音处理技术领域,尤其涉及一种基于卷积神经网络的语音还原方法,包括:步骤S1,采集电子伪装语音;步骤S2,采用预处理模型对电子伪装语音进行预处理,以将电子伪装语音转换为具有预设维度的标准语音序列;步骤S3,采用还原模型将标准语音序列还原为原始语音序列;其中,步骤S2中,预处理的过程包括共振峰折损清洗、共振峰合并优化以及共振峰序列调整;采用了卷积神经网络进行电子伪装语音还原,具有足够高的还原效果,能够满足高要求的电子伪装语音还原场景。
The invention relates to the technical field of voice processing, and in particular to a voice restoration method based on a convolutional neural network, comprising: step S1, collecting electronic camouflage voice; Converting the camouflaged voice into a standard voice sequence with preset dimensions; Step S3, using a restoration model to restore the standard voice sequence to the original voice sequence; wherein, in Step S2, the preprocessing process includes formant breakage cleaning, formant merge optimization And formant sequence adjustment; the convolutional neural network is used for electronic camouflage voice restoration, which has a high enough restoration effect to meet the high-demand electronic camouflage voice restoration scene.
Description
技术领域technical field
本发明涉及语音处理技术领域,尤其涉及一种基于卷积神经网络的语音还原方法。The invention relates to the technical field of speech processing, in particular to a speech restoration method based on a convolutional neural network.
背景技术Background technique
随着信息技术的不断发展,说话人识别技术取得了长足的进步,语音的说话人个性特征分析和研究得到了广泛关注。然而,伪装语音的出现,使说话人识别的研究工作受到了前所未有的挑战。广义的伪装语音是指不管原因如何,对于正常语音的任何改变、扭曲或偏离都可以称作伪装语音。狭义的伪装是指刻意伪装,即以掩盖身份为目的,对正常语音的故意扭曲。With the continuous development of information technology, the speaker recognition technology has made great progress, and the analysis and research of the speaker's personality characteristics of speech has received extensive attention. However, the appearance of camouflaged speech makes the research work of speaker recognition face unprecedented challenges. In a broad sense, disguised speech means that any change, distortion or deviation from normal speech, regardless of the cause, can be called disguised speech. In a narrow sense, camouflage refers to deliberate camouflage, that is, the intentional distortion of normal speech for the purpose of concealing one's identity.
在犯罪案件中,犯罪分子为了掩盖真实身份,常常通过使用电子伪装语音的手段来伪装自己的发音,逃避打击。电子伪装语音是指采用电子设备或语音处理软件对说话人的原始语音通过变声伪装处理后生成的畸变语音。电子伪装语音对说话人身份伪装程度较高,它改变了说话人原始语音中的声学个性特征,不仅仅从人耳听觉上无法辨认说话人,而且通过电声学仪器检测也难以确认。语音是人身识别的重要生物特征,电子伪装语音的出现,则会使语音鉴定工作难上加难。因此,深入研究各种电子伪装语音的特点,提取变化最为明显的特征参量,归纳电子伪装语音的变化规律,设计针对电子伪装语音的还原方法,使之能对变化多端的电子伪装语音进行还原,对于说话人的识别认定和其证据效力的发挥具有重要意义。In criminal cases, in order to cover up their real identities, criminals often disguise their pronunciation by using electronic camouflage voices to avoid being attacked. Electronic camouflaged speech refers to the distorted speech generated by using electronic equipment or speech processing software to process the speaker's original speech through voice-changing camouflage. Electronic camouflage speech has a high degree of camouflage for the speaker's identity. It changes the acoustic personality characteristics of the speaker's original speech. Not only the speaker cannot be identified from the human ear, but also it is difficult to confirm through the detection of electro-acoustic instruments. Voice is an important biological feature for personal identification. The emergence of electronic camouflage voice will make voice identification even more difficult. Therefore, the characteristics of various electronic camouflage voices are deeply studied, the characteristic parameters with the most obvious changes are extracted, the changing laws of electronic camouflage voices are summarized, and the restoration method for electronic camouflage voices is designed, so that it can restore the changing electronic camouflage voices. It is of great significance for the identification of the speaker and the exertion of its evidential effect.
电子伪装语音还原是指通过一定的算法或模型来弱化或消除电子伪装语音的伪装特性,并生成较电子伪装语音更为接近原始语音的还原语音的过程。由于电子伪装语音一般基于某种算法来实现自身声学特征的改变,原始语音转换为电子伪装语音的过程中存在着一定的变化规律。而语音又具有短时平稳特性,故而通过统计对比原始语音与电子伪装语音之间的声纹偏差特征,为电子伪装语音的还原提供依据。Electronic camouflage voice restoration refers to the process of weakening or eliminating the camouflage characteristics of electronic camouflage voice through a certain algorithm or model, and generating a restored voice that is closer to the original voice than the electronic camouflage voice. Since electronic camouflage speech is generally based on a certain algorithm to realize the change of its own acoustic characteristics, there is a certain change rule in the process of converting the original speech into electronic camouflage speech. And the voice has the characteristics of short-term stability, so by statistical comparison of the voiceprint deviation characteristics between the original voice and the electronic camouflage voice, it provides a basis for the restoration of the electronic camouflage voice.
但是,现有的电子伪装语音的还原效果不够理想,无法满足对还原效果要求较高的场景。However, the restoration effect of the existing electronic camouflage voice is not ideal, and cannot meet the scenes with high requirements for restoration effect.
发明内容SUMMARY OF THE INVENTION
针对上述问题,本发明提出了一种基于卷积神经网络的语音还原方法,其中,包括预设一预处理模型和一还原模型;In view of the above problems, the present invention proposes a voice restoration method based on a convolutional neural network, wherein a preprocessing model and a restoration model are preset;
所述还原模型中包括一卷积神经网络以及一初始还原因子,所述还原模型通过所述卷积神经网络对所述初始还原因子进行训练,生成用于控制所述还原模型的还原过程的还原因子;The reduction model includes a convolutional neural network and an initial reduction factor, and the reduction model trains the initial reduction factor through the convolutional neural network to generate a reduction for controlling the reduction process of the reduction model. factor;
所述卷积神经网络中包括相连的扩大因果卷积层以及子控制层,所述子控制层用于将所述初始还原因子转换成一预设维度的序列;The convolutional neural network includes a connected expanded causal convolution layer and a sub-control layer, and the sub-control layer is used to convert the initial reduction factor into a sequence of preset dimensions;
所述语音还原方法还包括:The voice restoration method also includes:
步骤S1,采集所述电子伪装语音,同时提取所述电子伪装语音中的声学参数;Step S1, collecting the electronic camouflage voice, and simultaneously extracting acoustic parameters in the electronic camouflage voice;
步骤S2,采用所述预处理模型对所述声学参数进行预处理,以将所述声学参数转换为具有所述预设维度的标准序列;Step S2, using the preprocessing model to preprocess the acoustic parameters to convert the acoustic parameters into a standard sequence with the preset dimension;
步骤S3,采用所述还原模型将所述电子伪装语音还原为还原语音序列,且所述还原模型根据所述标准序列完成对所述标准语音序列的还原;Step S3, using the restoration model to restore the electronic camouflage voice to a restored voice sequence, and the restoration model completes the restoration of the standard voice sequence according to the standard sequence;
其中,所述步骤S2中,所述预处理的过程包括共振峰折损清洗、共振峰合并优化以及共振峰序列调整。Wherein, in the step S2, the preprocessing process includes formant breakage cleaning, formant merging optimization, and formant sequence adjustment.
上述的语音还原方法,其中,所述扩大因果卷积层中采用了门激活单元对来自所述子控制层的输入进行非线性转换过程。In the above-mentioned speech restoration method, wherein, a gate activation unit is used in the expanded causal convolution layer to perform a nonlinear transformation process on the input from the sub-control layer.
上述的语音还原方法,其中,所述门激活单元采用以下函数进行所述非线性转换过程:The above-mentioned voice restoration method, wherein, the gate activation unit adopts the following function to perform the nonlinear conversion process:
式中,为输入的电子伪装语音序列,*表示卷积运算,σ()表示 Sigmoid函数,Wf,k表示学习型的卷积滤波器指数,表示滤波卷积滤波器,表示门控卷积滤波器,表示门激活函数,h表示所述还原因子,k表示层数,f表示滤波系数,g表示门控系数。In the formula, is the input electronic camouflage speech sequence, * represents the convolution operation, σ() represents the Sigmoid function, W f,k represents the learned convolution filter index, represents the filtered convolution filter, represents a gated convolution filter, represents the gate activation function, h represents the reduction factor, k represents the number of layers, f represents the filter coefficient, and g represents the gating coefficient.
上述的语音还原方法,其中,所述卷积神经网络还包括连接所述因果卷积层的输出的跃层残差结构;The above-mentioned speech restoration method, wherein the convolutional neural network further comprises a jump layer residual structure connecting the output of the causal convolutional layer;
所述跃层残差结构用于将任一前卷积层前的输入数据恒等映射至相隔预设层数的后卷积层,并叠加于经所述后卷积层计算所得的残差后输出。The jump layer residual structure is used to identically map the input data before any pre-convolutional layer to the post-convolutional layer separated by a preset number of layers, and superimpose it on the residual calculated by the post-convolutional layer. output later.
上述的语音还原方法,其中,还包括:步骤S4,对所述还原语音序列的数据进行分类输出。The above-mentioned voice restoration method further includes: step S4, classifying and outputting the data of the restored voice sequence.
上述的语音还原方法,其中,所述步骤S4具体包括:The above-mentioned voice restoration method, wherein, the step S4 specifically includes:
步骤S41,采用Softmax函数对还原后的数据进行离散化分类,实现所述还原语音序列的数据序列归一化;Step S41, using Softmax function to discretize and classify the restored data, so as to realize the normalization of the data sequence of the restored speech sequence;
步骤S42,对数据序列归一化后的所述还原语音序列进行μ-law压扩转换,以降低输出运算量。Step S42, performing μ-law companding transformation on the restored speech sequence after the normalization of the data sequence, so as to reduce the amount of output operation.
上述的语音还原方法,其中,所述Softmax函数为:The above-mentioned voice restoration method, wherein, the Softmax function is:
式中,为输入的还原语音序列,N为向量维度。In the formula, is the input restored speech sequence, and N is the vector dimension.
上述的语音还原方法,其中,所述μ-law压扩转换采用以下函数完成:The above-mentioned voice restoration method, wherein, the μ-law companding conversion is completed by the following functions:
式中,为输入的数据序列归一化后的所述还原语音序列。In the formula, is the restored speech sequence after normalization of the input data sequence.
上述的语音还原方法,其中,所述共振峰折损清洗包括:The above-mentioned voice restoration method, wherein, the formant damage cleaning comprises:
步骤A1,提取所述电子伪装语音中的非齐次韵母的共振峰数据对,进行声纹比对后确认出现折损的所述共振峰数据对;Step A1, extracts the formant data pair of the non-homogeneous vowels in the described electronic camouflage voice, and confirms the described formant data pair that has broken after the voiceprint comparison;
步骤A2,按照非零共振峰的数量的大小对所述共振峰数据对中的数据进行区分标记;Step A2, according to the size of the number of non-zero formants, the data in the pair of formant data are distinguished and marked;
步骤A3,将区分后的数据中对应共振峰中心频率参数进行交叉相减计算,形成一关于差值的绝对值矩阵;Step A3, performing cross-subtraction calculation on the corresponding formant center frequency parameter in the differentiated data to form an absolute value matrix about the difference;
步骤A4,根据所述绝对值矩阵,取不同行和不同列的元素之和最小的矩阵对应的所述共振峰数据对;Step A4, according to the absolute value matrix, take the formant data pair corresponding to the matrix with the minimum sum of the elements of different rows and different columns;
步骤A5,对所有的非齐次韵母的共振峰数据对进行所述步骤 A1~A4,形成折损清洗后的所述共振峰数据对的集合。Step A5, performing the steps A1 to A4 on all formant data pairs of non-homogeneous finals to form a set of the formant data pairs after damage and cleaning.
上述的语音还原方法,其中,所述共振峰合并优化包括:The above-mentioned voice restoration method, wherein, the formant merging optimization comprises:
步骤B1,按照预设的一共振峰提取规则提取所述电子伪装语音中的非齐次韵母的共振峰数据对;Step B1, extracts the formant data pair of the non-homogeneous final vowel in the described electronic camouflage voice according to a preset formant extraction rule;
步骤B2,按照非零共振峰的数量的大小对所述共振峰数据对中的数据进行区分标记;Step B2, according to the size of the number of non-zero formants, the data in the pair of formant data is distinguished and marked;
步骤B3,将非零共振峰的数量小的所述数据中的共振峰中心频率参数与相邻的所述共振峰中心频率参数做差,并提取出差值的绝对值最小的一组所述共振峰中心频率参数,记为(fv,1,fv+1,1),并做如下变换:Step B3, make a difference between the formant center frequency parameter in the data with the small number of non-zero formants and the adjacent formant center frequency parameter, and extract a group of the described formants with the smallest absolute value of the difference. The formant center frequency parameter, denoted as (f v,1 ,f v+1,1 ), and do the following transformation:
步骤B4,对所有的非齐次韵母的共振峰数据对进行所述步骤 B1~B3,形成合并优化后的所述共振峰数据对的集合。In step B4, the steps B1 to B3 are performed on all formant data pairs of non-homogeneous finals to form a set of the formant data pairs after combining and optimizing.
上述的语音还原方法,其中,所述共振峰序列调整包括:The above-mentioned voice restoration method, wherein, the formant sequence adjustment comprises:
步骤C1,提取所述电子伪装语音中的非齐次韵母的共振峰中心频率参数按照数值范围分为不同子集合A1,A2,A3,A4,并判断任一子集合 Aj(j=1,2,3,4)中元素的数量;Step C1, extracting the formant center frequency parameter of the non-homogeneous finals in the electronic camouflage voice is divided into different sub-sets A 1 , A 2 , A 3 , A 4 according to the numerical range, and judges any sub-set A j ( The number of elements in j=1,2,3,4);
于任一子集合Aj(j=1,2,3,4)中仅有1个或0个元素时,若有 Aj=Aj+1≠φ,则保留使|j-i|最小的j所对应的Aj,并将Aj+1置为空集,然后对Stm或Spn或Srr的字韵母共振峰进行S0中对应的字韵母共振峰进行同化操作,以保持同一字韵母的共振峰数量和实际位置一致;When there is only 1 or 0 elements in any subset A j (j=1,2,3,4), if A j =A j+1 ≠φ, then keep the j that minimizes |ji| The corresponding A j , and A j+1 is set as the empty set, and then the formants of the letters and finals of S tm or S pn or S rr Carry out the corresponding letter-final formant in S 0 Perform assimilation operation to keep the same number of formants and actual positions of finals of the same letter;
否则设M12=A1∩A2,M23=A2∩A3,M34=A3∩A4,其中Mj,j+1(j=1,2,3) 表示因处于中心频率分布重叠区域而同时出现于Aj和Aj+1中的共振峰的集合;此时位于同一中心频率分布重叠区域的共振峰不超过2个,即card(Mj,j+1)≤2,q=1,2,3,记B1=A1-M12,B2=A2-M12-M23, B3=A3-M23-M34,B4=A4-M34;令其中记为集合Mj-1,j中中心频率最大的共振峰,记为集合Mj,j+1中的中心频率最小的共振峰,则对每一个j=1,2,3,4进行如下计算:Otherwise, let M 12 =A 1 ∩A 2 , M 23 =A 2 ∩A 3 , M 34 =A 3 ∩A 4 , where M j,j+1 (j=1,2,3) represents the Resonant peaks appearing in both A j and A j+1 while distributing overlapping regions At this time, there are no more than 2 resonance peaks located in the overlapping area of the same center frequency distribution, that is, card(M j,j+1 )≤2, q=1,2,3, denoted B 1 =A 1 -M 12 , B 2 =A 2 -M 12 -M 23 , B 3 =A 3 -M 23 -M 34 , B 4 =A 4 -M 34 ; let in remember is the formant with the largest center frequency in the set M j-1,j , denoted is the formant with the smallest center frequency in the set M j,j+1 , then the following calculation is performed for each j=1,2,3,4:
式中,[x]表示不大于x的最大整数,x指代运算符号[]中的运算式的运算结果;而后,置集合A1,A2,A3,A4为空,将放入对应的集合Aj当中,使得Aj(j=1,2,3,4)中至多只有一条非零共振峰;In the formula, [x] represents the largest integer not greater than x, and x refers to the operation result of the operation expression in the operation symbol []; then, set the sets A 1 , A 2 , A 3 , and A 4 to be empty, and set the Put it into the corresponding set A j , so that there is at most one non-zero formant in A j (j=1, 2, 3, 4);
步骤C2,对所有的非齐次韵母的共振峰数据对进行所述步骤C1,形成序列调整后的所述共振峰数据对的集合。Step C2, performing the step C1 on all formant data pairs of non-homogeneous finals to form a set of the formant data pairs after sequence adjustment.
有益效果:本发明提出的一种基于卷积神经网络的语音还原方法,采用了卷积神经网络进行电子伪装语音还原,具有足够高的还原效果,能够满足高要求的电子伪装语音还原场景。Beneficial effects: The voice restoration method based on the convolutional neural network proposed by the present invention adopts the convolutional neural network to restore the electronic camouflage voice, which has a high enough restoration effect and can meet the high-demand electronic camouflage voice restoration scene.
附图说明Description of drawings
图1为本发明一实施例中基于卷积神经网络的语音还原方法的步骤流程图;1 is a flowchart of steps of a method for restoring speech based on a convolutional neural network according to an embodiment of the present invention;
图2为本发明一实施例中扩大卷积随扩大系数增长的感受野的示意图;2 is a schematic diagram of the receptive field of the enlarged convolution as the enlargement coefficient increases according to an embodiment of the present invention;
图3为本发明一实施例中将扩大卷积与因果卷积相结合产生的训练效果示意图;3 is a schematic diagram of a training effect generated by combining enlarged convolution with causal convolution in an embodiment of the present invention;
图4为本发明一实施例中基于卷积神经网络的语音还原的基本结构原理图;4 is a schematic diagram of the basic structure of speech restoration based on a convolutional neural network according to an embodiment of the present invention;
图5为本发明一实施例中深层传递优化过程中的跃层连接与残差块结构图。FIG. 5 is a structural diagram of a jump layer connection and a residual block in a deep transfer optimization process according to an embodiment of the present invention.
具体实施方式Detailed ways
下面结合附图和实施例对本发明进行进一步说明。The present invention will be further described below with reference to the accompanying drawings and embodiments.
在一个较佳的实施例中,如图1所示,提出了一种基于卷积神经网络的语音还原方法,其中,包括预设一预处理模型和一还原模型;In a preferred embodiment, as shown in FIG. 1 , a voice restoration method based on a convolutional neural network is proposed, wherein a preprocessing model and a restoration model are preset;
还原模型中包括一卷积神经网络以及一初始还原因子,还原模型通过卷积神经网络对初始还原因子进行训练,生成用于控制还原模型的还原过程的还原因子;The restoration model includes a convolutional neural network and an initial restoration factor, and the restoration model trains the initial restoration factor through the convolutional neural network to generate a restoration factor for controlling the restoration process of the restoration model;
卷积神经网络中包括相连的扩大因果卷积层以及子控制层,子控制层用于将初始还原因子转换成一预设维度的序列;The convolutional neural network includes a connected expanded causal convolution layer and a sub-control layer, and the sub-control layer is used to convert the initial reduction factor into a sequence of preset dimensions;
语音还原方法还可以包括:The voice restoration method may also include:
步骤S1,采集电子伪装语音,同时提取电子伪装语音中的声学参数;Step S1, collecting the electronic camouflage voice, and simultaneously extracting the acoustic parameters in the electronic camouflage voice;
步骤S2,采用预处理模型对声学参数进行预处理,以将声学参数转换为具有预设维度的标准序列;Step S2, using a preprocessing model to preprocess the acoustic parameters to convert the acoustic parameters into a standard sequence with preset dimensions;
步骤S3,采用还原模型将电子伪装语音还原为还原语音序列,且还原模型根据标准序列完成对标准语音序列的还原;Step S3, adopts the restoration model to restore the electronic camouflage voice to the restored voice sequence, and the restoration model completes the restoration of the standard voice sequence according to the standard sequence;
其中,步骤S2中,预处理的过程包括共振峰折损清洗、共振峰合并优化以及共振峰序列调整。Wherein, in step S2, the preprocessing process includes formant breakage cleaning, formant merging optimization, and formant sequence adjustment.
上述技术方案中,声学参数可以包括共振峰的中心频率、带宽和声强。In the above technical solution, the acoustic parameters may include the center frequency, bandwidth and sound intensity of the formant.
在一个较佳的实施例中,扩大因果卷积层中采用了门激活单元对来自子控制层的输入进行非线性转换过程。In a preferred embodiment, the gate activation unit is used in the augmented causal convolution layer to perform a nonlinear transformation process on the input from the sub-control layer.
上述实施例中,优选地,门激活单元采用以下函数进行非线性转换过程:In the above embodiment, preferably, the gate activation unit adopts the following function to perform the nonlinear conversion process:
式中,为输入的电子伪装语音序列,*表示卷积运算,σ()表示 Sigmoid函数,Wf,k表示学习型的卷积滤波器指数,表示滤波卷积滤波器,表示门控卷积滤波器,表示门激活函数,h表示还原因子,k表示层数,f表示滤波系数,g表示门控系数。In the formula, is the input electronic camouflage speech sequence, * represents the convolution operation, σ() represents the Sigmoid function, W f,k represents the learned convolution filter index, represents the filtered convolution filter, represents a gated convolution filter, represents the gate activation function, h represents the reduction factor, k represents the number of layers, f represents the filter coefficient, and g represents the gating coefficient.
在一个较佳的实施例中,卷积神经网络还包括连接因果卷积层的输出的跃层残差结构;In a preferred embodiment, the convolutional neural network further includes a layered residual structure connecting the outputs of the causal convolutional layers;
跃层残差结构用于将任一前卷积层前的输入数据恒等映射至相隔预设层数的后卷积层,并叠加于经后卷积层计算所得的残差后输出。The transition layer residual structure is used to map the input data before any pre-convolutional layer to the post-convolutional layer separated by a preset number of layers, and superimpose it on the residual calculated by the post-convolutional layer.
上述技术方案中,跃层残差结构可以包括跃层连接结构和残差结构;前卷积层和后卷积层组成了连接因果卷积层。In the above technical solution, the jump-layer residual structure may include a jump-layer connection structure and a residual structure; the pre-convolutional layer and the post-convolutional layer form a connected causal convolutional layer.
在一个较佳的实施例中,还包括:步骤S4,对还原语音序列的数据进行分类输出。In a preferred embodiment, the method further includes: step S4, classifying and outputting the data of the restored speech sequence.
上述实施例中,优选地,步骤S4具体包括:In the above embodiment, preferably, step S4 specifically includes:
步骤S41,采用Softmax函数对还原后的数据进行离散化分类,实现还原语音序列的数据序列归一化;Step S41, using the Softmax function to discretize the restored data to achieve normalization of the data sequence of the restored speech sequence;
步骤S42,对数据序列归一化后的还原语音序列进行μ-law压扩转换,以降低输出运算量。In step S42, μ-law companding transformation is performed on the restored speech sequence after the normalization of the data sequence, so as to reduce the amount of output operation.
上述实施例中,优选地,Softmax函数为:In the above-mentioned embodiment, preferably, the Softmax function is:
式中,为输入的还原语音序列,N为向量维度。In the formula, is the input restored speech sequence, and N is the vector dimension.
上述实施例中,优选地,μ-law压扩转换采用以下函数完成:In the above embodiment, preferably, the μ-law companding conversion is completed by the following function:
式中,为输入的数据序列归一化后的还原语音序列。In the formula, The restored speech sequence normalized for the input data sequence.
在一个较佳的实施例中,共振峰折损清洗可以包括:In a preferred embodiment, the formant break cleaning may include:
步骤A1,提取电子伪装语音中的非齐次韵母的共振峰数据对,进行声纹比对后确认出现折损的共振峰数据对;Step A1, extracts the formant data pair of the non-homogeneous final vowel in the electronic camouflage voice, and confirms the formant data pair that appears damaged after the voiceprint comparison;
步骤A2,按照非零共振峰的数量的大小对共振峰数据对中的数据进行区分标记;Step A2, according to the size of the number of non-zero formants, the data in the formant data pair is discriminated and marked;
步骤A3,将区分后的数据中对应共振峰中心频率参数进行交叉相减计算,形成一关于差值的绝对值矩阵;Step A3, performing cross-subtraction calculation on the corresponding formant center frequency parameter in the differentiated data to form an absolute value matrix about the difference;
步骤A4,根据绝对值矩阵,取不同行和不同列的元素之和最小的矩阵对应的共振峰数据对;Step A4, according to the absolute value matrix, get the formant data pair corresponding to the matrix with the minimum sum of the elements of different rows and different columns;
步骤A5,对所有的非齐次韵母的共振峰数据对进行步骤A1~A4,形成折损清洗后的共振峰数据对的集合。Step A5: Steps A1 to A4 are performed on all formant data pairs of non-homogeneous finals to form a set of formant data pairs after breakage and cleaning.
在一个较佳的实施例中,共振峰合并优化可以包括:In a preferred embodiment, the formant merging optimization may include:
步骤B1,按照预设的一共振峰提取规则提取电子伪装语音中的非齐次韵母的共振峰数据对;Step B1, extracts the formant data pair of the non-homogeneous final vowel in the electronic camouflage voice according to a preset formant extraction rule;
步骤B2,按照非零共振峰的数量的大小对共振峰数据对中的数据进行区分标记;Step B2, according to the size of the number of non-zero formants, the data in the formant data pair is discriminated and marked;
步骤B3,将非零共振峰的数量小的数据中的共振峰中心频率参数与相邻的共振峰中心频率参数做差,并提取出差值绝对值最小的一组共振峰中心频率参数,记为(fv,1,fv+1,1),并做如下变换:Step B3, make a difference between the formant center frequency parameter in the data with a small number of non-zero formants and the adjacent formant center frequency parameter, and extract a group of formant center frequency parameters with the smallest absolute value of the difference, record is (f v,1 ,f v+1,1 ), and do the following transformation:
步骤B4,对所有的非齐次韵母的共振峰数据对进行步骤B1~B3,形成合并优化后的共振峰数据对的集合。In step B4, steps B1-B3 are performed on all formant data pairs of non-homogeneous finals to form a set of combined and optimized formant data pairs.
在一个较佳的实施例中,共振峰序列调整可以包括:In a preferred embodiment, the formant sequence adjustment may include:
步骤C1,提取电子伪装语音中的非齐次韵母的共振峰中心频率参数按照数值范围分为不同子集合A1,A2,A3,A4,并判断任一子集合Aj (j=1,2,3,4)中元素的数量;Step C1, extracting the formant center frequency parameters of the non-homogeneous finals in the electronic camouflage speech are divided into different subsets A 1 , A 2 , A 3 , A 4 according to the numerical range, and determine any subset A j (j= 1,2,3,4) the number of elements;
于任一子集合Aj(j=1,2,3,4)中仅有1个或0个元素时,若有 Aj=Aj+1≠φ,则保留使|j-i|最小的j所对应的Aj,并将Aj+1置为空集,然后对Stm或Spn或Srr的字韵母共振峰进行S0中对应的字韵母共振峰进行同化操作,以保持同一字韵母的共振峰数量和实际位置一致;When there is only 1 or 0 elements in any subset A j (j=1,2,3,4), if A j =A j+1 ≠φ, then keep the j that minimizes |ji| The corresponding A j , and A j+1 is set as the empty set, and then the formants of the letters and finals of S tm or S pn or S rr Carry out the corresponding letter-final formant in S 0 Perform assimilation operation to keep the same number of formants and actual positions of finals of the same letter;
否则设M12=A1∩A2,M23=A2∩A3,M34=A3∩A4,其中Mj,j+1(j=1,2,3) 表示因处于中心频率分布重叠区域而同时出现于Aj和Aj+1中的共振峰的集合;此时位于同一中心频率分布重叠区域的共振峰不超过2个,即card(Mj,j+1)≤2,q=1,2,3,记B1=A1-M12,B2=A2-M12-M23, B3=A3-M23-M34,B4=A4-M34;令其中记为集合Mj-1,j中中心频率最大的共振峰,记为集合Mj,j+1中的中心频率最小的共振峰,则对每一个j=1,2,3,4进行如下计算:Otherwise, let M 12 =A 1 ∩A 2 , M 23 =A 2 ∩A 3 , M 34 =A 3 ∩A 4 , where M j,j+1 (j=1,2,3) represents the Resonant peaks appearing in both A j and A j+1 while distributing overlapping regions At this time, there are no more than 2 resonance peaks located in the overlapping area of the same center frequency distribution, that is, card(M j,j+1 )≤2, q=1,2,3, denoted B 1 =A 1 -M 12 , B 2 =A 2 -M 12 -M 23 , B 3 =A 3 -M 23 -M 34 , B 4 =A 4 -M 34 ; let in remember is the formant with the largest center frequency in the set M j-1,j , denoted is the formant with the smallest center frequency in the set M j,j+1 , then the following calculation is performed for each j=1,2,3,4:
式中,[x]表示不大于x的最大整数,x指代运算符号[]中的运算式的运算结果;而后,置集合A1,A2,A3,A4为空,将放入对应的集合Aj当中,使得Aj(j=1,2,3,4)中至多只有一条非零共振峰;In the formula, [x] represents the largest integer not greater than x, and x refers to the operation result of the operation expression in the operation symbol []; then, set the sets A 1 , A 2 , A 3 , and A 4 to be empty, and set the Put it into the corresponding set A j , so that there is at most one non-zero formant in A j (j=1, 2, 3, 4);
步骤C2,对所有的非齐次韵母的共振峰数据对进行步骤C1,形成序列调整后的共振峰数据对的集合。Step C2: Step C1 is performed on all formant data pairs of non-homogeneous finals to form a set of formant data pairs after sequence adjustment.
以下为阐明本发明的技术方案提供的更为详细的实施例内容:Below is the more detailed embodiment content that the technical solution of the present invention is provided:
共振峰参数假设如下:The formant parameters are assumed to be as follows:
共振峰折损清洗步骤如下:The cleaning steps for formant breakage are as follows:
(1)分别提取每一组Sc中的WNh,通过观察对应宽带语谱图的共振峰与声纹显像,依据前述共振峰折损原因选取非齐次韵母共振峰数据对放入Wl;(1) Respectively extract W Nh in each group of S c , by observing the formant and voiceprint imaging of the corresponding broadband spectrogram, select the formant data pair of non-homogeneous finals according to the aforementioned formant breakage reasons put in W l ;
(2)对于Wl中某一非齐次韵母共振峰数据对记其中非零共振峰数量较少者为较多者为 (2) For a certain non-homogeneous final formant data pair in W l note where the non-zero formant The lesser is more
(3)将中的每个fv,1(即共振峰中心频率)分别与的f1,1,…,fu,1相减,记其差的绝对值为du,v,即du,v=|fu,1-fv,1|,并放入u×v的矩阵D;(3) will Each f v,1 (that is, the formant center frequency) in the Subtract the f 1,1 ,... , f u, 1 of the matrix D of v;
(4)设集合当取最小值时,数组所对应的即为受损共振峰清洗所得该字的数据集;(4) Set a set when When taking the minimum value, the array corresponding to is the data set of the word obtained by cleaning the damaged formant;
(5)对Wl中每一非齐次韵母共振峰数据对进行上述(2)- (4)步操作,最终获得清洗共振峰折损后的数据对的集合Wl′。(5) to each non-homogeneous final formant data pair in W1 Carry out above-mentioned (2)-(4) step operation, finally obtain the data pair after cleaning formant breakage The set W l '.
共振峰合并优化步骤如下:The formant merging optimization steps are as follows:
(1)分别提取每一组Sc中的WNh。由于Wl∩Wm≠φ,所以需通过观察对应的宽带语谱图的共振峰与声纹显像,依据前述导致共振峰合并的原因选取非齐次韵母共振峰数据对放入Wm;(1) Extract W Nh in each group of S c respectively. Since W l ∩ W m ≠φ, it is necessary to observe the formant and voiceprint imaging of the corresponding broadband spectrogram, and select the non-homogeneous final formant data pair according to the aforementioned reasons for the merge of formants. put in W m ;
(2)对于Wl中某一非齐次韵母共振峰数据对记其中非零共振峰数量较少者为较多者为 (2) For a certain non-homogeneous final formant data pair in W l note where the non-zero formant The lesser is more
(3)将中每个fv,1(即共振峰中心频率,v=1,2,…且v<4)与其相邻的共振峰中心频率fv+1,1做差,取其差的绝对值最小的一组共振峰中心频率(fv,1,fv+1,1),就其所对应的与做出如下变换:(3) will Each f v, 1 (that is, the formant center frequency, v=1, 2, ... and v < 4) makes a difference with its adjacent formant center frequency f v+1 , 1, and the absolute value of the difference is the smallest A set of formant center frequencies (f v,1 ,f v+1,1 ) of and Make the following transformations:
(4)对Wm中每一非齐次韵母共振峰数据对进行上述2、3步操作,最终获得共振峰合并优化后的数据对的集合W′m。(4) For each inhomogeneous final formant data pair in W m Carry out the above 2 and 3 steps, and finally obtain the data pair after formant merging and optimization The set W'm of .
共振峰序列调整步骤如下:The formant sequence adjustment steps are as follows:
(1)设集合A1,A2,A3,A4为空,将某组Sc里原始语音集合S0中不足4个共振峰的某一字韵母依次按预设的共振峰中心频率范围分别放入 A1,A2,A3,A4,(1) Set A 1 , A 2 , A 3 , and A 4 to be empty, and select a certain letter final with less than 4 formants in the original voice set S 0 of a certain group S c sequentially according to the preset formant center frequency The ranges are put into A 1 , A 2 , A 3 , A 4 respectively,
当任意一Aj(j=1,2,3,4)中仅有1个或0个元素,即When any A j (j=1,2,3,4) has only 1 or 0 elements, that is
card(Aj)≤1,j=1,2,3,4card(A j )≤1, j=1,2,3,4
若有Aj=Aj+1≠φ,保留使|j-i|最小的j所对应的Aj(i为的原始顺序),并将另一集合置为空集。j的值即为经序列调整后的位置排序,并由此跳至第(3)步操作。否则,当存在至少一Aj(j=1,2,3,4)中有至少2个元素时,还需进行第(2)步操作;If A j =A j+1 ≠φ, keep the A j corresponding to the j that minimizes |ji| (i is in the original order) and set the other set to the empty set. The value of j is The position after the sequence adjustment is sorted, and thus jump to step (3). Otherwise, when there are at least 2 elements in at least one A j (j=1, 2, 3, 4), step (2) needs to be performed;
(2)设M12=A1∩A2,M23=A2∩A3,M34=A3∩A4,则集合Mj,j+1(j=1,2,3) 表示因处于中心频率分布重叠区域而同时出现于Aj和Aj+1中的共振峰由于在使用司法鉴定智能语音工作站进行检测时,已设定了共振峰中心频率的最小频率间隔为80Hz,故而位于同一中心频率分布重叠区域的共振峰不会超过2个,即(2) Suppose M 12 =A 1 ∩A 2 , M 23 =A 2 ∩A 3 , M 34 =A 3 ∩A 4 , then the set M j,j+1 (j=1,2,3) represents the factor Resonant peaks in both A j and A j+1 in the overlapping region of the center frequency distribution Since the minimum frequency interval of the formant center frequency has been set to 80Hz when using the forensic identification intelligent voice workstation for detection, there will be no more than two formants located in the overlapping area of the same center frequency distribution, that is,
card(Mj,j+1)≤2,q=1,2,3card(M j,j+1 )≤2, q=1,2,3
记B1=A1-M12,B2=A2-M12-M23,B3=A3-M23-M34,B4=A4-M34。令其中又记集合Mj-1,j中中心频率最大的共振峰,记集合Mj,j+1中中心频率最小的共振峰(当Mj-1,j或Mj,j+1为空集φ时,令或为向量0)。则对每一个j=1,2,3,4都可进行如下计算:Denote B 1 =A 1 -M 12 , B 2 =A 2 -M 12 -M 23 , B 3 =A 3 -M 23 -M 34 , and B 4 =A 4 -M 34 . make in remember again The formant with the largest center frequency in the set M j-1,j , denoted The formant with the smallest center frequency in the set M j,j+1 (when M j-1,j or M j,j+1 is an empty set φ, let or is the vector 0). Then for each j=1, 2, 3, 4, the following calculation can be performed:
其中,[x]表示不大于x的最大整数,x指代运算符号[]中的运算式的运算结果。而后,置集合A1,A2,A3,A4为空,将放入对应的集合Aj当中,使得Aj(j=1,2,3,4)中至多只有一条非零共振峰。j的值即为经序列调整后的位置排序;Among them, [x] represents the largest integer not greater than x, and x refers to the operation result of the operation expression in the operation symbol []. Then, set the sets A 1 , A 2 , A 3 , and A 4 to be empty, and set the Put them into the corresponding set A j , so that there is at most one non-zero formant in A j (j=1, 2, 3, 4). The value of j is Position sorting after sequence adjustment;
(3)相应地,对Stm或Spn或Srr字韵母共振峰进行S0中对应的字韵母共振峰进行同一操作以保持同一字韵母的共振峰数量和实际位置一致;(3) Correspondingly, for S tm or S pn or S rr letter final formants Carry out the corresponding letter-final formant in S 0 Carry out the same operation to keep the same number of formants and actual positions of the finals of the same letter;
(4)对Wh中每一齐次韵母共振峰数据对进行上述(1)-(3) 步操作,最终获得全部共振峰序列调整后的数据对的集合 W′h。(4) For each homogeneous final formant data pair in W h Perform the above (1)-(3) steps, and finally obtain the adjusted data pairs of all formant sequences The set W′ h of .
通过原始语音与电子伪装语音的语音参数预处理方法,将与不同伪装程度下的电子伪装语音转换为采样频率为8000Hz、16位、单声道的标准格式后,导入司法鉴定智能语音工作站,先对多人对话的音频进行语音分离,而后采集各个音频中的每个字的宽带语谱图,并通过长时平均的LPC(linear predictive coding线性预测编码,简称LPC) 分析方法生成每个字的韵母发音部分共振峰的中心频率、带宽和声强。计算原始语音与各个伪装程度的电子伪装语音所对应的字的各条共振峰中心参数的差值,记为初始还原特征h0。Through the voice parameter preprocessing method of original voice and electronic camouflage voice, the electronic camouflage voice under different camouflage degrees is converted into a standard format with a sampling frequency of 8000Hz, 16-bit, and monophonic, and imported into the forensic identification intelligent voice workstation. The audio of the multi-person dialogue is separated, and then the wideband spectrogram of each word in each audio is collected, and the long-term average LPC (linear predictive coding, LPC) analysis method is used to generate each word. The center frequency, bandwidth and sound intensity of the formant of the pronunciation part of the final vowel. Calculate the difference between the center parameters of each formant of the word corresponding to the original voice and the electronic camouflage voice of each degree of camouflage, and record it as the initial restoration feature h 0 .
本发明利用多层的扩大因果卷积神经网络进行机器学习的原因如下: (1)卷积神经网络(Convolution Neural Network,简称CNN)最早由 Hubel和Wiesel提出,是一种前馈的神经网络。通常,CNN的每个神经元由负责提取上一神经元局部特征的特征提取层与在该神经元计算过程中所需的多个特征映射平面所共同组成的特征映射层所构成; (2)设卷积输入为一m×n矩阵A,卷积核(Kernel,又称Filter)为一p×q矩阵B:The reason why the present invention utilizes the multi-layer enlarged causal convolutional neural network for machine learning is as follows: (1) Convolution Neural Network (CNN for short) was first proposed by Hubel and Wiesel, and is a feedforward neural network. Usually, each neuron of CNN is composed of a feature extraction layer responsible for extracting local features of the previous neuron and a feature mapping layer composed of multiple feature mapping planes required in the neuron calculation process; (2) Let the convolution input be an m×n matrix A, and the convolution kernel (Kernel, also known as Filter) is a p×q matrix B:
卷积的实质是加权叠加(Weighted Stacking),设其运算后所形成的新矩阵就是该卷积生成的输出(Feature Map),记为C。C中任一元素的值可由以下方式计算得到:The essence of convolution is Weighted Stacking, and the new matrix formed after its operation is the output (Feature Map) generated by the convolution, denoted as C. The value of any element in C can be calculated as follows:
(3)其中,1≤x≤m-p+1,1≤y≤n-q+1。由此可知卷积的输出维度O 与输入维度I、卷积核维度K存在以下关系:(3) Among them, 1≤x≤m-p+1, 1≤y≤n-q+1. It can be seen that the output dimension O of the convolution has the following relationship with the input dimension I and the convolution kernel dimension K:
O=I-K+1O=I-K+1
一般地,卷积计算输出C的数据维度小于输入A的数据维度,若要保持其维度一致,可以通过填充(Padding)来实现;Generally, the data dimension of the output C of the convolution calculation is smaller than the data dimension of the input A. To keep the dimensions consistent, it can be achieved by padding;
(4)扩大的因果卷积(DC-CNN)中存在着因果卷积与扩大卷积两种卷积方式。因果卷积多用于具有一定排列顺序的数据,对于长的序列化数据处理有着良好的建模效果。扩大卷积是一种稀疏化的卷积核,它通过忽略部分输入数据来增加感受野的范围,即按照一定规则在原始的卷积核中增加零生成“扩张”卷积核。如图2所示,灰色矩形表示在这一扩展系数下的感受野,圆点表示实际的卷积核,显然,随着扩展系数的增加,感受野以指数形式扩大;(4) There are two convolution methods, causal convolution and enlarged convolution, in the expanded causal convolution (DC-CNN). Causal convolution is mostly used for data with a certain order, and has a good modeling effect for long serialized data processing. Dilated convolution is a sparse convolution kernel, which increases the range of the receptive field by ignoring part of the input data, that is, adding zeros to the original convolution kernel according to certain rules to generate an "expanded" convolution kernel. As shown in Figure 2, the gray rectangle represents the receptive field under this expansion factor, and the dots represent the actual convolution kernel. Obviously, as the expansion factor increases, the receptive field expands exponentially;
(5)如果仅单纯地使用因果卷积,则需要极多的神经网络层数或极大的卷积核才能获得较好的训练结果。但过深的神经网络与过大的卷积不仅会大大降低运算效率,并且容易出现模型训练难以收敛或者退化现象。将扩大卷积与因果卷积相结合,可以使模型在不增加神经网络层数与卷积核大小的同时扩大感受野,使得模型在处理序列化信号数据时有着优良的性能,训练效果如图3所示。(5) If only causal convolution is simply used, a large number of neural network layers or a large convolution kernel are required to obtain better training results. However, too deep neural network and too large convolution will not only greatly reduce the operation efficiency, but also prone to the phenomenon that the model training is difficult to converge or degenerate. The combination of enlarged convolution and causal convolution can make the model expand the receptive field without increasing the number of neural network layers and the size of the convolution kernel, so that the model has excellent performance in processing serialized signal data. The training effect is shown in the figure 3 shown.
本发明设计的控制还原子网络的作用和原理如下:为了保证还原因子h与输入的语音信号保持相同的维度,模型中建立了一个还原控制子网络。还原控制子网络先通过1×1卷积对输入模型系统的初始还原特征h0进行降维与不同通道的特征融合,而后利用tanh函数实现非线性激励,最后通过反卷积将数据上采样到与输入的语音信号采样率一致,即得还原因子h。The function and principle of the control reduction sub-network designed by the present invention are as follows: In order to ensure that the reduction factor h and the input speech signal keep the same dimension, a reduction control sub-network is established in the model. The restoration control sub-network first reduces the dimension of the initial restoration feature h 0 of the input model system through 1×1 convolution and fuses the features of different channels, then uses the tanh function to achieve nonlinear excitation, and finally upsamples the data through deconvolution. Consistent with the sampling rate of the input speech signal, the reduction factor h is obtained.
本发明中门激活函数的还原因子h的作用是:在语音信号的建模中,非线性模型的效果比一般的线性模型效果更好,为了使DC-CNN 电子伪装语音还原模型能更好地解决语音生成问题,利用门激活单元 (Gated Activation Units,简称GAU)使神经网络具备分层的非线性映射学习能力。对经由还原控制子网络生成的还原因子h进行转置滤波卷积滤波器与转置门控卷积滤波器处理,并与卷积后的电子伪装语音按如下方式整合:The function of the reduction factor h of the gate activation function in the present invention is: in the modeling of the speech signal, the effect of the nonlinear model is better than that of the general linear model. In order to make the DC-CNN electronic camouflage speech restoration model better To solve the problem of speech generation, Gated Activation Units (GAU for short) are used to enable the neural network to have hierarchical nonlinear mapping learning capabilities. Transpose filtering convolution filter on the reduction factor h generated via the reduction control sub-network Gated convolution filter with transpose processed and integrated with the convoluted electronically camouflaged speech as follows:
本发明的基本结构如图4所示,语音信号xt是由t时刻之前的输入语音信号来预测还原的,xt仅依赖于x1,x2,…,xt-1与还原因子h。还原因子h是由还原初始特征h0经由还原控制子网络转换而来。一段时间下的语音信号序列的多维联合变量分布可表示为:The basic structure of the present invention is shown in Figure 4, the speech signal x t is predicted and restored by the input speech signal before time t, and x t only depends on x 1 , x 2 ,..., x t-1 and the restoration factor h . The reduction factor h is converted from the reduction initial feature h 0 via the reduction control sub-network. A sequence of speech signals over a period of time The multidimensional joint variable distribution of can be expressed as:
为了实现还原语音序列按上述条件概率生成,在基于DC-CNN电子伪装语音还原模型的神经网络主体中,采用多层的扩大因果卷积块堆叠与门激活函数建模。In order to realize the generation of the restored speech sequence according to the above conditional probability, in the neural network main body based on the DC-CNN electronic camouflage speech restoration model, a multi-layer enlarged causal convolution block stacking and gate activation function is used to model.
本发明使用了跃层残差结构来促进模型收敛并使梯度传递至更深层次,缓解因神经网络加深而导致的性能下降。在一个残差块结构中,将该层神经网络的输入通过跃层连接恒等映射至输出,并叠加于卷积后的残差输出,经优化计算即可得到期望输出。The present invention uses a layered residual structure to promote model convergence and transfer the gradient to a deeper level, so as to alleviate the performance degradation caused by the deepening of the neural network. In a residual block structure, the input of this layer of neural network The identity is mapped to the output through the jump layer connection, and superimposed on the residual output after the convolution, and the desired output can be obtained after optimization and calculation.
本发明的输出层采用了Softmax函数对合成运算后的数据进行离散化分类,步骤如下:The output layer of the present invention adopts the Softmax function to discretize and classify the data after the synthetic operation, and the steps are as follows:
(1)Softmax函数的输入是一个N维的实数向量,设为,则其函数表达式为:(1) The input of the Softmax function is an N-dimensional real vector, set as , then its function expression is:
(2)就其本质而言,Softmax函数能将一个N维的任意实数向量映射为一个各个元素的取值都在(0,1)中的N维向量,实现向量的归一化:(2) In its essence, the Softmax function can map an N-dimensional arbitrary real number vector to an N-dimensional vector with the values of each element in (0, 1) to achieve vector normalization:
且a′1,a′2,…,a′N∈(0,1)and a′ 1 ,a′ 2 ,…,a′ N ∈(0,1)
(3)映射生成的中元素a′1,a′2,…,a′N分别代表了原向量中a1,a2,…,aN的分类输出概率。故对于给定的N维可知其每个分类的概率为:(3) Mapping generated The elements a' 1 , a' 2 ,..., a' N represent the original vector respectively The classification output probabilities of a 1 , a 2 ,...,a N in . So for a given N dimension It can be known that the probability of each classification is:
(4)为降低模型系统的运算量,通过μ-law压扩转换,使输出的数据量降至28,提高模型的预测效率。(4) In order to reduce the computational complexity of the model system, the output data volume is reduced to 2 8 through μ-law companding transformation, and the prediction efficiency of the model is improved.
综上所述,本发明提出的一种基于卷积神经网络的语音还原方法,包括:步骤S1,采集电子伪装语音,同时提取电子伪装语音中的声学参数;步骤S2,采用预处理模型对声学参数进行预处理,以将声学参数转换为具有预设维度的标准序列;步骤S3,采用还原模型将电子伪装语音还原为还原语音序列,且还原模型根据标准序列完成对标准语音序列的还原;其中,步骤S2中,预处理的过程包括共振峰折损清洗、共振峰合并优化以及共振峰序列调整;首次采用了卷积神经网络进行电子伪装语音还原,具有足够高的还原效果,能够满足高要求的电子伪装语音还原场景。To sum up, a voice restoration method based on a convolutional neural network proposed by the present invention includes: step S1, collecting electronic camouflage voice, and extracting acoustic parameters in the electronic camouflage voice; step S2, using a preprocessing model to analyze the acoustic The parameters are preprocessed to convert the acoustic parameters into a standard sequence with preset dimensions; Step S3, the restoration model is used to restore the electronic camouflage voice to a restored voice sequence, and the restoration model completes the restoration of the standard voice sequence according to the standard sequence; wherein , in step S2, the preprocessing process includes formant breakage cleaning, formant merging optimization and formant sequence adjustment; convolutional neural network is used for the first time to restore electronic camouflage voice, which has a high enough restoration effect and can meet high requirements The electronic camouflage voice restoration scene.
通过说明和附图,给出了具体实施方式的特定结构的典型实施例,基于本发明精神,还可作其他的转换。尽管上述发明提出了现有的较佳实施例,然而,这些内容并不作为局限。Typical examples of specific structures of specific embodiments are given through the description and drawings, and other transformations may be made based on the spirit of the present invention. Although the above-described invention provides existing preferred embodiments, these are not intended to be limiting.
对于本领域的技术人员而言,阅读上述说明后,各种变化和修正无疑将显而易见。因此,所附的权利要求书应看作是涵盖本发明的真实意图和范围的全部变化和修正。在权利要求书范围内任何和所有等价的范围与内容,都应认为仍属本发明的意图和范围内。Various changes and modifications will no doubt become apparent to those skilled in the art upon reading the above description. Therefore, the appended claims should be construed to cover all changes and modifications of the true intent and scope of this invention. Any and all equivalent ranges and content within the scope of the claims should still be considered to be within the intent and scope of the invention.
Claims (11)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810937126.XA CN109215680B (en) | 2018-08-16 | 2018-08-16 | A speech restoration method based on convolutional neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810937126.XA CN109215680B (en) | 2018-08-16 | 2018-08-16 | A speech restoration method based on convolutional neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109215680A CN109215680A (en) | 2019-01-15 |
CN109215680B true CN109215680B (en) | 2020-06-30 |
Family
ID=64988551
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810937126.XA Active CN109215680B (en) | 2018-08-16 | 2018-08-16 | A speech restoration method based on convolutional neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109215680B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110110648B (en) * | 2019-04-30 | 2020-03-17 | 北京航空航天大学 | Action nomination method based on visual perception and artificial intelligence |
CN110600042B (en) * | 2019-10-10 | 2020-10-23 | 公安部第三研究所 | A method and system for gender recognition of disguised voice speakers |
CN110728993A (en) * | 2019-10-29 | 2020-01-24 | 维沃移动通信有限公司 | Voice change identification method and electronic equipment |
DE102020202878A1 (en) * | 2020-03-06 | 2021-09-09 | Robert Bosch Gesellschaft mit beschränkter Haftung | Method for monitoring the operation of at least one fuel cell device and fuel cell device |
DE102020202881A1 (en) * | 2020-03-06 | 2021-09-09 | Robert Bosch Gesellschaft mit beschränkter Haftung | Method for monitoring the operation of at least one fuel cell device and fuel cell device |
CN111009258A (en) * | 2020-03-11 | 2020-04-14 | 浙江百应科技有限公司 | Single sound channel speaker separation model, training method and separation method |
CN111739546A (en) * | 2020-07-24 | 2020-10-02 | 深圳市声扬科技有限公司 | Sound-changing voice reduction method and device, computer equipment and storage medium |
CN111739547B (en) * | 2020-07-24 | 2020-11-24 | 深圳市声扬科技有限公司 | Voice matching method and device, computer equipment and storage medium |
CN114648974B (en) * | 2020-12-17 | 2025-02-18 | 南京理工大学 | Speech synthesis method and system based on speech radar and deep learning |
TWI768676B (en) * | 2021-01-25 | 2022-06-21 | 瑞昱半導體股份有限公司 | Audio processing method and audio processing device, and associated non-transitory computer-readable medium |
CN114299961B (en) * | 2021-09-27 | 2025-03-14 | 腾讯科技(深圳)有限公司 | Speech recognition method, device, equipment, storage medium and program product |
CN115831127B (en) * | 2023-01-09 | 2023-05-05 | 浙江大学 | Method, device and storage medium for constructing voiceprint reconstruction model based on speech conversion |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010066269A1 (en) * | 2008-12-10 | 2010-06-17 | Agnitio, S.L. | Method for verifying the identify of a speaker and related computer readable medium and computer |
CN103730121A (en) * | 2013-12-24 | 2014-04-16 | 中山大学 | Method and device for recognizing disguised sounds |
CN104464724A (en) * | 2014-12-08 | 2015-03-25 | 南京邮电大学 | Speaker recognition method for deliberately pretended voices |
CN107563758A (en) * | 2017-07-18 | 2018-01-09 | 厦门快商通科技股份有限公司 | A kind of finance letter that solves examines the detection method and system that habitual offender swindles in business |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6427137B2 (en) * | 1999-08-31 | 2002-07-30 | Accenture Llp | System, method and article of manufacture for a voice analysis system that detects nervousness for preventing fraud |
US9767806B2 (en) * | 2013-09-24 | 2017-09-19 | Cirrus Logic International Semiconductor Ltd. | Anti-spoofing |
US9472195B2 (en) * | 2014-03-26 | 2016-10-18 | Educational Testing Service | Systems and methods for detecting fraud in spoken tests using voice biometrics |
-
2018
- 2018-08-16 CN CN201810937126.XA patent/CN109215680B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010066269A1 (en) * | 2008-12-10 | 2010-06-17 | Agnitio, S.L. | Method for verifying the identify of a speaker and related computer readable medium and computer |
CN103730121A (en) * | 2013-12-24 | 2014-04-16 | 中山大学 | Method and device for recognizing disguised sounds |
CN104464724A (en) * | 2014-12-08 | 2015-03-25 | 南京邮电大学 | Speaker recognition method for deliberately pretended voices |
CN107563758A (en) * | 2017-07-18 | 2018-01-09 | 厦门快商通科技股份有限公司 | A kind of finance letter that solves examines the detection method and system that habitual offender swindles in business |
Also Published As
Publication number | Publication date |
---|---|
CN109215680A (en) | 2019-01-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109215680B (en) | A speech restoration method based on convolutional neural network | |
CN112818861B (en) | A sentiment classification method and system based on multimodal contextual semantic features | |
CN110751208B (en) | An emotion recognition method for prisoners based on multimodal feature fusion based on self-weight differential encoder | |
Kim et al. | Person authentication using face, teeth and voice modalities for mobile device security | |
Kishore et al. | A video based Indian sign language recognition system (INSLR) using wavelet transform and fuzzy logic | |
Zhang et al. | Study on CNN in the recognition of emotion in audio and images | |
CN109063666A (en) | The lightweight face identification method and system of convolution are separated based on depth | |
CN109460737A (en) | A kind of multi-modal speech-emotion recognition method based on enhanced residual error neural network | |
CN106952649A (en) | Speaker Recognition Method Based on Convolutional Neural Network and Spectrogram | |
CN111738058B (en) | A reconstruction attack method for biological template protection based on generative adversarial networks | |
Khdier et al. | Deep learning algorithms based voiceprint recognition system in noisy environment | |
CN108664911A (en) | A kind of robust human face recognition methods indicated based on image sparse | |
CN112101096A (en) | Suicide emotion perception method based on multi-mode fusion of voice and micro-expression | |
Suja et al. | Analysis of emotion recognition from facial expressions using spatial and transform domain methods | |
Zhiyan et al. | Speech emotion recognition based on deep learning and kernel nonlinear PSVM | |
Ali et al. | Face recognition system based on four state hidden Markov model | |
Hizlisoy et al. | Text independent speaker recognition based on MFCC and machine learning | |
Nefian et al. | A Bayesian approach to audio-visual speaker identification | |
Kadyrov et al. | Speaker recognition from spectrogram images | |
CN110738985A (en) | Cross-modal biometric feature recognition method and system based on voice signals | |
Cheng et al. | Fractal dimension pattern-based multiresolution analysis for rough estimator of speaker-dependent audio emotion recognition | |
Al-Thahab | Speech recognition based radon-discrete cosine transforms by Delta Neural Network learning rule | |
JP2016162437A (en) | Pattern classification device, pattern classification method and pattern classification program | |
Kumari et al. | Experimental Analysis of face and Iris biometric traits based on the fusion approach | |
Sasikumar | A neural network based facial expression analysis using Gabor wavelets |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |