[go: up one dir, main page]

CN111816166A - Voice recognition method, apparatus, and computer-readable storage medium storing instructions - Google Patents

Voice recognition method, apparatus, and computer-readable storage medium storing instructions Download PDF

Info

Publication number
CN111816166A
CN111816166A CN202010694750.9A CN202010694750A CN111816166A CN 111816166 A CN111816166 A CN 111816166A CN 202010694750 A CN202010694750 A CN 202010694750A CN 111816166 A CN111816166 A CN 111816166A
Authority
CN
China
Prior art keywords
input audio
features
layer
domain
transformation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010694750.9A
Other languages
Chinese (zh)
Inventor
黎吉国
许继征
张莉
王悦
马思伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
ByteDance Inc
Original Assignee
Peking University
ByteDance Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, ByteDance Inc filed Critical Peking University
Priority to CN202010694750.9A priority Critical patent/CN111816166A/en
Publication of CN111816166A publication Critical patent/CN111816166A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

提供一种声音识别方法、装置以及存储指令的计算机可读存储介质。所述声音识别方法包括:获取输入音频的时域特征;获取所述输入音频的频域特征;将所述输入音频的时域特征和所述输入音频的频域特征进行融合,并基于融合后的特征执行声音识别。

Figure 202010694750

A voice recognition method, apparatus, and computer-readable storage medium for storing instructions are provided. The sound recognition method includes: acquiring the time domain feature of the input audio; acquiring the frequency domain feature of the input audio; fusing the time domain feature of the input audio and the frequency domain feature of the input audio, features to perform voice recognition.

Figure 202010694750

Description

声音识别方法、装置以及存储指令的计算机可读存储介质Voice recognition method, apparatus, and computer-readable storage medium storing instructions

技术领域technical field

本公开涉及声音识别技术领域,具体地说,涉及一种声音识别方法和声音识别装置。The present disclosure relates to the technical field of voice recognition, and in particular, to a voice recognition method and a voice recognition device.

背景技术Background technique

声音识别是一种对将物体发出的声音进行分析并与声音数据库中的声音进行比较以判断该物体是哪一个物体的技术。声音识别可具有多种应用,例如,可应用于说话人识别、生物识别,性别/年龄识别等。说话人识别是一种生物识别技术,也可被称为声纹识别,其在语音处理领域具有重要的位置,原因是其可广泛地应用于生物验证、鉴证和安全领域等。目前,传统的声音识别方案的识别效果比较有限,需要进一步提高声音识别的效果。Sound recognition is a technology that analyzes the sound emitted by an object and compares it with the sound in a sound database to determine which object the object is. Voice recognition can have various applications, for example, it can be applied to speaker recognition, biometrics, gender/age recognition, etc. Speaker recognition is a biometric technology, also known as voiceprint recognition, which plays an important role in the field of speech processing because it can be widely used in biometric verification, authentication, and security. At present, the recognition effect of the traditional voice recognition scheme is relatively limited, and the effect of voice recognition needs to be further improved.

发明内容SUMMARY OF THE INVENTION

本公开实施例公开了一种声音识别方法,以提高声音识别的效果。The embodiment of the present disclosure discloses a voice recognition method to improve the effect of voice recognition.

根据本公开的一方面,提供一种声音识别方法,包括:获取输入音频的时域特征;获取所述输入音频的频域特征;对所述输入音频的时域特征和所述输入音频的频域特征进行融合,并基于融合后的特征执行声音识别。According to an aspect of the present disclosure, there is provided a sound recognition method, comprising: acquiring a time-domain feature of input audio; acquiring a frequency-domain feature of the input audio; Domain features are fused, and voice recognition is performed based on the fused features.

可选地,所述对所述输入音频的时域特征和所述输入音频的频域特征进行融合可包括:对所述输入音频的时域特征和所述输入音频的频域特征进行拼接和变换处理,得到所述融合后的特征。Optionally, the fusion of the time domain feature of the input audio and the frequency domain feature of the input audio may include: splicing and combining the time domain feature of the input audio and the frequency domain feature of the input audio. Transform processing to obtain the fused features.

可选地,对所述输入音频的时域特征和所述输入音频的频域特征进行拼接和变换处理,得到所述融合后的特征可包括:将所述输入音频的时域特征和所述输入音频的频域特征拼接,得到拼接后的特征;对所述拼接后的特征执行两层全连接层变换,得到所述融合后的特征。Optionally, performing splicing and transformation processing on the time domain feature of the input audio and the frequency domain feature of the input audio, and obtaining the fused feature may include: combining the time domain feature of the input audio with the The frequency domain features of the input audio are spliced to obtain the spliced features; the two-layer fully connected layer transformation is performed on the spliced features to obtain the fused features.

可选地,对所述输入音频的时域特征和所述输入音频的频域特征进行拼接和变换处理,得到所述融合后的特征可包括:对所述输入音频的时域特征执行一层全连接层变换,得到第一变换特征;对所述输入音频的频域特征行一层全连接层变换,得到第二变换特征;对所述第一变换特征和所述第二变换特征进行拼接,得到拼接后的特征;对所述拼接后的特征执行一层全连接层变换,得到所述融合后的特征。Optionally, performing splicing and transformation processing on the time domain feature of the input audio and the frequency domain feature of the input audio, and obtaining the fused feature may include: performing a layer on the time domain feature of the input audio. The fully connected layer transforms to obtain the first transformation feature; the frequency domain feature of the input audio is transformed by one layer of the fully connected layer to obtain the second transformation feature; the first transformation feature and the second transformation feature are spliced , to obtain the spliced features; perform a layer of fully connected layer transformation on the spliced features to obtain the fused features.

可选地,对所述输入音频的时域特征和所述输入音频的频域特征进行拼接和变换处理,得到所述融合后的特征可包括:对所述输入音频的时域特征执行两层全连接层变换,得到第三变换特征;对所述输入音频的频域特征执行两层全连接层变换,得到第四变换特征;对所述第三变换特征和所述第四变换特征进行拼接,得到所述融合后的特征。Optionally, performing splicing and transformation processing on the time domain feature of the input audio and the frequency domain feature of the input audio, and obtaining the fused feature may include: performing two layers on the time domain feature of the input audio. Fully connected layer transformation to obtain a third transformation feature; performing two-layer fully connected layer transformation on the frequency domain features of the input audio to obtain a fourth transformation feature; splicing the third transformation feature and the fourth transformation feature , to obtain the fused features.

根据本公开的另一方面,提供一种声音识别装置,包括:时域特征获取模块,被配置为获取输入音频的时域特征;频域特征获取模块,被配置为获取所述输入音频的频域特征;声音识别模块,被配置为对所述输入音频的时域特征和所述输入音频的频域特征进行融合,并基于融合后的特征执行声音识别。According to another aspect of the present disclosure, there is provided a sound recognition apparatus, comprising: a time-domain feature acquisition module configured to acquire time-domain features of input audio; a frequency-domain feature acquisition module configured to acquire frequency-domain features of the input audio Domain features; a sound recognition module configured to fuse the time domain features of the input audio and the frequency domain features of the input audio, and perform sound recognition based on the fused features.

可选地,声音识别模块可被配置为:对所述输入音频的时域特征和所述输入音频的频域特征进行拼接和变换处理,得到所述融合后的特征。Optionally, the sound recognition module may be configured to perform splicing and transformation processing on the time domain feature of the input audio and the frequency domain feature of the input audio to obtain the fused feature.

可选地,声音识别模块可被配置为:将所述输入音频的时域特征和所述输入音频的频域特征拼接,得到拼接后的特征;对所述拼接后的特征执行两层全连接层变换,得到所述融合后的特征。Optionally, the sound recognition module can be configured to: splicing the time domain feature of the input audio and the frequency domain feature of the input audio to obtain the spliced feature; perform two-layer full connection on the spliced feature. Layer transformation to obtain the fused features.

可选地,声音识别模块可被配置为:对所述输入音频的时域特征执行一层全连接层变换,得到第一变换特征;对所述输入音频的频域特征行一层全连接层变换,得到第二变换特征;对所述第一变换特征和所述第二变换特征进行拼接,得到拼接后的特征;对所述拼接后的特征执行一层全连接层变换,得到所述融合后的特征。Optionally, the sound recognition module may be configured to: perform a layer of fully connected layer transformation on the time domain feature of the input audio to obtain a first transformation feature; perform a layer of fully connected layer on the frequency domain feature of the input audio transformation to obtain a second transformation feature; splicing the first transformation feature and the second transformation feature to obtain a spliced feature; performing a layer of fully connected layer transformation on the spliced feature to obtain the fusion later features.

可选地,声音识别模块可被配置为:对所述输入音频的时域特征执行两层全连接层变换,得到第三变换特征;对所述输入音频的频域特征执行两层全连接层变换,得到第四变换特征;对所述第三变换特征和所述第四变换特征进行拼接,得到所述融合后的特征。Optionally, the sound recognition module can be configured to: perform two-layer fully connected layer transformation on the time domain features of the input audio to obtain a third transformation feature; perform two layers of fully connected layers on the frequency domain features of the input audio. Transform to obtain a fourth transformed feature; splicing the third transformed feature and the fourth transformed feature to obtain the fused feature.

根据本公开的另一方面,提供一种声音识别装置,所述声音识别装置包括至少一个计算装置和至少一个存储有计算机指令的存储装置的系统,所述计算机指令在所述至少一个计算装置运行时,促使所述至少一个计算装置执行根据本公开的声音识别方法。According to another aspect of the present disclosure, there is provided a voice recognition device comprising a system of at least one computing device and at least one storage device storing computer instructions, the computer instructions running on the at least one computing device When the at least one computing device is caused to execute the voice recognition method according to the present disclosure.

根据本公开的另一方面,提供一种存储有指令的计算机可读存储介质,当所述指令在至少一个计算装置运行时,促使所述至少一个计算装置执行根据本公开的声音识别方法。According to another aspect of the present disclosure, there is provided a computer-readable storage medium storing instructions that, when executed on at least one computing device, cause the at least one computing device to perform a voice recognition method according to the present disclosure.

根据本公开的示例性实施例的声音识别方法和声音识别装置,通过融合方式利用声音信号的时域信息和频域信息共同执行声音识别,充分利用了声音信号的时间信息和频率信息,提高了声音识别的性能。例如,当将根据本公开的示例性实施例的声音识别方法和声音识别装置应用于说话人识别时,通过融合方式利用语音信号的时域信息和频域信息共同执行声纹识别,充分利用了语音信号的时间信息和频率信息,提高了说话人识别的性能。According to the voice recognition method and the voice recognition apparatus according to the exemplary embodiments of the present disclosure, the time domain information and the frequency domain information of the voice signal are used to jointly perform the voice recognition in a fusion manner, the time information and the frequency information of the voice signal are fully utilized, and the improvement is improved. The performance of voice recognition. For example, when the voice recognition method and the voice recognition apparatus according to the exemplary embodiments of the present disclosure are applied to speaker recognition, the voiceprint recognition is jointly performed by using the time domain information and the frequency domain information of the voice signal in a fusion manner, which fully utilizes the The time information and frequency information of the speech signal improve the performance of speaker recognition.

此外,根据本公开的示例性实施例的声音识别方法和声音识别装置,通过早期融合方式来将时域特征和频域特征一起变换到分类特征空间,因此变换过程是综合时域特征和频域特征执行的,即,更全面地考虑了音频信号的所有两个域的特征,因此,可达到良好的声音识别效果。In addition, according to the voice recognition method and the voice recognition apparatus according to the exemplary embodiments of the present disclosure, the time domain feature and the frequency domain feature are transformed together into the classification feature space by the early fusion method, so the transformation process is to integrate the time domain feature and the frequency domain feature. The feature is performed, that is, the features of all two domains of the audio signal are considered more comprehensively, and therefore, a good sound recognition effect can be achieved.

附图说明Description of drawings

通过结合附图,从实施例的下面描述中,本公开这些和/或其它方面及优点将会变得清楚,并且更易于理解,其中:These and/or other aspects and advantages of the present disclosure will become apparent, and be more readily understood, from the following description of embodiments, taken in conjunction with the accompanying drawings, wherein:

图1a和图1b示出现有说话人识别模型的示意图。Figures 1a and 1b show schematic diagrams of existing speaker recognition models.

图2是示出根据本公开的示例性实施例的时频网络(Time-FrequencyNetwork,TFN)模型的示意图。FIG. 2 is a schematic diagram illustrating a Time-Frequency Network (TFN) model according to an exemplary embodiment of the present disclosure.

图3a、图3b和图3c示出根据本公开的示例性实施例的融合方式的示意图。Figures 3a, 3b and 3c show schematic diagrams of a fusion manner according to an exemplary embodiment of the present disclosure.

图4是示出根据本公开的示例性实施例的声音识别方法的流程图。FIG. 4 is a flowchart illustrating a voice recognition method according to an exemplary embodiment of the present disclosure.

图5示出根据本公开的示例性实施例的声音识别装置的框图。FIG. 5 illustrates a block diagram of a voice recognition apparatus according to an exemplary embodiment of the present disclosure.

图6a、图6b、图6c和图6d示出说话人识别系统的示意图。Figures 6a, 6b, 6c and 6d show schematic diagrams of a speaker recognition system.

具体实施方式Detailed ways

提供参照附图的以下描述以帮助对由权利要求及其等同物限定的本公开的实施例的全面理解。包括各种特定细节以帮助理解,但这些细节仅被视为是示例性的。因此,本领域的普通技术人员将认识到在不脱离本公开的范围和精神的情况下,可对描述于此的实施例进行各种改变和修改。此外,为了清楚和简洁,省略对公知的功能和结构的描述。The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of embodiments of the present disclosure as defined by the claims and their equivalents. Various specific details are included to aid in that understanding, but are to be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted for clarity and conciseness.

在此需要说明的是,在本公开中出现的“若干项之中的至少一项”均表示包含“该若干项中的任意一项”、“该若干项中的任意多项的组合”、“该若干项的全体”这三类并列的情况。例如“包括A和B之中的至少一个”即包括如下三种并列的情况:(1)包括A;(2)包括B;(3)包括A和B。又例如“执行步骤一和步骤二之中的至少一个”,即表示如下三种并列的情况:(1)执行步骤一;(2)执行步骤二;(3)执行步骤一和步骤二。It should be noted here that "at least one of several items" in the present disclosure all means including "any one of the several items", "a combination of any of the several items", The three categories of "the whole of the several items" are juxtaposed. For example, "including at least one of A and B" includes the following three parallel situations: (1) including A; (2) including B; (3) including A and B. Another example is "execute at least one of step 1 and step 2", which means the following three parallel situations: (1) execute step 1; (2) execute step 2; (3) execute step 1 and step 2.

在声音识别领域,例如,在说话人识别应用上,根据输入数据的不同类型,目前的说话人识别模型可包括时域模型和频域模型。如图1a和图1b所示,图1a和图1b示出现有说话人识别模型的示意图。如图1a所示,时域模型使用原始语音波形(语音信号的一种时域表示)作为输入,也就是说,时域模型仅利用原始语音的时域信息来执行说话人识别。如图1b所示,频域模型使用频谱信号(语音信号的一种频域表示)作为输入,也就是说,频域模型仅利用语音信息的频域信息来执行说话人识别。因此,目前的说话人识别模型的效果都无法达到最佳。下面先分别对频域模型和时域模型进行详细的介绍。In the field of voice recognition, for example, in speaker recognition applications, current speaker recognition models may include time domain models and frequency domain models according to different types of input data. As shown in Figures 1a and 1b, Figures 1a and 1b show schematic diagrams of existing speaker recognition models. As shown in Figure 1a, the temporal model uses the raw speech waveform (a temporal representation of the speech signal) as input, that is, the temporal model only utilizes the temporal information of the raw speech to perform speaker recognition. As shown in Figure 1b, the frequency-domain model uses the spectral signal (a frequency-domain representation of the speech signal) as input, that is, the frequency-domain model uses only the frequency-domain information of the speech information to perform speaker recognition. Therefore, none of the current speaker recognition models can achieve the best results. The frequency domain model and the time domain model are introduced in detail below.

频域模型frequency domain model

在使用深度神经网络(Deep Neural Network,DNN)之前,大多数说话人识别方法使用频域特征对语音信号进行分类,例如,高斯混合模型(Gaussian Mixture Model,GMM)、语音分段的i-vector特征表示。这些方法是基于手工的频域特征(例如,滤波器组(FilterBank,FBANK)或梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficient,MFCC))执行。随着DNN的广泛使用,DNN也被设计为用于自动提取频域特征以用于说话人识别。然而这些方法都仅处理频域信号,而忽略了时域信息。Before using Deep Neural Network (DNN), most speaker recognition methods used frequency domain features to classify speech signals, for example, Gaussian Mixture Model (GMM), i-vector of speech segmentation Feature representation. These methods are performed based on handcrafted frequency-domain features (eg, FilterBank (FBANK) or Mel-Frequency Cepstral Coefficient (MFCC)). With the widespread use of DNNs, DNNs are also designed to automatically extract frequency domain features for speaker recognition. However, these methods only deal with the frequency domain signal, while ignoring the time domain information.

例如,传统的频域说话人识别方法的要点可包括以下项之一:(1)将GMM超矢量与支持向量机(Support Vector Machine,SVM)结合并基于两个GMM模型之间的KL距离近似方法来推导线性核;(2)将说话人与通道可变性建模并提出新的低维语音全局表征,称为单位向量或i-vector,这是大部分说话者识别的频域方法的基础;(3)基于i-vector,使用预训练的DNN来产生帧对齐,并与传统系统相比提高了等错误率30%;(4)通过基于具有数据增强的频率特征(诸如FBANK)对DNN进行训练来引入x-vector以提取固定长度的全局向量。这些方法都是使用语音信号的频谱信号作为输入,例如,梅尔频率倒谱系数(MFCC)、感知线性预测(Perceptual Linear Predictive,PLP)分析、线性预测倒谱系数(Linear PredictiveCepstral Coeficient,LPCC)等。这些频谱虽然也包含时间信息,但是由于时频变换(诸如,短时傅立叶变换(Short-Time Fourier Transformation,STFT)),频谱的时间分辨率与原始语音信号的时间分辨率相比明显降低。具体地,基于窗口的时频变换(诸如,短时傅立叶变换)使用具有步长的窗口将信号片段变换到频域,以产生时频特征,而时间分辨率将从N下降到N/步长(其中,//表示取整操作)。因此,使用频谱作为输入的现有说话人识别方法无法很好地学习时间特征,因此无法充分利用时域信息。For example, the gist of conventional frequency-domain speaker recognition methods may include one of the following: (1) Combining GMM supervectors with Support Vector Machine (SVM) and approximating based on the KL distance between the two GMM models method to derive linear kernels; (2) model speaker and channel variability and propose new low-dimensional global representations of speech, called unit vectors or i-vectors, which are the basis of most frequency-domain methods for speaker recognition ; (3) Based on i-vector, using pre-trained DNN to generate frame alignment, and improve the equal error rate by 30% compared with traditional systems; (4) By using frequency features with data augmentation (such as FBANK) to align DNNs Training is performed to introduce an x-vector to extract a fixed-length global vector. These methods all use the spectral signal of the speech signal as input, such as Mel Frequency Cepstral Coefficient (MFCC), Perceptual Linear Predictive (PLP) analysis, Linear Predictive Cepstral Coeficient (LPCC), etc. . Although these spectra also contain temporal information, due to time-frequency transformations such as Short-Time Fourier Transformation (STFT), the temporal resolution of the spectra is significantly reduced compared to the temporal resolution of the original speech signal. Specifically, a window-based time-frequency transform (such as a short-time Fourier transform) transforms a signal segment into the frequency domain using a window with a step size to produce time-frequency features, while the temporal resolution will drop from N to N/step (where // represents a rounding operation). Therefore, existing speaker recognition methods using spectrum as input cannot learn temporal features well, and thus cannot fully utilize temporal information.

时域模型time domain model

当卷积神经网络(Convolutional Neural Network,CNN)成功解决了大规模的图像分类问题并在高维数据建模方面展现了强大的能力时,CNN被用于直接在时域来解决说话人识别问题。近年来,被设计为直接从原始语音波形提取特征的端对端模型展现了比传统的仅使用频域特征的方法更佳的性能。例如,SincNet模型通过将第一层滤波器设计为可学习的带通滤波器,在说话人识别方面展现了很好的性能。When Convolutional Neural Networks (CNNs) successfully solved large-scale image classification problems and demonstrated strong capabilities in high-dimensional data modeling, CNNs were used to solve speaker recognition problems directly in the time domain . In recent years, end-to-end models designed to extract features directly from raw speech waveforms have shown better performance than traditional methods that only use frequency-domain features. For example, the SincNet model has shown good performance in speaker recognition by designing the first layer filter as a learnable bandpass filter.

然而,最近的使用原始语音信号作为输入的基于深度学习的方法由于在框架中不应用频率优化或频率变换而无法很好地学习频率特征。具体地,深度神经网络在时间轴上学习很多小的卷积滤波器,使用模型可将语音信号分类。也就是说,现有的基于神经网络的说话者识别模型中,仅从只具有时间轴的原始语音信号学习滤波器。因此,频域信号会被忽略。这些方法可包括:(1)使用端到端方式仅提取原始语音的时域信息来基于CNN执行说话人识别,(2)将CNN与长短期记忆网络(Long Short Term Memory,LSTM)结合来从原始语音信号提取全局向量以执行说话人识别,(3)SincNet模型使用可学习带通滤波器来代替CNN的第一层来获得更佳的互操作性,以提高性能。这里,虽然SincNet模型使用可学习带通滤波器来利用了语音信号的频率信息,但其仍然仅将原始语音波形作为输入,因此,其无法充分利用频域信息。However, recent deep learning based methods using raw speech signals as input cannot learn frequency features well due to not applying frequency optimization or frequency transformation in the framework. Specifically, the deep neural network learns many small convolutional filters on the time axis, and the speech signal can be classified using the model. That is, in existing neural network-based speaker recognition models, filters are only learned from raw speech signals with only a time axis. Therefore, the frequency domain signal is ignored. These methods may include: (1) using an end-to-end approach to extract only the temporal information of the original speech to perform speaker recognition based on CNN, (2) combining CNN with Long Short Term Memory (LSTM) to learn from The raw speech signal extracts global vectors to perform speaker recognition, and (3) the SincNet model uses a learnable bandpass filter to replace the first layer of the CNN for better interoperability for improved performance. Here, although the SincNet model utilizes the frequency information of the speech signal using a learnable bandpass filter, it still only takes the original speech waveform as input, so it cannot fully utilize the frequency domain information.

然而,对于声音信号分析来说,频域信息和时域信息都很重要,缺少任意一个域的信息都无法达到声音信号分析的最佳效果。为了充分利用时域特征和频域信息,使用共享或非共享分支学习时域特征表示和频域特征表示,本公开提出了将原始音频波形和频谱两者作为输入的全新的时频网络(TFN)模型。具体地,本公开设计的TFN模型可包括用于提取原始音频波形的时域信息的时间分支模型、用于提取频谱信号的频域信息的频域分支模型以及将时域信息和频域信息融合以执行声音识别的融合模型,TFN模型的输出可以是对发出声音的人物的预测分布(例如,在应用于说话人识别的情况下,TFN模型的输出可以是对说话人的预测分布)。下面,将参照图2至图6d具体描述根据本公开的示例性实施例的声音识别方法和声音识别装置。However, for sound signal analysis, both frequency domain information and time domain information are very important, and the lack of information in either domain cannot achieve the best effect of sound signal analysis. In order to make full use of time-domain features and frequency-domain information, using shared or non-shared branches to learn time-domain feature representations and frequency-domain feature representations, the present disclosure proposes a novel Time-Frequency Network (TFN) that takes both raw audio waveform and spectrum as input )Model. Specifically, the TFN model designed by the present disclosure may include a time branch model for extracting the time domain information of the original audio waveform, a frequency domain branch model for extracting the frequency domain information of the spectral signal, and a fusion of the time domain information and the frequency domain information. To perform a fusion model of voice recognition, the output of the TFN model may be the predicted distribution of the person uttering the voice (eg, in the case of application to speaker recognition, the output of the TFN model may be the predicted distribution of the speaker). Hereinafter, a voice recognition method and a voice recognition apparatus according to an exemplary embodiment of the present disclosure will be described in detail with reference to FIGS. 2 to 6d .

图2是示出根据本公开的示例性实施例的TFN模型的示意图。FIG. 2 is a schematic diagram illustrating a TFN model according to an exemplary embodiment of the present disclosure.

参照图2,TFN模型200可包括三个子模型,即,时间分支模型201、频域分支模型202和融合模型203。Referring to FIG. 2 , the TFN model 200 may include three sub-models, ie, a time branch model 201 , a frequency domain branch model 202 and a fusion model 203 .

时间分支模型201可被设计为从输入音频的原始音频波形提取时域特征。可使用任何可用的提取输入音频的时域特征的模型来实现时间分支模型201,本公开在此对不作限制。例如,可使用多层CNN或者循环神经网络(RecurrentNeural Network,RNN)等来实现时间分支模型201以从原始音频信号中提取时域的局部特征。The temporal branching model 201 may be designed to extract temporal features from the raw audio waveform of the input audio. The temporal branching model 201 may be implemented using any available model that extracts temporal features of the input audio, and the present disclosure is not limited herein. For example, a multi-layer CNN or a Recurrent Neural Network (RNN) or the like can be used to implement the temporal branch model 201 to extract local features in the temporal domain from the original audio signal.

根据本公开的示例性实施例,可将时间分支模型201设计为SincNet模型(也可以是其他普通的CNN模型或者RNN模型)。在这种情况下,时间分支模型201的第一层被设计为带通滤波器以对频率特征进行建模。这里,带通滤波器可被表示为下述的公式(1)。According to an exemplary embodiment of the present disclosure, the temporal branch model 201 can be designed as a SincNet model (or other common CNN models or RNN models). In this case, the first layer of the time branch model 201 is designed as a bandpass filter to model frequency features. Here, the band-pass filter can be expressed as the following formula (1).

g[n,f1,f2]=2f2sinc(2πf2n)-2f2sinc(2πf1n) (1)g[n,f 1 ,f 2 ]=2f 2 sinc(2πf 2 n)-2f 2 sinc(2πf 1 n) (1)

其中,g[]表示带通滤波器的输出,n表示带通滤波器的核的大小,f1表示截止频率下限,f2表示截止频率上限,sinc(x)=sinx/x。Among them, g[] represents the output of the band-pass filter, n represents the size of the kernel of the band-pass filter, f 1 represents the lower limit of the cut-off frequency, f 2 represents the upper limit of the cut-off frequency, and sinc(x)=sinx/x.

通过将时间分支模型201的第一层的滤波器设计为带通滤波器,可使得时间分支模型201具有更少的参数且具有更好的可解释性。By designing the filter of the first layer of the time branch model 201 as a band-pass filter, the time branch model 201 can have fewer parameters and better interpretability.

此外,时间分支模型201的其它层可以是典型的一维卷积层(Conv)以及批归一化层(Batch Normalization layer)(BN)和激活函数层(ReLU),在经过若干卷积层、批归一化层和激活函数层之后,时间分支模型201可输出时域特征。In addition, other layers of the temporal branch model 201 can be a typical one-dimensional convolutional layer (Conv), a batch normalization layer (BN) and an activation function layer (ReLU). After several convolutional layers, After the batch normalization layer and the activation function layer, the temporal branch model 201 can output temporal features.

频域分支模型202可被设计为从输入音频的频谱提取频域特征。根据本公开的示例性实施例,可使用MFCC、PLP、LPCC等作为用于提取频域特征的频谱。也可以使用任何可用的从频谱信号提取频域特征的模型来实现频域分支模型202,本公开在此对不作限制。例如,可使用一维或二维多层CNN来实现频域分支模型202以从频谱信号提取频域特征,或者可使用任何GMM、DNN或RNN来实现频域分支模型202以从频谱信号提取频域特征。The frequency domain branch model 202 may be designed to extract frequency domain features from the frequency spectrum of the input audio. According to an exemplary embodiment of the present disclosure, MFCC, PLP, LPCC, etc. may be used as a spectrum for extracting frequency-domain features. The frequency domain branch model 202 may also be implemented using any available model for extracting frequency domain features from spectral signals, which the present disclosure is not limited to herein. For example, the frequency-domain branching model 202 may be implemented using a one-dimensional or two-dimensional multilayer CNN to extract frequency-domain features from the spectral signal, or the frequency-domain branching model 202 may be implemented using any GMM, DNN, or RNN to extract frequency-domain features from the spectral signal. Domain features.

融合模型203可被设计为对时域特征和频域特征进行融合,以基于融合后的特征执行声音识别,即输出预测分布结果。这里,融合是指将时域特征和频域特征一起变换到分类特征空间以得到分类特征(即,融合后的特征),因此,融合处理包括特征拼接处理和变换处理。具体地说,融合模型203可对输入音频的时域特征和输入音频的频域特征执行拼接和变换处理来得到融合后的特征。例如,融合模型203可包括一层特征拼接层和多层全连接层(Fully Connected Layer)(FC层)。特征拼接层用于将两个特征向量拼接为一个特征向量,FC层用于将特征向量进行变换。本公开对融合模型203中的一层特征拼接层和多层FC层的数量和排列不作限制。The fusion model 203 can be designed to fuse time-domain features and frequency-domain features to perform voice recognition based on the fused features, that is, to output a predicted distribution result. Here, fusion refers to transforming time-domain features and frequency-domain features together into a categorical feature space to obtain categorical features (ie, fused features). Therefore, the fusion processing includes feature splicing processing and transformation processing. Specifically, the fusion model 203 may perform splicing and transformation processing on the time domain features of the input audio and the frequency domain features of the input audio to obtain the fused features. For example, the fusion model 203 may include one feature stitching layer and multiple fully connected layers (FC layers). The feature stitching layer is used to stitch two feature vectors into one feature vector, and the FC layer is used to transform the feature vector. The present disclosure does not limit the number and arrangement of one feature splicing layer and multi-layer FC layers in the fusion model 203 .

根据本公开的示例性实施例,融合模型203可包括一层特征拼接层和两层FC层。根据不同的变换类型,融合模型203可具有三种不同的实施方式,即,早期融合、中期融合和后期融合。According to an exemplary embodiment of the present disclosure, the fusion model 203 may include one feature concatenation layer and two FC layers. According to different transformation types, the fusion model 203 can have three different implementations, namely, early fusion, mid-phase fusion, and late fusion.

图3a、图3b和图3c示出根据本公开的示例性实施例的融合方式的示意图。Figures 3a, 3b and 3c show schematic diagrams of a fusion manner according to an exemplary embodiment of the present disclosure.

如图3a所示,融合模型203可采用早期融合方式。在早期融合方式中,特征拼接层被设置在第一层,两个FC层分别被设置在第二层和第三层。具体地说,融合模型203可首先在第一层将两个局部特征嵌入(即,时间分支模型201输出的时域特征和频域分支模型202输出的频域特征)拼接在一起,得到拼接后的特征(即,全局特征),随后分别在第二层和第三层将拼接后的特征经过两个FC层以将拼接后的特征投射(变换)到分类特征空间以得到输入音频的分类特征(即,融合后的特征),根据输入音频的分类特征执行声音识别。例如,将输入音频的分类特征经过softmax处理得到预测分类结果(即,概率分布值)。早期融合方式先将时域特征和频域特征拼接,再将拼接后的特征进行变换,因此变换过程是综合时域特征和频域特征执行的,即,更全面地考虑了音频信号的所有两个域的特征,因此,可达到良好的声音识别效果。As shown in FIG. 3a, the fusion model 203 may adopt an early fusion method. In the early fusion method, the feature splicing layer is set at the first layer, and the two FC layers are set at the second and third layers, respectively. Specifically, the fusion model 203 may first splicing together two local feature embeddings (ie, the time-domain features output by the time-branch model 201 and the frequency-domain features output by the frequency-domain branch model 202 ) at the first layer, to obtain the spliced The features (ie, global features) of (ie, fused features), voice recognition is performed based on the classification features of the input audio. For example, the classification feature of the input audio is subjected to softmax processing to obtain the predicted classification result (ie, the probability distribution value). The early fusion method first splices the time-domain features and the frequency-domain features, and then transforms the spliced features. Therefore, the transformation process is performed by integrating the time-domain features and the frequency-domain features, that is, it more comprehensively considers all the two aspects of the audio signal. Therefore, a good sound recognition effect can be achieved.

如图3b所示,融合模型203可采用中期融合方式。在中期融合方式中,一个FC层被设置在第一层,特征拼接层被设置在第二层,另一FC层被设置在第三层。具体地说,融合模型203可首先在第一层分别将两个局部特征嵌入(即,时间分支模型201输出的时域特征和频域分支模型202输出的频域特征)经过一个FC层,随后在第二层将两个经过FC层输出的特征拼接在一起,并最后在第三层将拼接后的全局特征经过一个FC层以将全局特征投射变换到分类特征空间以得到输入音频的分类特征(即,融合后的特征),根据输入音频的分类特征执行声音识别。例如,将输入音频的分类特征经过softmax处理得到预测分类结果(即,概率分布值)。As shown in FIG. 3b, the fusion model 203 may adopt a mid-term fusion method. In the mid-term fusion method, one FC layer is set on the first layer, the feature splicing layer is set on the second layer, and the other FC layer is set on the third layer. Specifically, the fusion model 203 can firstly embed two local features at the first layer (ie, the time-domain features output by the time-branch model 201 and the frequency-domain features output by the frequency-domain branch model 202 ) through an FC layer, and then In the second layer, the features output by the two FC layers are spliced together, and finally the spliced global features are passed through an FC layer in the third layer to project the global features to the classification feature space to obtain the classification features of the input audio. (ie, fused features), voice recognition is performed based on the classification features of the input audio. For example, the classification feature of the input audio is subjected to softmax processing to obtain the predicted classification result (ie, the probability distribution value).

如图3c所示,融合模型203可采用后期融合方式。在后期融合方式中,两个FC层被分别设置在第一层和第二层,特征拼接层被设置在第三层。具体地说,融合模型204可首先分别在第一层和第二层将两个局部特征嵌入(即,时间分支模型201输出的时域特征和频域分支模型202输出的频域特征)经过两个FC层以分别将两个局部特征投射到分类特征空间以分别得到时域特征的分类特征和频域特征的分类特征,随后在第三层将时域特征的分类特征和频域特征的分类特征拼接在一起以将两个低维分类特征空间中的分类特征拼接为高维分类特征空间中的分类特征,从而得到分类特征空间中的全局分类特征,根据全局分类特征执行声音识别。例如,将全局分类特征经过softmax处理得到预测分类结果(即,概率分布值)。As shown in FIG. 3c, the fusion model 203 may adopt a late fusion method. In the late fusion method, two FC layers are set on the first and second layers respectively, and the feature splicing layer is set on the third layer. Specifically, the fusion model 204 can first embed two local features (ie, the time domain features output by the time branch model 201 and the frequency domain features output by the frequency domain branch model 202 ) at the first layer and the second layer respectively. FC layers are used to project two local features into the classification feature space to obtain the classification features of the time domain features and the classification features of the frequency domain features respectively, and then in the third layer, the classification features of the time domain features and the classification features of the frequency domain features are classified. The features are stitched together to stitch the categorical features in the two low-dimensional categorical feature spaces into categorical features in the high-dimensional categorical feature space, thereby obtaining a global categorical feature in the categorical feature space, and performing voice recognition according to the global categorical feature. For example, the global classification feature is subjected to softmax processing to obtain the predicted classification result (ie, the probability distribution value).

融合模型204可采用上述任何融合方式来执行输入音频的时域特征和输入音频的频域特征的融合,还可采用其它可行的任何融合方式来执行输入音频的时域特征和输入音频的频域特征的融合,例如,FC层的数量可不一定为两层,还可以为单层,或者三层以上,还可将特征拼接层设置在多个层之中的任意位置,或者可只通过对输入音频的时域特征和输入音频的频域特征执行拼接而直接得到融合后的特征来执行声音识别。The fusion model 204 can use any of the above-mentioned fusion methods to perform the fusion of the time-domain features of the input audio and the frequency-domain features of the input audio, and can also use any other feasible fusion methods to perform the time-domain features of the input audio and the input audio. For the fusion of features, for example, the number of FC layers may not necessarily be two layers, but can also be a single layer, or more than three layers, and the feature splicing layer can also be set at any position among the multiple layers, or it can be The time-domain features of the audio and the frequency-domain features of the input audio are spliced to directly obtain the fused features to perform sound recognition.

图4是示出根据本公开的示例性实施例的声音识别方法的流程图。FIG. 4 is a flowchart illustrating a voice recognition method according to an exemplary embodiment of the present disclosure.

参照图4,在步骤401,获取输入音频的时域特征。具体地说,可通过从输入音频的原始音频波形提取时域特征来获取输入音频的时域特征。根据本公开的示例性实施例,可通过时间分支模型201来执行从输入音频的原始音频波形提取时域特征的步骤。根据本公开的另一示例性实施例,可从本地存储器或服务器等获取输入音频的时域特征。还可以通过其它任何可行的方式来获取输入音频的时域特征,本公开对获取的途径和来源不作限制。Referring to FIG. 4, in step 401, time domain features of the input audio are acquired. Specifically, the time-domain features of the input audio can be obtained by extracting the time-domain features from the original audio waveform of the input audio. According to an exemplary embodiment of the present disclosure, the step of extracting temporal features from the original audio waveform of the input audio may be performed by the temporal branching model 201 . According to another exemplary embodiment of the present disclosure, the time-domain features of the input audio may be acquired from a local storage, a server, or the like. The time-domain feature of the input audio may also be acquired in any other feasible manner, and the present disclosure does not limit the acquisition approach and source.

根据本公开的示例性实施例,可使用多层CNN或RNN等来实现时间分支模型201。根据本公开的示例性实施例,可使用SincNet模型来实现时间分支模型201。当然,提取时域特征方式不限于此,可使用任何方式来提取时域特征,例如,可使用任何多层CNN或RNN等来提取时域特征。According to an exemplary embodiment of the present disclosure, the temporal branching model 201 may be implemented using a multi-layer CNN or RNN or the like. According to an exemplary embodiment of the present disclosure, the temporal branching model 201 may be implemented using a SincNet model. Of course, the method of extracting temporal features is not limited to this, and any method can be used to extract temporal features, for example, any multi-layer CNN or RNN can be used to extract temporal features.

在步骤402,获取输入音频的频域特征。具体地说,可通过对输入音频的原始音频信号执行时频变换(诸如,快速傅立叶变换STFT),并从经过时频变换得到的频谱信号提取频域特征来获取输入音频的频域特征。根据本公开的另一示例性实施例,可从本地存储器或服务器等获取输入音频的频域特征。还可以通过其它任何可行的方式来获取输入音频的频域特征,本公开对获取的途径和来源不作限制。In step 402, frequency domain features of the input audio are acquired. Specifically, the frequency-domain features of the input audio can be obtained by performing a time-frequency transform (such as Fast Fourier Transform STFT) on the original audio signal of the input audio, and extracting frequency-domain features from a spectral signal obtained by the time-frequency transform. According to another exemplary embodiment of the present disclosure, the frequency domain features of the input audio may be acquired from a local storage or a server or the like. The frequency domain feature of the input audio can also be acquired in any other feasible manner, and the present disclosure does not limit the acquisition approach and source.

根据本公开的示例性实施例,可使用MFCC、PLP、LPCC等作为用于提取频域特征的频谱。According to an exemplary embodiment of the present disclosure, MFCC, PLP, LPCC, etc. may be used as a spectrum for extracting frequency-domain features.

根据本公开的示例性实施例,可通过频域分支模型202来执行从频谱信号提取频域特征的步骤。根据本公开的示例性实施例,可使用一维或二维多层CNN来实现频域分支模型202。当然,提取频域特征方式不限于此,可使用任何方式来提取频域特征,例如,还可使用任何GMM或DNN等来从频谱信号提取频域特征。According to an exemplary embodiment of the present disclosure, the step of extracting frequency domain features from the spectral signal may be performed by the frequency domain branch model 202 . According to an exemplary embodiment of the present disclosure, the frequency domain branching model 202 may be implemented using a one-dimensional or two-dimensional multilayer CNN. Of course, the method of extracting frequency domain features is not limited to this, and any method can be used to extract frequency domain features, for example, any GMM or DNN can also be used to extract frequency domain features from spectral signals.

此外,步骤401和步骤402可顺序地、逆序地、并行地执行,本公开对步骤401和步骤402的执行顺序不作限制。In addition, steps 401 and 402 may be performed sequentially, in reverse order, or in parallel, and the present disclosure does not limit the execution order of steps 401 and 402 .

在步骤403,对输入音频的时域特征和输入音频的频域特征进行融合,并基于融合后的特征执行声音识别。根据本公开的示例性实施例,可通过融合模块203来执行对输入音频的时域特征和输入音频的频域特征进行融合,并基于融合后的特征执行声音识别的步骤。In step 403, the time domain features of the input audio and the frequency domain features of the input audio are fused, and sound recognition is performed based on the fused features. According to an exemplary embodiment of the present disclosure, the fusion module 203 may perform the steps of fusing the time domain features of the input audio and the frequency domain features of the input audio, and performing the steps of sound recognition based on the fused features.

根据本公开的示例性实施例,可对输入音频的时域特征和输入音频的频域特征进行拼接和变换处理,得到融合后的特征。这里,当输入音频的时域特征和输入音频的频域特征经过拼接和变换处理之后,可被投射到分类特征空间,即,被变换为分类特征(即,融合后的特征)。对分类特征执行softmax处理来得到预测分类结果(即,概率分布值),从而执行声音识别。According to an exemplary embodiment of the present disclosure, splicing and transformation processing may be performed on the time-domain feature of the input audio and the frequency-domain feature of the input audio to obtain a fused feature. Here, after the time-domain features of the input audio and the frequency-domain features of the input audio are processed by concatenation and transformation, they can be projected into the categorical feature space, that is, transformed into categorical features (ie, fused features). A softmax process is performed on the classification features to obtain predicted classification results (ie, probability distribution values), thereby performing voice recognition.

根据本公开的示例性实施例,可通过早期融合方式来对输入音频的时域特征和输入音频的频域特征进行融合。在早期融合方式中,特征拼接层被设置在第一层,两个FC层分别被设置在第二层和第三层。具体地说,可将输入音频的时域特征和输入音频的频域特征拼接(例如,在第一层),得到拼接后的特征,并对拼接后的特征执行两层FC层变换(例如,分别在第二层和第三层),得到所述融合后的特征。早期融合方式先将时域特征和频域特征拼接,再将拼接后的特征进行变换,因此变换过程是综合时域特征和频域特征执行的,即,更全面地考虑了音频信号的所有两个域的特征,因此,可达到良好的声音识别效果。According to an exemplary embodiment of the present disclosure, the time-domain feature of the input audio and the frequency-domain feature of the input audio may be fused through an early fusion method. In the early fusion method, the feature splicing layer is set at the first layer, and the two FC layers are set at the second and third layers, respectively. Specifically, the time-domain features of the input audio and the frequency-domain features of the input audio can be concatenated (e.g., at the first layer) to obtain concatenated features, and a two-layer FC layer transformation can be performed on the concatenated features (e.g., respectively in the second and third layers) to obtain the fused features. The early fusion method first splices the time-domain features and the frequency-domain features, and then transforms the spliced features. Therefore, the transformation process is performed by integrating the time-domain features and the frequency-domain features, that is, it more comprehensively considers all the two aspects of the audio signal. Therefore, a good sound recognition effect can be achieved.

根据本公开的示例性实施例,可通过中期融合方式来对输入音频的时域特征和输入音频的频域特征进行融合。在中期融合方式中,一个FC层被设置在第一层,特征拼接层被设置在第二层,另一FC层被设置在第三层。具体地说,可对输入音频的时域特征执行一层FC层变换(例如,在第一层),得到第一变换特征,对输入音频的频域特征行一层FC层变换(例如,在第一层),得到第二变换特征(其中,对输入音频的时域特征的变换和输入音频的频域特征的变换的顺序不受限制),对第一变换特征和第二变换特征进行拼接(例如,在第二层),得到拼接后的特征,并对拼接后的特征执行一层FC层变换(例如,在第三层),得到所述融合后的特征。According to an exemplary embodiment of the present disclosure, the time-domain feature of the input audio and the frequency-domain feature of the input audio may be fused by a mid-term fusion method. In the mid-term fusion method, one FC layer is set on the first layer, the feature splicing layer is set on the second layer, and the other FC layer is set on the third layer. Specifically, a layer of FC layer transformation can be performed on the time domain features of the input audio (for example, in the first layer) to obtain the first transformed feature, and a layer of FC layer transformation can be performed on the frequency domain features of the input audio (for example, in the first layer). The first layer), obtain the second transformation feature (wherein, the order of transformation of the time domain feature of the input audio and the transformation of the frequency domain feature of the input audio is not limited), splicing the first transformation feature and the second transformation feature (eg, at the second layer), get the concatenated features, and perform a layer of FC layer transformation (eg, at the third layer) on the concatenated features to obtain the fused features.

根据本公开的示例性实施例,可通过后期融合方式来对输入音频的时域特征和输入音频的频域特征进行融合。在后期融合方式中,两个FC层被分别设置在第一层和第二层,特征拼接层被设置在第三层。具体地说,可对输入音频的时域特征执行两层全连接层变换(例如,分别在第一层和第二层),得到第三变换特征,对输入音频的频域特征执行两层全连接层变换(例如,分别在第一层和第二层),得到第四变换特征(其中,对输入音频的时域特征的变换和输入音频的频域特征的变换的顺序不受限制),对第三变换特征和第四变换特征进行拼接(例如,在第三层),得到所述融合后的特征。According to an exemplary embodiment of the present disclosure, the time-domain feature of the input audio and the frequency-domain feature of the input audio may be fused in a post-fusion manner. In the late fusion method, two FC layers are set on the first and second layers respectively, and the feature splicing layer is set on the third layer. Specifically, two layers of fully connected layer transformations (eg, at the first and second layers, respectively) may be performed on the time-domain features of the input audio to obtain a third transformed feature, and two layers of fully-connected layers may be performed on the frequency-domain features of the input audio. connecting layer transformations (eg, at the first and second layers, respectively) to obtain a fourth transformed feature (wherein the order of transforming the time-domain features of the input audio and transforming the frequency-domain features of the input audio is not limited), The third transformed feature and the fourth transformed feature are concatenated (eg, at the third layer) to obtain the fused feature.

当然,融合的方法不限于上述方法,还可采用其它可行的任何融合方式来将输入音频的时域特征和输入音频的频域特征一起变换到分类特征空间,例如,FC层的数量可不一定为两层,还可以为单层,或者三层以上,还可将特征拼接层设置在多个层之中的任意位置,或者可只通过对输入音频的时域特征和输入音频的频域特征执行拼接而直接得到融合后的特征来执行声音识别。Of course, the fusion method is not limited to the above method, and any other feasible fusion method can be used to transform the time domain feature of the input audio and the frequency domain feature of the input audio together into the classification feature space. For example, the number of FC layers may not necessarily be Two layers, it can also be a single layer, or more than three layers, the feature splicing layer can also be set at any position among the multiple layers, or it can be performed only by performing the time domain feature of the input audio and the frequency domain feature of the input audio. The fused features are directly obtained by splicing to perform sound recognition.

图5示出根据本公开的示例性实施例的声音识别装置的框图。FIG. 5 illustrates a block diagram of a voice recognition apparatus according to an exemplary embodiment of the present disclosure.

参照图5,根据本公开的示例性实施例的声音识别装置500可包括时域特征获取模块501、频域特征获取模块502和声音识别模块503。5 , a sound recognition apparatus 500 according to an exemplary embodiment of the present disclosure may include a time-domain feature acquisition module 501 , a frequency-domain feature acquisition module 502 , and a sound recognition module 503 .

时域特征提取模块501可获取输入音频的时域特征。具体地说,时域特征获取模块501可通过从输入音频的原始音频波形提取时域特征来获取输入音频的时域特征。或者,时域特征获取模块501可从本地存储器或服务器等获取输入音频的时域特征。时域特征获取模块501还可以通过其它任何可行的方式来获取输入音频的时域特征,本公开对获取的途径和来源不作限制。The temporal feature extraction module 501 can obtain temporal features of the input audio. Specifically, the time-domain feature acquisition module 501 may acquire the time-domain feature of the input audio by extracting the time-domain feature from the original audio waveform of the input audio. Alternatively, the time-domain feature acquisition module 501 may acquire the time-domain features of the input audio from a local storage or a server. The time-domain feature acquisition module 501 may also acquire the time-domain feature of the input audio in any other feasible manner, and the present disclosure does not limit the acquisition approach and source.

根据本公开的示例性实施例,时域特征获取模块501可通过时间分支模型201来从输入音频的原始音频波形提取时域特征。根据本公开的示例性实施例,可使用多层CNN或RNN等来实现时间分支模型201。根据本公开的示例性实施例,可使用SincNet模型来实现时间分支模型201。当然,提取时域特征的方式不限于此,可使用任何方式来提取时域特征,例如,可使用任何多层CNN或RNN等来提取时域特征。According to an exemplary embodiment of the present disclosure, the temporal feature acquisition module 501 may extract temporal features from the original audio waveform of the input audio through the temporal branching model 201 . According to an exemplary embodiment of the present disclosure, the temporal branching model 201 may be implemented using a multi-layer CNN or RNN or the like. According to an exemplary embodiment of the present disclosure, the temporal branching model 201 may be implemented using a SincNet model. Of course, the method of extracting temporal features is not limited to this, and any method can be used to extract temporal features, for example, any multi-layer CNN or RNN can be used to extract temporal features.

频域特征获取模块502可获取输入音频的频域特征。具体地说,频域特征获取模块502可通过对输入音频的原始音频信号执行时频变换(诸如,快速傅立叶变换STFT),并从经过时频变换得到的频谱信号提取频域特征来获取输入音频的频域特征。或者,频域特征获取模块502可从本地存储器或服务器等获取输入音频的频域特征。频域特征获取模块502还可以通过其它任何可行的方式来获取输入音频的频域特征,本公开对获取的途径和来源不作限制。The frequency-domain feature acquisition module 502 may acquire frequency-domain features of the input audio. Specifically, the frequency-domain feature acquisition module 502 may obtain the input audio by performing a time-frequency transform (such as Fast Fourier Transform STFT) on the original audio signal of the input audio, and extracting frequency-domain features from the spectral signal obtained by the time-frequency transform frequency domain features. Alternatively, the frequency-domain feature acquisition module 502 may acquire the frequency-domain features of the input audio from a local storage, a server, or the like. The frequency-domain feature acquisition module 502 may also acquire the frequency-domain feature of the input audio in any other feasible manner, and the present disclosure does not limit the acquisition approach and source.

根据本公开的示例性实施例,可使用MFCC、PLP、LPCC等作为用于提取频域特征的频谱。According to an exemplary embodiment of the present disclosure, MFCC, PLP, LPCC, etc. may be used as a spectrum for extracting frequency-domain features.

根据本公开的示例性实施例,频域特征获取模块502可通过频域分支模型202来从频谱信号提取频域特征。根据本公开的示例性实施例,可使用一维或二维多层CNN来实现频域分支模型202。当然,提取频域特征的方式不限于此,可使用任何方式来提取频域特征。According to an exemplary embodiment of the present disclosure, the frequency domain feature acquisition module 502 may extract frequency domain features from the spectral signal through the frequency domain branch model 202 . According to an exemplary embodiment of the present disclosure, the frequency domain branching model 202 may be implemented using a one-dimensional or two-dimensional multilayer CNN. Of course, the method of extracting frequency domain features is not limited to this, and any method can be used to extract frequency domain features.

声音识别模块503可对输入音频的时域特征和输入音频的频域特征进行融合,并基于融合后的特征执行声音识别。根据本公开的示例性实施例,声音识别模块503可通过融合模块203来执行声音识别。The sound recognition module 503 may fuse the time domain features of the input audio and the frequency domain features of the input audio, and perform sound recognition based on the fused features. According to an exemplary embodiment of the present disclosure, the voice recognition module 503 may perform voice recognition through the fusion module 203 .

根据本公开的示例性实施例,声音识别模块503可对输入音频的时域特征和输入音频的频域特征进行拼接和变换处理,得到融合后的特征。这里,当输入音频的时域特征和输入音频的频域特征经过拼接和变换处理之后,可被投射到分类特征空间,即,被变换为分类特征(即,融合后的特征)。声音识别模块503可对分类特征执行softmax处理来得到预测分类结果(即,概率分布值),从而执行声音识别。According to an exemplary embodiment of the present disclosure, the sound recognition module 503 may perform splicing and transformation processing on the time-domain features of the input audio and the frequency-domain features of the input audio to obtain fused features. Here, after the time-domain features of the input audio and the frequency-domain features of the input audio are processed by concatenation and transformation, they can be projected into the categorical feature space, that is, transformed into categorical features (ie, fused features). The voice recognition module 503 may perform softmax processing on the classification features to obtain predicted classification results (ie, probability distribution values), thereby performing voice recognition.

根据本公开的示例性实施例,声音识别模块503可通过早期融合方式来对输入音频的时域特征和输入音频的频域特征进行融合。在早期融合方式中,特征拼接层被设置在第一层,两个FC层分别被设置在第二层和第三层。具体地说,声音识别模块503可将输入音频的时域特征和输入音频的频域特征拼接(例如,在第一层),得到拼接后的特征,并对拼接后的特征执行两层FC层变换(例如,分别在第二层和第三层),得到所述融合后的特征。早期融合方式先将时域特征和频域特征拼接,再将拼接后的特征进行变换,因此变换过程是综合时域特征和频域特征执行的,即,更全面地考虑了音频信号的所有两个域的特征,因此,可达到良好的声音识别效果。According to an exemplary embodiment of the present disclosure, the sound recognition module 503 may fuse the time domain feature of the input audio and the frequency domain feature of the input audio through an early fusion method. In the early fusion method, the feature splicing layer is set at the first layer, and the two FC layers are set at the second and third layers, respectively. Specifically, the sound recognition module 503 can concatenate the time domain features of the input audio and the frequency domain features of the input audio (eg, in the first layer), obtain the concatenated features, and perform a two-layer FC layer on the concatenated features Transform (eg, at the second and third layers, respectively) to obtain the fused features. The early fusion method first splices the time-domain features and the frequency-domain features, and then transforms the spliced features. Therefore, the transformation process is performed by integrating the time-domain features and the frequency-domain features, that is, it more comprehensively considers all the two aspects of the audio signal. Therefore, a good sound recognition effect can be achieved.

根据本公开的示例性实施例,声音识别模块503可通过中期融合方式来对输入音频的时域特征和输入音频的频域特征进行融合。在中期融合方式中,一个FC层被设置在第一层,特征拼接层被设置在第二层,另一FC层被设置在第三层。具体地说,声音识别模块503可对输入音频的时域特征执行一层FC层变换(例如,在第一层),得到第一变换特征,对输入音频的频域特征行一层FC层变换(例如,在第一层),得到第二变换特征(其中,对输入音频的时域特征的变换和输入音频的频域特征的变换的顺序不受限制),对第一变换特征和第二变换特征进行拼接(例如,在第二层),得到拼接后的特征,并对拼接后的特征执行一层FC层变换(例如,在第三层),得到所述融合后的特征。According to an exemplary embodiment of the present disclosure, the voice recognition module 503 may fuse the time-domain features of the input audio and the frequency-domain features of the input audio through a mid-term fusion method. In the mid-term fusion method, one FC layer is set on the first layer, the feature splicing layer is set on the second layer, and the other FC layer is set on the third layer. Specifically, the voice recognition module 503 may perform a layer of FC layer transformation (for example, at the first layer) on the time domain features of the input audio to obtain the first transformation feature, and perform a layer of FC layer transformation on the frequency domain features of the input audio. (for example, at the first layer), obtain the second transformation feature (wherein, the order of transformation of the time domain feature of the input audio and the transformation of the frequency domain feature of the input audio is not limited), the first transformation feature and the second transformation The transformed features are spliced (eg, at the second layer) to obtain the spliced features, and a layer of FC layer transformation is performed on the spliced features (eg, at the third layer) to obtain the fused features.

根据本公开的示例性实施例,声音识别模块503可通过后期融合方式来对输入音频的时域特征和输入音频的频域特征进行融合。在后期融合方式中,两个FC层被分别设置在第一层和第二层,特征拼接层被设置在第三层。具体地说,声音识别模块503可对输入音频的时域特征执行两层全连接层变换(例如,分别在第一层和第二层),得到第三变换特征,对输入音频的频域特征执行两层全连接层变换(例如,分别在第一层和第二层),得到第四变换特征(其中,对输入音频的时域特征的变换和输入音频的频域特征的变换的顺序不受限制),对第三变换特征和第四变换特征进行拼接(例如,在第三层),得到所述融合后的特征。According to an exemplary embodiment of the present disclosure, the sound recognition module 503 may fuse the time-domain features of the input audio and the frequency-domain features of the input audio through a post-fusion method. In the late fusion method, two FC layers are set on the first and second layers respectively, and the feature splicing layer is set on the third layer. Specifically, the voice recognition module 503 may perform two-layer fully connected layer transformation (eg, in the first layer and the second layer, respectively) on the time-domain features of the input audio to obtain a third transformation feature, which is used for the frequency-domain features of the input audio. Perform a two-layer fully connected layer transformation (for example, at the first layer and the second layer, respectively), and obtain a fourth transformation feature (wherein the order of transformation of the time-domain features of the input audio and the transformation of the frequency-domain features of the input audio is not the same. Restricted), concatenate the third transformed feature and the fourth transformed feature (for example, at the third layer) to obtain the fused feature.

当然,融合的方法不限于上述方法,还可采用其它可行的任何融合方式来将输入音频的时域特征和输入音频的频域特征一起变换到分类特征空间,例如,FC层的数量可不一定为两层,还可以为单层,或者三层以上,还可将特征拼接层设置在多个层之中的任意位置,或者可只通过对输入音频的时域特征和输入音频的频域特征执行拼接而直接得到融合后的特征来执行声音识别。Of course, the fusion method is not limited to the above method, and any other feasible fusion method can be used to transform the time domain feature of the input audio and the frequency domain feature of the input audio together into the classification feature space. For example, the number of FC layers may not necessarily be Two layers, it can also be a single layer, or more than three layers, the feature splicing layer can also be set at any position among the multiple layers, or it can be performed only by performing the time domain feature of the input audio and the frequency domain feature of the input audio. The fused features are directly obtained by splicing to perform sound recognition.

根据本公开的示例性实施例,本公开提出的TFN模型以及根据本公开的声音识别方法和声音识别装置可应用于说话人识别。当将本公开提出的TFN模型以及根据本公开的声音识别方法和声音识别装置可应用于说话人识别时,输入音频可以是说话人的语音。According to an exemplary embodiment of the present disclosure, the TFN model proposed in the present disclosure and the voice recognition method and voice recognition apparatus according to the present disclosure can be applied to speaker recognition. When the TFN model proposed by the present disclosure and the voice recognition method and voice recognition apparatus according to the present disclosure can be applied to speaker recognition, the input audio may be the speaker's voice.

具体地说,根据不同的输出类型,说话人识别可包括说话人辨认和说话人验证。说话人辨认用于确定输入的语音属于注册人群中的哪个人并输出预测的人的索引。说话人验证用于确认输入的语音是否由该人发出并输出真或假。说话者辨认是多分类问题,说话者验证是二分类问题。多分类问题可被变换为多个二分类问题。根据本公开的声音识别方法和声音识别装置可应用于说话人辨认,也可应用于说话人验证。Specifically, speaker identification may include speaker identification and speaker verification according to different output types. Speaker recognition is used to determine to which person in the registered population the input speech belongs and to output the index of the predicted person. Speaker verification is used to confirm whether the input speech was uttered by that person and output true or false. Speaker identification is a multi-class problem, and speaker verification is a binary classification problem. Multi-classification problems can be transformed into multiple binary classification problems. The voice recognition method and the voice recognition device according to the present disclosure can be applied to speaker identification and speaker verification.

根据用户是否需要与系统协作,说话人识别名可包括文本依赖说话人识别和文本无关说话人识别。文本依赖说话人识别系统需要用户基于与系统的交互说出特定内容,使得系统可避免录制后播放的语音攻击,并提供较好的鲁棒性。然而,它需要用户的协作,这使得在一些应用中受到限制,诸如在一种没有交互的场景下。文本无关说话人识别系统不需要指定输入的语音的内容,该系统需要从未知内容的语音中识别说话人,这会更加困难。同时,文本无关说话人识别系统由于其不太有交互需求而被更广泛的使用。根据本公开的声音识别方法和声音识别装置可应用于文本依赖说话人识别,也可应用于文本无关说话人识别。Depending on whether the user needs to cooperate with the system, speaker identification names can include text-dependent speaker identification and text-independent speaker identification. The text-dependent speaker recognition system requires the user to speak specific content based on the interaction with the system, so that the system can avoid voice attacks after recording and playback, and provide better robustness. However, it requires user cooperation, which is limited in some applications, such as in a non-interactive scenario. Text-independent speaker recognition systems do not need to specify the content of the input speech, the system needs to identify the speaker from speech with unknown content, which is more difficult. Meanwhile, text-independent speaker recognition systems are more widely used due to their less interactive requirements. The voice recognition method and the voice recognition device according to the present disclosure can be applied to text-dependent speaker recognition, and can also be applied to text-independent speaker recognition.

图6a、图6b、图6c和图6d示出说话人识别系统的示意图。其中,图6a示出文本依赖说话人辨认系统的示意图,图6b示出文本依赖说话人验证系统的示意图,图6c示出文本无关说话人辨认系统的示意图,图6d示出文本无关说话人验证系统的示意图。Figures 6a, 6b, 6c and 6d show schematic diagrams of a speaker recognition system. 6a shows a schematic diagram of a text-dependent speaker recognition system, FIG. 6b shows a schematic diagram of a text-dependent speaker verification system, FIG. 6c shows a schematic diagram of a text-independent speaker recognition system, and FIG. 6d shows a text-independent speaker verification system Schematic diagram of the system.

下面的表1示出本公开提出的TFN模型和传统SincNet模型基于TIMIT数据集和LibriSpeech数据集的实验数据对比结果。Table 1 below shows the experimental data comparison results between the TFN model proposed in the present disclosure and the traditional SincNet model based on the TIMIT dataset and the LibriSpeech dataset.

[表1][Table 1]

Figure BDA0002590584180000141
Figure BDA0002590584180000141

TIMIT数据集包括462个说话人,LibriSpeech数据集包括2484个说话人。通过控制分类特征空间的维度来使用特定大小的模型来进行实验。针对TIMIT数据集,模型的分类特征空间维度为1024。针对LibriSpeech数据集,模型的分类特征空间维度为2048。在本公开提出的TFN模型和SincNet模型中,带通滤波器数据针对小模型被设置为512个、针对大模型为被设置为1024个。此外,在本公开提出的TFN模型中,将MFCC用作用于提取频域特征的频谱。使用分类错误率(CER)来评价模型性能,CER越低表示模型性能越好。从表1中可看出,对于TIMIT数据集,传统SincNet模型的CER为0.85%,本公开提出的TFN模型的CER为0.65%。针对LibriSpeech数据集,传统SincNet模型的CER为0.96%,本公开提出的TFN模型的CER为0.32%。可见,本公开提出的TFN模型展现了更好的性能。The TIMIT dataset includes 462 speakers and the LibriSpeech dataset includes 2484 speakers. Experiments are performed using models of a certain size by controlling the dimensionality of the categorical feature space. For the TIMIT dataset, the categorical feature space dimension of the model is 1024. For the LibriSpeech dataset, the categorical feature space dimension of the model is 2048. In the TFN model and the SincNet model proposed in the present disclosure, the bandpass filter data is set to 512 for the small model and 1024 for the large model. Furthermore, in the TFN model proposed in the present disclosure, MFCC is used as the spectrum for extracting frequency domain features. The classification error rate (CER) is used to evaluate the model performance, and the lower the CER, the better the model performance. As can be seen from Table 1, for the TIMIT dataset, the CER of the traditional SincNet model is 0.85%, and the CER of the TFN model proposed in this disclosure is 0.65%. For the LibriSpeech dataset, the CER of the traditional SincNet model is 0.96%, and the CER of the TFN model proposed in this disclosure is 0.32%. It can be seen that the TFN model proposed in the present disclosure exhibits better performance.

根据本公开的示例性实施例的声音识别方法和声音识别装置,通过融合方式利用声音信号的时域信息和频域信息共同执行声音识别,充分利用了声音信号的时间信息和频率信息,提高了声音识别的性能。例如,当将根据本公开的示例性实施例的声音识别方法和声音识别装置应用于说话人识别时,通过融合方式利用语音信号的时域信息和频域信息共同执行声纹识别,充分利用了语音信号的时间信息和频率信息,提高了说话人识别的性能。According to the voice recognition method and the voice recognition apparatus according to the exemplary embodiments of the present disclosure, the time domain information and the frequency domain information of the voice signal are used to jointly perform the voice recognition in a fusion manner, the time information and the frequency information of the voice signal are fully utilized, and the improvement is improved. The performance of voice recognition. For example, when the voice recognition method and the voice recognition apparatus according to the exemplary embodiments of the present disclosure are applied to speaker recognition, the voiceprint recognition is jointly performed by using the time domain information and the frequency domain information of the voice signal in a fusion manner, which makes full use of the The time information and frequency information of the speech signal improve the performance of speaker recognition.

此外,根据本公开的示例性实施例的声音识别方法和声音识别装置,通过早期融合方式来将时域特征和频域特征一起变换到分类特征空间,因此变换过程是综合时域特征和频域特征执行的,即,更全面地考虑了音频信号的所有两个域的特征,因此,可达到良好的声音识别效果。In addition, according to the voice recognition method and the voice recognition apparatus according to the exemplary embodiments of the present disclosure, the time domain feature and the frequency domain feature are transformed together into the classification feature space by the early fusion method, so the transformation process is to integrate the time domain feature and the frequency domain feature. The feature is performed, that is, the features of all two domains of the audio signal are considered more comprehensively, and therefore, a good sound recognition effect can be achieved.

以上已参照图2至图6d描述了根据本公开示例性实施例的声音识别方法和声音识别装置。The voice recognition method and the voice recognition apparatus according to the exemplary embodiments of the present disclosure have been described above with reference to FIGS. 2 to 6d .

图5所示出的各个模块可被配置为执行特定功能的软件、硬件、固件或上述项的任意组合。例如,各个模块可对应于专用的集成电路,也可对应于纯粹的软件代码,还可对应于软件与硬件相结合的模块。此外,安全监控模块所实现的一个或多个功能也可由物理实体设备(例如,处理器、客户端或服务器等)中的组件来统一执行。The various modules shown in FIG. 5 may be configured as software, hardware, firmware, or any combination of the foregoing to perform specific functions. For example, each module may correspond to a dedicated integrated circuit, may also correspond to pure software code, or may correspond to a module combining software and hardware. In addition, one or more functions implemented by the security monitoring module can also be uniformly performed by components in a physical entity device (eg, a processor, a client or a server, etc.).

此外,参照图4所描述的方法可通过记录在计算机可读存储介质上的程序(或指令)来实现。例如,根据本公开的示例性实施例,可提供存储有指令的计算机可读存储介质,其中,当所述指令在至少一个计算装置运行时,促使所述至少一个计算装置执行根据本公开的声音识别方法。Furthermore, the method described with reference to FIG. 4 may be implemented by programs (or instructions) recorded on a computer-readable storage medium. For example, in accordance with exemplary embodiments of the present disclosure, a computer-readable storage medium may be provided that stores instructions that, when executed on at least one computing device, cause the at least one computing device to execute a sound according to the present disclosure recognition methods.

上述计算机可读存储介质中的计算机程序可在诸如客户端、主机、代理装置、服务器等计算机设备中部署的环境中运行,应注意,计算机程序还可用于执行除了上述步骤以外的附加步骤或者在执行上述步骤时执行更为具体的处理,这些附加步骤和进一步处理的内容已经在参照图4进行相关方法的描述过程中提及,因此这里为了避免重复将不再进行赘述。The computer program in the above-mentioned computer-readable storage medium can run in an environment deployed in computer equipment such as a client, a host, an agent device, a server, etc. It should be noted that the computer program can also be used to perform additional steps in addition to the above-mentioned steps or More specific processing is performed when the above steps are performed, and the contents of these additional steps and further processing have been mentioned in the description of the related method with reference to FIG.

应注意,根据本公开示例性实施例的各个模块可完全依赖计算机程序的运行来实现相应的功能,即,各个模块在计算机程序的功能架构中与各步骤相应,使得整个系统通过专门的软件包(例如,lib库)而被调用,以实现相应的功能。It should be noted that each module according to the exemplary embodiment of the present disclosure can completely rely on the running of the computer program to achieve corresponding functions, that is, each module corresponds to each step in the functional architecture of the computer program, so that the entire system can be implemented through a special software package (for example, the lib library) is called to implement the corresponding function.

另一方面,图5所示的各个模块也可以通过硬件、软件、固件、中间件、微代码或其任意组合来实现。当以软件、固件、中间件或微代码实现时,用于执行相应操作的程序代码或者代码段可以存储在诸如存储介质的计算机可读介质中,使得处理器可通过读取并运行相应的程序代码或者代码段来执行相应的操作。On the other hand, each module shown in FIG. 5 can also be implemented by hardware, software, firmware, middleware, microcode or any combination thereof. When implemented in software, firmware, middleware, or microcode, program codes or code segments for performing corresponding operations may be stored in a computer-readable medium such as a storage medium, so that a processor can read and execute the corresponding program by reading code or code segment to perform the corresponding action.

例如,本公开的示例性实施例还可以实现为计算装置,该计算装置包括存储部件和处理器,存储部件中存储有计算机可执行指令集合,当计算机可执行指令集合被处理器执行时,执行根据本公开的示例性实施例的声音识别方法。For example, exemplary embodiments of the present disclosure may also be implemented as a computing device including a storage component and a processor, the storage component stores a computer-executable instruction set, and when the computer-executable instruction set is executed by the processor, executes the A voice recognition method according to an exemplary embodiment of the present disclosure.

具体说来,计算装置可以部署在服务器或客户端中,也可以部署在分布式网络环境中的节点装置上。此外,计算装置可以是PC计算机、平板装置、个人数字助理、智能手机、web应用或其他能够执行上述指令集合的装置。Specifically, the computing device may be deployed in a server or a client, or may be deployed on a node device in a distributed network environment. Furthermore, the computing device may be a PC computer, a tablet device, a personal digital assistant, a smartphone, a web application, or other device capable of executing the set of instructions described above.

这里,计算装置并非必须是单个的计算装置,还可以是任何能够单独或联合执行上述指令(或指令集)的装置或电路的集合体。计算装置还可以是集成控制系统或系统管理器的一部分,或者可被配置为与本地或远程(例如,经由无线传输)以接口互联的便携式电子装置。Here, the computing device does not have to be a single computing device, but can also be any set of devices or circuits capable of individually or jointly executing the above-mentioned instructions (or instruction sets). The computing device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces locally or remotely (eg, via wireless transmission).

在计算装置中,处理器可包括中央处理器(CPU)、图形处理器(GPU)、可编程逻辑装置、专用处理器系统、微控制器或微处理器。作为示例而非限制,处理器还可包括模拟处理器、数字处理器、微处理器、多核处理器、处理器阵列、网络处理器等。In a computing device, a processor may include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

根据本公开示例性实施例的声音识别方法中所描述的某些操作可通过软件方式来实现,某些操作可通过硬件方式来实现,此外,还可通过软硬件结合的方式来实现这些操作。Some operations described in the voice recognition method according to an exemplary embodiment of the present disclosure may be implemented by software, some operations may be implemented by hardware, and these operations may also be implemented by a combination of software and hardware.

处理器可运行存储在存储部件之一中的指令或代码,其中,存储部件还可以存储数据。指令和数据还可经由网络接口装置而通过网络被发送和接收,其中,网络接口装置可采用任何已知的传输协议。The processor may execute instructions or code stored in one of the storage components, which may also store data. Instructions and data may also be sent and received over a network via a network interface device, which may employ any known transport protocol.

存储部件可与处理器集成为一体,例如,将RAM或闪存布置在集成电路微处理器等之内。此外,存储部件可包括独立的装置,诸如,外部盘驱动、存储阵列或任何数据库系统可使用的其他存储装置。存储部件和处理器可在操作上进行耦合,或者可例如通过I/O端口、网络连接等互相通信,使得处理器能够读取存储在存储部件中的文件。The memory component may be integrated with the processor, eg, RAM or flash memory arranged within an integrated circuit microprocessor or the like. Additionally, the storage components may include separate devices, such as external disk drives, storage arrays, or any other storage device that may be used by a database system. The storage component and the processor may be operatively coupled, or may communicate with each other, eg, through I/O ports, network connections, etc., to enable the processor to read files stored in the storage component.

此外,计算装置还可包括视频显示器(诸如,液晶显示器)和用户交互接口(诸如,键盘、鼠标、触摸输入装置等)。计算装置的所有组件可经由总线和/或网络而彼此连接。In addition, the computing device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the computing device may be connected to each other via a bus and/or network.

根据本公开示例性实施例的声音识别方法可被描述为各种互联或耦合的功能块或功能示图。然而,这些功能块或功能示图可被均等地集成为单个的逻辑装置或按照非确切的边界进行操作。The voice recognition method according to the exemplary embodiment of the present disclosure may be described as various interconnected or coupled functional blocks or functional diagrams. However, these functional blocks or functional diagrams may be equally integrated into a single logical device or operate along non-precise boundaries.

因此,参照图4所描述的声音识别方法可通过包括至少一个计算装置和至少一个存储有计算机指令的存储装置的声音识别装置来实现。Therefore, the voice recognition method described with reference to FIG. 4 may be implemented by a voice recognition device including at least one computing device and at least one storage device storing computer instructions.

根据本公开的示例性实施例,至少一个计算装置是根据本公开示例性实施例的用于执行声音识别方法的计算装置,存储装置中存储有计算机可执行指令集合,当计算机可执行指令集合被至少一个计算装置执行时,执行参照图4所描述的声音识别方法。According to an exemplary embodiment of the present disclosure, at least one computing device is a computing device for executing a voice recognition method according to an exemplary embodiment of the present disclosure, and a computer-executable instruction set is stored in the storage device. When executed by at least one computing device, the voice recognition method described with reference to FIG. 4 is executed.

以上描述了本公开的各示例性实施例,应理解,上述描述仅是示例性的,并非穷尽性的,本公开不限于所披露的各示例性实施例。在不偏离本公开的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。因此,本公开的保护范围应该以权利要求的范围为准。Various exemplary embodiments of the present disclosure have been described above, and it should be understood that the above description is merely exemplary and not exhaustive, and the present disclosure is not limited to the disclosed exemplary embodiments. Numerous modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of this disclosure. Therefore, the scope of protection of the present disclosure should be determined by the scope of the claims.

Claims (12)

1. A method of voice recognition, comprising:
acquiring time domain characteristics of input audio;
acquiring frequency domain characteristics of the input audio;
and fusing the time domain characteristics of the input audio and the frequency domain characteristics of the input audio, and executing sound identification based on the fused characteristics.
2. The sound recognition method of claim 1, wherein the fusing the time-domain features of the input audio and the frequency-domain features of the input audio comprises:
and splicing and transforming the time domain characteristics of the input audio and the frequency domain characteristics of the input audio to obtain the fused characteristics.
3. The sound recognition method of claim 2, wherein the splicing and transforming the time-domain features of the input audio and the frequency-domain features of the input audio to obtain the fused features comprises:
splicing the time domain characteristics of the input audio and the frequency domain characteristics of the input audio to obtain spliced characteristics;
and executing two-layer full-connection layer transformation on the spliced features to obtain the fused features.
4. The sound recognition method of claim 2, wherein the splicing and transforming the time-domain features of the input audio and the frequency-domain features of the input audio to obtain the fused features comprises:
performing a layer of full-link layer transform on the time domain features of the input audio to obtain first transform features;
performing a layer of full-link layer transform on the frequency domain characteristics of the input audio to obtain second transform characteristics;
splicing the first transformation characteristic and the second transformation characteristic to obtain a spliced characteristic;
and executing one-layer full-connection layer transformation on the spliced features to obtain the fused features.
5. The sound recognition method of claim 2, wherein the splicing and transforming the time-domain features of the input audio and the frequency-domain features of the input audio to obtain the fused features comprises:
performing two-layer full-link layer transformation on the time domain characteristics of the input audio to obtain third transformation characteristics;
performing two-layer full-connected layer transformation on the frequency domain characteristics of the input audio to obtain fourth transformation characteristics;
and splicing the third transformation characteristic and the fourth transformation characteristic to obtain the fused characteristic.
6. A voice recognition apparatus, comprising:
a time domain feature acquisition module configured to acquire a time domain feature of an input audio;
a frequency domain feature acquisition module configured to acquire a frequency domain feature of the input audio;
a voice recognition module configured to fuse the time-domain features of the input audio and the frequency-domain features of the input audio and perform voice recognition based on the fused features.
7. The voice recognition apparatus of claim 6, wherein the voice recognition module is configured to:
and splicing and transforming the time domain characteristics of the input audio and the frequency domain characteristics of the input audio to obtain the fused characteristics.
8. The voice recognition apparatus of claim 7, wherein the voice recognition module is configured to:
splicing the time domain characteristics of the input audio and the frequency domain characteristics of the input audio to obtain spliced characteristics;
and executing two-layer full-connection layer transformation on the spliced features to obtain the fused features.
9. The voice recognition apparatus of claim 7, wherein the voice recognition module is configured to:
performing a layer of full-link layer transform on the time domain features of the input audio to obtain first transform features;
performing one-layer full-connection layer transformation on the frequency domain characteristics of the input audio to obtain second transformation characteristics;
splicing the first transformation characteristic and the second transformation characteristic to obtain a spliced characteristic; and executing one-layer full-connection layer transformation on the spliced features to obtain the fused features.
10. The voice recognition apparatus of claim 7, wherein the voice recognition module is configured to:
performing two-layer full-link layer transformation on the time domain characteristics of the input audio to obtain third transformation characteristics;
performing two-layer full-connected layer transformation on the frequency domain characteristics of the input audio to obtain fourth transformation characteristics;
and splicing the third transformation characteristic and the fourth transformation characteristic to obtain the fused characteristic.
11. A voice recognition apparatus comprising at least one computing device and at least one memory device having computer instructions stored thereon, wherein the computer instructions, when executed by the at least one computing device, cause the at least one computing device to perform a voice recognition method as claimed in any one of claims 1 to 5.
12. A computer-readable storage medium having instructions stored thereon, which when executed on at least one computing device, cause the at least one computing device to perform a voice recognition method as claimed in any one of claims 1 to 5.
CN202010694750.9A 2020-07-17 2020-07-17 Voice recognition method, apparatus, and computer-readable storage medium storing instructions Pending CN111816166A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010694750.9A CN111816166A (en) 2020-07-17 2020-07-17 Voice recognition method, apparatus, and computer-readable storage medium storing instructions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010694750.9A CN111816166A (en) 2020-07-17 2020-07-17 Voice recognition method, apparatus, and computer-readable storage medium storing instructions

Publications (1)

Publication Number Publication Date
CN111816166A true CN111816166A (en) 2020-10-23

Family

ID=72865537

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010694750.9A Pending CN111816166A (en) 2020-07-17 2020-07-17 Voice recognition method, apparatus, and computer-readable storage medium storing instructions

Country Status (1)

Country Link
CN (1) CN111816166A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560673A (en) * 2020-12-15 2021-03-26 北京天泽智云科技有限公司 Thunder detection method and system based on image recognition
CN112767952A (en) * 2020-12-31 2021-05-07 苏州思必驰信息科技有限公司 Voice wake-up method and device
CN112951242A (en) * 2021-02-02 2021-06-11 华南理工大学 Phrase voice speaker matching method based on twin neural network
CN113257283A (en) * 2021-03-29 2021-08-13 北京字节跳动网络技术有限公司 Audio signal processing method and device, electronic equipment and storage medium
CN113793614A (en) * 2021-08-24 2021-12-14 南昌大学 A Speaker Recognition Method Based on Speech Feature Fusion Based on Independent Vector Analysis
CN114186094A (en) * 2021-11-01 2022-03-15 深圳市豪恩声学股份有限公司 Audio scene classification method and device, terminal equipment and storage medium
CN114464196A (en) * 2022-01-17 2022-05-10 厦门快商通科技股份有限公司 A voiceprint recognition model training method, device, equipment and readable medium
CN114974267A (en) * 2022-04-15 2022-08-30 昆山杜克大学 Bird language classification model training method and bird language recognition method
CN116075885A (en) * 2021-07-19 2023-05-05 谷歌有限责任公司 Bit-vector-based content matching for third-party digital assistant actions

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107610692A (en) * 2017-09-22 2018-01-19 杭州电子科技大学 The sound identification method of self-encoding encoder multiple features fusion is stacked based on neutral net
CN108009481A (en) * 2017-11-22 2018-05-08 浙江大华技术股份有限公司 A kind of training method and device of CNN models, face identification method and device
CN108305616A (en) * 2018-01-16 2018-07-20 国家计算机网络与信息安全管理中心 A kind of audio scene recognition method and device based on long feature extraction in short-term
CN108305634A (en) * 2018-01-09 2018-07-20 深圳市腾讯计算机系统有限公司 Decoding method, decoder and storage medium
CN108899037A (en) * 2018-07-05 2018-11-27 平安科技(深圳)有限公司 Animal vocal print feature extracting method, device and electronic equipment
CN109523993A (en) * 2018-11-02 2019-03-26 成都三零凯天通信实业有限公司 A kind of voice languages classification method merging deep neural network with GRU based on CNN
CN109584887A (en) * 2018-12-24 2019-04-05 科大讯飞股份有限公司 A kind of method and apparatus that voiceprint extracts model generation, voiceprint extraction
CN109935243A (en) * 2019-02-25 2019-06-25 重庆大学 Speech emotion recognition method based on VTLP data enhancement and multi-scale time-frequency domain hole convolution model
CN110047468A (en) * 2019-05-20 2019-07-23 北京达佳互联信息技术有限公司 Audio recognition method, device and storage medium
CN110459241A (en) * 2019-08-30 2019-11-15 厦门亿联网络技术股份有限公司 A kind of extracting method and system for phonetic feature
US10482334B1 (en) * 2018-09-17 2019-11-19 Honda Motor Co., Ltd. Driver behavior recognition
CN110502981A (en) * 2019-07-11 2019-11-26 武汉科技大学 A Gesture Recognition Method Based on Fusion of Color Information and Depth Information
CN110675860A (en) * 2019-09-24 2020-01-10 山东大学 Speech information recognition method and system based on improved attention mechanism combined with semantics
CN111104987A (en) * 2019-12-25 2020-05-05 三一重工股份有限公司 Face recognition method and device and electronic equipment

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107610692A (en) * 2017-09-22 2018-01-19 杭州电子科技大学 The sound identification method of self-encoding encoder multiple features fusion is stacked based on neutral net
CN108009481A (en) * 2017-11-22 2018-05-08 浙江大华技术股份有限公司 A kind of training method and device of CNN models, face identification method and device
CN108305634A (en) * 2018-01-09 2018-07-20 深圳市腾讯计算机系统有限公司 Decoding method, decoder and storage medium
CN108305616A (en) * 2018-01-16 2018-07-20 国家计算机网络与信息安全管理中心 A kind of audio scene recognition method and device based on long feature extraction in short-term
CN108899037A (en) * 2018-07-05 2018-11-27 平安科技(深圳)有限公司 Animal vocal print feature extracting method, device and electronic equipment
US10482334B1 (en) * 2018-09-17 2019-11-19 Honda Motor Co., Ltd. Driver behavior recognition
CN109523993A (en) * 2018-11-02 2019-03-26 成都三零凯天通信实业有限公司 A kind of voice languages classification method merging deep neural network with GRU based on CNN
CN109584887A (en) * 2018-12-24 2019-04-05 科大讯飞股份有限公司 A kind of method and apparatus that voiceprint extracts model generation, voiceprint extraction
CN109935243A (en) * 2019-02-25 2019-06-25 重庆大学 Speech emotion recognition method based on VTLP data enhancement and multi-scale time-frequency domain hole convolution model
CN110047468A (en) * 2019-05-20 2019-07-23 北京达佳互联信息技术有限公司 Audio recognition method, device and storage medium
CN110502981A (en) * 2019-07-11 2019-11-26 武汉科技大学 A Gesture Recognition Method Based on Fusion of Color Information and Depth Information
CN110459241A (en) * 2019-08-30 2019-11-15 厦门亿联网络技术股份有限公司 A kind of extracting method and system for phonetic feature
CN110675860A (en) * 2019-09-24 2020-01-10 山东大学 Speech information recognition method and system based on improved attention mechanism combined with semantics
CN111104987A (en) * 2019-12-25 2020-05-05 三一重工股份有限公司 Face recognition method and device and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
缪裕青;邹巍;刘同来;周明;蔡国永;: "基于参数迁移和卷积循环神经网络的语音情感识别", no. 10 *
谭铁牛: "《人工智能用AI技术打造智能化未来》", 科学普及出版社, pages: 104 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560673A (en) * 2020-12-15 2021-03-26 北京天泽智云科技有限公司 Thunder detection method and system based on image recognition
CN112767952A (en) * 2020-12-31 2021-05-07 苏州思必驰信息科技有限公司 Voice wake-up method and device
CN112951242A (en) * 2021-02-02 2021-06-11 华南理工大学 Phrase voice speaker matching method based on twin neural network
CN113257283A (en) * 2021-03-29 2021-08-13 北京字节跳动网络技术有限公司 Audio signal processing method and device, electronic equipment and storage medium
CN113257283B (en) * 2021-03-29 2023-09-26 北京字节跳动网络技术有限公司 Audio signal processing method and device, electronic equipment and storage medium
CN116075885A (en) * 2021-07-19 2023-05-05 谷歌有限责任公司 Bit-vector-based content matching for third-party digital assistant actions
US12380885B2 (en) 2021-07-19 2025-08-05 Google Llc Bit vector-based content matching for third-party digital assistant actions
CN113793614A (en) * 2021-08-24 2021-12-14 南昌大学 A Speaker Recognition Method Based on Speech Feature Fusion Based on Independent Vector Analysis
CN113793614B (en) * 2021-08-24 2024-02-09 南昌大学 Speech feature fusion speaker recognition method based on independent vector analysis
CN114186094A (en) * 2021-11-01 2022-03-15 深圳市豪恩声学股份有限公司 Audio scene classification method and device, terminal equipment and storage medium
CN114464196A (en) * 2022-01-17 2022-05-10 厦门快商通科技股份有限公司 A voiceprint recognition model training method, device, equipment and readable medium
CN114974267A (en) * 2022-04-15 2022-08-30 昆山杜克大学 Bird language classification model training method and bird language recognition method

Similar Documents

Publication Publication Date Title
CN111816166A (en) Voice recognition method, apparatus, and computer-readable storage medium storing instructions
Kabir et al. A survey of speaker recognition: Fundamental theories, recognition methods and opportunities
Ashar et al. Speaker identification using a hybrid cnn-mfcc approach
JP6621536B2 (en) Electronic device, identity authentication method, system, and computer-readable storage medium
US11875799B2 (en) Method and device for fusing voiceprint features, voice recognition method and system, and storage medium
WO2021082941A1 (en) Video figure recognition method and apparatus, and storage medium and electronic device
US11562735B1 (en) Multi-modal spoken language understanding systems
WO2021012734A1 (en) Audio separation method and apparatus, electronic device and computer-readable storage medium
WO2019154107A1 (en) Voiceprint recognition method and device based on memorability bottleneck feature
Chakravarty et al. A lightweight feature extraction technique for deepfake audio detection
WO2020073509A1 (en) Neural network-based speech recognition method, terminal device, and medium
JP2019522810A (en) Neural network based voiceprint information extraction method and apparatus
US9947323B2 (en) Synthetic oversampling to enhance speaker identification or verification
CN112992155B (en) Far-field voice speaker recognition method and device based on residual error neural network
CN116959421A (en) Method and device for processing audio data, audio data processing equipment and media
KR20200023893A (en) Speaker authentication method, learning method for speaker authentication and devices thereof
CN114664325A (en) Abnormal sound identification method, system, terminal equipment and computer readable storage medium
CN114822560A (en) Method, system, device and medium for training voiceprint recognition model and voiceprint recognition
CN117037772A (en) Voice audio segmentation method, device, computer equipment and storage medium
CN113793598B (en) Training method of voice processing model, data enhancement method, device and equipment
WO2023193394A1 (en) Voice wake-up model training method and apparatus, voice wake-up method and apparatus, device and storage medium
CN111508524A (en) Method and system for identifying voice source device
Chauhan et al. Text-independent speaker recognition system using feature-level fusion for audio databases of various sizes
Jha et al. Analysis of human voice for speaker recognition: Concepts and advancement
CN116631450A (en) Multi-mode voice emotion recognition method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201023

RJ01 Rejection of invention patent application after publication