WO2018137426A1

WO2018137426A1 - Method and apparatus for recognizing voice information of user

Info

Publication number: WO2018137426A1
Application number: PCT/CN2017/115677
Authority: WO
Inventors: 袁文华
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2017-01-24
Filing date: 2017-12-12
Publication date: 2018-08-02
Anticipated expiration: 2019-07-24
Also published as: CN108345777A

Abstract

Provided are a method and apparatus for recognizing voice information of a user. The method comprises: obtaining voice information of a user; extracting first language features and voice features of the voice information; searching for preset second language features corresponding to the voice features; and determining whether the voice information is legitimate according to a first comparison result of the first language features and the second language features.

Description

User voice information identification method and device

Technical field

本公开涉及通信领域，具体而言，涉及一种用户声音信息的识别方法及装置。The present disclosure relates to the field of communications, and in particular to a method and apparatus for identifying user voice information.

Background technique

身份识别系统包括人脸识别、指纹识别、声纹识别、密码识别和口令识别。The identity recognition system includes face recognition, fingerprint recognition, voiceprint recognition, password recognition and password recognition.

口令识别是指用户输入或说出一个或多个字、短语或句子作为验证的密钥，但是该识别技术只能用于简单的识别场合，密钥容易被窃取或盗用。Password identification means that the user inputs or speaks one or more words, phrases or sentences as the key for verification, but the identification technology can only be used for simple identification, and the key is easily stolen or stolen.

声纹(Voiceprint)是用电声学仪器显示的携带言语信息的声波频谱，人类语言的产生是人体语言中枢与发音器官之间一个复杂的生理物理过程，人在讲话时使用的发声器官，如，舌、牙齿、喉头、肺以及鼻腔，在尺寸和形态上个体差异很大，因此，任何两个人的声纹图谱均会存在差异。尽管每个人的语音学特性并不是一成不变的，具有变异性，会受到生理、病理、心理、模拟、伪装以及环境的因素的影响，但是声纹识别技术相对于口令识别技术还是更加安全的。Voiceprint is a sound wave spectrum that carries speech information displayed by electroacoustic instruments. The generation of human language is a complex physiological and physical process between the human language center and the vocal organs. The vocal organs used by people when speaking, for example, The tongue, teeth, throat, lungs, and nasal cavity vary greatly in size and morphology, so there are differences in the voiceprint of any two people. Although each person's phonetic characteristics are not static and variability, they are affected by physical, pathological, psychological, analog, camouflage, and environmental factors, but voiceprint recognition technology is more secure than password recognition technology.

但是仅仅只采用声纹识别技术对于用户身份的识别还是存在安全上的风险的，因为语音学特征仍可以通过一定的技术进行模仿，如，非法录取用户讲话时的语音信息，通过该盗录的语音信息模仿用户的声音，导致用户的身份信息被盗用，给用户造成经济上的损失。However, only the use of voiceprint recognition technology for the identification of user identity is still a security risk, because the phonetic features can still be imitated by certain techniques, such as illegally accepting the voice information of the user's speech, through the pirated The voice information mimics the user's voice, causing the user's identity information to be stolen, causing economic losses to the user.

发明内容Summary of the invention

以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。The following is an overview of the topics detailed in this document. This Summary is not intended to limit the scope of the claims.

本公开实施例提供了一种用户声音信息的识别方法及装置。Embodiments of the present disclosure provide a method and apparatus for identifying user voice information.

根据本公开的一个实施例，提供了一种用户声音信息的识别方法，包括：获取用户的声音信息；提取所述声音信息的第一语言特征和语音特征；查找与所述语音特征对应的预先设置的第二语言特征；根据所述第一语言特征和所述第二语言特征的第一比较结果确定所述声音信息是否合法。According to an embodiment of the present disclosure, a method for identifying user voice information is provided, including: Acquiring the voice information of the user; extracting the first language feature and the voice feature of the voice information; searching for a preset second language feature corresponding to the voice feature; according to the first language feature and the second language feature The first comparison result determines whether the sound information is legal.

在示例性实施例中，根据所述第一语言特征和所述第二语言特征的第一比较结果确定所述声音信息是否合法，包括：根据所述第一比较结果确定所述第一语言特征与所述第二语言特征之间的向量相似度；根据所述向量相似度与预设阈值的比较结果确定所述声音信息是否合法，其中，当所述比较结果指示所述向量相似度大于或等于所述预设阈值时，确定所述声音信息合法。In an exemplary embodiment, determining whether the sound information is legal according to the first comparison result of the first language feature and the second language feature comprises: determining the first language feature according to the first comparison result a vector similarity with the second language feature; determining whether the sound information is legal according to a comparison result of the vector similarity and a preset threshold, wherein when the comparison result indicates that the vector similarity is greater than or When the preset threshold is equal to, the sound information is determined to be legal.

在示例性实施例中，所述预设阈值为多个所述第二语言特征之间的向量相似度的平均值。In an exemplary embodiment, the preset threshold is an average of vector similarities between a plurality of the second language features.

在示例性实施例中，查找与所述语音特征对应的预先设置的第二语言特征，包括：查找与所述语音特征对应的用户标识信息；根据查找到的所述用户标识信息获取所述第二语言特征。In an exemplary embodiment, the searching for the preset second language feature corresponding to the voice feature comprises: searching for user identification information corresponding to the voice feature; and acquiring the first according to the found user identifier information Two language features.

在示例性实施例中，所述语音特征与所述用户标识信息的对应关系通过以下方式确定：以预先输入的语音特征和用户标识信息作为神经网络模型的输入，在所述神经网络模型中进行训练学习，得到所述对应关系。In an exemplary embodiment, the correspondence between the voice feature and the user identification information is determined by using a voice feature and user identification information input in advance as an input of a neural network model in the neural network model. Train the learning to get the corresponding relationship.

在示例性实施例中，所述第一语言特征和所述第二语音特征为梅尔频谱倒数(Mel Frequency Cepstral Coefficents，简称为MFCC)；所述语音特征为线性倒谱系数(Linear Prediction Cepstrum Coefficient，简称为LPCC)。In an exemplary embodiment, the first linguistic feature and the second linguistic feature are Mel Frequency Cepstral Coefficents (MFCC); the speech feature is a linear cepstrum coefficient (Linear Prediction Cepstrum Coefficient) , referred to as LPCC).

在示例性实施例中，所述第一语言特征和所述第二语音特征包括声纹特征；所述语音特征包括用户口令的语言学内容。In an exemplary embodiment, the first linguistic feature and the second voicing feature comprise a voiceprint feature; the voice feature comprising linguistic content of a user password.

根据本公开的另一个实施例，提供了一种用户声音信息的识别装置，包括：获取模块，配置为获取用户的声音信息；提取模块，配置为提取所述声音信息的第一语言特征和语音特征；查找模块，配置为查找与所述语音特征对应的预先设置的第二语言特征；确定模块，配置为根据所述第一语言特征和所述第二语言特征的第一比较结果确定所述声音信息是否合法。According to another embodiment of the present disclosure, there is provided an apparatus for identifying user voice information, comprising: an obtaining module configured to acquire sound information of a user; and an extracting module configured to extract a first language feature and a voice of the sound information a locating module configured to find a pre-set second language feature corresponding to the voice feature; the determining module configured to determine the first comparison result according to the first language feature and the second language feature Whether the sound information is legal.

在示例性实施例中，所述确定模块还配置为根据所述第一比较结果确定所述第一语言特征与所述第二语言特征之间的向量相似度；以及根据所述向量相似度与预设阈值的比较结果确定所述声音信息是否合法，其中，当所述比较结果指示所述向量相似度大于或等于所述预设阈值时，确定所述声音信息合法。In an exemplary embodiment, the determining module is further configured to determine according to the first comparison result a vector similarity between the first language feature and the second language feature; and determining whether the sound information is legal according to a comparison result of the vector similarity and a preset threshold, wherein when the comparison result indicates When the vector similarity is greater than or equal to the preset threshold, it is determined that the sound information is legal.

在示例性实施例中，所述查找模块还配置为查找与所述语音特征对应的用户标识信息；以及根据查找到的所述用户标识信息获取所述第二语言特征。In an exemplary embodiment, the lookup module is further configured to look up user identification information corresponding to the voice feature; and acquire the second language feature according to the found user identification information.

根据本公开的另一个实施例，还提供了一种存储介质。该存储介质设置为存储用于执行以下步骤的程序代码：获取用户的声音信息；提取所述声音信息的第一语言特征和语音特征；查找与所述语音特征对应的预先设置的第二语言特征；根据所述第一语言特征和所述第二语言特征的第一比较结果确定所述声音信息是否合法。According to another embodiment of the present disclosure, a storage medium is also provided. The storage medium is configured to store program code for performing the steps of: acquiring sound information of the user; extracting first language features and voice features of the sound information; and searching for pre-set second language features corresponding to the voice features Determining whether the sound information is legal according to the first comparison result of the first language feature and the second language feature.

在示例性实施例中，存储介质还设置为存储用于执行以下步骤的程序代码：根据所述第一比较结果确定所述第一语言特征与所述第二语言特征之间的向量相似度；根据所述向量相似度与预设阈值的比较结果确定所述声音信息是否合法，其中，当所述比较结果指示所述向量相似度大于或等于所述预设阈值时，确定所述声音信息合法。In an exemplary embodiment, the storage medium is further configured to store program code for performing a step of determining a vector similarity between the first language feature and the second language feature based on the first comparison result; Determining whether the sound information is legal according to the comparison result of the vector similarity and the preset threshold, wherein when the comparison result indicates that the vector similarity is greater than or equal to the preset threshold, determining that the sound information is legal .

在示例性实施例中，存储介质还设置为存储用于执行以下步骤的程序代码：查找与所述语音特征对应的用户标识信息；根据查找到的所述用户标识信息获取所述第二语言特征。In an exemplary embodiment, the storage medium is further configured to store program code for performing: searching for user identification information corresponding to the voice feature; acquiring the second language feature based on the found user identification information .

在示例性实施例中，存储介质还设置为存储用于执行以下步骤的程序代码：以预先输入的语音特征和用户标识信息作为神经网络模型的输入，在所述神经网络模型中进行训练学习，得到所述对应关系。In an exemplary embodiment, the storage medium is further configured to store program code for performing training learning in the neural network model with pre-entered speech features and user identification information as inputs to a neural network model, The corresponding relationship is obtained.

通过本公开，由于获取和提取声音信息的语言特征和语音特征，并且根据该语音特征查找对应的语言特征，根据语言特征之间的比较结果确定声音信息是否合法，实现了在身份识别的过程中综合考虑用户的声音信息的语言特征和语音特征。利用本公开的方案，解决了目前不存在将口令识别和声纹识别技术相结合综合考虑用户的语言特征和语音特征的识别方法，而口令识别技术或声纹识别技术的密钥较为容易被他人窃取或者盗用，继而给用户在经济上造成损失，用户体验度差的问题，有效地将口令识别技术和声纹识别技术相结合，通过综合考虑用户的声音信息的语言特征和语音特征，提高了身份识别的安全性，大大降低了密钥被他人窃取或者盗用的概率，提高了用户体验度。Through the present disclosure, since the linguistic features and the voice features of the sound information are acquired and extracted, and the corresponding linguistic features are searched according to the voice features, whether the sound information is legal according to the comparison result between the linguistic features is determined, and the identification process is implemented. The linguistic features and voice features of the user's voice information are comprehensively considered. With the solution of the present disclosure, it is solved that there is currently no recognition method combining the password recognition and the voiceprint recognition technology to comprehensively consider the language feature and the voice feature of the user, and the key of the password recognition technology or the voiceprint recognition technology is relatively easy to be used by others. Stealing or stealing, and then giving the user The problem of economic loss and poor user experience effectively combines password recognition technology and voiceprint recognition technology, and improves the security of identity recognition by comprehensively considering the language features and voice features of the user's voice information. The probability that a key is stolen or stolen by others increases the user experience.

在阅读并理解了附图和详细描述后，可以明白其他方面。Other aspects will be apparent upon reading and understanding the drawings and detailed description.

DRAWINGS

图1是本公开实施例的一种用户声音信息的识别方法的计算机终端的硬件结构框图；1 is a block diagram showing the hardware structure of a computer terminal for identifying a user voice information according to an embodiment of the present disclosure;

图2是根据本公开实施例的用户声音信息的识别方法的流程图；2 is a flowchart of a method of identifying user sound information according to an embodiment of the present disclosure;

图3是根据本公开示例性实施例的用户声音信息的识别装置的结构框图；FIG. 3 is a structural block diagram of an apparatus for identifying user sound information according to an exemplary embodiment of the present disclosure; FIG.

图4是根据本公开示例性实施例的用户声音信息的识别方法的流程图(一)；FIG. 4 is a flowchart (1) of a method of identifying user sound information according to an exemplary embodiment of the present disclosure;

图5是根据本公开示例性实施例的用户声音信息的识别方法的流程图(二)；FIG. 5 is a flowchart (2) of a method of identifying user sound information according to an exemplary embodiment of the present disclosure;

图6是根据本公开实施例的用户声音信息的识别装置的结构框图。FIG. 6 is a structural block diagram of an apparatus for identifying user sound information according to an embodiment of the present disclosure.

detailed description

目前并不存在将口令识别技术和声纹识别技术相结合综合考虑用户的语言特征和语音特征的识别方法，而现有的口令识别技术或声纹识别技术的密钥较为容易被他人窃取或者盗用，继而给用户在经济上造成损失，用户体验度差。At present, there is no recognition method combining the password recognition technology and the voiceprint recognition technology to comprehensively consider the user's language features and voice features, and the existing password recognition technology or voiceprint recognition technology key is easily stolen or stolen by others. Then, the user is economically lost, and the user experience is poor.

下文中将参考附图并结合实施例来详细说明本公开。The present disclosure will be described in detail below with reference to the drawings in conjunction with the embodiments.

本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。The terms "first", "second" and the like in the specification and claims of the present disclosure and the above-mentioned figures are used to distinguish similar objects, and are not necessarily used to describe a particular order or order.

为便于理解本公开实施例，以下对本公开实施例中所涉及的技术术语解释如下：In order to facilitate the understanding of the embodiments of the present disclosure, the following technical terms are involved in the embodiments of the present disclosure. Interpretation is as follows:

向量相似度：为语言特征向量之间的相似度，通常用向量之间的距离来表示，如欧式距离、余弦距离等。Vector similarity: is the similarity between linguistic feature vectors, usually expressed by the distance between vectors, such as Euclidean distance, cosine distance, and so on.

示例1Example 1

本公开示例1所提供的方法实施例可以在移动终端、计算机终端或者类似的运算装置中执行。以运行在计算机终端上为例，图1是本公开实施例的一种用户声音信息的识别方法的计算机终端的硬件结构框图。如图1所示，计算机终端10可以包括一个或多个(图中仅示出一个)处理器102(处理器102可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)、配置为存储数据的存储器104、以及用于通信功能的传输装置106。本领域普通技术人员可以理解，图1所示的结构仅为示意，其并不对上述电子装置的结构造成限定。例如，计算机终端10还可包括比图1中所示更多或者更少的组件，或者具有与图1所示不同的配置。The method embodiment provided by Example 1 of the present disclosure may be executed in a mobile terminal, a computer terminal, or the like. Taking a computer terminal as an example, FIG. 1 is a hardware structural block diagram of a computer terminal for identifying a user voice information according to an embodiment of the present disclosure. As shown in FIG. 1, computer terminal 10 may include one or more (only one shown) processor 102 (processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) A memory 104 configured to store data, and a transmission device 106 for communication functions. It will be understood by those skilled in the art that the structure shown in FIG. 1 is merely illustrative and does not limit the structure of the above electronic device. For example, computer terminal 10 may also include more or fewer components than those shown in FIG. 1, or have a different configuration than that shown in FIG.

存储器104可配置为存储应用软件的软件程序以及模块，如本公开实施例中的用户声音信息的识别方法对应的程序指令/模块，处理器102通过运行存储在存储器104内的软件程序以及模块，从而执行各种功能应用以及数据处理，即实现上述的方法。存储器104可包括高速随机存储器，还可包括非易失性存储器，如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中，存储器104可还包括相对于处理器102远程设置的存储器，这些远程存储器可以通过网络连接至计算机终端10。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 104 can be configured as a software program and a module for storing application software, such as program instructions/modules corresponding to the method for identifying user sound information in the embodiment of the present disclosure, and the processor 102 runs the software program and module stored in the memory 104. Thereby performing various functional applications and data processing, that is, implementing the above method. Memory 104 may include high speed random access memory, and may also include non-volatile memory such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory. In some examples, memory 104 may further include memory remotely located relative to processor 102, which may be coupled to computer terminal 10 via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

传输装置106配置为经由一个网络接收或者发送数据。上述的网络具体实例可包括计算机终端10的通信供应商提供的无线网络。在一个实例中，传输装置106包括一个网络适配器(Network Interface Controller，NIC)，其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中，传输装置106可以为射频(Radio Frequency，RF)模块，其配置为通过无线方式与互联网进行通讯。 Transmission device 106 is configured to receive or transmit data via a network. The network specific examples described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a Network Interface Controller (NIC) that can be connected to other network devices through a base station to communicate with the Internet. In one example, the transmission device 106 can be a Radio Frequency (RF) module configured to communicate with the Internet wirelessly.

在本示例中提供了一种运行于图1所示的计算机终端的用户声音信息的识别方法，图2是根据本公开实施例的用户声音信息的识别方法的流程图，如图2所示，该流程包括如下步骤：In this example, a method for identifying user voice information running in the computer terminal shown in FIG. 1 is provided. FIG. 2 is a flowchart of a method for identifying user voice information according to an embodiment of the present disclosure, as shown in FIG. The process includes the following steps:

步骤S202，获取用户的声音信息；Step S202, acquiring sound information of the user;

步骤S204，提取声音信息的第一语言特征和语音特征；Step S204, extracting first language features and voice features of the sound information;

步骤S206，查找与语音特征对应的预先设置的第二语言特征；Step S206, searching for a preset second language feature corresponding to the voice feature;

步骤S208，根据第一语言特征和第二语言特征的第一比较结果确定声音信息是否合法。Step S208: Determine whether the sound information is legal according to the first comparison result of the first language feature and the second language feature.

在本步骤中，声音信息包括语言特征和语音特征，其中，语言特征为用户口令的具体内容，语音特征为声音信息中提取的声纹特征。例如，当用户发出口令“开灯”时，该“开灯”两个字为用户声音信息的语言特征，而用户通过发音器官发出的声音的特征为语音特征。In this step, the sound information includes a language feature and a voice feature, wherein the language feature is a specific content of the user password, and the voice feature is a voiceprint feature extracted from the sound information. For example, when the user issues the password "turn on the light", the word "turning on the light" is the linguistic feature of the user's voice information, and the feature of the sound emitted by the user through the vocal organ is the voice feature.

通过上述步骤，由于获取和提取声音信息的语言特征和语音特征，并且根据该语音特征查找对应的语言特征，根据语言特征之间的比较结果确定声音信息是否合法，实现了在身份识别的过程中综合考虑用户的声音信息的语言特征和语音特征。利用本公开的方案，解决了目前不存在将口令识别和声纹识别技术相结合综合考虑用户的语言特征和语音特征的识别方法，而口令识别技术或声纹识别技术的密钥较为容易被他人窃取或者盗用，继而给用户在经济上造成损失，用户体验度差的问题，有效地将口令识别技术和声纹识别技术相结合，通过综合考虑用户的声音信息的语言特征和语音特征，提高了身份识别的安全性，大大降低了密钥被他人窃取或者盗用的概率，提高了用户体验度。Through the above steps, since the linguistic features and the voice features of the sound information are acquired and extracted, and the corresponding linguistic features are searched according to the voice features, whether the sound information is legal according to the comparison result between the linguistic features is determined, and the identification process is implemented. The linguistic features and voice features of the user's voice information are comprehensively considered. With the solution of the present disclosure, it is solved that there is currently no recognition method combining the password recognition and the voiceprint recognition technology to comprehensively consider the language feature and the voice feature of the user, and the key of the password recognition technology or the voiceprint recognition technology is relatively easy to be used by others. Stealing or stealing, and then giving users economic losses and poor user experience, effectively combining password recognition technology with voiceprint recognition technology, and improving the language and voice characteristics of the user's voice information. The security of identity greatly reduces the probability that the key will be stolen or stolen by others, which improves the user experience.

在示例性实施例中，上述步骤的执行主体可以为单片机等数据处理设备，但不限于此。In an exemplary embodiment, the execution body of the above steps may be a data processing device such as a single chip microcomputer, but is not limited thereto.

在一个示例性的实施例中，可以通过如下操作方式执行步骤S208：根据第一比较结果确定第一语言特征与第二语言特征之间的向量相似度；根据向量相似度与预设阈值的比较结果确定声音信息是否合法，其中，当上述比较结果指示向量相似度大于或等于预设阈值时，确定声音信息合法。 In an exemplary embodiment, step S208 may be performed by determining a vector similarity between the first language feature and the second language feature according to the first comparison result; comparing the vector similarity with the preset threshold As a result, it is determined whether the sound information is legal, wherein when the comparison result indicates that the vector similarity is greater than or equal to the preset threshold, it is determined that the sound information is legal.

在本示例中，第二语言特征可以为多个语言特征，当第一语言特征与第二语言特征进行比较时，可以依次判断该第一语言特征与各个第二语言特征之间的向量相似度，然后取向量相似度的平均值，但是并不限于此。由于本示例采用了向量相似度与阈值比较的方法判断声音信息是否合法，实现了快速准确地对用户的身份进行判断，提高了根据语言特征识别用户身份的准确率。In this example, the second language feature may be a plurality of language features, and when the first language feature is compared with the second language feature, the vector similarity between the first language feature and each second language feature may be sequentially determined. And then take the average of the vector similarity, but is not limited to this. Because this example uses the method of vector similarity and threshold comparison to judge whether the sound information is legal, it realizes the fast and accurate judgment of the user's identity, and improves the accuracy of identifying the user's identity according to the language feature.

在一个示例性的实施例中，上述预设阈值为多个第二语言特征之间的向量相似度的平均值。在本示例中，在对于阈值进行设置的过程中，可以多次重复采集用户的语言特征，继而得到各个采集语言特征之间的向量相似度的平均值，该平均值较为客观准确地反映了用户声音信息的语言特征，防止了由于某次采集的语言特征受到干扰导致采集结果不准确的问题。In an exemplary embodiment, the predetermined threshold is an average of vector similarities between the plurality of second language features. In this example, in the process of setting the threshold, the language feature of the user may be repeatedly collected, and then the average of the vector similarity between the collected language features is obtained, and the average objectively and accurately reflects the user. The linguistic features of the sound information prevent the problem of inaccurate acquisition results due to interference with the language features of a certain acquisition.

在一个示例性的实施例中，可以通过如下的操作方式执行步骤S206：查找与语音特征对应的用户标识信息；根据查找到的用户标识信息获取第二语言特征。In an exemplary embodiment, step S206 may be performed by searching for user identification information corresponding to the voice feature, and acquiring the second language feature according to the found user identification information.

在本示例中，用户标识信息可以由整型字符串或名字的字符串构成，一般为用户的ID，该用户标识信息由系统分配并且唯一。In this example, the user identification information may be composed of an integer string or a string of names, typically the ID of the user, which is assigned by the system and is unique.

在一个示例性的实施例中，上述语音特征与用户标识信息的对应关系通过以下方式确定：以预先输入的语音特征和用户标识信息作为神经网络模型的输入，在神经网络模型中进行训练学习，得到对应关系。In an exemplary embodiment, the correspondence between the voice feature and the user identification information is determined by using the voice feature and the user identification information input in advance as input of a neural network model, and performing training learning in the neural network model. Get the corresponding relationship.

在本示例中，上述神经网络模型可以为深度神经网络识别模型，如卷积神经网络(Convolutional Neural Network，简称为CNN)。在初始化过程中，要预训练深度神经网络模型。在预训练时，要有较大的训练集，训练集的内容是预先输入的语音特征和用户标识信息，预训练的过程是标准的深度神经网络训练过程，将训练集的内容逐条输入到深度神经网络中，根据输出和用户标识信息的比对通过误差反向传播(Error Back Propagation，简称为BP)算法优化神经网络的参数，循环进行，直到模型的精度达到要求。通过采用神经网络模型以及训练学习的方法，可以有效地获取语音特征与用户标识信息的对应关系，实现根据用户的声音信息的语音特征快速准确地查找与其对应的用户标识信息。 In this example, the neural network model may be a deep neural network recognition model, such as a Convolutional Neural Network (CNN). During the initialization process, the deep neural network model is pre-trained. In the pre-training, there must be a large training set. The content of the training set is the pre-entered speech feature and user identification information. The pre-training process is a standard deep neural network training process, and the content of the training set is input into the depth one by one. In the neural network, the parameters of the neural network are optimized by the Error Back Propagation (BP) algorithm according to the comparison of the output and the user identification information, and the loop is performed until the accuracy of the model reaches the requirement. By adopting the neural network model and the training learning method, the correspondence between the voice feature and the user identification information can be effectively obtained, and the voice feature corresponding to the user's voice information can be quickly and accurately searched for the corresponding user identification information.

在一个示例性的实施例中，第一语言特征和第二语音特征为MFCC；语音特征为LPCC。In an exemplary embodiment, the first linguistic feature and the second linguistic feature are MFCC; the speech feature is LPCC.

在本示例中，语音特征可以采用LPCC作为特征参数，LPCC参数则具有计算高效的优点,并且比较彻底地去掉了语音产生过程中的激励信息，有效地反映了声道响应，需要的LPCC较少，十几个LPCC就能描述语音信号的共振峰特性。In this example, the speech feature can use LPCC as the feature parameter, and the LPCC parameter has the advantage of high computational efficiency, and the excitation information in the speech generation process is completely removed, which effectively reflects the channel response, and requires less LPCC. More than a dozen LPCCs can describe the formant characteristics of speech signals.

在一个示例性的实施例中，上述第一语言特征和第二语音特征包括声纹特征；语音特征包括用户口令的语言学内容。In an exemplary embodiment, the first language feature and the second phone feature include a voiceprint feature; the voice feature includes linguistic content of a user password.

图3是根据本公开示例性实施例的用户声音信息的识别装置的结构框图，如图3所示，该装置包括：声音录入模块32，配置为采集声音信息；语言特征识别模块34，配置为在声音信息中提取语言特征；语音特征识别模块36，配置为在声音信息中提取语音特征；特征比较模块38，配置为比较语言特征和向量相似度；特征存储模块310，配置为存储预先设置的语言特征、与语言特征对应的用户标识信息、预设阈值、和语音特征与用户标识信息的对应关系；识别结果输出模块312，配置为将确定声音信息是否合法的指示信息进行输出。3 is a structural block diagram of an apparatus for identifying user sound information according to an exemplary embodiment of the present disclosure. As shown in FIG. 3, the apparatus includes: a sound entry module 32 configured to collect sound information; and a language feature recognition module 34 configured to The language feature is extracted from the voice information; the voice feature recognition module 36 is configured to extract the voice feature in the sound information; the feature comparison module 38 is configured to compare the language feature and the vector similarity; and the feature storage module 310 is configured to store the preset The linguistic feature, the user identification information corresponding to the linguistic feature, the preset threshold, and the correspondence between the voice feature and the user identification information; the recognition result output module 312 is configured to output the indication information that determines whether the voice information is legal.

图4是根据本公开示例性实施例的用户声音信息的识别方法的流程图(一)，如图4所示，该流程包括：FIG. 4 is a flowchart (1) of a method for identifying user voice information according to an exemplary embodiment of the present disclosure. As shown in FIG. 4, the flow includes:

步骤S402，声音录入模块32采集用户的口令密码的声音信息；Step S402, the sound input module 32 collects the sound information of the password of the user;

步骤S404，语言特征识别模块34将声音信息进行傅里叶变换得到该声音信息的频谱，对该频谱取对数后进行逆傅立叶变换得到MFCC；Step S404, the language feature recognition module 34 performs Fourier transform on the sound information to obtain the spectrum of the sound information, and takes the logarithm of the spectrum and then performs inverse Fourier transform to obtain the MFCC;

步骤S406，声音录入模块32和语言特征识别模块34重复上述采集和获取操作，直到获取n个MFCC，其中，n≥2；Step S406, the sound entry module 32 and the language feature recognition module 34 repeat the above-mentioned acquisition and acquisition operations until n MFCCs are acquired, where n≥2;

步骤S408，特征比较模块38将n个MFCC进行两两比较，得到多个向量相似度并对上述多个向量相似度取平均，得到预设阈值；Step S408, the feature comparison module 38 compares n MFCCs in pairs, obtains multiple vector similarities, and averages the plurality of vector similarities to obtain a preset threshold.

步骤S410，语音特征识别模块36在声音信息中提取LPCC；将LPCC和用户的ID输入到预训练的深度神经网络模型中，在该深度神经网络模型进行训练学习； Step S410, the speech feature recognition module 36 extracts the LPCC in the sound information; inputs the LPCC and the user ID into the pre-trained deep neural network model, and performs training learning on the deep neural network model;

步骤S412，声音录入模块32和语音特征识别模块36重复上述采集和训练学习操作，直到深度神经网络模型的测试效果达到指定要求；Step S412, the sound input module 32 and the voice feature recognition module 36 repeat the above-mentioned acquisition and training learning operations until the test effect of the deep neural network model reaches the specified requirement;

步骤S414，存储模块310存储n个MFCC、与MFCC对应的用户的ID、预设阈值、以及LPCC与用户的ID之间的对应关系。In step S414, the storage module 310 stores n MFCCs, IDs of users corresponding to the MFCCs, preset thresholds, and correspondences between the LPCCs and the IDs of the users.

本示例为用户初次使用用户声音信息的识别装置时，该装置对于相关参数进行配置的操作，在本示例中，声音录入模块32可以通过麦克风进行声音信息的采集。This example is an operation in which the device configures relevant parameters when the user first uses the identification device of the user's voice information. In this example, the sound entry module 32 can perform sound information collection through the microphone.

图5是根据本公开示例性实施例的用户声音信息的识别方法的流程图(二)，如图5所示，该流程包括：FIG. 5 is a flowchart (2) of a method for identifying user voice information according to an exemplary embodiment of the present disclosure. As shown in FIG. 5, the flow includes:

步骤S502，声音录入模块32采集用户的口令密码的声音信息；Step S502, the sound input module 32 collects the sound information of the password of the user;

步骤S504，语言特征识别模块34将声音信息进行傅里叶变换得到该声音信息的频谱，对该频谱取对数后进行逆傅立叶变换得到MFCC；Step S504, the linguistic feature recognition module 34 performs Fourier transform on the sound information to obtain the spectrum of the sound information, and takes the logarithm of the spectrum and then performs inverse Fourier transform to obtain the MFCC;

步骤S506，语音特征识别模块36在声音信息中提取LPCC；将LPCC输入到预训练的深度神经网络模型中，得到用户的ID；Step S506, the speech feature recognition module 36 extracts the LPCC in the sound information; inputs the LPCC into the pre-trained deep neural network model to obtain the ID of the user;

步骤S508，特征比较模块38在存储模块310中查找与得到的用户的ID对应的MFCC；并将查找到的MFCC与在口令密码的声音信息中提取到MFCC进行比较，得到向量相似度；将该向量相似度与预设阈值进行比较，当向量相似度大于或等于预设阈值时，口令密码的声音信息为合法，用户身份识别成功；当向量相似度小于预设阈值时，口令密码的声音信息为非法，用户身份识别失败；Step S508, the feature comparison module 38 searches the storage module 310 for the MFCC corresponding to the obtained user ID, and compares the found MFCC with the MFCC extracted from the voice information of the password password to obtain a vector similarity; The vector similarity is compared with the preset threshold. When the vector similarity is greater than or equal to the preset threshold, the voice information of the password password is legal, the user identification is successful; when the vector similarity is less than the preset threshold, the voice information of the password password Is illegal, user identification fails;

步骤S510，识别结果输出模块312将特征比较模块38确定的用户身份识别成功或者失败的识别结果进行输出。In step S510, the recognition result outputting module 312 outputs the recognition result of the successful or failed user identification determined by the feature comparison module 38.

本示例的上述方法步骤可以应用与以下场景中，但是并不限于此。The above method steps of the present example can be applied to the following scenarios, but are not limited thereto.

用户回家时在家门口发出“开门”指令；用户声音信息的识别装置提示输入口令，用户说出预设的口令密码“我是房主”；识别装置录入该口令密码并根据该口令密码的语音特征确定预设的语言特征；识别装置将口令密码提取的语言特征与预设的语言特征进行比较得到向量相似度；将向量相似度与预设阈值进行比较，当向量相似度大于或等于预设阈值时，门自动打开；当向量相似度小于预设阈值时，提示用户重新输入口令密码。When the user returns home, an "open door" command is issued at the door of the door; the identification device of the user's voice information prompts for a password, and the user speaks the preset password password "I am the homeowner"; the identification device enters the password and the voice according to the password The feature determines a preset language feature; the identifying device compares the language feature extracted by the password and the password with the preset language feature to obtain a vector similarity; compares the vector similarity with the preset threshold, and when the vector similarity is greater than or equal to the preset When the threshold is reached, the door is automatically opened; When the vector similarity is less than the preset threshold, the user is prompted to re-enter the password.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是通常采用的实施方式。基于这样的理解，本公开的方案本质上可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端设备(可以是手机，计算机，服务器，或者网络设备等)执行本公开各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the method according to the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course, by hardware, but in many cases, the former is The usual implementation. Based on this understanding, the solution of the present disclosure may be embodied in the form of a software product stored in a storage medium (such as a ROM/RAM, a magnetic disk, an optical disk), and includes a plurality of instructions for making one The terminal device (which may be a cell phone, computer, server, or network device, etc.) performs the methods described in various embodiments of the present disclosure.

示例2Example 2

在本示例中还提供了一种用户声音信息的识别装置，该装置用于实现之前示例及示例性实施方式，已经进行过说明的不再赘述。如以下所使用的，术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置通常以软件来实现，但是硬件，或者软件和硬件的组合的实现也是可能并被构想的。Also provided in this example is an identification device for user sound information, which is used to implement the previous examples and exemplary embodiments, which are not described again. As used below, the term "module" may implement a combination of software and/or hardware of a predetermined function. Although the devices described in the following embodiments are typically implemented in software, hardware, or a combination of software and hardware, is also possible and contemplated.

图6是根据本公开实施例的用户声音信息的识别装置的结构框图，如图6所示，该装置包括获取模块62，配置为获取用户的声音信息；提取模块64，配置为提取声音信息的第一语言特征和语音特征；查找模块66，配置为查找与语音特征对应的预先设置的第二语言特征；确定模块68，配置为根据第一语言特征和第二语言特征的第一比较结果确定声音信息是否合法。6 is a structural block diagram of an apparatus for identifying user voice information according to an embodiment of the present disclosure. As shown in FIG. 6, the apparatus includes an obtaining module 62 configured to acquire sound information of a user, and an extracting module 64 configured to extract sound information. a first language feature and a voice feature; the search module 66 is configured to search for a preset second language feature corresponding to the voice feature; the determining module 68 is configured to determine according to the first comparison result of the first language feature and the second language feature Whether the sound information is legal.

在一个示例性的实施例中，上述确定模块68还配置为根据第一比较结果确定第一语言特征与第二语言特征之间的向量相似度；以及根据向量相似度与预设阈值的比较结果确定声音信息是否合法，其中，当比较结果指示向量相似度大于或等于预设阈值时，确定声音信息合法。In an exemplary embodiment, the determining module 68 is further configured to determine a vector similarity between the first language feature and the second language feature according to the first comparison result; and compare the vector similarity with the preset threshold. Determining whether the sound information is legal, wherein when the comparison result indicates that the vector similarity is greater than or equal to the preset threshold, it is determined that the sound information is legal.

在一个示例性的实施例中，上述查找模块66还配置为查找与语音特征对应的用户标识信息；以及根据查找到的用户标识信息获取第二语言特征。In an exemplary embodiment, the lookup module 66 is further configured to search for user identification information corresponding to the voice feature; and acquire the second language feature according to the found user identification information.

本示例在模块的划分上与示例1存在不同，在本示例中，获取模块62与示例1中的声音录入模块32类似，但增加了一些新的功能特征；提取模块64与示例1中的语言特征识别模块34和语音特征识别模块36类似，但增加了一些新的功能特征；在本示例中通过查找模块66和确定模块68来实现示例1中的特征比较模块38的功能。This example differs from Example 1 in the division of the module. In this example, the acquisition module 62 is similar to the sound entry module 32 in Example 1, but with some new functional features added; Block 64 is similar to language feature recognition module 34 and speech feature recognition module 36 in Example 1, but with some new functional features added; feature comparison module in Example 1 is implemented by lookup module 66 and determination module 68 in this example. 38 features.

上述各个模块是可以通过软件或硬件来实现的，对于后者，可以通过以下方式实现，但不限于此：上述模块均位于同一处理器中；或者，上述各个模块以任意组合的形式分别位于不同的处理器中。The above modules may be implemented by software or hardware. For the latter, the foregoing may be implemented by, but not limited to, the above modules are all located in the same processor; or, the above modules are respectively located in different combinations. In the processor.

示例3Example 3

本公开的实施例还提供了一种存储介质。在本示例中，上述存储介质可以被设置为存储用于执行以下步骤的程序代码：S11，获取用户的声音信息；S12，提取声音信息的第一语言特征和语音特征；S13，查找与语音特征对应的预先设置的第二语言特征；S14，根据第一语言特征和第二语言特征的第一比较结果确定声音信息是否合法。Embodiments of the present disclosure also provide a storage medium. In the present example, the above storage medium may be configured to store program code for performing the following steps: S11, acquiring sound information of the user; S12, extracting first language features and voice features of the sound information; S13, finding and voice features Corresponding pre-set second language features; S14, determining whether the sound information is legal according to the first comparison result of the first language feature and the second language feature.

在本示例中，存储介质还可以被设置为存储用于执行以下步骤的程序代码：S21，根据第一比较结果确定第一语言特征与第二语言特征之间的向量相似度；S22，根据向量相似度与预设阈值的比较结果确定声音信息是否合法，其中，当所述比较结果指示所述向量相似度大于或等于所述预设阈值时，确定所述声音信息合法。In this example, the storage medium may be further configured to store program code for performing the following steps: S21, determining a vector similarity between the first language feature and the second language feature according to the first comparison result; S22, according to the vector The result of the comparison between the similarity and the preset threshold determines whether the sound information is legal. When the comparison result indicates that the vector similarity is greater than or equal to the preset threshold, the sound information is determined to be legal.

在本示例中，存储介质还可以被设置为存储用于执行以下步骤的程序代码：S31，查找与语音特征对应的用户标识信息；S32，根据查找到的用户标识信息获取第二语言特征。In this example, the storage medium may be further configured to store program code for performing the following steps: S31, searching for user identification information corresponding to the voice feature; S32, acquiring the second language feature based on the found user identification information.

在本示例中，上述存储介质可以包括但不限于：U盘、只读存储器(Read-Only Memory，简称为ROM)、随机存取存储器(Random Access Memory，简称为RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。In this example, the foregoing storage medium may include, but is not limited to, a USB flash drive, a read-only memory (ROM), a random access memory (RAM), a mobile hard disk, a magnetic disk. Or a variety of media such as optical discs that can store program code.

本示例中的具体实施方式可以参考之前示例及示例性实施方式中所描述的示例，在此不再赘述。For specific embodiments in this example, reference may be made to the examples described in the previous examples and exemplary embodiments, and details are not described herein again.

本领域普通技术人员可以理解，上文中所公开方法中的全部或某些步骤、系统、装置中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。在硬件实施方式中，在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分；例如，一个物理组件可以具有多个功能，或者一个功能或步骤可以由若干物理组件合作执行。某些组件或所有组件可以被实施为由处理器，如数字信号处理器或微处理器执行的软件，或者被实施为硬件，或者被实施为集成电路，如专用集成电路。这样的软件可以分布在计算机可读介质上，计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的，术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外，本领域普通技术人员公知的是，通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据，并且可包括任何信息递送介质。One of ordinary skill in the art will appreciate all or some of the steps in the methods disclosed above. The functional modules/units in the system, device, and device can be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be composed of several physical The components work together. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on a computer readable medium, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As is well known to those of ordinary skill in the art, the term computer storage medium includes volatile and nonvolatile, implemented in any method or technology for storing information, such as computer readable instructions, data structures, program modules or other data. Sex, removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical disc storage, magnetic cartridge, magnetic tape, magnetic disk storage or other magnetic storage device, or may Any other medium used to store the desired information and that can be accessed by the computer. Moreover, it is well known to those skilled in the art that communication media typically includes computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and can include any information delivery media. .

以上所述仅为本公开的示例性实施例而已，并不用于限制本公开，对于本领域的技术人员来说，本公开可以有各种更改和变化。凡在本公开的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本公开的保护范围之内。The above description is only exemplary embodiments of the present disclosure, and is not intended to limit the disclosure, and various changes and modifications may be made to the present disclosure. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and scope of the present disclosure are intended to be included within the scope of the present disclosure.

Industrial applicability

通过本公开，由于获取和提取声音信息的语言特征和语音特征，并且根据该语音特征查找对应的语言特征，根据语言特征之间的比较结果确定声音信息是否合法，实现了在身份识别的过程中综合考虑用户的声音信息的语言特征和语音特征。利用本公开的方案，解决了目前不存在将口令识别和声纹识别技术相结合综合考虑用户的语言特征和语音特征的识别方法，而口令识别技术或声纹识别技术的密钥较为容易被他人窃取或者盗用，继而给用户在经济上造成损失，用户体验度差的问题，有效地将口令识别技术和声纹识别技术相结合，通过综合考虑用户的声音信息的语言特征和语音特征，提高了身份识别的安全性，大大降低了密钥被他人窃取或者盗用的概率，提高了用户体验度。因此本公开具有工业实用性。 Through the present disclosure, since the linguistic features and the voice features of the sound information are acquired and extracted, and the corresponding linguistic features are searched according to the voice features, whether the sound information is legal according to the comparison result between the linguistic features is determined, and the identification process is implemented. The linguistic features and voice features of the user's voice information are comprehensively considered. With the solution of the present disclosure, it is solved that there is currently no recognition method combining the password recognition and the voiceprint recognition technology to comprehensively consider the language feature and the voice feature of the user, and the key of the password recognition technology or the voiceprint recognition technology is relatively easy to be used by others. Stealing or stealing, and then giving the user The problem of economic loss and poor user experience effectively combines password recognition technology and voiceprint recognition technology, and improves the security of identity recognition by comprehensively considering the language features and voice features of the user's voice information. The probability that a key is stolen or stolen by others increases the user experience. The present disclosure therefore has industrial applicability.

Claims

A method for identifying user voice information, comprising:

Obtaining the user's voice information;

Extracting first language features and voice features of the sound information;

Finding a preset second language feature corresponding to the voice feature;

Determining whether the sound information is legal according to the first comparison result of the first language feature and the second language feature.

The method of claim 1, wherein determining whether the sound information is legal according to the first comparison result of the first language feature and the second language feature comprises:

Determining a vector similarity between the first language feature and the second language feature according to the first comparison result;

Determining whether the sound information is legal according to the comparison result of the vector similarity and the preset threshold, wherein when the comparison result indicates that the vector similarity is greater than or equal to the preset threshold, determining that the sound information is legal .

The method of claim 2, wherein the predetermined threshold is an average of vector similarities between a plurality of the second language features.

The method of claim 1, wherein the finding a pre-set second language feature corresponding to the voice feature comprises:

Finding user identification information corresponding to the voice feature;

And acquiring the second language feature according to the found user identification information.

The method according to claim 4, wherein the correspondence between the voice feature and the user identification information is determined by:

The voice feature and the user identification information input in advance are used as input of a neural network model, and training learning is performed in the neural network model to obtain the corresponding relationship.

The method according to any one of claims 1 to 5, wherein

The first language feature and the second phone feature are a Mel spectrum reciprocal MFCC;

The speech feature is a linear cepstral coefficient LPCC.

The method according to any one of claims 1 to 5, wherein

The first language feature and the second phone feature include a voiceprint feature;

The speech feature includes linguistic content of the user's password.

A device for identifying user voice information, comprising:

An obtaining module (62) configured to obtain sound information of the user;

An extracting module (64) configured to extract a first language feature and a voice feature of the sound information;

a lookup module (66) configured to look up a pre-set second language feature corresponding to the voice feature;

The determining module (68) is configured to determine whether the sound information is legal according to the first comparison result of the first language feature and the second language feature.

The apparatus of claim 8, wherein the determining module (68) is further configured to determine a vector similarity between the first language feature and the second language feature based on the first comparison result; Determining whether the sound information is legal according to the comparison result of the vector similarity and the preset threshold, wherein when the comparison result indicates that the vector similarity is greater than or equal to the preset threshold, determining that the sound information is legal .

The apparatus of claim 8, wherein the lookup module (66) is further configured to look up user identification information corresponding to the voice feature; and to acquire the second language feature based on the found user identification information .

A computer readable storage medium storing computer executable instructions that, when executed by a processor, implement the method of any of claims 1-7.