WO2019218515A1 - 服务器、基于声纹的身份验证方法及存储介质 - Google Patents
服务器、基于声纹的身份验证方法及存储介质 Download PDFInfo
- Publication number
- WO2019218515A1 WO2019218515A1 PCT/CN2018/102118 CN2018102118W WO2019218515A1 WO 2019218515 A1 WO2019218515 A1 WO 2019218515A1 CN 2018102118 W CN2018102118 W CN 2018102118W WO 2019218515 A1 WO2019218515 A1 WO 2019218515A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- voice data
- voiceprint
- verification
- duration
- current
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/08—Use of distortion metrics or a particular distance between probe pattern and reference templates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/08—Network architectures or network communication protocols for network security for authentication of entities
- H04L63/0861—Network architectures or network communication protocols for network security for authentication of entities using biometrical features, e.g. fingerprint, retina-scan
Definitions
- the present application relates to the field of communications technologies, and in particular, to a server, a voiceprint-based identity verification method, and a storage medium.
- the voiceprint collection method is usually: the voice collection starts after the call is established, and the entire voice is continuously collected, and then the voiceprint feature is extracted and verified.
- This method does not take into account the influence of the low quality of the previous collection on the extraction and verification of the voiceprint feature, and the communication establishment process is also a few seconds to ten seconds after the call is connected.
- the voice quality is lower during the middle and late calls, such as the background noise and low volume. As the duration of the call increases, if you continue to consider this part of the recording as voice data for voiceprint verification, it will affect the overall quality of the collected voice, thus affecting the accuracy of voiceprint verification.
- the purpose of the present application is to provide a server, a voiceprint-based authentication method, and a storage medium, aiming at improving the accuracy of identity verification based on voiceprint.
- the present application provides a server including a memory and a processor coupled to the memory, the memory storing a processing system operable on the processor, the processing system being The processor implements the following steps when executed:
- the voice data collected from the 1st to the Nth times is collected according to the time of voice collection. Sequentially splicing and forming voiceprint verification speech data to be determined, wherein N is a positive integer greater than one;
- the voice data is removed according to the preset voiceprint verification voice data, so as to obtain the second preset duration after the voice data is removed.
- Current voiceprint verification voice data
- the present application further provides a voiceprint-based identity verification method, where the voiceprint-based identity verification method includes:
- the voice data collected from the first time to the Nth time is collected according to the voice.
- the time sequence is spliced and forms a voiceprint verification speech data to be determined, wherein N is a positive integer greater than one;
- the voice data is removed according to the preset voiceprint verification voice data, so as to obtain the second preset after the voice data is removed.
- the current voiceprint of the duration verifies the voice data
- the present application also provides a computer readable storage medium having stored thereon a processing system that, when executed by a processor, implements the steps of the voiceprint based authentication method described above.
- the beneficial effects of the present application are: in the process of receiving the voice data sent by the client, if the voice data collected by the client is received multiple times, the voice data may be spliced according to the sequence of the collection time. If the duration of the spliced voice data is greater than the second preset duration, the part of the voice data in the spliced voice data before the acquisition time may be culled, so as to remove the voice data that affects the overall quality of the voice, and improve Accuracy based on voiceprint for authentication.
- FIG. 1 is a schematic diagram of an optional application environment of each embodiment of the present application.
- FIG. 2 is a schematic flow chart of a voiceprint-based identity verification method according to a first embodiment of the present application
- FIG. 3 is a schematic flowchart of a second embodiment of a voiceprint-based identity verification method according to the present application.
- FIG. 4 is a schematic flow chart of a third embodiment of a voiceprint based identity verification method according to the present application.
- FIG. 1 it is a schematic diagram of an application environment of a preferred embodiment of the voiceprint-based authentication method of the present application.
- the application environment diagram includes the server 1 and the terminal device 2.
- the server 1 can perform data interaction with the terminal device 2 through a suitable technology such as a network or a near field communication technology.
- the terminal device 2 includes, but is not limited to, any electronic product that can interact with a user through a keyboard, a mouse, a remote controller, a touch panel, or a voice control device, for example, a personal computer, a tablet computer, or a smart phone.
- PDA Personal Digital Assistant
- game consoles Internet Protocol Television (IPTV)
- IPTV Internet Protocol Television
- smart wearable devices navigation devices, etc.
- mobile devices such as digital TVs, desktop computers, Fixed terminal for notebooks, servers, etc.
- the server 1 is a device capable of automatically performing numerical calculation and/or information processing in accordance with an instruction set or stored in advance.
- the server 1 may be a single network server, a server group composed of multiple network servers, or a cloud-based cloud composed of a large number of hosts or network servers, wherein cloud computing is a kind of distributed computing, which is loosely coupled by a group.
- a super virtual computer consisting of a set of computers.
- the server 1 may include, but is not limited to, a memory 11, a processor 12, and a network interface 13 communicably connected to each other through a system bus, and the memory 11 stores a processing system operable on the processor 12. It is pointed out that Figure 1 shows only the server 1 with components 11-13, but it should be understood that not all illustrated components may be implemented, and more or fewer components may be implemented instead.
- the memory 11 includes a memory and at least one type of readable storage medium.
- the memory provides a cache for the operation of the server 1;
- the readable storage medium may be, for example, a flash memory, a hard disk, a multimedia card, a card type memory (for example, SD or DX memory, etc.), a random access memory (RAM), a static random access memory (SRAM).
- a non-volatile storage medium such as a read only memory (ROM), an electrically erasable programmable read only memory (EEPROM), a programmable read only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, or the like.
- the readable storage medium may be an internal storage unit on the server 1, such as a hard disk on the server 1; in other embodiments, the non-volatile storage medium may also be external to the server 1 Storage devices, such as plug-in hard drives on the server 1, smart memory cards (SMC), Secure Digital (SD) cards, flash cards, etc.
- the readable storage medium of the memory 11 is generally used to store an operating system installed on the server 1 and various types of application software, such as program code for storing the processing system in an embodiment of the present application. Further, the memory 11 can also be used to temporarily store various types of data that have been output or are to be output.
- the processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments.
- the processor 12 is typically used to control overall operations on the server 1, such as performing control and processing related to data interaction or communication with the client computer 2, the handheld terminal 3.
- the processor 12 is configured to run program code or process data stored in the memory 11, such as running a processing system or the like.
- the network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the server 1 and other electronic devices.
- the network interface 13 is mainly used to connect the server 1 to the terminal device 2, and establish a data transmission channel and a communication connection between the server 1 and the terminal device 2.
- the processing system is stored in the memory 11 and includes at least one computer readable instruction stored in the memory 11, the at least one computer readable instruction being executable by the processor 12 to implement the methods of various embodiments of the present application;
- the at least one computer readable instruction can be classified into different logic modules depending on the functions implemented by its various parts.
- the client is installed in a terminal device such as a mobile phone, a tablet computer, or a personal computer, and requests authentication from the server based on the voiceprint.
- the client collects the user's voice data according to a predetermined time interval, for example, the user's voice data is collected every 2 seconds.
- the terminal device collects the user's voice data in real time through a voice collection device such as a microphone.
- a voice collection device such as a microphone.
- the power supply preferably uses the commercial power and keeps the current stable; the sensor should be used when recording.
- the client After the voice data of the first preset duration is collected, the client sends the voice data of the first preset duration to the server.
- the first preset duration is 6 seconds.
- the voice data collected from the 1st to the Nth times is collected according to the time of voice collection. Sequentially splicing and forming voiceprint verification speech data to be determined, wherein N is a positive integer greater than one;
- the client after receiving the voice data of the first preset duration sent by the client, if the voice data of the user is received multiple times, for example, two or more voice data are received, The user can talk more, and the client can collect more voice data. At this time, the voice data received from the first to the Nth times are spliced according to the time sequence of the voice collection, and the voiceprint verification voice data to be determined is obtained. . When the client collects voice data, the voice data identifies the start time and end time of the collection.
- the voice data received this time can be directly used as the subsequent current data.
- the voiceprint verifies the voice data for authenticating the voice data based on the current voiceprint for authentication.
- the voice data is removed according to the preset voiceprint verification voice data, so as to obtain the second preset duration after the voice data is removed.
- Current voiceprint verification voice data
- the second preset duration is, for example, 12 seconds.
- the voice data of the second preset duration is provided, and the voice data can be analyzed more accurately, so as to accurately verify the identity of the user.
- the voiceprint verification voice data to be determined may be culled to remove part of the voice data that affects the overall quality of the voice. .
- the preset culling rule comprises: subtracting the duration of the voiceprint verification voice data to be determined by the second preset duration to obtain a culling duration; and in the voiceprint verification voice data to be determined, according to the culling duration
- the voice data of the previous acquisition time is culled to obtain the current voiceprint verification voice data of the second preset duration after the voice data is culled.
- the voiceprint verification voice data to be determined is still used to authenticate the user, and the user is authenticated.
- the voiceprint verification speech data to be determined is used as subsequent current voiceprint verification voice data to verify the voice data based on the current voiceprint for identity verification.
- the step of constructing the current voiceprint discrimination vector of the current voiceprint verification voice data includes: verifying the current voiceprint The voice data is processed to extract a preset type voiceprint feature, and a corresponding voiceprint feature vector is constructed based on the preset type voiceprint feature; the voiceprint feature vector is input into a pre-trained background channel model to construct the current The voiceprint verifies the current voiceprint discrimination vector corresponding to the voice data.
- the voiceprint feature includes a plurality of types, such as a wideband voiceprint, a narrowband voiceprint, an amplitude voiceprint, etc.
- the preset type voiceprint feature in the embodiment is preferably a Mel frequency cepstrum coefficient of the current voiceprint verification voice data ( Mel Frequency Cepstrum Coefficient, MFCC), the default filter is a Meyer filter.
- MFCC Mel Frequency Cepstrum Coefficient
- the voiceprint feature of the current voiceprint verification voice data is composed into a feature data matrix, and the feature data matrix is the corresponding voiceprint feature vector.
- the pre-emphasis processing is actually a high-pass filtering process, filtering out the low-frequency data, so that the high-frequency characteristics in the current voiceprint verification voice data are more prominent.
- the cepstrum analysis on the Mel spectrum is, for example, taking the logarithm and inverse transform.
- the inverse transform is generally realized by DCT discrete cosine transform.
- the second to thirteenth coefficients after DCT are taken as the Mel frequency cepstrum coefficients.
- MFCC The Mel frequency cepstrum coefficient MFCC is the voiceprint feature of the speech data of this frame, and the Mel frequency cepstral coefficient MFCC of each frame is composed into a feature data matrix, which is a voiceprint feature vector.
- the voice frequency cepstral coefficient MFCC of the speech data is composed of a corresponding voiceprint feature vector, which can be improved because it is more similar to the human auditory system than the linearly spaced frequency band used in the normal cepstrum spectrum. The accuracy of the authentication.
- the voiceprint feature vector is input into the pre-trained background channel model to construct a current voiceprint discrimination vector corresponding to the current voiceprint verification voice data, for example, using a pre-trained background channel model to calculate the current voiceprint verification.
- the feature matrix corresponding to the voice data is used to determine a current voiceprint discrimination vector corresponding to the current voiceprint verification voice data.
- the background channel model is a set of Gaussian mixture models, and the training process of the background channel model
- the method includes the following steps: 1. Acquiring a preset number of voice data samples, each preset number of voice data samples corresponding to a standard voiceprint discrimination vector; 2. processing each voice data sample separately to extract corresponding voice data samples corresponding to each a preset type of voiceprint feature, and constructing a voiceprint feature vector corresponding to each voice data sample based on a preset type of voiceprint feature corresponding to each voice data sample; 3.
- the eigenvectors train the Gaussian mixture model, and use the verification set to verify the accuracy of the trained Gaussian mixture model after the training is completed. If the accuracy is greater than the preset threshold (for example, 98.5%), the training ends, the trained Gaussian mixture model is used as the background channel model to be used, or if the accuracy is less than or equal to the preset threshold, the voice is added. The number of data samples is retrained until the accuracy of the Gaussian mixture model is greater than a preset threshold.
- the preset threshold for example, 98.5%
- the background channel model pre-trained in this embodiment is obtained by mining and comparing a large amount of voice data.
- This model can accurately depict the background voiceprint characteristics of the user while maximally retaining the voiceprint features of the user. And this feature can be removed at the time of identification, and the inherent characteristics of the user's voice can be extracted, which can greatly improve the accuracy and efficiency of user identity verification.
- the calculating the distance between the current voiceprint discrimination vector and the standard voiceprint discrimination vector, and generating the identity verification result based on the calculated distance comprises:
- Calculating the cosine distance between the current voiceprint discrimination vector and the standard voiceprint discrimination vector Identifying the vector for the standard voiceprint, And identifying the vector for the current voiceprint; if the cosine distance is less than or equal to the preset distance threshold, generating information for verifying the pass; if the cosine distance is greater than the preset distance threshold, generating information that the verification fails.
- the user identity identifier when storing the user's standard voiceprint authentication vector, the user identity identifier may be carried.
- the corresponding standard voiceprint discrimination vector is obtained according to the identification information of the current voiceprint authentication vector, and the current voiceprint discrimination is calculated.
- the cosine distance between the vector and the standard voiceprint discrimination vector obtained by matching, the cosine distance is used to verify the identity of the target user, and the accuracy of the authentication is improved.
- the voice data may be spliced according to the sequence of the collection time. If the duration of the spliced voice data is greater than the second preset duration, the part of the voice data in the spliced voice data before the acquisition time may be culled, so as to remove the voice data that affects the overall quality of the voice, and improve Accuracy based on voiceprint for authentication.
- FIG. 2 is a schematic flowchart of an embodiment of a voiceprint-based identity verification method according to an embodiment of the present disclosure.
- the voiceprint-based identity verification method includes the following steps:
- Step S1 After receiving the identity verification request sent by the client, receiving the voice data of the first preset duration sent by the client;
- the client is installed in a terminal device such as a mobile phone, a tablet computer, or a personal computer, and requests authentication from the server based on the voiceprint.
- the client collects the user's voice data according to a predetermined time interval, for example, the user's voice data is collected every 2 seconds.
- the terminal device collects the voice data of the user in real time through a voice collection device such as a microphone.
- a voice collection device such as a microphone.
- the power supply preferably uses the commercial power and keeps the current stable; the sensor should be used when recording.
- the client After the voice data of the first preset duration is collected, the client sends the voice data of the first preset duration to the server.
- the first preset duration is 6 seconds.
- Step S2 after receiving the voice data of the first preset duration sent by the client, if the voice data received by the Nth time is currently received, the voice data received from the first time to the Nth time is voiced according to the voice.
- the time sequence of the acquisition is spliced and forms a voiceprint verification voice data to be determined, where N is a positive integer greater than one;
- the client after receiving the voice data of the first preset duration sent by the client, if the voice data of the user is received multiple times, for example, two or more voice data are received, The user can talk more, and the client can collect more voice data. At this time, the voice data received from the first to the Nth times are spliced according to the time sequence of the voice collection, and the voiceprint verification voice data to be determined is obtained. . When the client collects voice data, the voice data identifies the start time and end time of the collection.
- the user after receiving the voice data of the first preset duration sent by the client, if only the voice data received the first time is received, the user speaks less.
- the client can only collect voice data of a short duration, and the voice data of the user cannot be collected later.
- the voice received this time can be directly used.
- the data is used as a subsequent current voiceprint verification voice data to authenticate the voice data based on the current voiceprint.
- Step S3 if the duration of the voiceprint verification voice data to be determined is greater than the second preset duration, the voice data is removed according to the preset voiceprint verification voice data, so as to obtain the second preview after the voice data is removed.
- the current voiceprint of the duration is used to verify the voice data;
- the second preset duration is, for example, 12 seconds.
- the voice data of the second preset duration is provided, and the voice data can be analyzed more accurately, so as to accurately verify the identity of the user.
- the voiceprint verification voice data to be determined may be culled to remove part of the voice data that affects the overall quality of the voice. .
- the preset culling rule comprises: subtracting the duration of the voiceprint verification voice data to be determined by the second preset duration to obtain a culling duration; and in the voiceprint verification voice data to be determined, according to the culling duration
- the voice data of the previous acquisition time is culled to obtain the current voiceprint verification voice data of the second preset duration after the voice data is culled.
- the voiceprint data is still used to verify the voice data for the user.
- the authentication verifies the voiceprint verification voice data to be used as the subsequent current voiceprint verification voice data to perform identity verification based on the current voiceprint verification voice data.
- Step S4 constructing a current voiceprint discrimination vector of the current voiceprint verification voice data, and determining a standard voiceprint discrimination vector corresponding to the identity identifier according to a mapping relationship between the predetermined identity identifier and the standard voiceprint discrimination vector, and calculating a current voice
- the distance between the texture identification vector and the standard voiceprint discrimination vector, and an identity verification result is generated based on the calculated distance.
- the step of constructing the current voiceprint discrimination vector of the current voiceprint verification voice data includes: verifying the current voiceprint The voice data is processed to extract a preset type voiceprint feature, and a corresponding voiceprint feature vector is constructed based on the preset type voiceprint feature; the voiceprint feature vector is input into a pre-trained background channel model to construct the current The voiceprint verifies the current voiceprint discrimination vector corresponding to the voice data.
- the voiceprint feature includes a plurality of types, such as a wideband voiceprint, a narrowband voiceprint, an amplitude voiceprint, etc.
- the preset type voiceprint feature in the embodiment is preferably a Mel frequency cepstrum coefficient of the current voiceprint verification voice data ( Mel Frequency Cepstrum Coefficient, MFCC), the default filter is a Meyer filter.
- MFCC Mel Frequency Cepstrum Coefficient
- the voiceprint feature of the current voiceprint verification voice data is composed into a feature data matrix, and the feature data matrix is the corresponding voiceprint feature vector.
- the pre-emphasis processing is actually a high-pass filtering process, filtering out the low-frequency data, so that the high-frequency characteristics in the current voiceprint verification voice data are more prominent.
- the cepstrum analysis on the Mel spectrum is, for example, taking the logarithm and inverse transform.
- the inverse transform is generally realized by DCT discrete cosine transform.
- the second to thirteenth coefficients after DCT are taken as the Mel frequency cepstrum coefficients.
- MFCC The Mel frequency cepstrum coefficient MFCC is the voiceprint feature of the speech data of this frame, and the Mel frequency cepstral coefficient MFCC of each frame is composed into a feature data matrix, which is a voiceprint feature vector.
- the voice frequency cepstral coefficient MFCC of the speech data is composed of a corresponding voiceprint feature vector, which can be improved because it is more similar to the human auditory system than the linearly spaced frequency band used in the normal cepstrum spectrum. The accuracy of the authentication.
- the voiceprint feature vector is input into the pre-trained background channel model to construct a current voiceprint discrimination vector corresponding to the current voiceprint verification voice data, for example, using a pre-trained background channel model to calculate the current voiceprint verification.
- the feature matrix corresponding to the voice data is used to determine a current voiceprint discrimination vector corresponding to the current voiceprint verification voice data.
- the background channel model is a set of Gaussian mixture models, and the training process of the background channel model
- the method includes the following steps: 1. Acquiring a preset number of voice data samples, each preset number of voice data samples corresponding to a standard voiceprint discrimination vector; 2. processing each voice data sample separately to extract corresponding voice data samples corresponding to each a preset type of voiceprint feature, and constructing a voiceprint feature vector corresponding to each voice data sample based on a preset type of voiceprint feature corresponding to each voice data sample; 3.
- the eigenvectors train the Gaussian mixture model, and use the verification set to verify the accuracy of the trained Gaussian mixture model after the training is completed. If the accuracy is greater than the preset threshold (for example, 98.5%), the training ends, the trained Gaussian mixture model is used as the background channel model to be used, or if the accuracy is less than or equal to the preset threshold, the voice is added. The number of data samples is retrained until the accuracy of the Gaussian mixture model is greater than a preset threshold.
- the preset threshold for example, 98.5%
- the background channel model pre-trained in this embodiment is obtained by mining and comparing a large amount of voice data.
- This model can accurately depict the background voiceprint characteristics of the user while maximally retaining the voiceprint features of the user. And this feature can be removed at the time of identification, and the inherent characteristics of the user's voice can be extracted, which can greatly improve the accuracy and efficiency of user identity verification.
- the calculating the distance between the current voiceprint discrimination vector and the standard voiceprint discrimination vector, and generating the identity verification result based on the calculated distance comprises:
- Calculating the cosine distance between the current voiceprint discrimination vector and the standard voiceprint discrimination vector Identifying the vector for the standard voiceprint, And identifying the vector for the current voiceprint; if the cosine distance is less than or equal to the preset distance threshold, generating information for verifying the pass; if the cosine distance is greater than the preset distance threshold, generating information that the verification fails.
- the user identity identifier when storing the user's standard voiceprint authentication vector, the user identity identifier may be carried.
- the corresponding standard voiceprint discrimination vector is obtained according to the identification information of the current voiceprint authentication vector, and the current voiceprint discrimination is calculated.
- the cosine distance between the vector and the standard voiceprint discrimination vector obtained by matching, the cosine distance is used to verify the identity of the target user, and the accuracy of the authentication is improved.
- the voice data may be spliced according to the sequence of the collection time. If the duration of the spliced voice data is greater than the second preset duration, the part of the voice data in the spliced voice data before the acquisition time may be culled, so as to remove the voice data that affects the overall quality of the voice, and improve Accuracy based on voiceprint for authentication.
- the present application also provides a computer readable storage medium having stored thereon a processing system that, when executed by a processor, implements the steps of the voiceprint based authentication method described above.
- the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better.
- Implementation Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
- the optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present application.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Acoustics & Sound (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Game Theory and Decision Science (AREA)
- Business, Economics & Management (AREA)
- General Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Collating Specific Patterns (AREA)
- Telephonic Communication Services (AREA)
Abstract
一种服务器、基于声纹的身份验证方法及存储介质,该方法包括:在接收到身份验证请求后,接收客户端发送来的语音数据;在接收到语音数据后,若当前接收到第N次接收到的语音数据,则将第1次至第N次接收到的语音数据按照时间顺序拼接;若待定的声纹验证语音数据的时长大于第二预设时长,则按照预设的剔除规则对待定的声纹验证语音数据进行剔除,得到第二预设时长的当前的声纹验证语音数据;构建当前的声纹验证语音数据的当前声纹鉴别向量,并确定对应的标准声纹鉴别向量,计算当前声纹鉴别向量与标准声纹鉴别向量之间的距离,基于计算的距离生成身份验证结果。该方法能够提高基于声纹进行身份验证的准确性。
Description
优先权申明
本申请基于巴黎公约申明享有2018年05月14日递交的申请号为CN2018104566454、名称为“服务器、基于声纹的身份验证方法及存储介质”中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。
本申请涉及通信技术领域,尤其涉及一种服务器、基于声纹的身份验证方法及存储介质。
目前,在进行远程声纹验证方案中,声纹采集方式通常是:通话建立后语音采集开始,不断采集整段语音,然后进行声纹特征的提取、验证。这种方式未考虑到前期采集的质量低所对声纹特征提取及验证的影响,并且通话接通后的几秒到十几秒内也是一个通讯建立的过程,这段时间的语音相较于通话中后期的语音质量更低,例如背景音嘈杂、音量低等环境的影响。随着通话时长的增加,若继续考虑这部分录音作为声纹验证的语音数据,将会影响采集的语音的整体质量,从而影响声纹验证的准确性。
发明内容
本申请的目的在于提供一种服务器、基于声纹的身份验证方法及存储介质,旨在提高基于声纹进行身份验证的准确性。
为实现上述目的,本申请提供一种服务器,所述服务器包括存储器及与所述存储器连接的处理器,所述存储器中存储有可在所述处理器上运行的处理系统,所述处理系统被所述处理器执行时实现如下步骤:
在接收到客户端发送的带有身份标识的身份验证请求后,接收客户端发 送来的第一预设时长的语音数据;
在接收到客户端发送来的第一预设时长的语音数据后,若当前接收到第N次接收到的语音数据,则将第1次至第N次接收到的语音数据按照语音采集的时间顺序拼接并形成待定的声纹验证语音数据,其中,N为大于1的正整数;
若待定的声纹验证语音数据的时长大于第二预设时长,则按照预设的剔除规则对待定的声纹验证语音数据进行语音数据剔除,以在语音数据剔除后得到第二预设时长的当前的声纹验证语音数据;
构建当前的声纹验证语音数据的当前声纹鉴别向量,并根据预先确定的身份标识与标准声纹鉴别向量的映射关系,确定该身份标识对应的标准声纹鉴别向量,计算当前声纹鉴别向量与标准声纹鉴别向量之间的距离,基于计算的距离生成身份验证结果。
为实现上述目的,本申请还提供一种基于声纹的身份验证方法,所述基于声纹的身份验证方法包括:
S1,在接收到客户端发送的带有身份标识的身份验证请求后,接收客户端发送来的第一预设时长的语音数据;
S2,在接收到客户端发送来的第一预设时长的语音数据后,若当前接收到第N次接收到的语音数据,则将第1次至第N次接收到的语音数据按照语音采集的时间顺序拼接并形成待定的声纹验证语音数据,其中,N为大于1的正整数;
S3,若待定的声纹验证语音数据的时长大于第二预设时长,则按照预设的剔除规则对待定的声纹验证语音数据进行语音数据剔除,以在语音数据剔除后得到第二预设时长的当前的声纹验证语音数据;
S4,构建当前的声纹验证语音数据的当前声纹鉴别向量,并根据预先确定的身份标识与标准声纹鉴别向量的映射关系,确定该身份标识对应的标准声纹鉴别向量,计算当前声纹鉴别向量与标准声纹鉴别向量之间的距离,基于计算的距离生成身份验证结果。
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有处理系统,所述处理系统被处理器执行时实现上述的基于声纹的身份验证方法的步骤。
本申请的有益效果是:本申请在接收到客户端发送来的语音数据的过程中,如果当前多次接收到客户端采集的语音数据,则可将这些语音数据按照采集时间的先后顺序拼接,如果拼接后的语音数据的时长大于第二预设时长,则可以将拼接后的语音数据中采集时间在前的部分语音数据进行剔除,以便将前面影响语音的整体质量的语音数据剔除掉,提高基于声纹进行身份验证的准确性。
图1为本申请各个实施例一可选的应用环境示意图;
图2为本申请基于声纹的身份验证方法第一实施例的流程示意图;
图3为本申请基于声纹的身份验证方法第二实施例的流程示意图;
图4为本申请基于声纹的身份验证方法第三实施例的流程示意图。
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
需要说明的是,在本申请中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛 盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。
参阅图1所示,是本申请基于声纹的身份验证方法的较佳实施例的应用环境示意图。该应用环境示意图包括服务器上1、终端设备2。服务器上1可以通过网络、近场通信技术等适合的技术与终端设备2进行数据交互。
所述终端设备2包括,但不限于,任何一种可与用户通过键盘、鼠标、遥控器、触摸板或者声控设备等方式进行人机交互的电子产品,例如,个人计算机、平板电脑、智能手机、个人数字助理(Personal Digital Assistant,PDA)、游戏机、交互式网络电视(Internet Protocol Television,IPTV)、智能式穿戴式设备、导航装置等等的可移动设备,或者诸如数字TV、台式计算机、笔记本、服务器等等的固定终端。
所述服务器上1是一种能够按照事先设定或者存储的指令,自动进行数值计算和/或信息处理的设备。所述服务器上1可以是单个网络服务器、多个网络服务器组成的服务器组或者基于云计算的由大量主机或者网络服务器构成的云,其中云计算是分布式计算的一种,由一群松散耦合的计算机集组成的一个超级虚拟计算机。
在本实施例中,服务器上1可包括,但不仅限于,可通过系统总线相互通信连接的存储器11、处理器12、网络接口13,存储器11存储有可在处理器12上运行的处理系统。需要指出的是,图1仅示出了具有组件11-13的服务器上1,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
其中,存储器11包括内存及至少一种类型的可读存储介质。内存为服务器上1的运行提供缓存;可读存储介质可为如闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等的非易失性存储介质。在一些实施例中,可读存储介质可以是服务器上1的内 部存储单元,例如该服务器上1的硬盘;在另一些实施例中,该非易失性存储介质也可以是服务器上1的外部存储设备,例如服务器上1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。本实施例中,存储器11的可读存储介质通常用于存储安装于服务器上1的操作系统和各类应用软件,例如存储本申请一实施例中的处理系统的程序代码等。此外,存储器11还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器12在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器12通常用于控制所述服务器上1的总体操作,例如执行与所述客户端计算机2、手持终端3进行数据交互或者通信相关的控制和处理等。本实施例中,所述处理器12用于运行所述存储器11中存储的程序代码或者处理数据,例如运行处理系统等。
所述网络接口13可包括无线网络接口或有线网络接口,该网络接口13通常用于在所述服务器上1与其他电子设备之间建立通信连接。本实施例中,网络接口13主要用于将服务器上1与终端设备2相连,在服务器上1与终端设备2之间建立数据传输通道和通信连接。
所述处理系统存储在存储器11中,包括至少一个存储在存储器11中的计算机可读指令,该至少一个计算机可读指令可被处理器器12执行,以实现本申请各实施例的方法;以及,该至少一个计算机可读指令依据其各部分所实现的功能不同,可被划为不同的逻辑模块。
在一实施例中,上述处理系统被所述处理器12执行时实现如下步骤:
在接收到客户端发送的带有身份标识的身份验证请求后,接收客户端发送来的第一预设时长的语音数据;
本实施例中,客户端安装在手机、平板电脑、个人计算机等终端设备中,其基于声纹向服务器请求进行身份验证。客户端按照预定的时间间隔采集用户的语音数据,例如每隔2秒采集一次用户的语音数据。终端设备通过麦克 风等语音采集设备实时采集得到用户的语音数据。在采集语音数据时,应尽量防止环境噪声和终端设备的干扰。终端设备与用户保持适当距离,且尽量不用失真大的终端设备,电源优选使用市电,并保持电流稳定;在进行录音时应使用传感器。
客户端每采集第一预设时长的语音数据后,即将该第一预设时长的语音数据发送给服务器。优选地,第一预设时长为6秒。
在接收到客户端发送来的第一预设时长的语音数据后,若当前接收到第N次接收到的语音数据,则将第1次至第N次接收到的语音数据按照语音采集的时间顺序拼接并形成待定的声纹验证语音数据,其中,N为大于1的正整数;
在一实施例中,在接收到客户端发送来的第一预设时长的语音数据后,如果当前多次接收到用户的语音数据,例如,接收到2次或2次以上的语音数据,说明用户说话较多,客户端能够采集到较多的语音数据,这时,将第1次至第N次接收到的语音数据按照语音采集的时间先后顺序进行拼接,得到待定的声纹验证语音数据。其中,客户端每次采集语音数据时,该语音数据中标识了采集的起始时间及结束时间。
在另一实施例中,在接收到客户端发送来的第一预设时长的语音数据后,如果当前只接收到第1次接收到的语音数据,说明用户说话较少,客户端仅能采集到较短时长的语音数据,后续无法再采集到用户的语音数据,这时,为了能够对用户进行身份验证,提高身份验证的灵活性,可以直接将本次接收到的语音数据作为后续的当前的声纹验证语音数据,以基于该当前的声纹验证语音数据进行身份验证。
若待定的声纹验证语音数据的时长大于第二预设时长,则按照预设的剔除规则对待定的声纹验证语音数据进行语音数据剔除,以在语音数据剔除后得到第二预设时长的当前的声纹验证语音数据;
其中,第二预设时长例如为12秒。提供第二预设时长的语音数据,能够较准确地分析语音数据,实现对用户的身份的准确验证。
在一实施例中,如果待定的声纹验证语音数据的时长大于第二预设时长,则可以对待定的声纹验证语音数据进行剔除,以将影响语音的整体质量的部分语音数据进行剔除掉。
优选地,预设的剔除规则包括:将待定的声纹验证语音数据的时长减去所述第二预设时长,得到剔除时长;在待定的声纹验证语音数据中,按照该剔除时长的大小将采集时间在前的语音数据进行剔除,以在语音数据剔除后得到第二预设时长的当前的声纹验证语音数据。
在另一实施例中,如果待定的声纹验证语音数据的时长大于第二预设时长,为了提高身份验证的灵活性,仍然使用该待定的声纹验证语音数据对用户进行身份验证,将该待定的声纹验证语音数据作为后续的当前的声纹验证语音数据,以基于该当前的声纹验证语音数据进行身份验证。
构建当前的声纹验证语音数据的当前声纹鉴别向量,并根据预先确定的身份标识与标准声纹鉴别向量的映射关系,确定该身份标识对应的标准声纹鉴别向量,计算当前声纹鉴别向量与标准声纹鉴别向量之间的距离,基于计算的距离生成身份验证结果。
为了有效减少声纹识别的计算量,提高声纹识别的速度,在一实施例中,上述构建当前的声纹验证语音数据的当前声纹鉴别向量的步骤,具体包括:对当前的声纹验证语音数据进行处理,以提取预设类型声纹特征,并基于该预设类型声纹特征构建对应的声纹特征向量;将该声纹特征向量输入预先训练的背景信道模型中,以构建该当前的声纹验证语音数据对应的当前声纹鉴别向量。
其中,声纹特征包括多种类型,例如宽带声纹、窄带声纹、振幅声纹等,本实施例预设类型声纹特征优选为当前的声纹验证语音数据的梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC),预设滤波器为梅尔滤波器。在构建对应的声纹特征向量时,将当前的声纹验证语音数据的声纹特征组成特征数据矩阵,该特征数据矩阵即为对应的声纹特征向量。
具体地,对当前的声纹验证语音数据进行预加重及加窗处理,对每一个 加窗进行傅立叶变换得到对应的频谱,将所述频谱输入梅尔滤波器以输出得到梅尔频谱;在梅尔频谱上进行倒谱分析以获得梅尔频率倒谱系数MFCC,基于所述梅尔频率倒谱系数MFCC组成对应的声纹特征向量。
其中,预加重处理实际是高通滤波处理,滤除低频数据,使得当前的声纹验证语音数据中的高频特性更加突显,具体地,高通滤波的传递函数为:H(Z)=1-αZ
-1,其中,Z为语音数据,α为常量系数,优选地,α的取值为0.97;由于语音数据在分帧之后在一定程度上背离原始语音,因此,需要对语音数据进行加窗处理。在梅尔频谱上进行倒谱分析例如为取对数、做逆变换,逆变换一般是通过DCT离散余弦变换来实现,取DCT后的第2个到第13个系数作为梅尔频率倒谱系数MFCC。梅尔频率倒谱系数MFCC即为这帧语音数据的声纹特征,将每帧的梅尔频率倒谱系数MFCC组成特征数据矩阵,该特征数据矩阵即为声纹特征向量。
本实施例取语音数据的梅尔频率倒谱系数MFCC组成对应的声纹特征向量,由于其比用于正常的对数倒频谱中的线性间隔的频带更能近似人类的听觉系统,因此能够提高身份验证的准确性。
然后,将上述声纹特征向量输入预先训练的背景信道模型,以构建出当前的声纹验证语音数据对应的当前声纹鉴别向量,例如,利用预先训练的背景信道模型来计算当前的声纹验证语音数据对应的特征矩阵,以确定出当前的声纹验证语音数据对应的当前声纹鉴别向量。
为了高效率、高质量地构建出当前的声纹验证语音数据对应的当前声纹鉴别向量,在一优选的实施例中,该背景信道模型为一组高斯混合模型,该背景信道模型的训练过程包括如下步骤:1.获取预设数量的语音数据样本,各个预设数量的语音数据样本对应有标准的声纹鉴别向量;2.分别对各个语音数据样本进行处理以提取出各个语音数据样本对应的预设类型声纹特征,并基于各个语音数据样本对应的预设类型声纹特征构建各个语音数据样本对应的声纹特征向量;3.将提取出的所有预设类型声纹特征向量分为第一百分比的训练集和第二百分比的验证集,所述第一百分比和第二百分比之和小 于或者等于100%;4.利用训练集中的预设类型声纹特征向量对该组高斯混合模型进行训练,并在训练完成后利用验证集对训练后的该组高斯混合模型的准确率进行验证;若准确率大于预设阈值(例如,98.5%),则训练结束,以训练后的该组高斯混合模型作为待使用的背景信道模型,或者,若准确率小于或者等于预设阈值,则增加语音数据样本的数量,并重新进行训练,直至该组高斯混合模型的准确率大于预设阈值。
本实施例预先训练的背景信道模型为通过对大量语音数据的挖掘与比对训练得到,这一模型可以在最大限度保留用户的声纹特征的同时,精确刻画用户说话时的背景声纹特征,并能够在识别时将这一特征去除,而提取用户声音的固有特征,能够较大地提高用户身份验证的准确率及效率。
在一实施例中,上述计算当前声纹鉴别向量与标准声纹鉴别向量之间的距离,基于计算的距离生成身份验证结果的步骤包括:
计算该当前声纹鉴别向量与标准声纹鉴别向量之间的余弦距离:
为所述标准声纹鉴别向量,
为当前声纹鉴别向量;若所述余弦距离小于或者等于预设的距离阈值,则生成验证通过的信息;若所述余弦距离大于预设的距离阈值,则生成验证不通过的信息。
其中,在存储用户的标准声纹鉴别向量时可以携带用户身份标识,在验证用户的身份时,根据当前声纹鉴别向量的标识信息匹配得到对应的标准声纹鉴别向量,并计算当前声纹鉴别向量与匹配得到的标准声纹鉴别向量之间的余弦距离,以余弦距离来验证目标用户的身份,提高身份验证的准确性。
与现有技术相比,本申请在接收到客户端发送来的语音数据的过程中,如果当前多次接收到客户端采集的语音数据,则可将这些语音数据按照采集时间的先后顺序拼接,如果拼接后的语音数据的时长大于第二预设时长,则可以将拼接后的语音数据中采集时间在前的部分语音数据进行剔除,以便将前面影响语音的整体质量的语音数据剔除掉,提高基于声纹进行身份验证的准确性。
如图2所示,图2为本申请基于声纹的身份验证方法一实施例的流程示意图,该基于声纹的身份验证方法包括以下步骤:
步骤S1,在接收到客户端发送的带有身份标识的身份验证请求后,接收客户端发送来的第一预设时长的语音数据;
本实施例中,客户端安装在手机、平板电脑、个人计算机等终端设备中,其基于声纹向服务器请求进行身份验证。客户端按照预定的时间间隔采集用户的语音数据,例如每隔2秒采集一次用户的语音数据。终端设备通过麦克风等语音采集设备实时采集得到用户的语音数据。在采集语音数据时,应尽量防止环境噪声和终端设备的干扰。终端设备与用户保持适当距离,且尽量不用失真大的终端设备,电源优选使用市电,并保持电流稳定;在进行录音时应使用传感器。
客户端每采集第一预设时长的语音数据后,即将该第一预设时长的语音数据发送给服务器。优选地,第一预设时长为6秒。
步骤S2,在接收到客户端发送来的第一预设时长的语音数据后,若当前接收到第N次接收到的语音数据,则将第1次至第N次接收到的语音数据按照语音采集的时间顺序拼接并形成待定的声纹验证语音数据,其中,N为大于1的正整数;
在一实施例中,在接收到客户端发送来的第一预设时长的语音数据后,如果当前多次接收到用户的语音数据,例如,接收到2次或2次以上的语音数据,说明用户说话较多,客户端能够采集到较多的语音数据,这时,将第1次至第N次接收到的语音数据按照语音采集的时间先后顺序进行拼接,得到待定的声纹验证语音数据。其中,客户端每次采集语音数据时,该语音数据中标识了采集的起始时间及结束时间。
在其他实施例中,如图3所示,在接收到客户端发送来的第一预设时长的语音数据后,如果当前只接收到第1次接收到的语音数据,说明用户说话较少,客户端仅能采集到较短时长的语音数据,后续无法再采集到用户的语 音数据,这时,为了能够对用户进行身份验证,提高身份验证的灵活性,可以直接将本次接收到的语音数据作为后续的当前的声纹验证语音数据,以基于该当前的声纹验证语音数据进行身份验证。
步骤S3,若待定的声纹验证语音数据的时长大于第二预设时长,则按照预设的剔除规则对待定的声纹验证语音数据进行语音数据剔除,以在语音数据剔除后得到第二预设时长的当前的声纹验证语音数据;
其中,第二预设时长例如为12秒。提供第二预设时长的语音数据,能够较准确地分析语音数据,实现对用户的身份的准确验证。
在一实施例中,如果待定的声纹验证语音数据的时长大于第二预设时长,则可以对待定的声纹验证语音数据进行剔除,以将影响语音的整体质量的部分语音数据进行剔除掉。
优选地,预设的剔除规则包括:将待定的声纹验证语音数据的时长减去所述第二预设时长,得到剔除时长;在待定的声纹验证语音数据中,按照该剔除时长的大小将采集时间在前的语音数据进行剔除,以在语音数据剔除后得到第二预设时长的当前的声纹验证语音数据。
在其他实施例中,如图4所示,如果待定的声纹验证语音数据的时长大于第二预设时长,为了提高身份验证的灵活性,仍然使用该待定的声纹验证语音数据对用户进行身份验证,将该待定的声纹验证语音数据作为后续的当前的声纹验证语音数据,以基于该当前的声纹验证语音数据进行身份验证。
步骤S4,构建当前的声纹验证语音数据的当前声纹鉴别向量,并根据预先确定的身份标识与标准声纹鉴别向量的映射关系,确定该身份标识对应的标准声纹鉴别向量,计算当前声纹鉴别向量与标准声纹鉴别向量之间的距离,基于计算的距离生成身份验证结果。
为了有效减少声纹识别的计算量,提高声纹识别的速度,在一实施例中,上述构建当前的声纹验证语音数据的当前声纹鉴别向量的步骤,具体包括:对当前的声纹验证语音数据进行处理,以提取预设类型声纹特征,并基于该预设类型声纹特征构建对应的声纹特征向量;将该声纹特征向量输入预先训 练的背景信道模型中,以构建该当前的声纹验证语音数据对应的当前声纹鉴别向量。
其中,声纹特征包括多种类型,例如宽带声纹、窄带声纹、振幅声纹等,本实施例预设类型声纹特征优选为当前的声纹验证语音数据的梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC),预设滤波器为梅尔滤波器。在构建对应的声纹特征向量时,将当前的声纹验证语音数据的声纹特征组成特征数据矩阵,该特征数据矩阵即为对应的声纹特征向量。
具体地,对当前的声纹验证语音数据进行预加重及加窗处理,对每一个加窗进行傅立叶变换得到对应的频谱,将所述频谱输入梅尔滤波器以输出得到梅尔频谱;在梅尔频谱上进行倒谱分析以获得梅尔频率倒谱系数MFCC,基于所述梅尔频率倒谱系数MFCC组成对应的声纹特征向量。
其中,预加重处理实际是高通滤波处理,滤除低频数据,使得当前的声纹验证语音数据中的高频特性更加突显,具体地,高通滤波的传递函数为:H(Z)=1-αZ
-1,其中,Z为语音数据,α为常量系数,优选地,α的取值为0.97;由于语音数据在分帧之后在一定程度上背离原始语音,因此,需要对语音数据进行加窗处理。在梅尔频谱上进行倒谱分析例如为取对数、做逆变换,逆变换一般是通过DCT离散余弦变换来实现,取DCT后的第2个到第13个系数作为梅尔频率倒谱系数MFCC。梅尔频率倒谱系数MFCC即为这帧语音数据的声纹特征,将每帧的梅尔频率倒谱系数MFCC组成特征数据矩阵,该特征数据矩阵即为声纹特征向量。
本实施例取语音数据的梅尔频率倒谱系数MFCC组成对应的声纹特征向量,由于其比用于正常的对数倒频谱中的线性间隔的频带更能近似人类的听觉系统,因此能够提高身份验证的准确性。
然后,将上述声纹特征向量输入预先训练的背景信道模型,以构建出当前的声纹验证语音数据对应的当前声纹鉴别向量,例如,利用预先训练的背景信道模型来计算当前的声纹验证语音数据对应的特征矩阵,以确定出当前的声纹验证语音数据对应的当前声纹鉴别向量。
为了高效率、高质量地构建出当前的声纹验证语音数据对应的当前声纹鉴别向量,在一优选的实施例中,该背景信道模型为一组高斯混合模型,该背景信道模型的训练过程包括如下步骤:1.获取预设数量的语音数据样本,各个预设数量的语音数据样本对应有标准的声纹鉴别向量;2.分别对各个语音数据样本进行处理以提取出各个语音数据样本对应的预设类型声纹特征,并基于各个语音数据样本对应的预设类型声纹特征构建各个语音数据样本对应的声纹特征向量;3.将提取出的所有预设类型声纹特征向量分为第一百分比的训练集和第二百分比的验证集,所述第一百分比和第二百分比之和小于或者等于100%;4.利用训练集中的预设类型声纹特征向量对该组高斯混合模型进行训练,并在训练完成后利用验证集对训练后的该组高斯混合模型的准确率进行验证;若准确率大于预设阈值(例如,98.5%),则训练结束,以训练后的该组高斯混合模型作为待使用的背景信道模型,或者,若准确率小于或者等于预设阈值,则增加语音数据样本的数量,并重新进行训练,直至该组高斯混合模型的准确率大于预设阈值。
本实施例预先训练的背景信道模型为通过对大量语音数据的挖掘与比对训练得到,这一模型可以在最大限度保留用户的声纹特征的同时,精确刻画用户说话时的背景声纹特征,并能够在识别时将这一特征去除,而提取用户声音的固有特征,能够较大地提高用户身份验证的准确率及效率。
在一实施例中,上述计算当前声纹鉴别向量与标准声纹鉴别向量之间的距离,基于计算的距离生成身份验证结果的步骤包括:
计算该当前声纹鉴别向量与标准声纹鉴别向量之间的余弦距离:
为所述标准声纹鉴别向量,
为当前声纹鉴别向量;若所述余弦距离小于或者等于预设的距离阈值,则生成验证通过的信息;若所述余弦距离大于预设的距离阈值,则生成验证不通过的信息。
其中,在存储用户的标准声纹鉴别向量时可以携带用户身份标识,在验证用户的身份时,根据当前声纹鉴别向量的标识信息匹配得到对应的标准声 纹鉴别向量,并计算当前声纹鉴别向量与匹配得到的标准声纹鉴别向量之间的余弦距离,以余弦距离来验证目标用户的身份,提高身份验证的准确性。
与现有技术相比,本申请在接收到客户端发送来的语音数据的过程中,如果当前多次接收到客户端采集的语音数据,则可将这些语音数据按照采集时间的先后顺序拼接,如果拼接后的语音数据的时长大于第二预设时长,则可以将拼接后的语音数据中采集时间在前的部分语音数据进行剔除,以便将前面影响语音的整体质量的语音数据剔除掉,提高基于声纹进行身份验证的准确性。
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有处理系统,所述处理系统被处理器执行时实现上述的基于声纹的身份验证方法的步骤。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。
Claims (20)
- 一种服务器,其特征在于,所述服务器包括存储器及与所述存储器连接的处理器,所述存储器中存储有可在所述处理器上运行的处理系统,所述处理系统被所述处理器执行时实现如下步骤:在接收到客户端发送的带有身份标识的身份验证请求后,接收客户端发送来的第一预设时长的语音数据;在接收到客户端发送来的第一预设时长的语音数据后,若当前接收到第N次接收到的语音数据,则将第1次至第N次接收到的语音数据按照语音采集的时间顺序拼接并形成待定的声纹验证语音数据,其中,N为大于1的正整数;若待定的声纹验证语音数据的时长大于第二预设时长,则按照预设的剔除规则对待定的声纹验证语音数据进行语音数据剔除,以在语音数据剔除后得到第二预设时长的当前的声纹验证语音数据;构建当前的声纹验证语音数据的当前声纹鉴别向量,并根据预先确定的身份标识与标准声纹鉴别向量的映射关系,确定该身份标识对应的标准声纹鉴别向量,计算当前声纹鉴别向量与标准声纹鉴别向量之间的距离,基于计算的距离生成身份验证结果。
- 根据权利要求1所述的服务器,其特征在于,所述处理系统被所述处理器执行时,还实现如下步骤:在接收到客户端发送来的第一预设时长的语音数据后,若当前只接收到第1次接收到的语音数据,则将本次接收到的语音数据作为所述当前的声纹验证语音数据,以基于该当前的声纹验证语音数据进行身份验证。
- 根据权利要求1所述的服务器,其特征在于,所述预设的剔除规则包括:将待定的声纹验证语音数据的时长减去所述第二预设时长,得到剔除时 长;在待定的声纹验证语音数据中,按照该剔除时长的大小将采集时间在前的语音数据进行剔除,以在语音数据剔除后得到第二预设时长的当前的声纹验证语音数据。
- 根据权利要求2所述的服务器,其特征在于,所述预设的剔除规则包括:将待定的声纹验证语音数据的时长减去所述第二预设时长,得到剔除时长;在待定的声纹验证语音数据中,按照该剔除时长的大小将采集时间在前的语音数据进行剔除,以在语音数据剔除后得到第二预设时长的当前的声纹验证语音数据。
- 根据权利要求1所述的服务器,其特征在于,所述处理系统被所述处理器执行时,还实现如下步骤:若待定的声纹验证语音数据的时长小于等于第二预设时长,则将待定的声纹验证语音数据作为所述当前的声纹验证语音数据,以基于该当前的声纹验证语音数据进行身份验证。
- 根据权利要求2所述的服务器,其特征在于,所述处理系统被所述处理器执行时,还实现如下步骤:若待定的声纹验证语音数据的时长小于等于第二预设时长,则将待定的声纹验证语音数据作为所述当前的声纹验证语音数据,以基于该当前的声纹验证语音数据进行身份验证。
- 根据权利要求1或2所述的服务器,其特征在于,所述构建当前的声纹验证语音数据的当前声纹鉴别向量的步骤包括:对当前的声纹验证语音数据进行处理,以提取预设类型声纹特征,并基于该预设类型声纹特征构建对应的声纹特征向量;将该声纹特征向量输入预先训练的背景信道模型中,以构建该当前声纹验证语音数据对应的当前声纹鉴别向量;所述计算当前声纹鉴别向量与标准声纹鉴别向量之间的距离,基于计算的距离生成身份验证结果的步骤包括:若所述余弦距离小于或者等于预设的距离阈值,则生成身份验证通过的信息;若所述余弦距离大于预设的距离阈值,则生成身份验证不通过的信息。
- 一种基于声纹的身份验证方法,其特征在于,所述基于声纹的身份验证方法包括:S1,在接收到客户端发送的带有身份标识的身份验证请求后,接收客户端发送来的第一预设时长的语音数据;S2,在接收到客户端发送来的第一预设时长的语音数据后,若当前接收到第N次接收到的语音数据,则将第1次至第N次接收到的语音数据按照语音采集的时间顺序拼接并形成待定的声纹验证语音数据,其中,N为大于1的正整数;S3,若待定的声纹验证语音数据的时长大于第二预设时长,则按照预设的剔除规则对待定的声纹验证语音数据进行语音数据剔除,以在语音数据剔除后得到第二预设时长的当前的声纹验证语音数据;S4,构建当前的声纹验证语音数据的当前声纹鉴别向量,并根据预先确定的身份标识与标准声纹鉴别向量的映射关系,确定该身份标识对应的标准声纹鉴别向量,计算当前声纹鉴别向量与标准声纹鉴别向量之间的距离,基于计算的距离生成身份验证结果。
- 根据权利要求8所述的基于声纹的身份验证方法,其特征在于,所述步骤S1之后,还包括:在接收到客户端发送来的第一预设时长的语音数据后,若当前只接收到第1次接收到的语音数据,则将本次接收到的语音数据作为所述当前的声纹验证语音数据,以基于该当前的声纹验证语音数据进行身份验证。
- 根据权利要求8所述的基于声纹的身份验证方法,其特征在于,所述预设的剔除规则包括:将待定的声纹验证语音数据的时长减去所述第二预设时长,得到剔除时长;在待定的声纹验证语音数据中,按照该剔除时长的大小将采集时间在前的语音数据进行剔除,以在语音数据剔除后得到第二预设时长的当前的声纹验证语音数据。
- 根据权利要求9所述的基于声纹的身份验证方法,其特征在于,所述预设的剔除规则包括:将待定的声纹验证语音数据的时长减去所述第二预设时长,得到剔除时长;在待定的声纹验证语音数据中,按照该剔除时长的大小将采集时间在前的语音数据进行剔除,以在语音数据剔除后得到第二预设时长的当前的声纹验证语音数据。
- 根据权利要求8所述的基于声纹的身份验证方法,其特征在于,所述步骤S2之后,还包括:若待定的声纹验证语音数据的时长小于等于第二预设时长,则将待定的声纹验证语音数据作为所述当前的声纹验证语音数据,以基于该当前的声纹验证语音数据进行身份验证。
- 根据权利要求9所述的基于声纹的身份验证方法,其特征在于,所述 步骤S2之后,还包括:若待定的声纹验证语音数据的时长小于等于第二预设时长,则将待定的声纹验证语音数据作为所述当前的声纹验证语音数据,以基于该当前的声纹验证语音数据进行身份验证。
- 根据权利要求8或9所述的基于声纹的身份验证方法,其特征在于,所述构建当前的声纹验证语音数据的当前声纹鉴别向量的步骤包括:对当前的声纹验证语音数据进行处理,以提取预设类型声纹特征,并基于该预设类型声纹特征构建对应的声纹特征向量;将该声纹特征向量输入预先训练的背景信道模型中,以构建该当前声纹验证语音数据对应的当前声纹鉴别向量;所述计算当前声纹鉴别向量与标准声纹鉴别向量之间的距离,基于计算的距离生成身份验证结果的步骤包括:若所述余弦距离小于或者等于预设的距离阈值,则生成身份验证通过的信息;若所述余弦距离大于预设的距离阈值,则生成身份验证不通过的信息。
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有处理系统,所述处理系统被处理器执行时实现步骤:在接收到客户端发送的带有身份标识的身份验证请求后,接收客户端发送来的第一预设时长的语音数据;在接收到客户端发送来的第一预设时长的语音数据后,若当前接收到第N次接收到的语音数据,则将第1次至第N次接收到的语音数据按照语音采集的时间顺序拼接并形成待定的声纹验证语音数据,其中,N为大于1的正 整数;若待定的声纹验证语音数据的时长大于第二预设时长,则按照预设的剔除规则对待定的声纹验证语音数据进行语音数据剔除,以在语音数据剔除后得到第二预设时长的当前的声纹验证语音数据;构建当前的声纹验证语音数据的当前声纹鉴别向量,并根据预先确定的身份标识与标准声纹鉴别向量的映射关系,确定该身份标识对应的标准声纹鉴别向量,计算当前声纹鉴别向量与标准声纹鉴别向量之间的距离,基于计算的距离生成身份验证结果。
- 根据权利要求15所述的计算机可读存储介质,其特征在于,所述处理系统被所述处理器执行时,还实现如下步骤:在接收到客户端发送来的第一预设时长的语音数据后,若当前只接收到第1次接收到的语音数据,则将本次接收到的语音数据作为所述当前的声纹验证语音数据,以基于该当前的声纹验证语音数据进行身份验证。
- 根据权利要求15所述的计算机可读存储介质,其特征在于,所述预设的剔除规则包括:将待定的声纹验证语音数据的时长减去所述第二预设时长,得到剔除时长;在待定的声纹验证语音数据中,按照该剔除时长的大小将采集时间在前的语音数据进行剔除,以在语音数据剔除后得到第二预设时长的当前的声纹验证语音数据。
- 根据权利要求16所述的计算机可读存储介质,其特征在于,所述预设的剔除规则包括:将待定的声纹验证语音数据的时长减去所述第二预设时长,得到剔除时长;在待定的声纹验证语音数据中,按照该剔除时长的大小将采集时间在前 的语音数据进行剔除,以在语音数据剔除后得到第二预设时长的当前的声纹验证语音数据。
- 根据权利要求15所述的计算机可读存储介质,其特征在于,所述处理系统被所述处理器执行时,还实现如下步骤:若待定的声纹验证语音数据的时长小于等于第二预设时长,则将待定的声纹验证语音数据作为所述当前的声纹验证语音数据,以基于该当前的声纹验证语音数据进行身份验证。
- 根据权利要求16所述的计算机可读存储介质,其特征在于,所述处理系统被所述处理器执行时,还实现如下步骤:若待定的声纹验证语音数据的时长小于等于第二预设时长,则将待定的声纹验证语音数据作为所述当前的声纹验证语音数据,以基于该当前的声纹验证语音数据进行身份验证。
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810456645.4A CN108630208B (zh) | 2018-05-14 | 2018-05-14 | 服务器、基于声纹的身份验证方法及存储介质 |
| CN201810456645.4 | 2018-05-14 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2019218515A1 true WO2019218515A1 (zh) | 2019-11-21 |
Family
ID=63693020
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2018/102118 Ceased WO2019218515A1 (zh) | 2018-05-14 | 2018-08-24 | 服务器、基于声纹的身份验证方法及存储介质 |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN108630208B (zh) |
| WO (1) | WO2019218515A1 (zh) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP4002900A1 (en) * | 2020-11-13 | 2022-05-25 | Deutsche Telekom AG | Method and device for multi-factor authentication with voice based authentication |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110491389B (zh) * | 2019-08-19 | 2021-12-14 | 效生软件科技(上海)有限公司 | 一种话务系统的声纹识别方法 |
| CN114512134B (zh) * | 2020-11-17 | 2025-10-24 | 阿里巴巴集团控股有限公司 | 声纹信息提取、模型训练与声纹识别的方法和装置 |
| CN114242076B (zh) * | 2021-12-16 | 2025-11-04 | 携程旅游信息技术(上海)有限公司 | 声纹识别方法、装置、电子设备、存储介质 |
| CN114547568A (zh) * | 2022-02-09 | 2022-05-27 | 支付宝(杭州)信息技术有限公司 | 一种基于语音的身份验证方法、装置及设备 |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105679310A (zh) * | 2015-11-17 | 2016-06-15 | 乐视致新电子科技(天津)有限公司 | 一种用于语音识别方法及系统 |
| CN105989836A (zh) * | 2015-03-06 | 2016-10-05 | 腾讯科技(深圳)有限公司 | 一种语音采集方法、装置及终端设备 |
| CN106027762A (zh) * | 2016-04-29 | 2016-10-12 | 乐视控股(北京)有限公司 | 一种手机寻回方法及装置 |
| CN107517207A (zh) * | 2017-03-13 | 2017-12-26 | 平安科技(深圳)有限公司 | 服务器、身份验证方法及计算机可读存储介质 |
Family Cites Families (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN100547655C (zh) * | 2004-09-09 | 2009-10-07 | 上海优浪信息科技股份有限公司 | 一种语音锁 |
| CN1941080A (zh) * | 2005-09-26 | 2007-04-04 | 吴田平 | 一种楼宇对讲门口机声纹识别开锁模块及识别开锁方法 |
| US9691392B1 (en) * | 2015-12-09 | 2017-06-27 | Uniphore Software Systems | System and method for improved audio consistency |
| CN105975568B (zh) * | 2016-04-29 | 2020-04-03 | 腾讯科技(深圳)有限公司 | 一种音频处理方法及装置 |
| US10045110B2 (en) * | 2016-07-06 | 2018-08-07 | Bragi GmbH | Selective sound field environment processing system and method |
-
2018
- 2018-05-14 CN CN201810456645.4A patent/CN108630208B/zh active Active
- 2018-08-24 WO PCT/CN2018/102118 patent/WO2019218515A1/zh not_active Ceased
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105989836A (zh) * | 2015-03-06 | 2016-10-05 | 腾讯科技(深圳)有限公司 | 一种语音采集方法、装置及终端设备 |
| CN105679310A (zh) * | 2015-11-17 | 2016-06-15 | 乐视致新电子科技(天津)有限公司 | 一种用于语音识别方法及系统 |
| CN106027762A (zh) * | 2016-04-29 | 2016-10-12 | 乐视控股(北京)有限公司 | 一种手机寻回方法及装置 |
| CN107517207A (zh) * | 2017-03-13 | 2017-12-26 | 平安科技(深圳)有限公司 | 服务器、身份验证方法及计算机可读存储介质 |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP4002900A1 (en) * | 2020-11-13 | 2022-05-25 | Deutsche Telekom AG | Method and device for multi-factor authentication with voice based authentication |
Also Published As
| Publication number | Publication date |
|---|---|
| CN108630208A (zh) | 2018-10-09 |
| CN108630208B (zh) | 2020-10-27 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN107527620B (zh) | 电子装置、身份验证的方法及计算机可读存储介质 | |
| WO2018166187A1 (zh) | 服务器、身份验证方法、系统及计算机可读存储介质 | |
| WO2019100606A1 (zh) | 电子装置、基于声纹的身份验证方法、系统及存储介质 | |
| JP7152514B2 (ja) | 声紋識別方法、モデルトレーニング方法、サーバ、及びコンピュータプログラム | |
| WO2021051572A1 (zh) | 语音识别方法、装置以及计算机设备 | |
| CN108564954B (zh) | 深度神经网络模型、电子装置、身份验证方法和存储介质 | |
| CN106683680B (zh) | 说话人识别方法及装置、计算机设备及计算机可读介质 | |
| WO2019218515A1 (zh) | 服务器、基于声纹的身份验证方法及存储介质 | |
| JP6429945B2 (ja) | 音声データを処理するための方法及び装置 | |
| WO2018149077A1 (zh) | 声纹识别方法、装置、存储介质和后台服务器 | |
| CN108564955B (zh) | 电子装置、身份验证方法和计算机可读存储介质 | |
| WO2020181824A1 (zh) | 声纹识别方法、装置、设备以及计算机可读存储介质 | |
| WO2019136912A1 (zh) | 电子装置、身份验证的方法、系统及存储介质 | |
| CN108650266B (zh) | 服务器、声纹验证的方法及存储介质 | |
| WO2019196305A1 (zh) | 电子装置、身份验证的方法及存储介质 | |
| WO2019136911A1 (zh) | 更新声纹数据的语音识别方法、终端装置及存储介质 | |
| CN113223536A (zh) | 声纹识别方法、装置及终端设备 | |
| WO2021042537A1 (zh) | 语音识别认证方法及系统 | |
| CN113436633B (zh) | 说话人识别方法、装置、计算机设备及存储介质 | |
| CN105224844B (zh) | 验证方法、系统和装置 | |
| CN110797033A (zh) | 基于人工智能的声音识别方法、及其相关设备 | |
| CN117852007A (zh) | 融合人脸与声纹的身份认证方法、装置、设备及存储介质 | |
| WO2019179033A1 (zh) | 说话人认证方法、服务器及计算机可读存储介质 | |
| CN115834108A (zh) | 客户身份的验证方法及装置 | |
| CN113035230A (zh) | 认证模型的训练方法、装置及电子设备 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18918716 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 18918716 Country of ref document: EP Kind code of ref document: A1 |