CN111199742A

CN111199742A - Identity verification method and device and computing equipment

Info

Publication number: CN111199742A
Application number: CN201811382011.5A
Authority: CN
Inventors: 卓著; 邢君; 方硕; 赵情恩; 雷赟
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-11-20
Filing date: 2018-11-20
Publication date: 2020-05-26

Abstract

The invention discloses an identity authentication method, an identity authentication device and computing equipment, wherein the method comprises the following steps: if the user in the current voice call has registered voiceprint information, verifying the basic information of the user, and collecting a first voice signal of the user in the verification process of the basic information; if the basic information passes the verification, the first voice signal is subjected to voiceprint verification through the voiceprint information registered by the user; and if the voiceprint verification is passed, judging that the user identity verification is successful.

Description

Identity verification method and device and computing equipment

Technical Field

The present invention relates to the field of voice signal processing, and in particular, to an identity authentication method, an identity authentication device, and a computing device.

Background

Under some important scenes that the user identity needs to be accurately authenticated, such as the user fund account number, the account number retrieval of a bank account number, password modification, loss reporting and release, and the like, the user identity can be verified and verified through a customer service telephone. And if the scheme of voiceprint verification is adopted, the customer service can simplify the identity verification process of the user, so that the service time is reduced, the working efficiency is improved, and further more user experience is brought.

Voiceprint Verification (Voiceprint Verification), also known as speaker Verification, is a biometric identification technique that extracts voice features from voice signals emitted by a speaker and verifies the identity of the speaker accordingly. The voiceprint refers to a sound wave frequency spectrum carrying speech information in human voice, has unique biological characteristics like fingerprints, has an identity recognition effect, and not only has specificity but also has relative stability.

Although most of the existing voiceprint-based identity verification schemes realize text-independent voiceprint recognition and do not need a user to send a corresponding sound signal aiming at a specific text, the existing voiceprint-based identity verification schemes usually lack a manual authentication link and cannot ensure that 100% of the passing voiceprints are the user himself or the user himself is required to collect and register the voiceprints on site, so that the existing voiceprint-based identity verification schemes are inconvenient to use and difficult to ensure the real-time property of verification. Therefore, there is a need to provide a new authentication scheme to optimize the above process.

Disclosure of Invention

To this end, the present invention provides an authentication scheme in an attempt to solve or at least alleviate the above-identified problems.

According to an aspect of the present invention, there is provided an authentication method, comprising the steps of: firstly, if a user in the current voice call registers voiceprint information, verifying basic information of the user, and collecting a first voice signal of the user in the verification process of the basic information; if the basic information passes the verification, the first voice signal is subjected to voiceprint verification through the voiceprint information registered by the user; and if the voiceprint verification is passed, judging that the user identity verification is successful.

Optionally, in the identity authentication method according to the present invention, further comprising: if the voiceprint verification is not passed, additional information of the user is verified; if the additional information passes the verification, the user identity verification is judged to be successful; and if the additional information is not passed, judging that the user authentication fails.

Optionally, in the identity authentication method according to the present invention, further comprising: if the user does not register the voiceprint information, carrying out personal information verification on the user and collecting a second voice signal of the user in the process of personal information verification; if the personal information passes the verification, carrying out voice quality detection on the second voice signal; and if the voice quality detection is passed, generating voiceprint information corresponding to the user based on the second voice signal, and registering the voiceprint information.

Optionally, in the identity authentication method according to the present invention, the authenticating the personal information to the user includes: and carrying out personal information verification on the user through manual customer service.

Optionally, in the authentication method according to the present invention, the personal information includes basic information and additional information.

Optionally, in the identity authentication method according to the present invention, further comprising: and if the personal information authentication fails, judging that the user identity authentication fails.

Optionally, in the identity authentication method according to the present invention, the performing voiceprint authentication on the first voice signal through the voiceprint information of the user that is registered includes: determining a voiceprint feature of the first voice signal based on a streaming calculation mode; and carrying out voiceprint verification on the voiceprint characteristics through the registered voiceprint information of the user.

Optionally, in the identity authentication method according to the present invention, the performing voiceprint authentication on the first voice signal through the voiceprint information of the user that is registered includes: performing endpoint detection on the first voice signal to acquire one or more non-mute voice signals; extracting voice characteristic parameters of the non-mute voice signals for all the non-mute voice signals, and determining the voiceprint characteristics of the first voice signal based on the voice characteristic parameters; calculating the matching degree of the voiceprint characteristics and the voiceprint information registered by the user; and if the matching degree exceeds a preset matching degree threshold value, judging that the first voice signal passes the verification.

Optionally, in the identity verification method according to the present invention, the voice feature parameter includes mel-frequency cepstrum coefficients.

Optionally, in the identity authentication method according to the present invention, the extracting the voice feature parameter of the non-silent voice signal includes: performing framing and windowing processing on the non-silent voice signal to generate a plurality of corresponding voice frames; calculating a discrete power spectrum of each voice frame, and filtering the discrete power spectrum through a preset triangular band-pass filter to obtain a corresponding coefficient set; the set of coefficients is processed using a discrete cosine transform to generate mel-frequency cepstral coefficients for the speech frame.

Optionally, in the identity authentication method according to the present invention, the basic information includes a name, an identity card number, a registered account number, and/or a bound mobile phone number.

Optionally, in the authentication method according to the present invention, the additional information includes gender, date of birth, native place, home address, zip code, fixed telephone, bound email, password-prompted question and/or password-prompted answer.

According to still another aspect of the present invention, there is provided an authentication apparatus including an information authentication module, an authentication module, and a determination module. The information verification module is suitable for verifying the basic information of a user when the user in the current voice call registers voiceprint information, and collecting a first voice signal of the user in the verification process of the basic information; the voiceprint verification module is suitable for carrying out voiceprint verification on the first voice signal through the voiceprint information registered by the user when the basic information verification is passed; the judging module is suitable for judging that the user identity authentication is successful when the voiceprint authentication passes.

According to yet another aspect of the invention, there is also provided a computing device comprising one or more processors, memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing the authentication method according to the invention.

According to the identity verification scheme, the artificial core body is taken as the main part, the voiceprint core body is taken as the auxiliary part, when the user is artificially cored, the voiceprint information of the user is calculated and accumulated in a real-time streaming manner, so that the voice of the user with enough length can be obtained, the voiceprint recognition result has high accuracy, the result can be returned in real time when the voiceprint recognition result needs to be returned, the risk brought by only depending on the voiceprint core body is avoided, and the defects that the longer service time is needed to be spent, the user experience is poor and the efficiency is low due to the fact that the voiceprint core body is completely depended on are avoided.

When the user who needs to carry out identity verification does not register the voiceprint information, the voice used by the user is registered for the first time and is obtained through manual inspection and judgment of the voice quality detection module, the quality of the voiceprint information registered for the first time by the user is ensured, the voice quality is evaluated through a model through manual detection of the authenticity of the voice, the user is not aware in the whole process, the user does not need to be matched with the registration intentionally, and the risk of voiceprint attack does not exist.

Drawings

To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.

FIG. 1 shows a schematic diagram of an authentication system 100 according to one embodiment of the invention;

FIG. 2 shows a schematic diagram of an authentication system 200 according to yet another embodiment of the invention;

FIG. 3 illustrates a block diagram of a computing device 300, according to one embodiment of the invention;

FIG. 4 shows a schematic diagram of an authentication process according to one embodiment of the invention;

FIG. 5 shows a flow diagram of an authentication method 500 according to one embodiment of the invention;

FIG. 6 shows a schematic diagram of a voiceprint verification process according to one embodiment of the invention; and

fig. 7 shows a schematic view of an authentication means 700 according to an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Fig. 1 shows a schematic diagram of an authentication system 100 according to an embodiment of the invention. It should be noted that the identity verification system 100 in fig. 1 is only an example, in a specific practical situation, there may be different numbers of signal sending terminals and servers in the identity verification system 100, and the signal sending terminals may be mobile terminals, such as smart phones, tablet computers, and the like, or may be computing devices, such as PCs, and the like, and the present invention is not limited thereto.

As shown in fig. 1, the authentication system 100 includes a signal transmitting terminal 110 and a server 120. In which an authentication device (not shown) resides in the server 120. The authentication system 100 is described below with a specific application scenario. In this scenario, the signal sending end 100 is configured as a smart phone, and a user dials a service phone through the smart phone and wishes to handle related services (such as password change, account retrieval, and the like), and at this time, if voiceprint information is registered in the current voice call, the basic information reflected by the user in the voice call process is verified through an authentication device in the server 120, and the voice of the user in the basic information verification process is collected as a first voice signal. And when the basic information passes the verification, the first voice signal is subjected to voiceprint verification through the voiceprint information registered by the user, and if the voiceprint verification passes, the user identity verification is judged to be successful.

When the voiceprint verification is carried out, the voiceprint characteristics of the user are determined in a streaming calculation mode, so that the first voice signal with enough length (generally 30 seconds) can be obtained, the accuracy of voiceprint recognition is improved, and the real-time property of voiceprint verification result feedback is also ensured. Specifically, a series of processing means such as endpoint detection, voice feature parameter extraction, voiceprint feature determination, matching degree calculation and the like are utilized to determine the matching degree of the voiceprint feature of the first voice signal of the user and the registered voiceprint information. When the matching degree exceeds a preset matching degree threshold, it is determined that the voiceprint authentication is passed and the user identity authentication is successful, and then the server 120 may perform corresponding subsequent operations, and when the matching degree does not exceed the preset matching degree threshold, it is determined that the voiceprint authentication is not passed and the user identity authentication is failed, and the server 120 will refuse to perform the service operation required by the user.

Further, in this embodiment, the basic information of the user may be verified by a human customer service, that is, in the current voice call, the human customer service communicates with the user to verify the basic information, or may be implemented in a non-human manner, for example, by using a robot customer service or other means, the basic information to be verified is broadcasted by voice, the answer of the user is recorded, and the recording result is analyzed to determine whether the basic information answered by the user is correct. Specifically, which way is adopted to complete the verification of the basic information, the invention is not limited, and the verification of other information in the above way is also not limited, and the verification can be selected according to the actual situation.

Of course, if the information of the user is checked through manual customer service in the information verification process, a module of a manual seat should be correspondingly added in the identity verification system 100, and at this time, the manual seat module and the server 120 should be connected through a local area network. Fig. 2 shows a schematic diagram of an authentication system 200 according to yet another embodiment of the invention.

As shown in fig. 2, the authentication System 200 includes a Mobile terminal 210, an IVR (Interactive voice response) server 220, an artificial seat module 230, and a voiceprint authentication server 240, where an authentication device (not shown) resides in the voiceprint authentication server 240, where the Mobile terminal 210 and the IVR server 220 are connected via a communication network such as Internet, GSM (Global System for Mobile communication), CDMA (Code Division Multiple Access), and the IVR server 220, the artificial seat module 230, and the voiceprint authentication server 240 are located in the same lan.

In this embodiment, after the user dials the phone number through the mobile terminal 210 to access the network, the IVR server 220 may select whether to use the local area network to access the manual seat module 230 according to the specific situation of the current voice call, and when it is determined that the manual seat module 230 is needed for processing, the user is accessed. At this time, if the user has registered the voiceprint information, the basic information reflected by the user in the voice call process is verified through the manual customer service, and the voice of the user in the basic information verification process is collected as the first voice signal through the authentication device in the voiceprint verification server 240. And when the basic information passes the verification, the first voice signal is subjected to voiceprint verification through the voiceprint information registered by the user, and if the voiceprint verification passes, the user identity verification is judged to be successful.

During voiceprint verification, real-time streaming calculation and accumulation of voiceprint features of a user are carried out, and a series of processing means such as endpoint detection, voice feature parameter extraction, voiceprint feature determination, matching degree calculation and the like are utilized to determine the matching degree of the voiceprint features of a first voice signal of the user and the registered voiceprint information of the user. When the matching degree exceeds a preset matching degree threshold value, the voiceprint authentication is judged to be passed, the user identity authentication is successful, then the voiceprint authentication server 240 can execute corresponding subsequent operations, when the matching degree does not exceed the preset matching degree threshold value, the voiceprint authentication is judged to be failed, the user identity authentication is failed, the voiceprint authentication server 240 refuses to execute the business operation required by the user, and the manual customer service can inform the user of the identity abnormality according to the business operation, and the next step of processing is carried out according to the actual situation.

It should be noted that, when the number of the manual customer services is sufficient and the user with the current access amount can be handled, the IVR server 220 may be replaced by a communication switching unit, and the communication switching unit switches the call request sent by the mobile terminal 210 into the local area network and connects with the manual position module 230, so as to reduce the construction cost of the system and the network performance requirement.

According to an embodiment of the present invention, the server 120 in the authentication system 100 and the voiceprint authentication server 240 in the authentication system 200 can be implemented by the computing device 300 as described below. FIG. 3 shows a block diagram of a computing device 300, according to one embodiment of the invention.

As shown in FIG. 3, in a basic configuration 302, a computing device 300 typically includes a system memory 306 and one or more processors 304. A memory bus 308 may be used for communication between the processor 304 and the system memory 306.

Depending on the desired configuration, the processor 304 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a Digital Signal Processor (DSP), or any combination thereof. The processor 304 may include one or more levels of cache, such as a level one cache 310 and a level two cache 312, a processor core 314, and registers 316. The example processor core 314 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 318 may be used with the processor 304, or in some implementations the memory controller 318 may be an internal part of the processor 304.

Depending on the desired configuration, system memory 306 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 306 may include an operating system 320, one or more programs 322, and data 324. In some implementations, the program 322 can be arranged to execute instructions on an operating system with the data 324 by one or more processors 304.

The computing device 300 may also include an interface bus 340 that facilitates communication from various interface devices (e.g., output devices 342, peripheral interfaces 344, and communication devices 346) to the basic configuration 302 via the bus/interface controller 330. The example output devices 342 include a graphics processing unit 348 and an audio processing unit 350. They may be configured to facilitate communications with various external devices, such as a display or speakers, via one or more a/V ports 352. Example peripheral interfaces 344 may include a serial interface controller 354 and a parallel interface controller 356, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 358. An example communication device 346 can include a network controller 360, which can be arranged to facilitate communications with one or more other computing devices 362 over a network communication link via one or more communication ports 364.

A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

Computing device 300 may be implemented as a server, such as a file server, a database server, an application server, a WEB server, etc., or as part of a small-form factor portable (or mobile) electronic device, such as a cellular telephone, a Personal Digital Assistant (PDA), a personal media player device, a wireless WEB-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. Computing device 300 may also be implemented as a personal computer including both desktop and notebook computer configurations.

In some embodiments, the computing device 300 is implemented as the server 120 or the voiceprint authentication server 240 and is configured to perform the authentication method 500 according to the invention. Wherein the program 222 of the computing device 300 comprises a plurality of program instructions for executing the authentication method 500 according to the present invention, and the data 324 may further store configuration information of the authentication system 100 or 200, etc.

Fig. 4 shows a schematic diagram of an authentication process according to an embodiment of the invention. As shown in fig. 4, when a user calls that the owner of an account is the user, it may be determined whether the account corresponding to the user has registered voiceprint information, and if the user has registered voiceprint information, the user may be subjected to basic information verification. If the basic information passes the verification, the first voice signal of the user collected in the basic information verification process is subjected to voiceprint verification, and when the voiceprint verification passes, the user can be judged to be successful in voiceprint verification.

In this embodiment, when the voiceprint authentication fails, the additional information authentication is continuously performed on the user, and if the additional information authentication passes, it is determined that the user authentication is successful, otherwise, it is determined that the user authentication fails. And if the previous basic information verification fails, the user identity verification can be directly judged to fail.

Further, if the user does not register voiceprint information, the user needs to be authenticated with personal information, and the personal information includes basic information and additional information. If the personal information passes the verification, voice quality detection is carried out on a second voice signal of the user collected in the personal information verification process, when the voice quality detection passes, voiceprint information corresponding to the user is generated based on the second voice signal, voiceprint information registration is carried out, when the voice quality detection fails, the registration process is ended, and the condition that the account number of the user is abnormal is judged. If the personal information verification fails, the registration process is ended, and it is determined that the user account is abnormal.

It should be noted that, the above operations of verifying the basic information, the additional information, and the personal information of the user may be preferably completed by a human customer service to implement an identity verification system with a human verification as a main part and a voiceprint verification as an auxiliary part, and may also be based on the consideration of saving human cost, communicate with the user by a robot customer service or an answering machine, and analyze the conversation content by combining with other voice processing technologies to complete information verification, which is not limited by the present invention.

Fig. 5 shows a flow diagram of an authentication method 500 according to an embodiment of the invention. As shown in fig. 5, the method 500 begins at step S510. In step S510, if the user in the current voice call has registered voiceprint information, the basic information of the user is verified, and a first voice signal of the user in the process of verifying the basic information is collected. According to one embodiment of the invention, when the basic information of the user is verified, the basic information verification can be carried out on the user through manual customer service. In this embodiment, the basic information includes a name, an identity card number, a registration account number, and/or a bound mobile phone number, and when the basic information is verified, it is determined that the user passes the basic information verification that each item of content in the basic information is consistent with the real basic information associated with the registration account number.

For example, the user currently in a voice call is user1, and user1 has previously registered voiceprint information, which provides the following basic information:

name: jinyang (golden Yang)

Identification card number: 110102197810272321

Registering an account: jinyang _1027

And the real basic information associated with the registered account jinyang _1027 is:

name: jinyang (golden Yang)

Identification card number: 110102197810272321

Registering an account: jinyang _1027

After comparing each item in the basic information with the above-mentioned true basic information, it is determined that the information is identical, and it is determined that the user1 is authenticated by the basic information. Further, while basic information of the user1 is verified, a first voice signal of the user1 during the basic information verification is collected for further processing.

If the basic information is verified, the process then proceeds to step S520, where the first voice signal is subjected to voiceprint verification through the voiceprint information registered by the user. And if the basic information verification fails, determining that the user identity verification fails. According to an embodiment of the present invention, when performing voiceprint verification on the first voice signal, the voiceprint feature of the first voice signal may be determined based on a streaming calculation method, and then the voiceprint feature may be performed with voiceprint verification according to the voiceprint information registered by the user. In this embodiment, a streaming computation framework is used to compute the voiceprint characteristics of the first voice signal, so as to perform real-time streaming computation and accumulate the voiceprint characteristics of the user when verifying the basic information of the user, and since the voiceprint characteristics are the first voice signal collected during the verification of the basic information of the user, the method based on streaming computation can ensure that the first voice signal with a sufficient length (generally 30 seconds) can be obtained, and ensure that the feedback can be responded in real time when the voiceprint verification result is required. The following describes a specific technical solution adopted for voiceprint authentication based on fig. 6.

FIG. 6 shows a schematic diagram of a voiceprint verification process according to one embodiment of the invention. As shown in fig. 6, the first voice signal is first passed through endpoint detection to obtain one or more non-silent voice signals, then voice feature parameter extraction is performed on each non-silent voice signal, a voiceprint feature of the first voice signal is determined based on the voice feature parameters, a matching degree between the voiceprint feature and voiceprint information registered by the user is calculated, and if the matching degree exceeds a preset matching degree threshold, it is determined that the first voice signal is verified. The processing procedure of step S520 is described in detail below with reference to fig. 6.

According to one embodiment of the invention, the first speech signal may be voice print verified in the following manner. First, the first voice signal is end-point detected to obtain one or more non-mute voice signals. Endpoint Detection (VAD) determines the start point and the end point of a voice from a segment of signals including the voice, and then extracts a corresponding non-silent voice signal. Effective endpoint detection not only minimizes processing time, but also eliminates noise interference from silence segments (i.e., silent speech signals), thereby ensuring processing quality. However, the end point of the speech is relatively blurred due to noise generated by artificial respiration or the like before and after the silent section or the utterance, and when the end point is a weak fricative or a weak plosive and the end point is a nasal sound, end point detection becomes difficult. To solve the above problem as much as possible, the short-term energy and the short-term zero-crossing rate are generally combined with other methods for endpoint detection.

Due to the physical reason of the human voice mode, the voice signal corresponding to the unprocessed human voice is time-varying, and in order to apply the processing means in the stationary state, the frame-division windowing processing needs to be performed on the voice signal first, so that a whole voice signal is divided into a plurality of voice frames with a fixed time duration (usually 10-30 ms) as one frame, and in each voice frame, the voice signal can be considered to be relatively stationary.

In order to ensure continuity between frames, framing usually results in a certain overlap ratio between frames. The length of the window function is called the frame length, and the overlap of two frames before and after is called the frame shift. And x (n) represents the voice signal to be verified, and the signal after framing and windowing is as follows:

x_seq(n)＝w(n)·x(seq·Q+n) (1)

where N is 0,1, …, N-1, seq denotes a frame number, w (N) denotes a window function, Q denotes a frame shift, and N denotes a frame length. The ratio of frame shift to frame length is defined as the frame overlap ratio, and usually takes a value of 1/2 or 3/4, i.e. Q ═ N/2 or Q ═ 3N/4.

The commonly used window functions include a rectangular window, a Hamming window and a Hanning window, the windowing process generally affects the original signal, and the side lobe of the rectangular window is too large to cause the inconvenience of frequency domain processing, so various window functions with low side lobe components, such as the Hamming window or the Hanning window, are generally adopted to achieve the purpose of reducing the side lobe effect. In the following description, unless otherwise specified, a hamming window is selected as the predetermined window function.

Defining the short-time energy of the seq-th speech frame as:

the analysis of the short-term energy can find that the short-term energy of voiced sound, unvoiced sound and silence is decreased in turn, and on the basis, corresponding thresholds can be set according to the value of the short-term energy to segment the silent section and the non-silent section of the voice signal, so as to detect one or more non-silent voice signals.

The short-time zero-crossing rate is the number of times that each voice frame changes symbols, and the voice signals are normalized according to the change between the maximum value and the minimum value of the voice signals, so that the change of the wave crests and the wave troughs of the voice signals can be converted into the number of times that the horizontal axis of coordinates passes through. Defining the short-time zero crossing rate of the seq-th voice frame as follows:

where sgn is a sign function, i.e.:

the short-term zero-crossing rate can be automatically segmented by comparing the frequency difference between the speech signal and the noise signal (including the silence segments), and when unvoiced, the frequency is higher, and the high frequency means that the short-term zero-crossing rate is high. This is in sharp contrast to noise signals with low zero crossing rates and speech signals in the silence segment, so that short-term zero crossing rates can be effectively identified on the problem that short-term energy is more difficult to accurately distinguish between unvoiced and noise and the silence segment. Based on the method, the double-threshold endpoint detection algorithm formed by combining the short-time energy and the short-time zero crossing rate has higher accuracy. Regarding the dual-threshold endpoint detection method based on short-term energy and short-term zero-crossing rate, it is a mature technology in the prior art and is not described herein again.

Of course, what kind of algorithm is specifically adopted to perform the endpoint detection on the voice signal may be appropriately adjusted according to the actual application scenario, the system configuration, the performance requirement, and the like, which are easily conceivable for a person skilled in the art to know the scheme of the present invention and are also within the protection scope of the present invention, and are not described herein again.

According to an embodiment of the present invention, after performing endpoint detection on the first voice signal, 3 non-silent voice signals are obtained and are respectively marked as a non-silent voice signal a, a non-silent voice signal B, and a non-silent voice signal C. Next, for each acquired non-silent speech signal, speech feature parameters of the non-silent speech signal are extracted. In this embodiment, the speech feature parameters include mel-frequency cepstrum coefficients, and the speech feature parameters of the non-silent speech signal can be extracted as follows.

First, a non-silent speech signal is subjected to frame-wise windowing to generate a plurality of corresponding speech frames. The processing method of framing and windowing can refer to the contents mentioned in the above endpoint detection, and will not be described herein again. For example, for the non-silent speech signal a, 200 speech frames are obtained after the frame windowing process. And then, calculating the discrete power spectrum of each voice frame, and filtering the discrete power spectrum through a preset triangular band-pass filter group to obtain a corresponding coefficient set. The preset triangular band-pass filter bank is a mel filter bank, the coefficient output by the filter bank after filtering is recorded as X (K), K is 1,2, …, K is the number of the filters, and the K coefficients form a corresponding coefficient set.

And finally, processing the coefficient set by utilizing discrete cosine transform to generate a Mel frequency cepstrum coefficient of the voice frame. With MFCC_rThe r-th mel-frequency cepstrum coefficient is represented by:

wherein the result takes the first r mel-frequency cepstral coefficients, r is usually less than or equal to K.

After the static parameters of the mel-frequency cepstrum coefficients are extracted according to equation (5), they may be differentiated to obtain the dynamic parameters Δ MFCC_rMFCC as a static parameter_rIn turn, enhances the noise suppression effect. The mel-frequency cepstrum coefficients used generally include 12-dimensional, 13-dimensional and 39-dimensional (the mel-frequency cepstrum coefficient of 13-dimensional and its first and second derivatives). In addition, the speech feature parameters are not limited to mel-frequency cepstrum coefficients, and may also include linear prediction cepstrum coefficients, FBank feature parameters, and the like, where the FBank feature parameters are actually a coefficient set obtained in the above process of obtaining mel-frequency cepstrum coefficients, in other words, the mel-frequency cepstrum coefficients can be obtained by performing discrete cosine transform on the FBank feature parameters. Of course, which kind of parameters or combination of several parameters is specifically adopted as the speech feature parameters can be selected according to the actual situation, and the invention is not limited thereto.

After the extraction of the voice feature parameters of the non-silent voice signal is completed, determining the voiceprint feature of the first voice signal based on the voice feature parameters. According to one embodiment of the invention, the voiceprint features include I-Vector, D-Vector, and X-Vector. It should be noted that when I-Vector is used as the voiceprint feature, the speech feature parameter used for I-Vector calculation is the mel-frequency cepstrum coefficient, and when D-Vector or X-Vector is used as the voiceprint feature, the speech feature parameter used for D-Vector or X-Vector calculation is the FBank feature parameter.

I-Vector is an improved version of JFA (Joint Factor Analysis), and the idea of the JFA method is to Model speaker differences and channel differences separately using the subspace of the GMM (Gaussian Mixed Model) super-Vector space, so that channel interference can be classified conveniently. In the JFA model, the modeling process is based primarily on: speaker space defined by the eigentone space matrix V and channel space defined by the eigentone channel space matrix U. In the I-Vector model, a global variance Space (T) is used, which includes both the difference between speakers and the difference between channels. Therefore, the modeling process of I-Vector does not strictly distinguish talker effects from channel effects in the GMM mean supervectors.

Given a segment of speech h of speaker s, this new speaker and channel dependent GMM mean supervector is defined as follows:

M_s,h＝m_u+Tω_s,h(6)

wherein m is_uIs the mean value supervector of speaker and channel independent, i.e. the mean value supervector of UBM (Universal background model, can be understood as a large-scale GMM), T is the global space matrix, omega_s,hIs a global difference factor, M_s,hObey mean value of m_uThe covariance matrix is TT^*Is normally distributed.

The expression (6) can be understood as M for a specific speaker s and a specific voice h_s,hDetermined by the mean of the UBMs plus the matrix product of the global difference space matrix and the global difference factor. Further, in I-Vector validation, the global difference space matrix T is estimated first, and then the I-Vector is estimated.

Regarding the estimation of the global difference space matrix T, considering that each segment of speech is from different speakers, the T matrix can be estimated specifically by the following method:

1. calculating the Baum-Welch statistic corresponding to each speaker;

2. the initial value of T is randomly generated, and the T matrix is iteratively estimated using the following EM (Expectation Maximization) algorithm:

e, step E: calculate ω_s,hAnd the expected form of the a posteriori correlation matrix.

Step M: and re-estimating by a maximum likelihood method to update the T matrix.

After a plurality of iterations, a global difference space matrix T is obtained.

Finally, according to the trained global difference space matrix T and the Baum-Welch statistic corresponding to each target speaker, the omega at the moment is calculated_s,hThe posterior mean value of (a) is the I-Vector. Thus, each targeted speaker has an I-Vector associated with it.

D-Vector, a Vector formed in DNN (Deep Neural Network) based speaker recognition algorithm. The DNN-based speaker recognition algorithm uses DNN to replace GMM to calculate posterior statistics, after DNN training is completed, voice characteristic parameters of each voice frame are extracted to serve as DNN input, an activation function is extracted from a hidden layer closest to an output end, L2 is normalized, and then the normalized activation function is accumulated to obtain a Vector which is called D-Vector. If a person has multiple voices, the average values of the D-vectors corresponding to all the voices are the vocal print characteristics of the person.

In addition, because the D-Vector is extracted from the hidden layer closest to the output end, the size of the model can be reduced by removing the classifier, so that more speaker data can be used in the training process under the condition of not changing the size of the model, and after all, the classifier is removed, so that the node number of the layer of the classifier is not considered.

X-Vector is used to describe the embedded layer features extracted from TDNN (Time Delay Neural Network). In the network structure of TDNN, there is a statistical Pooling Layer (Statistics Pooling Layer) responsible for mapping Frame-Level layers (Frame-Level layers) to Segment-Level layers (Segment-Level layers), and calculating the mean and standard deviation of the Frame-Level layers. TDNN is a time delay structure, and the output end can learn long-term characteristics, so that the X-Vector can capture user voiceprint information by using short voice of about 10 seconds, and has stronger robustness on short voice.

After determining the voiceprint feature of the first speech signal, calculating the matching degree of the voiceprint feature and the voiceprint information registered by the user. In this embodiment, I-Vector is used as the voiceprint feature, and a PLDA (Probabilistic linear discriminant Analysis) algorithm is used to calculate the degree of matching between the voiceprint feature and the user registered voiceprint information.

The PLDA is a Generated Model (Generated Model) that can be used to Model and classify I-vectors. The PLDA algorithm is a channel compensation algorithm because the I-Vector contains both speaker information and channel information, and since only speaker information is of interest, channel compensation is required. In the process of voiceprint recognition training, the training speech is assumed to be composed of the speech of I speakers, wherein each speaker has J sections of different speech, and the jth section of speech of the ith speaker is defined as Y_ij. Then, define Y_ijThe generated model is:

Y_ij＝μ+Fh_i+Gw_ij+ε_ij(7)

where μ is the data mean and F, G is the spatial feature matrix, which contains the fundamental factors in the respective hypothetical variable space, which can be considered as the basis of the respective space. Each column of F corresponds to a feature vector of the inter-class feature space, and each column of G corresponds to a feature vector of the intra-class feature space. And vector h_iAnd w_ijCan be regarded as the feature representation of the speech in the respective space, epsilon_ijThen it is the noise covariance. If h of two voices_iThe greater the likelihood that features are the same, i.e., the higher the degree of match, the greater the likelihood that they are from the same speaker.

The model parameters for the PLDA included 4, μ, F, G and ε_ijThe training method is formed by adopting an EM algorithm for iterative training. In general, a simplified version of the PLDA model is used, the training of the intra-class eigenspace matrix G is omitted, and only the inter-class eigenspace matrix F is trained, i.e.:

Y_ij＝μ+Fh_i+ε_ij(8)

according to an embodiment of the invention, after the I-Vector of the first voice signal and the voiceprint information registered by the user1 are subjected to matching degree calculation based on a PLDA algorithm, the matching degree between the I-Vector of the first voice signal and the voiceprint information is score 1. In this embodiment, if the preset threshold value of the matching degree is score2 and is smaller than score1, it can be determined that the first speech signal is verified.

It should be noted that the voiceprint features are not limited to the above I-Vector, D-Vector and X-Vector, and the corresponding matching degree calculation method may also be adopted according to the selected voiceprint features, which is not limited by the present invention. In addition, the specific implementation of the algorithms of I-Vector, D-Vector, X-Vector and PLDA is the prior mature technology and is not described herein again.

Finally, in step S530, if the voiceprint authentication passes, it is determined that the user identity authentication is successful. According to an embodiment of the present invention, as shown in step S520, the user1 passes the voiceprint authentication, and therefore, the user identity authentication is determined to be successful.

In addition, considering that voiceprint authentication is not 100% accurate and is difficult to be done without loss, according to another embodiment of the present invention, if voiceprint authentication fails, the additional information of the user is authenticated, if additional information authentication passes, the user authentication is determined to be successful, and if additional information fails, the user authentication is determined to be failed. In verifying the additional information of the user, the additional information verification of the user may be preferably performed by a human customer service.

In this embodiment, the additional information includes gender, date of birth, native place, home address, zip code, landline telephone, bound email, password-prompted questions, and/or password-prompted answers. When the additional information is verified, it is required to ensure that each item of content in the additional information is consistent with the real additional information associated with the registered account number, and the fact that the user passes the additional information verification can be judged.

For example, if the user1 fails the voiceprint authentication in step S520, the additional information provided by the user1 needs to be authenticated, and the additional information is specifically as follows:

sex: woman

Date of birth: 1978, month 10, day 27

The method comprises the following steps: beijing

The home address is as follows: beijing city western area Fuxing outdoor 999 houses, 999, 11, building, 3, unit 502

And (3) post code: 100000

Password prompting problem: favorite animal

Password prompt answers: cat (cat)

The real additional information associated with the registered account jinyang _1027 of the user1 is:

sex: woman

Date of birth: 1978, month 10, day 27

The method comprises the following steps: beijing

And (3) post code: 100000

Password prompting problem: favorite animal

Password prompt answers: dog

After comparing each item in the additional information with the real additional information, if the two items are found to be inconsistent in the password prompt answer, it is determined that the user1 fails the additional information authentication, and it is determined that the user authentication fails.

Further, for users who have not registered voiceprint information, after performing profile verification on the users, voiceprint information can be extracted based on the voice with higher quality of the users so as to complete registration. According to another embodiment of the present invention, if the user does not register the voiceprint information, the personal information authentication is performed for the user, and the second voice signal of the user in the personal information authentication process is collected. In verifying the personal information of the user, it is preferable that the personal information verification of the user be performed by a human customer service. Wherein the personal information includes basic information and additional information.

In this embodiment, the user1 does not register voiceprint information, and at this time, the user needs to perform double verification of the basic information and the additional information to ensure the authenticity and reliability of the information as much as possible, and the specific way of verifying the personal information may refer to the above-mentioned verification of the relevant contents of the basic information and the additional information, which is not described herein again.

If the personal information verification is passed, voice quality detection is carried out on the second voice signal, if the voice quality detection is passed, voiceprint information corresponding to the user is generated based on the second voice signal, voiceprint information registration is carried out, and if the personal information verification is not passed, the user identity verification is judged to be failed. The specific way of voice quality detection can refer to the related schemes related to voice quality testing and inspection in the prior mature technology, and the invention is not limited thereto and is not repeated herein.

Fig. 7 shows a schematic view of an authentication means 700 according to an embodiment of the invention. As shown in fig. 7, the authentication apparatus 700 includes an information authentication module 710, an authentication module 720, and a determination module 730.

The information verification module 710 is adapted to verify basic information of a user when the user in a current voice call has registered voiceprint information, and collect a first voice signal of the user in a basic information verification process. The basic information comprises a name, an identity card number, a registered account number and/or a bound mobile phone number.

The information verification module 710 is further adapted to verify the additional information of the user when the voiceprint verification fails. Wherein the additional information comprises gender, date of birth, native place, home address, zip code, landline telephone, bound email, password-prompted question and/or password-prompted answer.

The information authentication module 710 is further adapted to perform personal information authentication on the user when the user is not registered with voiceprint information, and collect a second voice signal of the user during the personal information authentication. The information verification module 710 is further adapted to perform personal information verification on the user through human customer service according to an embodiment of the present invention. In this embodiment, the personal information includes basic information and additional information.

The voiceprint authentication module 720 is adapted to perform voiceprint authentication on the first voice signal through the voiceprint information registered by the user when the basic information authentication is passed. According to an embodiment of the present invention, the voiceprint verification module 720 is further adapted to determine a voiceprint feature of the first speech signal based on a streaming calculation, and perform voiceprint verification on the voiceprint feature through the user's registered voiceprint information

According to an embodiment of the present invention, the voiceprint authentication module 720 is further adapted to perform endpoint detection on the first voice signal to obtain one or more non-silent voice signals, extract voice feature parameters of the non-silent voice signals for each non-silent voice signal, determine a voiceprint feature of the first voice signal based on the voice feature parameters, calculate a matching degree between the voiceprint feature and the user registered voiceprint information, and determine that the first voice signal is authenticated if the matching degree exceeds a preset matching degree threshold.

In this embodiment, the voice characteristic parameters include mel-frequency cepstrum coefficients, and the voiceprint verification module 720 is further adapted to perform framing and windowing on the non-silent voice signal to generate a plurality of corresponding voice frames, calculate a discrete power spectrum of each voice frame, filter the discrete power spectrum through a preset triangular band-pass filter to obtain a corresponding coefficient set, and process the coefficient set by using discrete cosine transform to generate the mel-frequency cepstrum coefficients of the voice frames.

The voiceprint authentication module 720 is further adapted to perform voice quality detection on the second voice signal when the personal information authentication is passed, and generate voiceprint information corresponding to the user based on the second voice signal and perform voiceprint information registration when the voice quality detection is passed.

The determination module 730 is adapted to determine that the user identity authentication is successful when the voiceprint authentication passes. The determination module 730 is further adapted to determine that the user authentication fails when the basic information authentication fails. The determination module 730 is further adapted to determine that the user authentication is successful when the additional information is authenticated and that the user authentication is failed when the additional information is not authenticated. The determination module 730 is further adapted to determine that the user authentication fails when the personal information authentication fails.

The specific steps and embodiments of the identity authentication are disclosed in detail in the description based on fig. 4 to 6, and are not described herein again.

The existing identity verification method is usually lack of a manual authentication link, cannot ensure that 100% of voiceprints passing through are the user himself or the user himself needs to collect and register the voiceprints on site, is inconvenient to use, is not a voiceprint system of a streaming computing frame, and is difficult to ensure the real-time property of verification. According to the identity verification scheme provided by the embodiment of the invention, the artificial core body is taken as a main body, the voiceprint core body is taken as an auxiliary body, and when the voiceprint core body is artificially checked, the voiceprint information of the user is calculated and accumulated in a real-time streaming manner, so that the voice of the user with enough length can be obtained, the voiceprint recognition result has high accuracy, the result can be returned in real time when the voiceprint recognition result needs to be requested to be returned, the risk caused by only depending on the voiceprint core body is avoided, and the defects of long service time, poor user experience and low efficiency caused by depending on the artificial core body completely are avoided.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules or units or groups of devices in the examples disclosed herein may be arranged in a device as described in this embodiment, or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. Modules or units or groups in embodiments may be combined into one module or unit or group and may furthermore be divided into sub-modules or sub-units or sub-groups. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.

The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the authentication method of the present invention according to instructions in said program code stored in the memory.

By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer-readable media includes both computer storage media and communication media. Computer storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of computer readable media.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

Claims

1. An identity verification method comprising:

if the user in the current voice call has registered voiceprint information, verifying the basic information of the user, and collecting a first voice signal of the user in the verification process of the basic information;

if the basic information passes the verification, the voiceprint verification is carried out on the first voice signal through the voiceprint information registered by the user;

and if the voiceprint verification is passed, judging that the user identity verification is successful.

2. The method of claim 1, further comprising:

if the voiceprint verification is not passed, verifying the additional information of the user;

if the additional information passes the verification, the user identity verification is judged to be successful;

and if the additional information is not passed, judging that the user identity authentication fails.

3. The method of claim 1, further comprising:

if the user does not register the voiceprint information, carrying out personal information verification on the user and collecting a second voice signal of the user in the process of personal information verification;

if the personal information passes the verification, carrying out voice quality detection on the second voice signal;

and if the voice quality detection is passed, generating voiceprint information corresponding to the user based on the second voice signal, and registering the voiceprint information.

4. The method of claim 3, wherein the verifying the personal information of the user comprises:

and carrying out personal information verification on the user through manual customer service.

5. The method of claim 3 or 4, wherein the personal information includes basic information and additional information.

6. The method of claim 3, further comprising:

and if the personal information verification fails, judging that the user identity verification fails.

7. The method of claim 1, wherein said voiceprint authentication of said first speech signal with voiceprint information of said user's registration comprises:

determining a voiceprint feature of the first voice signal based on a streaming computing manner;

and carrying out voiceprint verification on the voiceprint characteristics through the voiceprint information registered by the user.

8. The method of claim 1 or 7, wherein said voiceprint verification of said first speech signal by said user's registered voiceprint information comprises:

performing endpoint detection on the first voice signal to acquire one or more non-mute voice signals;

extracting voice characteristic parameters of the non-mute voice signals for all the non-mute voice signals, and determining the voiceprint characteristics of the first voice signal based on the voice characteristic parameters;

calculating the matching degree of the voiceprint features and the voiceprint information registered by the user;

and if the matching degree exceeds a preset matching degree threshold value, judging that the first voice signal passes the verification.

9. The method of claim 8, wherein the speech feature parameters comprise mel-frequency cepstral coefficients.

10. The method of claim 9, wherein the extracting the speech feature parameters of the non-silent speech signal comprises:

performing framing and windowing processing on the non-silent voice signal to generate a plurality of corresponding voice frames;

calculating a discrete power spectrum of each voice frame, and filtering the discrete power spectrum through a preset triangular band-pass filter to obtain a corresponding coefficient set;

and processing the coefficient set by using discrete cosine transform to generate Mel frequency cepstrum coefficients of the voice frame.

11. The method of claim 1, wherein the basic information comprises a name, an identification number, a registered account number, and/or a bound mobile phone number.

12. The method of claim 2, wherein the additional information comprises gender, date of birth, native place, home address, zip code, landline telephone, bound email, password-prompted question, and/or password-prompted answer.

13. An authentication apparatus comprising:

the information verification module is suitable for verifying the basic information of a user when the user in the current voice call registers voiceprint information and collecting a first voice signal of the user in the verification process of the basic information;

the voiceprint verification module is suitable for carrying out voiceprint verification on the first voice signal through the voiceprint information registered by the user when the basic information verification is passed;

and the judging module is suitable for judging that the user identity authentication is successful when the voiceprint authentication passes.

14. A computing device, comprising:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods of claims 1-12.