HK1199137B - Voice authentication and speech recognition system and method - Google Patents
Voice authentication and speech recognition system and method Download PDFInfo
- Publication number
- HK1199137B HK1199137B HK14112580.5A HK14112580A HK1199137B HK 1199137 B HK1199137 B HK 1199137B HK 14112580 A HK14112580 A HK 14112580A HK 1199137 B HK1199137 B HK 1199137B
- Authority
- HK
- Hong Kong
- Prior art keywords
- user
- speech
- voice
- acoustic models
- speech recognition
- Prior art date
Links
Abstract
A method for configuring a speech recognition system comprises obtaining a speech sample utilised by a voice authentication system in a voice authentication process. The speech sample is processed to generate acoustic models for units of speech associated with the speech sample. The acoustic models are stored for subsequent use by the speech recognition system as part of a speech recognition process.
Description
Technical Field
The present invention relates to the automatic tuning and configuration of a speech recognition system operating as part of a voice authentication system. The result is a system that recognizes both individuals and their voices.
Background
The key to making an effective speech recognition system is to create acoustic models, grammars, and language models that enable the underlying speech recognition technology to reliably recognize what is being spoken within an application and to clarify or understand the speech given the context of the speech sample. The process of creating acoustic models, grammars and language models involves collecting a database of speech samples (also commonly referred to as voice samples) that represents the way a speaker interacts with a speech recognition system. To create these acoustic models, grammars and language models, each speech sample in the database needs to be segmented and labeled as its word or phoneme component. All common components of all speakers (such as all speakers who say the word "two" for example) are then compiled and processed to create a word (or phoneme) acoustic model for that component. In large vocabulary phoneme based systems, the process also needs to be repeated to create language and accent specific models and grammars for this linguistic market. In general, generating an acoustic model that can accurately recognize speech requires about 1,000 to 2,000 examples of each word or phoneme (from each gender).
Developing speech recognition systems for any linguistic market is a data-driven process. In the absence of speech data representing this market-specific language and accent, appropriate acoustic, grammatical, and language models cannot be generated. Thus, obtaining the necessary speech data (assuming it is available) and creating appropriate language and accent specific models for the new linguistic market can be particularly time consuming and expensive.
It would be advantageous if a speech recognition system could be provided that could be automatically configured in a cost-effective manner for any linguistic visual market.
Disclosure of Invention
According to a first aspect of the present invention, there is provided a method for configuring a speech recognition system, the method comprising:
obtaining a speech sample utilized by a voice authentication system in a voice authentication process;
processing the speech sample to generate a plurality of acoustic models of a plurality of speech units associated with the speech sample; and
the acoustic models are stored for subsequent use by the speech recognition system as part of a speech recognition process.
In one embodiment, the phonetic units include triphones, diphones, cluster states, phonemes, words or phrases.
In one embodiment, the method further comprises: the method further includes evaluating voice content data associated with the voice sample to determine an audible identifier for each of the voice units, and classifying the acoustic models based on the determined audible identifier.
In one embodiment, the method further comprises updating the stored acoustic models based on acoustic models generated from further obtained and processed speech samples.
In one embodiment, the method further comprises determining a quality of each of the stored acoustic models, and continuing to update the acoustic models until the quality reaches a predefined threshold.
In one embodiment, the voice samples are provided by different users of the authentication system during enrollment therewith.
In one embodiment, the method further includes storing the acoustic models in a generic speech recognition database.
In one embodiment, the method further comprises obtaining only a plurality of speech samples associated with one or more predefined speech contours selected from the group consisting of: language, gender, channel medium, grammar.
In one embodiment, the voice samples are provided by the same user either during enrollment with the authentication system or as part of a subsequent authentication session.
In one embodiment, the acoustic models are stored in a database specific to the user, and wherein the database is automatically accessed to perform the speech recognition process in response to the user authenticating himself to the authentication system.
According to a second aspect of the present invention there is provided a combined speech recognition and voice authentication method comprising setting a parameter of a speech recognition function using an output determined by a voice authentication of a user, for subsequent recognition of a utterance by the user.
In one embodiment, the output is utilized to select one of a plurality of acoustic model databases for use by the speech recognition function in recognizing speech of the user, each acoustic model database containing a set of acoustic models trained in a different manner.
In one embodiment, the database includes acoustic models of speech units that have been trained using voice data derived from speech provided by the user either during registration with the authentication system or during a subsequent authentication session.
In one embodiment, the database includes acoustic models of speech units that have been trained using speech samples provided to the user by one or more other users having a shared voice profile.
According to a third aspect of the present invention there is provided a computer readable medium embodying a computer program, the computer program comprising one or more instructions for controlling a computer system to implement a method as described above in accordance with the first aspect.
According to a fourth aspect of the present invention, there is provided a speech recognition system comprising:
a processing module operable to obtain a speech sample utilized by a voice authentication system in a voice authentication process, the processing module further arranged to process the speech sample to generate acoustic models of speech units associated with the speech sample; and
a storage module operable to store the acoustic models for subsequent use by the speech recognition system as part of a speech recognition process implemented by the processing module.
In one embodiment, the phonetic units include triphones, diphones, cluster states, phonemes, words or phrases.
In one embodiment, the processing module is further operable to evaluate the speech content data associated with the speech sample to determine an audible identifier for each of the speech units, and classify the acoustic models based on the associated identifier.
In an embodiment, the processing module is further arranged to update the stored acoustic models based on acoustic models generated from further obtained and processed speech samples.
In one embodiment, the processing module is further operable to determine a quality of each of the stored acoustic models, and to continue updating the acoustic models until the quality reaches a predefined threshold.
In one embodiment, the voice samples are provided by different users of the authentication system during enrollment therewith.
In one embodiment, the acoustic models are stored in a generic speech recognition database.
In one embodiment, the processing module is further operable to obtain only a plurality of speech samples associated with one or more desired predefined contours selected from the group consisting of: language, gender, channel medium, grammar.
In one embodiment, the voice samples are provided by the same user either during enrollment with the authentication system or as part of a subsequent authentication session.
In one embodiment, the system includes a database operable to store the acoustic models, and wherein the database is automatically accessed to perform the speech recognition process in response to the authentication system successfully authenticating the user.
According to a fifth aspect of the present invention, there is provided a combined speech recognition and voice authentication system, the system comprising:
a voice authentication function operable to authenticate a user utterance;
a voice recognition function operable to evaluate subsequent utterances by the user in response to a positive authentication by the voice authentication function; and
a parameter setting module operable to set a parameter of the speech recognition function to be established by the voice authentication function based on a user identifier.
In one embodiment, the identifier is utilized to select one of a set of acoustic model databases used by the voice recognition function in recognizing subsequent utterances of the user.
In one embodiment, the selected database includes acoustic models that have been trained using speech samples provided by the user either during enrollment with the authentication system or during a subsequent authentication determination.
In one embodiment, the selected database includes acoustic models that have been trained using speech samples provided to the user by the one or more other users having a shared voice profile, the voice profile being determined from the voice authentication determination.
Drawings
The features and advantages of the present invention will become apparent from the following description of embodiments thereof, given by way of example only, with reference to the accompanying drawings, in which:
FIG. 1 is a block diagram of a system according to one embodiment of the invention;
FIG. 2 is a schematic diagram of individual modules implemented by the voice processing system of FIG. 1;
FIG. 3 is a schematic diagram illustrating a process for creating a voiceprint;
FIG. 4 is a schematic diagram illustrating a process for providing speech recognition capabilities for the system of FIG. 1, according to one embodiment of the present invention;
FIG. 5 is a diagram illustrating a process for building a speech recognition model and grammar, according to one embodiment; and
FIG. 6 is a diagram illustrating a flow for providing user-specific speech recognition capabilities for the system of FIG. 1, according to one embodiment.
Detailed Description
Embodiments automatically create speech recognition models using speech samples processed by a voice authentication system (also commonly referred to as a voice biometric recognition system), which may be advantageously utilized to provide additional speech recognition capabilities. Since the generated model is based on samples provided by the actual users of the system, the system is tuned for these users and the system is therefore able to provide a high level of speech recognition accuracy for this user population. This technique also avoids the need to purchase "additional" speech recognition schemes that are not only expensive but may be difficult to obtain, especially for markets where speech databases suitable for creating acoustic models, grammars and language models used by speech recognition techniques are not available. Embodiments also relate to personalized speech recognition models for providing an even higher level of speech recognition accuracy for individual users of the system.
For purposes of illustration, and with reference to the figures, embodiments of the invention will be described below in the context of a voice processing system 102 that provides both voice authentication and speech recognition functionality for a security service 104, such as an interactive voice response ("IVR") telephone banking service. In the embodiment shown, the voice processing system 102 is implemented independently of the security service 104 (e.g., by a third party provider). In this embodiment, the user of the security service 104 uses an input device in the form of a phone 106 (e.g., a standard phone, mobile phone, or Internet Protocol (IP) based telephony service, such as Skype)TM) In communication with the security service 104.
FIG. 1 illustrates an example system configuration 100 for implementing one embodiment of the invention. As described above, the user communicates with the telephone banking service 104 using the telephone 106. The security service 104, in turn, connects to the voice processing system 102 to initially authenticate these users and thereafter provide voice recognition capabilities for the user's voice commands during the telephone banking session. In accordance with the illustrated embodiment, the voice processing system 102 is connected to the security service 104 through a communication network in the form of a public switched telephone network 108.
Further details of system configuration
Referring to fig. 2, the voice processing system 102 includes a server computer 105 that includes typical server hardware including a processor, a motherboard, random access memory, a hard disk, and a power supply. The server 105 also includes an operating system that cooperates with the hardware to provide an environment in which software applications may be executed. In this regard, the hard disk of the server 105 is loaded with a processing module 114 that is operable under the control of the processor to implement various voice authentication and speech recognition functions. As shown, the processing module 114 is made up of various individual modules/components for implementing the aforementioned functions, namely, a voice biometric trainer 115, a voice biometric engine 116, an automatic speech recognition trainer 117, and an automatic speech recognition engine 118.
The processor module 114 is communicatively coupled to a number of databases, including an identity management database 120, a voice file database 122, a voiceprint database 124, and a speech recognition model and grammar database 126. A number of personalized speech recognition model databases 128a through 128n may also be provided for storing models and grammars each tailored to a particular user's voice. A rules memory 130 is provided for storing various rules implemented by the processing module 114, as will be described in more detail in subsequent paragraphs.
The server 105 includes appropriate software and hardware for communicating with the secure service provider system 104. The communication may be over any suitable communication link, such as an internet connection, a wireless data connection, or a public network connection. In one embodiment, user voice data (i.e., data representing a voice sample provided by a user during enrollment, authentication, and subsequent interaction with the secure service provider system 104) is routed through the secure service provider 104. Alternatively, the voice data may be provided directly to the server 105 (in which case the server 105 would also implement an appropriate call answering service).
As discussed, the communication system 108 of the illustrated embodiment is in the form of a public switched telephone network. However, in an alternative embodiment, the communication network may be a data network, such as the internet. In such an embodiment, the user may use a networked computing device to exchange data (in one embodiment, XML code and packetized voice messages) with the server 105 using a network protocol, such as the TCP/IP protocol. Further details of such embodiments are outlined in international patent application PCT/AU 2008/000070, the contents of which are incorporated herein by reference. In another alternative embodiment, the communication system may additionally include a third or fourth generation ("3G") CDMA or GPRS enabled mobile telephone network connected to the packet switched network, which may be utilized to access the server 105. In such an embodiment, the user input device 102 includes wireless capability for transmitting voice samples as data. The wireless computing device may include, for example, a mobile telephone, a personal computer with a wireless card, and any other mobile communication device that facilitates voice docketing functionality. In another embodiment, the present invention may employ an 802.11-based wireless network or some other personal virtual network.
In accordance with the illustrated embodiment, the secure service provider system 104 is in the form of a telephone banking server. The secure service provider system 104 includes a transceiver that includes a network card for communicating with the processing system 102. The server also includes suitable hardware and/or software for providing answering services. In the illustrated embodiment, the security service provider 104 communicates with the user through the public switched telephone network 108 using a transceiver module.
Voiceprint enrollment
Before describing the technique for creating a speech recognition model in any detail, a basic flow for registering a speech sample and generating a voiceprint will first be described with reference to fig. 3. At step 302, a speech sample is received by the voice processing system 102 and stored in the voice file database 122 in a suitable file storage format (e.g., wav file format). The voice biometric trainer 115 processes the stored voice file at step 304 for generating a voiceprint associated with the identifier of the user providing the speech sample. The system 102 may request additional speech samples from the user until a sufficient number of samples have been received for creating an accurate voiceprint. Typically, for text-related implementations (i.e., where the text spoken by the user must be the same for enrollment and verification), three repetitions of the same word or phrase are requested and processed, thereby generating an accurate voice print. In the case of a text-independent implementation (i.e., where the user may provide any speech for verification purposes), more than 30 seconds of speech is requested for generating an accurate voice print. The quality of the voiceprint can be measured, for example, using the procedure described in australian patent 2009290150, issued to the same applicant, the contents of which are incorporated herein by reference. At step 306, the voiceprint is loaded into the voiceprint database 124 for subsequent use by the voice biometric engine 116 in a user authentication process (step 308). Verification samples provided by the user during the authentication process (which may be, for example, a passphrase, an account number, etc.) are also stored in the voice file database 122 for use in updating or "tuning" the stored voice print associated with the user using techniques well known to those skilled in the art.
Creating a generic speech recognition model
Referring to fig. 4, an extension of the enrollment process is shown that advantageously allows for the automatic creation of a generic speech recognition model for speech recognition capabilities based on an enrolled voice file. At step 402, the stored voice file (which may be either the voice file provided during enrollment or the voice file provided by post-successful authentication) is passed to the ASR trainer 117, which processes the voice file to generate acoustic models for speech units associated with the voice file, as will be described in more detail in subsequent paragraphs. These acoustic models are then stored in the speech recognition model database 126, step 404, each of which is preferably generated from a plurality of voice files obtained from the voice file database 122. These models may then be used at step 406 to provide automatic speech recognition capabilities for the user to access the security service 104.
In more detail, and with additional reference to FIG. 5, the acoustic model generation step 402 includes using a segmenter module 502 to separate the voice files into speech units (also referred to as components) of the desired speech unit type (502). According to the illustrated embodiment, the different types of speech units that the segmenter module 502 may process include triphones, diphones, cluster states, phonemes, words and phrases, although it will be understood that any suitable speech unit may be processed depending on the desired implementation. The segmenter module 502 specifies a starting point for the speech unit and an ending point for the speech unit. The segmenter module 502 may be programmed to identify the end point as the starting point for the next speech unit. Similarly, the segmenter module 502 may be programmed to identify the gap between the end of one speech unit and the start of the next speech unit. The waveform in this gap is referred to herein as "garbage" and may represent silence, background noise, noise introduced by the communication channel, or sounds produced by the speaker but not associated with speech (e.g., respiratory noise, "kayaking", "o", hesitation, etc.). The trainer 506 uses such sounds to generate a special model, which is commonly referred to in the art as a "garbage model". The recognition engine 126 then uses these garbage models to recognize the sounds heard in the speech samples, but which are not predefined speech units. The segmented non-garbage speech units are stored in step 504 in association with an audible identifier (hereinafter "classifier") derived from the speech content data associated with the original speech sample. For example, the voice processing system may store metadata containing words or phrases spoken by the user during enrollment (e.g., his account number, etc.). The segmenter 502 may evaluate a phonetic lookup dictionary to determine the phonetic units (triphones, diphones, cluster states, or phonemes) that make up the registered word/phrase. A generic or prototype acoustic model of the speech unit is stored in the segmenter 502 and is used by it to segment the user-provided speech into its triphones, diphones, cluster states, or phoneme components. Further voice files are obtained, segmented and stored (step 504) until a sufficient number of samples for each speech unit have been obtained to create a generic speech model for the classified speech units. In a particular embodiment, between 500 and 2,000 samples of each triphone, diphone, cluster state, or phoneme portion are required to generate a generic acoustic model that fits the identified portion. According to the illustrated embodiment, when a new speech file is stored in the database 122, the ASR trainer 117 automatically processes it for creating and/or updating the acoustic models stored in the model database 126. Typically between 500 and 2,000 voice files are obtained and processed before a model is generated in order to provide a model that will adequately reflect the language and accents of registered users. The speech units are then processed by a trainer module 506. The trainer module 506 processes the segmented speech units spoken by the enrolled speakers to create an acoustic model for each of these speech units required by the speech recognition system using model generation techniques known in the art. Similarly, training module 506 also compiles grammars and language models from speech files associated with the phonetic units used for speech recognition. The grammar and language model is computed from a statistical analysis of the sequence of triphones, diphones, cluster states, phonemes, words and/or phrases in the speech samples, the statistical analysis representing the probability that a particular triphone, diphone, cluster state, phoneme, word and/or phrase is followed by another particular triphone, diphone, cluster state, phoneme, word and/or phrase. In this way, the acoustic, grammatical and language models are implemented specific to the way the speaker is enrolled in the system and thus specific to the accents and language spoken by the enrolled speaker. The generated models and the included grammars are stored in the database 126 for subsequent use in providing automatic speech recognition to users of the security service 104.
In one embodiment, certain rules are enforced by the processing module 114 that specify a minimum number of phonetic unit samples that must be processed for model creation. The rules may also specify the quality of the stored models before the processing module 114 will recognize speech using the stored models. In particular embodiments, there may be one male and female model for each classifier. According to such embodiments, the rules may specify that only voice samples from male users are selected to create male models, and only voice samples from female users are selected to create female models. This may be determined from stored metadata associated with known users or by evaluating the sample, which involves acoustically processing the sample using both female and male models, and determining the gender based on the resulting authentication score, i.e., using a higher score for the male model to represent male speakers while using a higher score for the female model to represent female speakers. Additional or alternative models may be created equally for different languages, channel media (e.g., mobile phone, landline, etc.) and grammar profiles, such that a particular set of models will be selected based on the detected caller profile. The detected profile may be determined, for example, based on data available for the call (as may indicate which profile most closely matches the phone line number or IP address of the current call), or by processing the voice in parallel using many different models and selecting the model that produces the best result or that is appropriate (e.g., by evaluating the resulting authentication score).
Creating personalized speech recognition models
Once the users have been successfully authenticated, they are considered to be 'known' to the system 102. In particular embodiments, once known to the user, a personalized model set may be created and subsequently accessed to provide greater speech recognition accuracy for the user.
In accordance with such an embodiment, and with additional reference to FIG. 6, a personalized voiceprint and speech recognition database 128 is provided for each user known to the system (see steps 602 through 606). The models may be initially configured from voice samples provided by the user during the enrollment process (e.g., in some instances, the user may be required to provide multiple enrollment voice samples, such as stating his account number, name, pin number, etc., which may be processed to create a limited number of models), from generic models as previously described, or from a combination of both. When a user provides a new speech sample, a new model may be created and the existing model updated, if necessary. It will be appreciated that a new sample may be provided either during or after successful authentication of the user (e.g., caused by voice commands issued by the user during a telephone banking session). The system 102 may also prompt the user to issue certain words, phrases, etc. from time to time (i.e., at step 602) to help build a more complete set of models for this user. Again, the process may be controlled with rules stored in the rules store 130.
Although the embodiment described in the preceding paragraph describes the processing system 102 in the form of a "third party," or centralized system, it will still be the provider system 104.
Alternative configurations and methods may include a speaker collecting voice samples using a third party voice recognition function, such as a "Siri" personal assistant (as described in published U.S. patent application No. 20120016678 assigned to Apple Inc., or "Dragon" voice recognition software integrated into a cell phone or other computing device available from Nuance Communications, Inc., of burlington, ma), using the cell phone or other computing device in conjunction with the voice authentication system described herein. In this case, speech samples from "known" speakers may be stored in the voice file database 122 and then used by the segmenter module 502 and the trainer module 506 to create a speech recognition model for the speaker using the process described above.
Alternatively, speech samples collected by a host service or a cloud service (such as a hosted IVR service or a cloud-based voice processing system used in conjunction with a voice authentication system) may also be used to create a speech recognition model using the methods described herein.
While the invention has been described with reference to the present embodiments, it will be understood by those skilled in the art that changes, variations and improvements may be made and equivalents may be substituted for elements thereof and steps thereof without departing from the scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the central scope thereof. However, such alterations, changes, modifications, and improvements (although not specifically described above) are intended to and are implied to be within the scope and spirit of the invention. Therefore, it is intended that the invention not be limited to the particular embodiments described herein, but that the invention will include all embodiments falling within the scope of the independent claims.
In the claims which follow and in the preceding description of the invention, except where the context requires otherwise due to express language or necessary implication, the word "comprise" or variations such as "comprises" or "comprising" is used in an inclusive sense, i.e. to specify the presence of the stated features but not to preclude the presence or addition of further features in various embodiments of the invention.
Claims (23)
1. A method for configuring a speech recognition system, the method comprising:
obtaining a voice sample from a user, the voice sample being used to authenticate the user as part of an authentication process;
processing the speech sample to train one or more generic acoustic models of a plurality of speech units associated with the speech sample to create respective one or more personalized acoustic models;
storing the one or more personalized acoustic models in a set of personalized acoustic models for the user;
selectively retraining the one or more personalized acoustic models of the set of personalized acoustic models based on additional speech samples provided by the user that contain respective speech units; and
in response to determining that the user has accessed a speech recognition function, a speech recognition process is directed to access the set of personalized acoustic models to recognize subsequent utterances of the plurality of users.
2. The method according to claim 1, wherein the phonetic units comprise triphones, diphones, cluster states, phonemes, words or phrases.
3. The method according to claim 2, further comprising evaluating voice content data associated with the voice sample to determine an audible identifier for each of the voice units, and classifying the personalized acoustic model based on the determined audible identifier.
4. The method according to any of the preceding claims, wherein the personalized acoustic model comprises a plurality of language and/or grammar models of the phonetic units.
5. A method according to any of claims 1 to 3, further comprising determining a quality measure for each of the one or more personalized acoustic models, and wherein the one or more personalized acoustic models are retrained based on additional speech samples until the respective quality measure meets a predefined threshold.
6. The method according to claim 4, further comprising determining a quality measure for each of the one or more personalized acoustic models, and wherein the one or more personalized acoustic models are retrained based on additional speech samples until the respective quality measure meets a predefined threshold.
7. The method of any of claims 1-3, wherein the speech recognition process is automatically directed to access the set of personalized acoustic models in response to successfully authenticating the user.
8. The method of claim 4, wherein the speech recognition process is automatically directed to access the set of personalized acoustic models in response to successfully authenticating the user.
9. The method of claim 5, wherein the speech recognition process is automatically directed to access the set of personalized acoustic models in response to successfully authenticating the user.
10. The method of claim 6, wherein the speech recognition process is automatically directed to access the set of personalized acoustic models in response to successfully authenticating the user.
11. A combined speech recognition and voice authentication method, comprising:
in response to a voice authentication function successfully authenticating a user, accessing a set of personalized acoustic language and/or grammar models for use by a speech recognition function in recognizing one or more utterances, the set of acoustic language and/or grammar models comprising a plurality of acoustic language and/or grammar models that have been trained using voice data derived from the plurality of utterances provided by the user either during enrollment with the authentication function or during one or more subsequent authentication processes.
12. The method of claim 11, wherein the set of personalized acoustic language and/or grammar models includes acoustic models of speech units that have been trained using speech samples provided to the user by one or more other users having a shared voice profile.
13. A speech recognition system comprising:
a processing module operable to:
obtaining a voice sample for use by an authentication system in authenticating a user as part of an authentication process;
processing the speech sample so as to train one or more generic acoustic models of speech units associated with the speech sample to create corresponding one or more personalized acoustic models and so as to subsequently store the one or more personalized acoustic models in a set of personalized acoustic models;
selectively retraining the one or more personalized acoustic models based on a plurality of additional speech samples provided by the user that contain a corresponding plurality of speech units; and
in response to determining that the user has accessed a speech recognition function, the processing module is further arranged to direct a speech recognition process to access the set of personalized acoustic models for recognizing subsequent user utterances.
14. The system according to claim 13, wherein the phonetic units comprise triphones, diphones, cluster states, phonemes, words or phrases.
15. The system of claim 13, wherein the processing module is further operable to evaluate the voice content data associated with the voice sample to determine an audible identifier for each of the voice units, and classify the personalized acoustic model based on the associated identifier.
16. The system of claim 14, wherein the processing module is further operable to evaluate the voice content data associated with the voice sample to determine an audible identifier for each of the voice units, and classify the personalized acoustic model based on the associated identifier.
17. The system of any of claims 13 to 16, the processing module further operable to determine a quality measure for each of the one or more personalized acoustic models, and continue to regenerate the one or more personalized acoustic models until the quality measure reaches a predefined threshold.
18. A system according to any one of claims 13 to 16, wherein the additional voice samples are provided by the user either during enrollment with the authentication system or as part of a subsequent authentication session performed by the authentication system.
19. A system according to claim 17, wherein the additional voice samples are provided by the user either during enrollment with the authentication system or as part of a subsequent authentication session performed by the authentication system.
20. The system of any of claims 13 to 16, wherein the set of personalized acoustic models is automatically accessed to perform the speech recognition process in response to the authentication system successfully authenticating the user.
21. The system of claim 17, wherein the set of personalized acoustic models is automatically accessed to perform the speech recognition process in response to the authentication system successfully authenticating the user.
22. The system of claim 18, wherein the set of personalized acoustic models is automatically accessed to perform the speech recognition process in response to the authentication system successfully authenticating the user.
23. The system of claim 19, wherein the set of personalized acoustic models is automatically accessed to perform the speech recognition process in response to the authentication system successfully authenticating the user.
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| AU2012900256A AU2012900256A0 (en) | 2012-01-24 | Voice Authentication and Speech Recognition System | |
| AU2012900256 | 2012-01-24 | ||
| PCT/AU2013/000050 WO2013110125A1 (en) | 2012-01-24 | 2013-01-23 | Voice authentication and speech recognition system and method |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| HK1199137A1 HK1199137A1 (en) | 2015-06-19 |
| HK1199137B true HK1199137B (en) | 2018-04-13 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN104185868B (en) | Authentication voice and speech recognition system and method | |
| US20160372116A1 (en) | Voice authentication and speech recognition system and method | |
| US11900948B1 (en) | Automatic speaker identification using speech recognition features | |
| AU2013203139A1 (en) | Voice authentication and speech recognition system and method | |
| US10339290B2 (en) | Spoken pass-phrase suitability determination | |
| JP6394709B2 (en) | SPEAKER IDENTIFYING DEVICE AND FEATURE REGISTRATION METHOD FOR REGISTERED SPEECH | |
| US6161090A (en) | Apparatus and methods for speaker verification/identification/classification employing non-acoustic and/or acoustic models and databases | |
| US10032451B1 (en) | User recognition for speech processing systems | |
| US7533023B2 (en) | Intermediary speech processor in network environments transforming customized speech parameters | |
| US8010367B2 (en) | Spoken free-form passwords for light-weight speaker verification using standard speech recognition engines | |
| US20130110511A1 (en) | System, Method and Program for Customized Voice Communication | |
| KR102097710B1 (en) | Apparatus and method for separating of dialogue | |
| US9865249B2 (en) | Realtime assessment of TTS quality using single ended audio quality measurement | |
| Li et al. | Automatic verbal information verification for user authentication | |
| US20140188468A1 (en) | Apparatus, system and method for calculating passphrase variability | |
| US10866948B2 (en) | Address book management apparatus using speech recognition, vehicle, system and method thereof | |
| EP2541544A1 (en) | Voice sample tagging | |
| HK1199137B (en) | Voice authentication and speech recognition system and method | |
| Ali et al. | Voice reminder assistant based on speech recognition and speaker identification using kaldi | |
| KR20200114606A (en) | Methode and aparatus of providing voice |