[go: up one dir, main page]

HK1132831B - Method and system for providing speech recognition - Google Patents

Method and system for providing speech recognition Download PDF

Info

Publication number
HK1132831B
HK1132831B HK09110147.2A HK09110147A HK1132831B HK 1132831 B HK1132831 B HK 1132831B HK 09110147 A HK09110147 A HK 09110147A HK 1132831 B HK1132831 B HK 1132831B
Authority
HK
Hong Kong
Prior art keywords
name
user
database
speech recognition
grammar
Prior art date
Application number
HK09110147.2A
Other languages
Chinese (zh)
Other versions
HK1132831A1 (en
Inventor
David Sannerud
Original Assignee
Verizon Business Network Services Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US11/526,395 external-priority patent/US8190431B2/en
Application filed by Verizon Business Network Services Inc. filed Critical Verizon Business Network Services Inc.
Publication of HK1132831A1 publication Critical patent/HK1132831A1/en
Publication of HK1132831B publication Critical patent/HK1132831B/en

Links

Description

Method and system for providing speech recognition
RELATED APPLICATIONS
This application claims priority from U.S. patent application serial No. 11/526,395 (attorney docket COS06005), filed on 25/9/2006, the contents of which are incorporated herein by reference.
Background
Speech recognition plays an important role in communication systems, both for collecting and providing information to users. Traditionally, Interactive Voice Response (IVR) systems have relied on a combination of dual tone multi-frequency (DTMF) and voice input to obtain and process information. However, for complex transactions that require the input of a large number of numbers, letters and words, the concept of IVR systems is more appealing than its concept. That is, for complex data entries, the typical DTMF interface has proven to be impracticable slow. For example, organizations have become constantly reliant on voice-based systems to augment DTMF input. Unfortunately, voice-based systems have introduced new, more challenging problems related to the intricacies of endless variations in spoken and human utterances. Thus, IVR systems implementing speech recognition technology have proven to be unacceptably inaccurate in converting a spoken utterance into a corresponding text string or other equivalent symbolic representation.
Accordingly, there is a need for an improved method for providing speech recognition.
Drawings
The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
FIG. 1 is a diagram illustrating a communication system capable of providing speech recognition to obtain a name in accordance with an embodiment of the present invention;
FIG. 2 is a diagram of an exemplary Interactive Voice Response (IVR) unit in accordance with an embodiment of the present invention;
FIG. 3 is a diagram of a speech recognition system according to an embodiment of the present invention;
FIGS. 4A and 4B are flow diagrams of a speech recognition process according to an embodiment of the present invention;
FIG. 5 is a diagram of a computer system that can be used to implement various embodiments of the present invention.
Detailed Description
An apparatus, method and software for providing speech recognition are described. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It is apparent, however, to one skilled in the art that the present invention may be practiced without these specific details or with an equivalent arrangement. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present invention.
Although various embodiments of the present invention are described with respect to speech recognition of pronouns (e.g., names), these embodiments are considered applicable to generalized speech recognition using equivalent interfaces and operations.
Fig. 1 is a diagram illustrating a communication system capable of providing speech recognition to obtain a name according to an embodiment of the present invention. The communication system 100 includes a speech recognition system (or logic) 101 that utilizes a name grammar database 103, a confidence database 105. The speech recognition system 101 operates with an Interactive Voice Response (IVR) unit (or system) 107 that receives voice calls from a station 109 over a telephone network 111. The telephone network 111 can be a circuit switched system or a packet voice network (e.g., a voice over internet protocol (VoIP) network). The packet voice network 111 can be accessed by a suitable station 109, e.g., a computer, workstation, or other device (e.g., a Personal Digital Assistant (PDA), etc.) that supports microphone and speaker functionality. Among other functions, the IVR system 107 collects and provides data to the user. The IVR system 107 is more fully explained in fig. 2. Data collection is supported by the data repository 113.
For illustrative purposes, the speech recognition system 101 is described with respect to the recognition of an audio signal representing a name. The user's name is arguably the most routinely collected, commonly used piece of information. Unfortunately, obtaining the user's name is a difficult task for conventional systems that utilize dual tone multi-frequency (DTMF) input interfaces. For example, DTMF interfaces become increasingly impractical as the number of letters contained in an individual's name increases. Moreover, many phone designs (particularly cellular phones) require that the speaker and the dial pad be built together so that the user can conveniently use the dial pad and listen to voice queries. Therefore, speech recognition has been introduced to supplement the DTMF interface.
Conventional speech recognition interfaces are highly dependent on grammatical content and common pronunciation rules to achieve accurate conversion results. However, with respect to user names (or any proper noun), these techniques prove inadequate because these types of words generally do not have significant grammatical content that can be used to distinguish among possible conversion choices. In addition, common pronunciation rules provide little, if any, beneficial value since proper nouns contain a disproportionately large number of non-standard pronunciation variations. Thus, the variability of speech is exemplified not only by the loss of content but also by the auditory differences between the phonemes themselves.
In addition, a set of unique complexities that are independent of the type of speech being converted prevents voice recognition techniques. For example, the variability of sound introduced by the ambient background noise, microphone position, and transducer quality increases the loss of transducer accuracy. In addition, speaker variability, arising from physical and emotional states, speech rate, voice quality and intensity, social linguistic background, dialect, and vocal tract size and shape, also contribute to the loss of recognition accuracy.
Returning to fig. 1, the speech recognition system 101, described more fully below with respect to fig. 3, can support a variety of applications including interaction with human users, such as call flow processing, directory assistance, business transactions (e.g., airline ticketing, stock brokering, banking, ordering, etc.), browsing/gathering information, and so forth.
Although not shown, the IVR system 107 can access the data store 113 via a data network, which can include a Local Area Network (LAN), a Wide Area Network (WAN), a cellular or satellite network, the Internet, or the like. Additionally, those of ordinary skill in the art will appreciate that the data store 113 can be directly linked to or included within the IVR system 107. For instance, the data store 113 can be any type of information store (e.g., a database, server, computer, etc.) that associates personalized information with a user name. The personalization information can include any one or combination of date of birth, account number (e.g., bank, charge card, billing code, etc.), Social Security Number (SSN), address (e.g., work, home, Internet Protocol (IP), Media Access Control (MAC), etc.), phone list (home, work, cell phone, etc.), and any other form of uniquely identifiable data such as biometric, voice print, etc.
In one embodiment of the present invention, data store 113 is configured to allow reverse retrieval of a user's name using one or more of the above listed forms of personalized information. Further, data store 113 can be automatically updated and maintained by any resource, including third party vendors.
Although the speech recognition system 101 is shown as a separate component, it is contemplated that the speech recognition system 101 can be integrated with the IVR system 107.
Fig. 2 is a diagram of an exemplary Interactive Voice Response (IVR) system, according to an embodiment of the present invention. In this example, the IVR system 107 includes a telephony interface 201, a resource manager 203, and a voice browser 205. The IVR system 107 utilizes a telephony interface 201 for communicating with one or more users over the telephony network 111. In alternative embodiments, other interfaces are utilized depending on the access method of the user. Further, while the IVR system is shown as a separate, distributed entity, the IVR system 107 can incorporate some or all of the functionality into a single network element.
As shown, the resource manager 203 provides various speech resources, such as a verification system 207, an Automatic Speech Recognizer (ASR)209, and a text-to-speech (TTS) engine 211. The TTS engine 211 converts text information (digital signals) from the voice browser 205 to speech (analog signals) for playback to the user. The TTS engine 211 completes the transition through a front-end input and a back-end output. The input converts plain text into its equivalent written word by text normalization, preprocessing, and/or word-breaking. Words are then assigned phonetic transcriptions and divided into prosodic units, e.g., phrases, clauses, and/or sentences. Using this combination of phonetic transcription and prosodic arrangement, the front-end input conveys a symbolic linguistic representation to the back-end output for synthesis. Based on the desired level of naturalness or intelligibility, the back-end output can generate speech waveforms by any of the following synthesis processes: continuum, unit selection, diphone, domain assignment, formants, phonological, Hidden Markov Models (HMM), and other similar methods, as well as any hybrid combination thereof. Through the synthesis process, the back-end output generates the actual sound output that is transmitted to the user.
The ASR209 can effectively act as the speech recognition system 101, or alternatively, as an interface to the speech recognition system 101; the specific embodiment depends on the application. The ASR209 effectively converts the user's spoken language (represented by analog signals) into text or equivalent symbolic form (digital signals) for processing by the voice browser 205 and/or the verification system 207.
Instead of or in addition to the TTS engine 211, the voice browser 205 can play pre-recorded sound files to the user. According to one embodiment of the invention, the resource manager 203 can include analog-to-digital and digital-to-analog converters (not shown) for sending signals, for example, between the station 109 and the voice browser 205. Additionally, in an alternative embodiment, the voice browser 205 may contain speech recognition and synthesis logic (not shown) that implements the above, extracting meaning from the user's spoken utterance, and directly producing an acoustic deduction of the text.
The verification system can be linked to the telephony interface 201, the ASR209 or both components according to the desired authentication method. Thus, the authentication system 207 requires a username, password, code or other unique identification for restricting access to the voice browser 205. In this manner, the user is required to provide this information using spoken utterances transmitted through the ASR209 or DTMF signals transmitted via the telephony interface 201. Alternatively, the authentication system 207 can provide a level of non-intrusive security by positively identifying and screening users based on their voice prints transmitted from the telephony interface 201. Thus, in either embodiment, the validation system 207 can keep sensitive transactions secure.
The voice browser 205 functions as a gateway between a call and various network applications, for example. The voice browser 205 can use a microphone, keypad, and speaker instead of the keyboard, mouse, and monitor of a conventional network-based system. The voice browser 205 processes markup language pages such as Voice extensible markup language (VoiceXML), Voice application language Table tags (SALT), Hypertext markup language (HTML), and others residing on a server (not shown), such as Wireless Markup Language (WML) for Wireless Application Protocol (WAP) -based cell phone applications and the world Wide Web (W3) platform for handheld devices. Because a wide range of markup languages are supported, the voice browser 205 can be configured to include a VoiceXML compatible browser, a SALT compatible browser, an HTML compatible browser, a WML compatible browser, or any other markup language compatible browser for communicating with a user. Just as with standard web services and applications, the voice browser 205 can utilize standardized network infrastructure, i.e., hypertext transfer protocol (HTTP), cookies, web caching, Uniform Resource Locator (URL), secure HTTP, etc., to establish and maintain connections.
FIG. 3 is a diagram of a speech recognition system according to an embodiment of the present invention. The speech recognition system 101 can provide automatic speech recognition from the user that is dependent and/or independent of the speaker's voice utterance. Thus, the speech recognition system 101 processes voice communications transmitted over the telephone network 111 to determine whether a word or speech pattern matches any grammar or vocabulary stored within a database (e.g., the name grammar database 103 or the confidence database 105). Name grammar database 103 is made up of possible combinations of user names and spellings of those names. According to an embodiment of the invention, can be according to NUANCETMSay and spell name grammar to create a name grammar database 103.
In alternative embodiments, database 103 can include any grammar database containing names and spellings of those names, as well as a thesaurus database, another grammar database, an acoustic model database, and/or a natural language definition database. The thesaurus database contains phonetic pronunciations for the words used in the grammar database. In addition, the acoustic model database defines the language utilized by the speech application.
Further, while only one name grammar database 103 and one confidence database are shown, it is to be appreciated that there may be multiple databases, controlled by, for example, a database management system (not shown). In a database management system, data is stored in one or more data containers, each container containing records, and the data within each record is organized into one or more fields. In a relational database system, data containers are referred to as tables, records are referred to as rows, and fields are referred to as columns. In object-oriented databases, data containers are referred to as object classes, records are referred to as objects, and fields are referred to as attributes.
As seen in FIG. 3, a supplemental grammar database 105, denoted as a "confidence database," is used in conjunction with the name grammar database 103 to produce an accurate identification of the user's name. In an exemplary embodiment, the confidence database 105 can be derived from the primary name grammar database 103, such as an N-best list (where N is an integer that can be set according to a particular application). The N-best results can include expected name results that may improve recognition. In other words, the N-best result is a list of items returned from a grammar that is well correlated to the caller's utterance. The N-best list is sorted by likelihood of matching, and includes one or more entries. In this process, the correct name is added to the N-best supplemental grammar. According to one embodiment, there is no weighting or preference given to any of the supplemental name grammars. This smaller subset of full-name grammar, which contains both pseudonyms (decoys) and correct names, will allow better recognition of the caller's name. According to one embodiment of the invention, the supplemental grammar database can be dynamically created.
According to an exemplary embodiment, pseudonym application 311 is utilized to generate a change in name within the N-best list to improve the likelihood of recognition. These generated names, which may include the correct name, are provided as additional entries into the confidence database 105.
The speech recognition system 101 is configured to process the acoustic utterance to determine whether a word or speech pattern matches any of the names stored in the name grammar database 103 and/or the confidence database 105. When a particular utterance (or set of utterances) for a voice communication is identified as a match, the speech recognition system 101 sends an output signal for implementation through the verification system 207 and/or voice browser. Thus, it is contemplated that the speech recognition system 101 can include speaker-dependent and/or speaker-independent voice recognition. In addition, the speech recognition system 101 can be implemented by a suitable speech recognition system capable of detecting and converting speech communications into text or other equivalent symbolic representations.
For example, the speech recognition system 101 includes: a digitizer 301 for digitizing an audio input (e.g., speech), a parsing module 303 and an edge comparison module 305, and a confidence value generator 307 and interpretation generator 309. In addition, the speech recognition system 101 uses the name grammar database 103, confidence 105 to help identify the user's name more accurately; this process is more fully described with respect to fig. 4A and 4B.
In operation, the digitizer 301 accepts acoustic or audio signals (i.e., user utterances) from the telephone interface 201 and converts them to digital signals through an analog-to-digital converter. Once digitized, the signal is converted to the frequency domain using known methods, such as discrete/fast/short-time forms of fourier transform, etc., and combined with spectral frames for further processing. Since the human ear can only perceive audible sounds in the range from 20Hz to 20kHz, and since human sounds typically produce speech in the range of only 500Hz to 2kHz, the digitizer 301 can be optimized to operate in these ranges. Note that the digitizer 301 can include a host of signal processing components, i.e., filters, amplifiers, modulators, compressors, error detectors/checkers, etc., for conditioning the signal, e.g., removing signal noise such as ambient noise, canceling transmission echoes, etc.
After the analog signal is processed by the digitizer 301, the corresponding digital signal is passed to the parsing module 303 for extracting the acoustic parameters using known methods, e.g. linear predictive coding. For example, the parsing module 303 can identify acoustic feature vectors that include sonographic coefficients that identify the speech classification and word boundaries of the user utterance. It will be appreciated that other conventional modeling techniques can be used to extract one or more characteristics and/or patterns of the distinct acoustic portions of the classified digital signal.
Once parsed, the various sound characteristics defined by the parsing module 303 are input to an edge comparison module 309 for comparison with and identification as recognized words, i.e., the user's first, middle and/or last name. Thus, the edge comparison module 305 can use any known speech recognition method and/or algorithm, such as Hidden Markov Models (HMMs), as well as the name grammar database 103 and the confidence database 105 to recognize user utterances as words. After the words are recognized, the interpretation generator 309 passes the associated equivalent text or symbolic representation (hereinafter collectively referred to as "values") to the voice browser 205 and/or the verification system 207 for appropriate processing.
In general, the grammar database stores all possible combinations and associated values of user utterances that are effectively accepted by a particular speech application. By way of example, a simple syntax, denoted "YESNOGRAMMAR", can be defined as follows:
YESNOGRAMMAR
[
(yes){true}
(no){false}
]
in this example, the content of the grammar is contained within [ ] brackets. The edge comparison module 305 uses the terms within () brackets for comparison with acoustic features extracted from the user utterance. When the acoustic features are similarly compared to the items within () brackets, the values contained within the { } brackets are passed to the interpretation generator 309.
The edge comparison module 305 utilizes a confidence value generator 307 to determine a confidence level that measures the correlation of the recognized utterance with the term values within the grammar database. A high confidence value means that there is a greater degree of similarity between the recognized utterance and the value of the term within the grammar database. Conversely, a low confidence value means a weaker similarity. In the case where the utterance is not recognized, i.e., confidence value generator 307 perceives no similarity to any item within the grammar, the edge comparison module will produce an "out of grammar" state and require the user to re-enter their utterance.
Using the simple yesnoogrammar defined above, an exemplary speech recognition process is explained below. First, the IVR system 107 asks the user a question, "do you go to colorado? If the user answers "yes," the speech recognition system 101 recognizes the utterance and passes the "true" result to the interpretation generator 309 for output to an appropriate device, such as the voice browser 205, for system processing. Whereas if the user answers "possible", the utterance cannot be compared to a "yes" or "no" value within the grammar YESNOGRAMMAR. For example, a no-recognition condition may occur, and the edge comparison module may generate an "out-of-grammar" state and require the user to re-enter their utterance.
In this regard, grammars are used to limit users to those values defined within the grammar, i.e., the desired utterance. For example, if the user is asked to speak a numeric identifier, such as a Social Security Number (SSN), the grammar would limit the first number to numbers from 0 to 7 since no SSN starts with 8 or 9. Thus, if the user speaks an SSN starting at 8, when the speech recognition system 101 analyzes the utterance and compares it to the restricted grammar, the result will inevitably be an "out of grammar" state.
Unfortunately, user utterances cannot always be "classified" as desired utterances. For example, the speech recognition system 101 utilizing the yesnoogrammar grammar above would not recognize a user utterance in spoken language equivalent to "affirmative" instead of "yes" or in spoken language equivalent to "negative" instead of "no". However, it is not practical to attempt to provide each possible alternative utterance for the desired utterance, especially as the complexity of the desired utterance increases.
With speech recognition of proper nouns, or more specifically, user names, a sharp subset of such impracticalities (entries) arises. A simple name grammar entitled SURNAME can be defined as follows:
SURNAMES
[
(white w h i t e) {white}
(brimm b r i m m) {brimm}
(cage c a g e) {cage}
(langford l a n g f o r d) {langford}
(whyte w h y t e) {whyte}
]
in this example, a name, i.e., a grammar value, includes the name and the spelling of the name.
Since there is an almost endless array of user names, typical name grammars contain only a large percentage of possible names. In addition, those names stored within the name grammar are typically arranged or otherwise "tuned" to account for (account for) name commonality. While these features minimize system resource flooding (overbrowsing) and provide "good" coverage for common names, users who speak unique names that are not in the grammar will eventually produce an "out of grammar" state. Furthermore, due to the similarity of speech and the "tuning" nature of the name grammar, users who utilize uncommon pronunciations of common names, such as "White" instead of "White", will present the wrong name. This is simply the impracticality that the speech recognition system 101 seeks to address. The operation of the speech recognition system 101 is described next.
FIG. 4 is a flow diagram of a speech recognition process according to an embodiment of the present invention. In step 401, data (e.g., account information, social security number, or other personalized information) is received from the user as part of an application or call flow of the IVR system 107, for example. Through the use of more readily identifiable data, such as an account number or social security number, a name associated with the account number can be obtained, per step 403. Next, the user is asked for the name, as in step 405. The user is requested to speak and spell the name.
In step 407, a generated audio input from the user in response to the name query is received. The process then applies speech recognition to the audio input using a primary name grammar database, such as name grammar database 103, as in step 409. It is determined whether there is a state other than syntax, via step 411. If this occurs, the user's name is re-queried as in step 413. At this point, the process applies a high confidence database to output the recognized name (step 415). That is, the process utilizes a second name grammar database of high confidence (e.g., confidence database 105) to output the last recognized name. In one embodiment, the name from the N-best list is combined with the name associated with the account number or social security number to generate a supplemental name grammar; the process can be performed dynamically. Pseudonym names similar to the actual names can also be added to the supplemental name grammar. The level of trust can be predefined or preset depending on the application-i.e., "high".
Thereafter, via step 417, the process determines whether the recognized name matches the retrieved name (as obtained in step 403). If there is a match, the last recognized name is confirmed with the user, via step 421. To confirm, for example, the process can provide a simple query as follows: "i hear < name >. Is that correct? "
If there is no match, as determined via step 419, the speech recognition process confirms the last recognized name with the user and re-evaluates the name wording (step 423). To confirm, for example, the process can provide a more direct query, as follows: "i hear < name >. Do you determine that is the name of the account? "
According to one embodiment, the desired outcome is not revealed to the caller for security purposes; the caller must speak the desired result and confirm. If the name is incorrect, as determined in step 425, the process returns to step 413 to re-query the user. This process can be repeated any number of times (e.g., 3 times); that is, the number of repetitions is configurable. If the user exceeds the maximum number of retries, the call can end with a failure event. When the name is acknowledged as correct, the process ends.
For illustrative purposes, the speech recognition process is now explained with respect to three scenarios that are relevant to applications for reporting salary that use SSN as personalization information. The first scenario involves using only the primary name grammar database 103, without the need to utilize the confidence database 105 (Table 1). The second scenario describes the case where a supplemental grammar database, e.g., confidence database 105, is needed (table 2). As shown in table 3, the last scenario shows the status of failure.
Interrogation User response
First, speak or key in your social security number. 555-00-5555
Now, tell me your birthday. 1976, 7/4
Thank you, now say and spell your name as displayed on your social security card. GeorgeG-E-O-R-G-E
I get your name as<Name and spelling recognized from full-name grammar>George G-E-O-R-G-E, p? Is that
Next, speak and spell your last name displayed on your social security card. Smith,S-M-I-T-H
I get your name as<Name and spelling recognized from full-name grammar>Smith, S-M-I-T-H, p? Is that
Some people have another family name-e.g., job title or wedding name-that may be listed under their social security number. Do you have another surname? Please say yes or no. Whether or not
Please don't hang up the phone when i check our database. This may take several seconds.
Next, I need<Past month><Years of past months>The salary earned in (1). Please tell me the total salary in dollars and cents. $279.30
Do not hang up the phone while i send information to the social security administration.
Good, those salaries have been reported. Thank you call the SSA monthly salary report hotline.
TABLE 1
Interrogation User response
First, speak or key in your social security number. 777-00-7777
Now, tell me your birthday. 1976, 7/4
Thank you, now say and spell your name as displayed on your social security card. TomasT-O-M-A-S
I hear the name that<Name and spelling recognized from full-name grammar>Thomast-H-O-M-A-S, that is the name displayed on your social security card? Whether or not
We try again, spelling you also right after you tell me your last name. As such, "John, J-O-H-N". TomasT-O-M-A-S
I hear the name that<Name and spelling recognized from a dynamically constructed grammar>Tomas T-O-M-A-S, p? Is that
Next, speak and spell your last name displayed on your social security card. Smith,S-M-I-T-H
I get your name Smith, S-M-I-T-H, to? Is that
Some people have another family name-e.g., job title or wedding name-that may be listed under their social security number. Do you have another surname? Please say yes or no. Whether or not
Please don't hang up the phone when i check our database. This may take several seconds.
Next, I need<Past month><Years of past months>The salary earned in (1). Please tell me the total salary in dollars and cents. $1207.30
Do not hang up the phone while i send information to the social security administration.
Good, those salaries have been reported. Thank you call the SSA monthly salary report hotline.
TABLE 2
Interrogation User response
First, speak or key in your social security number. 888-00-8888
Now, tell me your birthday. 1977, 7, 4 days
Thank you, now say and spell your name shown on your social security card. KellyK-E-L-L-Y
I hear the name that<Name and spelling recognized from full-name grammar>KellyK-E-L-Y, that is the name displayed on your social security card? Whether or not
We try again and spell it also right after you tell me your last name. As such, "John, J-O-H-N". Kellie,K-E-L-L-I-E
I hear the name that<Name and spelling recognized from a dynamically constructed grammar>Kellie, K-E-L-I-E, that is the name displayed on your social security card? Is that
Next, speak and spell your last name displayed on your social security card. Smith,S-M-I-T-H
I get your name Smith, S-M-I-T-H, to? Is that
Some people have another family name-e.g., job title or wedding name-that may be listed under their social security number. Do you have another surname? Please say yes or no. Whether or not
Do not hang up when i check our database. This may take several seconds.
For no one, we cannot handle your request. Please check your information and try again later.
TABLE 3
Thus, the speech recognition process of FIGS. 4A and 4B can be used to improve conventional speech recognition utterances and spelling name captures. The method allows for another information or data combination, such as a birthday date and an account number or social security number, to be used to obtain the user's or caller's name. The actual name may be obtained and used in a supplemental name grammar to help identify the caller's name.
The processes described herein for providing speech recognition may be implemented via software, hardware (e.g., general processor, Digital Signal Processing (DSP) chip, Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), etc.), firmware, or a combination thereof. Such exemplary hardware for performing the described functions is described below.
FIG. 5 illustrates a computer system 500 upon which an embodiment in accordance with the invention can be implemented. For example, the processes described herein can be implemented using computer system 500. Computer system 500 includes a bus 501 or other communication mechanism for communicating information, and a processor 503 coupled with bus 501 for processing information. Computer system 500 also includes a main memory 505, such as a Random Access Memory (RAM) or other dynamic storage device, coupled to bus 501 for storing information and instructions to be executed by processor 503. Main memory 505 can also be used for storing temporary variables or other intermediate information during execution of instructions by the processor 503. Computer system 500 may further include a Read Only Memory (ROM)507 or other static storage device coupled to bus 501 for storing static information and instructions for processor 503. A storage device 509, such as a magnetic disk or optical disk, is coupled to bus 501 for persistently storing information and instructions.
Computer system 500 may be coupled via bus 501 to a display 511, such as a Cathode Ray Tube (CRT), liquid crystal display, active matrix display, or plasma display, to display information to a computer user. An input device 513, such as a keyboard including alphanumeric and other keys, is coupled to the bus 501 for communicating information and command selections to the processor 503. Another type of user input device is cursor control 515, such as a mouse, a trackball, or cursor direction keys to communicate direction information and command selections to processor 503 and to control cursor movement on display 511.
According to one embodiment of the invention, the processes described herein are performed by the computer system 500 in response to the processor 503 executing an arrangement of instructions contained in main memory 505. Such instructions can be read into main memory 505 from another computer-readable medium, such as the storage device 509. Execution of the arrangement of instructions contained in main memory 505 causes the processor 503 to perform the process steps described herein. One or more processors in a multi-processing arrangement may also be employed to execute the commands contained in main memory 505. In alternative embodiments, hard-wired circuitry in place of or in combination with software instructions may be used to implement the embodiment of the invention. Thus, embodiments of the invention are not limited to any specific combination of hardware circuitry and software.
Computer system 500 also includes a communications interface 517 coupled to bus 501. The communication interface 517 is coupled to a network link 519 that provides a two-way data communication, wherein the network link 519 is connected to a local network 521. For example, communication interface 517 may be a Digital Subscriber Line (DSL) card or modem, an Integrated Services Digital Network (ISDN) card, a cable modem, a telephone modem, or any other communication interface to provide a data communication connection to a corresponding type of communications line. As another example, communication interface 517 may be a Local Area Network (LAN) card (e.g., for an Ethernet (TM) or Asynchronous Transfer Mode (ATM) network) to provide a data communication connection to a compatible LAN. Wireless links can also be implemented. In any such implementation, communication interface 517 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information. In addition, the communication interface 517 can include peripheral interface devices such as a Universal Serial Bus (USB) interface, a PCMCIA (personal computer memory card International Association) interface, or the like. Although a single communication interface 517 is depicted in fig. 5, multiple communication interfaces can be used.
The network connection 519 typically provides data communication through one or more networks to other data devices. For example, network link 519 may provide a connection through local network 521 to a host computer 523 having connectivity to a network 525 (e.g., a Wide Area Network (WAN) or the global packet data communication network now commonly referred to as the "Internet"), or to data devices operated by a service provider. Both local network 521 and network 525 use electrical, electromagnetic, or optical signals to communicate information and instructions. The signals through the various networks and the signals on the network link 519 and through the communication interface 517, which communicate digital data with the computer system 500, are exemplary forms of carrier waves carrying the information and instructions.
The computer system 500 can send messages and receive data, including program code, through the network(s), the network link 519 and the communication interface 517. In the Internet example, a server (not shown) might transmit requested code belonging to an application program for implementing an embodiment of the present invention through the network 525, the local network 521 and the communication interface 517. The processor 503 may execute code transmitted while being received and/or code stored in the storage device 509, or other non-volatile storage for later execution. In this way, computer system 500 may obtain application code in the form of a carrier wave.
The term "computer-readable medium" as used herein refers to any medium that participates in providing instructions to processor 503 for execution. Such a medium may be represented in many forms, including but not limited to, non-volatile media, and transmission media. Non-volatile media includes, for example, optical or magnetic disks, such as the storage device 509. Volatile media includes dynamic memory, such as main memory 505. Transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 501. Transmission media can also take the form of acoustic, light, or electromagnetic waves, such as those generated during radio frequency and infrared data communications. Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, CDRW, DVD, any other optical medium, punch cards, paper tape, optical mark sheets, any other physical medium with patterns of holes or other optically recognizable indicia, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave, or any other medium from which a computer can read.
Various forms of computer readable media may be involved in providing instructions to a processor for execution. For example, the instructions for carrying out at least part of the invention may initially be borne on a magnetic disk of a remote computer. In such a case, the remote computer loads the instructions into main memory and sends the instructions over a telephone line using a modem. A modem of a local computer system receives the data on the telephone line and uses an infrared transmitter to convert the data to an infrared signal and transmit the infrared signal to a portable computing device, such as a Personal Digital Assistant (PDA) or a laptop. An infrared detector on the portable computing device receives the information and instructions carried by the infrared signal and places the data on a bus. The bus communicates data to main memory, from which the processor retrieves and executes instructions. The instructions received by main memory can optionally be stored on storage device either before or after execution by processor.
In the foregoing specification, various preferred embodiments have been described with reference to the accompanying drawings. It will, however, be evident that various modifications and changes may be made thereto, and additional embodiments may be implemented, without departing from the broader scope of the invention as set forth in the claims that follow. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Claims (14)

1. A method for providing speech recognition, comprising:
obtaining a name from a user based on data provided by the user;
querying the user for the user's name;
receiving a first audio input from the user in response to the query;
applying speech recognition to the first audio input using a name grammar database to output a recognized name;
determining whether the recognized name matches the retrieved name;
re-querying the user for the user's name if no match is determined;
receiving a second audio input from the user in response to the re-query; and
applying speech recognition to the second speech input using a confidence database having fewer entries than the name grammar database, wherein the confidence database has entries derived from the name grammar database, the entries being ranked by confidence level.
2. The method of claim 1, further comprising:
querying the user for the data, wherein the data includes one of business information or personal information.
3. The method of claim 1, further comprising:
confirming the recognized name with the user.
4. The method of claim 3, wherein the confirming is performed by audibly providing the recognized name to the user.
5. The method of claim 1, further comprising:
determining a failed state if no match is found with the retrieved name after a predetermined number of repeated re-queries of the user name.
6. The method of claim 1, further comprising:
additional entries for the confidence database are determined using a pseudonym application.
7. The method of claim 1, further comprising:
determining a confidence level of a comparison between the retrieved name and the recognized name associated with the first audio input or the second audio input.
8. A system for providing speech recognition, comprising:
means for obtaining a name from a user based on data provided by the user;
means for querying the user for the user's name;
means for receiving a first audio input from the user in response to the query;
means for applying speech recognition to the first audio input using a name grammar database to output a recognized name;
means for determining whether the recognized name matches the retrieved name;
means for re-querying the user for the user's name if no match is determined;
means for receiving a second audio input from the user in response to the re-query; and
means for applying speech recognition to the second speech input using a confidence database having fewer entries than the name grammar database, wherein the confidence database has entries derived from the name grammar database, the entries being ranked by confidence level.
9. The apparatus of claim 8, further comprising means for querying the user for the data, wherein the data comprises one of business information or personal information.
10. The apparatus of claim 8, further comprising means for confirming the recognized name with the user.
11. The apparatus of claim 10, wherein the confirmation is performed by audibly providing the recognized name to the user.
12. The apparatus of claim 8, further comprising means for: determining a failed state if no match is found with the retrieved name after a predetermined number of repeated re-queries of the user name.
13. The apparatus of claim 8, further comprising means for determining additional entries for the confidence database using a pseudonym application.
14. The apparatus of claim 8, further comprising means for determining a confidence level of a comparison between the retrieved name and the recognized name associated with the first audio input or the second audio input.
HK09110147.2A 2006-09-25 2007-09-25 Method and system for providing speech recognition HK1132831B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US11/526,395 US8190431B2 (en) 2006-09-25 2006-09-25 Method and system for providing speech recognition
US11/526,395 2006-09-25
PCT/US2007/079413 WO2009064281A1 (en) 2006-09-25 2007-09-25 Method and system for providing speech recognition

Publications (2)

Publication Number Publication Date
HK1132831A1 HK1132831A1 (en) 2010-03-05
HK1132831B true HK1132831B (en) 2013-07-12

Family

ID=

Similar Documents

Publication Publication Date Title
US8457966B2 (en) Method and system for providing speech recognition
US8488750B2 (en) Method and system of providing interactive speech recognition based on call routing
US20080243504A1 (en) System and method of speech recognition training based on confirmed speaker utterances
CN1783213B (en) Methods and apparatus for automatic speech recognition
US7542907B2 (en) Biasing a speech recognizer based on prompt context
EP1171871B1 (en) Recognition engines with complementary language models
US6470315B1 (en) Enrollment and modeling method and apparatus for robust speaker dependent speech models
US20060217978A1 (en) System and method for handling information in a voice recognition automated conversation
US10468016B2 (en) System and method for supporting automatic speech recognition of regional accents based on statistical information and user corrections
US20030144846A1 (en) Method and system for modifying the behavior of an application based upon the application&#39;s grammar
US20030149566A1 (en) System and method for a spoken language interface to a large database of changing records
EP1962279A1 (en) System and method for semantic categorization
US20050004799A1 (en) System and method for a spoken language interface to a large database of changing records
Thimmaraja Yadava et al. Enhancements in automatic Kannada speech recognition system by background noise elimination and alternate acoustic modelling
US20080243499A1 (en) System and method of speech recognition training based on confirmed speaker utterances
Żelasko et al. AGH corpus of Polish speech
US9824682B2 (en) System and method for robust access and entry to large structured data using voice form-filling
US20040006469A1 (en) Apparatus and method for updating lexicon
US20080243498A1 (en) Method and system for providing interactive speech recognition using speaker data
Callejas et al. Implementing modular dialogue systems: a case study
US6662157B1 (en) Speech recognition system for database access through the use of data domain overloading of grammars
Natarajan et al. Speech-enabled natural language call routing: BBN call director.
Kurian et al. Connected digit speech recognition system for Malayalam language
HK1132831B (en) Method and system for providing speech recognition
EP1554864A1 (en) Directory assistant method and apparatus