MXPA98002752A

MXPA98002752A - Method and apparatus for voice interaction in a network, using definitions of interaction with paramet

Info

Publication number: MXPA98002752A
Application number: MXPA/A/1998/002752A
Authority: MX
Inventors: Klarlund Nils; Christopher Ramming James
Original assignee: At&T Corp
Priority date: 1997-04-10
Filing date: 1998-04-07
Publication date: 1999-07-06

Abstract

The present invention relates to a visualization aid with audio that executes a display with voice dialing language. The visualization assistant with audio receives an interactive voice request. Based on the request, the network node obtains a document. The document includes a voice tag, and a definition of interaction with parameters or at least a link to a definition of interaction with parameters when user interaction is required. The visualization assistant with audio interprets the document according to the definition of interaction with parameters. When using the interaction definition, they are verified in the visualization assistant with audio, instead of in a network server. In addition, the definition of interaction with parameters can define a finite state machine. When it does, the definition of interaction with parameters can be analyzed in a way that minimizes performance problems of the visualization assistant with aud

Description

METHOD AND APPARATUS FOR VOICE INTERACTION IN A NETWORK, USING DEFINITIONS OF INTERACTION WITH PARAMETERS FIELD OF THE INVENTION The present invention is directed to voice interaction in a network. More particularly, the present invention is directed to voice interaction on a network using definitions of interaction with parameters. BACKGROUND OF THE INVENTION The amount of information available in communication networks is large and growing at a rapid rate.

The most popular of these networks is the Internet, which is a network of computers linked around the world. Much of the Internet's popularity can be attributed to the Internet's World Wide Web (WWW) Portion. The WWW is a portion of the Internet where information is typically passed between server computers and client computers, using the Protocol Hypertext Transfer (HTTP = Hypertext Transfer Protocol). A server stores information and serves (ie sends) the information to a client, in response to a request from the client. Clients run computer software programs, often called searchers or visualizers (brcwsers), which help request and inhibit information. Examples of WWW visualization scanners are REF: 27103 Netscape Navigator, available from Netscape Communications, Inc., and Internet Explorer, available from Microsoft Corp. The servers and the information stored there, are identified through the Uniform Resource Locators (URL = Uniform Resource Locators) ). The URLs are described in detail by Berners-Lee, T., and collaborators, Uniform Resource Locators, RFC 1738, Network Working Group, 1994, which is incorporated herein by reference. For example, the URL http://www.hostname.com/documentl.html identifies the document "documentl.html" on the host server "www.hostname.com". In this way, a request for information from a host server by a client generally includes a URL. Information that is passed from a server to a client is usually called a document. These documents are generally defined in terms of document language, such as the Hypertext Markup Language (HTML = Hypertext Markup Language). Upon request from a client, a server sends an HTML document to the client. HTML documents contain information that is interpreted by the viewer in such a way that a user representation can be displayed on a computer display screen. An HTML document can contain information such as text, logical structure commands, hypertext links, and user power commands. If the user chooses (for example by the operation of the mouse button), a hypertext link from the display, the search engine or viewer will request another document from a server. Currently, most WWW viewers rely on user interfaces of text and graphics. In this way, documents are presented as images on a computer screen. These images include for example text, graphics, hypertext links, and dialog boxes for user feeding. The majority of user interaction with WWW is through a graphical user interface. Although the audio data is capable of being received and played back on a user's computer, (for example, a file with a .wav or .au extension), the reception of this audio data is secondary to the graphic interface of the WWW. In this way, with most WWW viewers, audio data can be sent as a result of a user request, but there is no way for a user to interact with the WWW using the audio interface. A searcher or audio viewer system is described in the U.S. patent application. No. 08 / 635,601, awarded to AT &T Corp. and titled "Method and Apparatus for Information Retrieval Using Audio Inter face" (Method and Apparatus for Retrieving Information Using Audio Interface;, filed on April 22, 1996, incorporated herein by reference (hereinafter referred to as the "AT &T Audio Visualizer patent"). The described audio visualization system allows a user to access documents on a server computer connected to the Internet using an audio interface device. In a modality described in the patent of the AT & T audio visualizer, an audio interface device accesses a centralized audio visualizer which is executed in an audio visualization aid. The visualizer with audio receives documents from server computers that can be coupled to the Internet. The documents can include specialized instructions that allow them to be used with the audio interface device. The specialized instructions are typically similar to HTML. Specialized instructions may cause the display to generate audio output from the written text, or accept user power through DTMF tones or automated speech recognition. A problem that arises with an audio visualization system that includes a centralized viewer is that the user data feed often requires a complex sequence of events involving the user and the viewer. These events include, for example: a) asking the user for power; b) list the power selections; c) ask the user for additional information; and d) inform the user that a previous or proportioned feeding was inconsistent or in error. We have found it convenient to program and tailor the centralized viewer in order to define the allowed sequences of events that can occur when the user interacts with the viewer. However, when the viewer is programmed and adjusted, it is important to minimize certain performance problems that result from both accidentally and malicious programming. One problem is that a display that has been adjusted to the measurement may not respond if the adjustment to the measure contains for example an infinite loop. In addition to reducing the performance of the viewer, in deterioration of another activity performed by the viewer, this loop could allow a phone call to be extended over time, contributing disadvantageously to the cost of the call, while at the same time potentially denying others subscribers who call access to the viewer. Another problem known as a "denial of service" attack is easier for the attacker to execute if the display conforms to the measure in a way that allows a calling party to keep the call connected without offering any power. Some of these performance problems are less important in the context of uncentralized viewers, because uncentralized viewers that have been poorly tailored typically affect only the computer running the viewer and the computer's telephone lines, and therefore programming errors are effectively quarantined. However, in the centralized viewer mode of the audio visualization system described in the AT & T audio visualizer patent, and in any centralized viewer, when the audio visualization aid executing the centralized viewer incurs problems of performance, the negative effects of the problems are exacerbated. In an audio visualization system, multiple users access the same auxiliary audio visualization through multiple audio interface devices and in this way many users are negatively affected when the audio visualization assistant incurs performance problems. Therefore, it is convenient in a visualization system with audio to minimize performance problems. Another problem with most known viewers is that the data provided to the viewer on the client computer, typically they are sent to the server when the data is verified and validated. For example, if a user provides data to a computerized form of filling in a viewer, the data is typically sent to the Internet server where it is verified that the form was properly filled (ie all required information has been provided, the number required digits has been provided, etc.). If the form is not filled properly, the server typically sends an error message to the client, and the user will try to correct the errors. However, in a visualization system with audio, frequently the data provided by the user is in the form of speech. Speech is converted to voice data or voice files using speech recognition. However, using voice recognition to obtain voice data is not as accurate as obtaining data through keyboard input. Therefore, even more verification and validation of data is required when they are supplied using speech recognition. In addition, converted speech speech files are typically large with respect to data provided from a keyboard, and this makes it difficult to frequently send voice files from the audio viewing aid to the Internet server. Therefore, it is convenient to make the most verifications and possible validations of data provided in the viewer, in a visualization system with audio, in such a way that the number of times the voice data is sent to the Internet server is minimized. Based on the foregoing, there is a need for a visualizer system with audio where the performance problems of the audio visualizer assistant executing the visualizer are minimized, and where the data provided is typically verified and validated on the visualizer instead on the Internet server. COMPENDIUM OF THE INVENTION According to one embodiment of the present invention, a search aid or audio visualization performs a language display with voice dialing. The visualization assistant with audio receives an interactive voice request. Based on the request, the network node obtains a document. The document includes a voice tag, and when user interaction is required, a definition of interaction with Darám ^ tros, or at least a link to a definition of interaction with parameters. The visualization assistant with audio interprets the document according to the definition of interaction with parameters. When using the definition of interaction with parameters, the data provided is typically verified in the visualization assistant with audio instead of in the network server. In addition, in one embodiment, the definition of interaction with parameters defines a finite state machine. In this modality, the definition of interaction with parameters can be analyzed in such a way that the performance problems of the visualization assistant with audio are minimized. BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 shows a diagram of the telecommunications system that is suitable for the practice of a modality of the present invention. Figure 2 illustrates the general form of a definition of interaction with parameters. Figures 3A, 3B and 3C are an example of a definition of interaction with parameters. DETAILED DESCRIPTION Figure 1 shows a diagram of a telecommunications system that is suitable for practicing one embodiment of the present invention. An audio interface device such as a telephone 110 is connected to a local exchange (LEC = local exchange carrier) 120. Audio interface devices other than a telephone can also be used. For example, the audio interface device may be a multimedia computer having telephony capabilities. In one embodiment, a user of telephone 110 requests information by directing a telephone call to a telephone number associated with information that is provided by a document server, such as the document server 160. A user may also request information using any device that functions as an audio interface device, such as a computer. In the embodiment shown in Figure 1, the document server 160 is part of the communication network 162. In an advantageous embodiment, the network 162 is Internet. Telephone numbers associated with information accessible through a document server, such as the document server 160, are configured such that they are directed to special telecommunication network nodes, such as an audio display aid 150. In the mode shown in Figure 1, the audio display aid 150 is a node in the telecommunications network 102, which is a long-distance telephone network. In this manner, the call is routed to the LEC 120, which also directs the call to a long-distance bearer switch 130 via the trunk 125. The long-distance network 102 will generally have other switches similar to the switch 130 for directing calls. However, only one switch is illustrated in Figure 1 for reasons of clarity. It is noted that the switch 130 in the telecommunications network 102 is an "intelligent" switch, since it contains (or is connected to) a processing unit 131 that can be programmed to perform various functions. This use of processing units in telecommunications network switches and their programming is well known in the art. Upon receiving the call on the switch 130, the call is then directed to the audio display aid 150. In this way, an audio channel is established between the telephone 110 and an auxiliary with audio display 150. The directing of calls to through a telecommunications network is well known in the art and will not be described further here. Upon receiving the call and request of the telephone 110, the audio display aid 150 establishes a communications channel with the document server 160 associated with the call telephone number via the link 164. In a WWW mode, the link 164 is a connection with female plug protocol over TCP / IP, the establishment of which is well known in the art. For additional information on TCP / IP, see Comer, Douglas, In terne tworking with TCP / IP: Principles, Protocols, and Arc? Itecture (Network Interconnection with TCP / IP: Principles, Protocols and Archi- tecture), Englewood Cliffs, NJ , Prentice Hall, 1988, which is incorporated herein by reference. The audio display aid 150 and the document server 160 communicate with each other using a document server protocol. As used herein, a document server protocol is a communications protocol for the transfer of information between a client and a server. According to this protocol, a client requests information from a server when sending a request to the server and the server responds to the request by sending a document containing the requested information to the server. In this way, a protocol channel is established to serve as documents between the display aid with audio 150 and the document server 160 via the link 164. In an advantageous WWW mode, the protocol serving the document is the Transfer Protocol of Hypertext (HTTP = Hypertext Transfer Protocol). This protocol is well known in the WWW communication technique and is described in detail in Berners-Lee, T. and Connolly, D. , Hypertext Transfer Protocol (HTTP) Working Draft of the Internet Engineering Task Force (Work Project of the Hypertext Transfer Protocol (HTTP) of the Internet Engineering Task Force) 1993, which is incorporated herein by reference. In this way, the audio display aid 150 communicates with the document server 160 using the HTTP protocol. In this way, as far as the document server 160 is concerned, it behaves as if it were communicating with any conventional WWW client running a conventional graphical viewer. In this way, the document server 160 serves documents to the display aid with audio 150 in response to requests it receives on the link 164. A document, as used herein, is an information connection. The document can be a static document since the document is predefined in the server 160 and all the requests for that document result in the same information supplied. Alternatively, the document can be a dynamic document, so that the information that is served in response to a request is generated dynamically at the time the request is made. Typically, dynamic documents are generated by scripts, which are programs executed by the server 160 in response to a request for information. For example, a URL can be associated with a script. When the server 160 receives a request including that URL, the server 160 will execute the script to generate a dynamic document, and the dynamically generated document will serve the client requesting the information. Dynamic scripts are typically executed using the Common Gateway Interface (CGI = Common Gateway Interface). The use of scripts to dynamically generate documents is well known in the art. As will be further described below, according to the present invention, the documents supplied by the server 160 include speech marks, which are instructions that are interpreted by the display aid with audio 150. In order to facilitate interaction between the user of the telephone 110 and the display aid with audio 150, in one embodiment the voicemarks include links to definitions of interaction with parameters. Details of the interaction definitions with parameters will be described below. When the links are interpreted by the display aid 150 with audio, the interaction definitions are invoked with appropriate parameters. In another modality, the definitions of interaction with parameters are included within the document. In one embodiment, voicemarks and parameter interaction definitions are written in an HTML-based language, but are custailored especially for the display aid with audio 150. An example of HTML markup voice instructions they are "audio-HTML", described in the patent of the visualizer with AT & T audio. When an HTML document is received by a client running a conventional WWW viewer, the viewer interprets the HTML document in an image and displays the image on a computer display screen. However, in the audio visualization system illustrated in Figure 1, upon receiving a document from the document server 160, the audio display aid 150 converts some of the voice marking instructions located in the document into audio data. in a known way, such as using text-to-speech. Additional details of this conversion are described in the patent of the AT & T audio visualizer. The audio data is then sent to the telephone 110 via the switch 130 and the LEC 120. In this way, the user of the telephone 110 can access information from the document server 160 by means of an audio interface. further, the user can send audio user power from the telephone 110 back to the display aid with audio 150. This audio user power may for example be speech signals or DTMF tones. The audio display aid 150 converts the user audio feed into user data or instructions that are appropriate to transmit to the document server 160 via the link 164 according to the HTTP protocol in a known manner. Additional details of this conversion are described in AT &amp's audio visualization patent; T. The data or user instructions are then sent to the document server 160 via the protocol channel serving documents. In this way, the user interaction with the document server is through an audio user interface. The interaction definitions with parameters are pre-defined routines that specify how the power is collected from the user by the audio interface device 110 through signals, backfeeds and time intervals. Interaction definitions with parameters are invoked by specific voice mark instructions in documents, when the documents are interpreted by the audio viewer (referred to as the "voice markup language" viewer) it is executed in the display assistant with audio 150. In one embodiment, the instructions define links to definitions of interaction with parameters. The interaction definitions with parameters may be located within the document or elsewhere within the audio display system illustrated in Figure 1 (for example, in the document server 160, in the display assistant with audio 150, or in any other storage device coupled to the display assistant with audio 150). In one modality, .the definitions of interaction with parameters are stored in a database coupled to an interaction definition server. The interaction definition server is coupled to the VML viewer, so that the interaction definitions with parameters are available to the VML viewer when requested. In addition, the definitions of interaction with parameters can be part of the voice tag instructions in which case a link is not required. For example, a definition of interaction with parameters may exist that allows a user to make a selection from a list of menu options. This definition of interaction with parameters can be entitled "MENU_INTERACT". If a document includes a section where such interaction is required, a voice dialing instruction may be written invoking this interaction such as "Cali ME? U_I? TERACT, parameter 1, parameter 2". This voice tag, when interpreted by the VML viewer, will invoke the definition of interaction with parameters with the title "ME? U_I? TERACT", and pass it to parameters 1 and 2.

The definitions of interaction with parameters are what allows the present invention to achieve the previously described benefits (that is, to minimize the performance problems of the visualization aid with audio and to verify and validate the data provided in the visualization assistant with audio instead of the Internet server). The definitions of interaction with parameters adjust to the measure and modify the behavior of the visualizer with centralized audio to achieve these benefits. Specifically, in one modality, the definitions of interaction with parameters define finite-state machines. It is well known that finite state machines can be fully analyzed before being executed using known techniques. The analysis can determine for example if the definition of interaction with parameters will end if the user does not hang up and does not offer any power. This prevents a user from paralyzing the VML viewer, indefinitely while doing nothing. In addition, the analysis can determine if all the sections or states of the definition of interaction with parameters can be reached by the user. Additionally, the analysis can determine if the definition of interaction with parameters includes sections or states that do not lead to an exit point which would cause an infinite loop. These states can be reviewed or deleted before the definition of interaction with parameters is interpreted or executed by the VML viewer or the 150-view audio assistant. Due to the availability of these analysis tools, a developer of the visualization document with audio that uses definitions of interaction with parameters can ensure that interruptions to the viewer are minimized when implementing the interaction definitions analyzed when the document requires user interaction. In addition, the definitions of interaction with parameters provide verification of the user's power. Therefore, because the interaction definitions with parameters are interpreted in the display aid with audio 150, there is a minimum need that the user's power is sent to the Internet server for verification. This saves time and telecommunications costs because user power often consists of relatively large voice files. Examples of some of the possible types of interaction definitions with parameters include: a) menu, where the user will make a choice from a list of menu options: b) mul timenu, where the user chooses a subset of options; c) text, where the user must provide a string of characters, - d) digi ts, where the user must provide sequences of digits whose length is not determined a priori; e) digi tslimi ted, where the user must feed a predetermined number of digits; and f) recording, where the user's voice is recorded to an audio file. Figure 2 illustrates the general form of a definition of interaction with parameters. Line 200 defines an interaction named "interact ion_name" for the interaction type "interact ion_type". In addition, line 200 declares all the means that can be used in the interaction. The means declared on line 200 include automatic speech recognition (ASR), keypad tone or DTMF (TT) and recording (REC). Line 202 defines a number of attribute parameters. The attribute parameters are used to assign parameters to the interaction and are included in the voice dialing instruction that invokes the interaction. If parameters are not included in the voice dial instructions, a predefined value "default_value" is used as the parameter. Line 204 defines a number of message parameters. The message parameters can be used as formal separators within the state machine for allow specified signals and messages when the interaction is used. Message parameters are also used to assign parameters to the interaction and are included in the voice dialing instruction that invokes the interaction. Line 206 defines a number of counter variable declarations. Each counter is declared with an initial value. Operations allow this variable to be decremented from a fixed initial value (typically less than 10) and test for 0. Line 208 defines a number of boolean variable declarations. Each Boolean variable is declared with an initial value. Line 210 defines a number of status declarations. Each state contains one of the following constructions: 1) An action consisting of a message synthesized in speech and code to change the state, either immediately or as a result of activated events. The power modes that are activated are also specified. For example, the ttmenu power mode, which is defined for menu-type interactions, specifies that events that designate the selection of an option may occur as a result of the user providing a digit. Each event is mentioned in an event transition, which specifies the side effects to occur when the event occurs; or 2) A conditional expression, which allows the action to depend on the configurations or the settings of the variables. In this way, a conditional expression consists of actions that are embedded in the if-then-other-form constructions (if-then-else). An interaction defined in the language previously described can be considered as a finite state machine whose total state space is the product of the current state and the values of the various variables. Figures 3A, 3B and 3C are an example of a definition of interaction with parameters. With reference to Figure 3A, line 300 defines the type of interaction as a menu, and an interaction name with parameters. Line 302 defines the attribute parameters. Lines 304 and 306 define counter variables. Lines 308, 310, 312, 314, 316 and 318 indicate the start of message parameters. With reference to Figure 3B, lines 320, 322 and 324 indicate the start of various states. With reference to Figure 3C, lines 326, 328, 330 indicate the start of various states. Finally, line 332 indicates the end of the interaction definition. More details of the "initial" state beginning at line 320 of Figure 3B will be described. The other states illustrated in Figures 3B and 3C work similarly. Initially, the state machine associated with the interaction is in the "initial" state and the two counter variables TTERRCOUNT and TOCOUNT are started to MAXTTERROR and MAXTO, respectively. These values, if not explicitly exceeded by parameters when the interaction definition is used, are 3 and 2, respectively. The "initial" state specifies that the PROMPT message (which is typically a parameter whose current value is the text in the voice dialing document preceding the use of the interaction) will need to be synthesized while the keypad tone (TT) command mode and the keypad tone menu selection mode (TTMENU) are activated. These activations allow the TTMENU COLLECT and TT INPUT = "HELPTT" events, respectively, to occur. The first type of event denotes a digit feed that specifies a menu option selection. The second type of event refers specifically to the "HELPTT" feed (whose default value is "##"). If an event of the first type happens, then the next state of the finite state machine will be "echochoice". If the second event occurs first, then the next state will be "help". If a keypad tone without meaning occurs, then the event transition involving the TTFAIL event specifies that TTERRCOUNT shall be decreased and that the next state is "not valid" (notvalid). If none of these three events occurs in a period of time designated by "INACTIVITYTIME", then the TIMEOUT event occurs, TTERRCOUNT is decreased and the next stage is "inactivity". As described, the VML viewer of the present invention interprets documents according to the definitions of interaction with parameters. The interaction definitions with parameters allow a visualizing system with audio to minimize performance problems of the visualization assistant with audio and verify the data supplied in the visualization assistant with audio, instead of an Internet server. In addition, the definitions of interaction with parameters establish a dialogue for the feeding of data in a field (ie the field "HELPTT") in which sequences of system responses and user power can be specified and controlled. Each user generated event such as a key oppression or pronunciation by the user is controlled and answered by the interaction definitions with parameters. The above detailed description shall be understood as in every illustrative and exemplary but not restrictive aspect, and the scope of the invention described herein shall not be determined from the detailed description, but on the contrary from the claims as they are interpreted according to the fullest extent allowed by the patent laws. It will be understood that the embodiments illustrated and described herein are only illustrative of the principles of the present invention and that various modifications may be implemented by those skilled in the art without departing from the scope and spirit of the invention. For example, the audio visualization system illustrated in Figure 1 executes the VML viewer as a centralized display in the audio display aid 150. However, the present invention can also be implemented with other embodiments of a visualization system with audio, including all the modalities described in the searcher or visualizer patent with AT &amp audio; T. It is noted that in relation to this date, the best method known to the applicant to carry out the aforementioned invention, is that which is clear from the present description of the invention. Having described the invention as above, property is claimed as contained in the following:

Claims

CLAIMS 1. Method for operating an auxiliary search or visualization with audio, characterized by copying the stages of: (a) receiving an application; (b) obtain a document based on the request, where the document includes a voice tag; and (c) interpret the document according to a definition of interaction with parameters. Method according to claim 1, characterized in that the request is received over a public switched telephone network. Method according to claim 1, characterized in that the request is received over a data network. 4. Method of compliance with the claim 1, characterized in that the document is obtained from a server connected to a data network. Method according to claim 1, characterized in that the definition of interaction with parameters is located in the document. Method according to claim 1, characterized in that the definition of interaction with parameters is located on a server coupled to a data network. Method according to claim 6, characterized in that the document is interpreted in a search engine or visualizer with speech language < VW_ > further coupled to the data network and a public switched telephone network, further comprising the steps of: (d) receiving the definition of interaction in the VML viewer from the server; (e) interpret the document in the VML viewer based on the interaction definition. Method according to claim 1, characterized in that the definition of interaction with parameters defines a finite state machine. 9. The method according to claim 8, characterized in that it further comprises the step of analyzing the definition of interaction with parameters to determine whether it includes any infinite loops. 10. The method according to claim 8, further comprising the step of analyzing the definition of interaction with parameters to determine whether all states cause the audio display aid to terminate due to lack of activity. The method according to claim 8, characterized in that it further comprises the step of determining how long an interaction takes to terminate due to lack of user power. 12. The method according to claim 1, characterized in that the request is received from a telephone coupled to a public switched telephone network. The method according to claim 1, characterized in that the document is stored in a server coupled to a data network, and the document is obtained from the server by a display with voice dialing language (VML) coupled to the data network and a public switched telephone network, based on the request received by the VML browser or viewer. 14. A search or display system with audio in a network, comprising a visualization assistant with audio coupled to the network and executing a language display with voice dialing (VML), the VML viewer is adapted to receive a request , obtain a VML document through the network, and interpret the VML document according to a definition of interaction with parameters. 15. The audio visualization system according to claim 14, characterized in that it also comprises a server with interaction definition coupled to the display, the server is adapted to receive a request from the definition of interaction with parameters from the display and to send the definition of interaction requested to the search engine or viewer. 16. The audio visualization system according to claim 15, characterized in that it also comprises a database coupled to the server, the database stores the interaction definitions, the server obtains the interactive definitions of the database. 17. The audio visualization system according to claim 14, characterized in that the request is received over a public switched telephone network. 18. The audio visualization system according to claim 14, characterized in that the request is received over a data network. 19. The audio visualization system according to claim 14, characterized in that the definition of interaction with parameters is located in the VML document. 20. The audio visualization system according to claim 14, characterized in that the definition of interaction with parameters is located on a server coupled to the network. 21. The audio visualization system according to claim 14, characterized in that the definition of interaction with parameters defines a finite state machine.