METHOD AND DEVICE FOR MULTIMODAL INTERACTIVE BROWSING
The present invention relates to multimodai interactive browsing in communication networks having portable terminals. It also relates to multimodai interactive browsing in a communication network with a minimized use of network resources. More specifically, the invention relates to a simple multi-modal user interface concept, where a conventional network browser user interface is augmented for simple multi-modal interaction by offering a possibility for data-entry by voice as an alternative to using manual input.
A fully fledged multimodai system accepts input and produces output in multiple modalities to the user. The available input modalities in a mobile terminal are usually considered to consist of some form of keyboard input, a speech recognition user interface, and potentially a pen user interface. The output modalities may consist of the standard general user interface of a regular browser, plus some form of audio output (spoken, music, etc).
The most straightforward way of providing speech input capabilities to existing GUI-oriented markup would be to directly analyze the content for links and other items, and to build speech grammars for these. This approach has the problem that ambiguous links are produced from multiple components like "press here". Due to this problem, on-the-fly grammar generation is not a universal solution for voice enabling visual content.
The publication WO99/48088 describes a method in which a wearable computer extracts and compiles a speech grammar from a web document. Alternatively, this extraction and compilation takes place at a server. This grammar is used by the browser to enable voice browsing. The O99/48088 describes three similar mechanisms for allowing a user to employ voice commands to navigate pages. In one mechanism a "speech hint" or index value corresponding to each hyperlink in a Web page is determined and displayed on the Web page. When a voice command is received, a determination is made of whether the voice command corresponds to an index value. If the voice command corresponds to an index value, the hyperlink corresponding to the index value is activated to retrieve additional data. In the second mechanism, when a voice command is received, a determination is made, whether the voice command corresponds to the text associated with a hyperlink on the current page. If the voice command corresponds to the text associated with a hyperlink, the associated hyperlink is activated to retrieve additional data.
In a third mechanism, a voice command causes a list of hyperlinks to be displayed. Each
hyperlink is displayed with a corresponding index value. In response to receiving a voice command, a determination is made of whether the voice command corresponds to either hyperlink text or an index value corresponding to a hyperlink. If a match is found, the corresponding hyperlink is activated to retrieve additional data.
The document US 6101473 describes a method, where voice browsing is realized by synchronous operation of a telephone network service and an internet service. This is definitively prohibitive due to the waste of network resources, requiring two different communication links. Further this service requires an interconnection between the telephone service and the internet service. This architecture requires both speech and data to be transmitted over the air, thereby increasing the cost of the connection for the browsing session. This high cost of connection is difficult to justify for a would-be user of a multimodai browser. As another hurdle for user satisfaction, the over-the-air co:browser synchronization required in a distributed browser architecture may cause latencies in browser operation which will degrade the user experience. The latencies will of course depend on the available bandwidth for the over-the-air synchronization connection.
The document US 6188985 describes a method in which a wireless control unit implements the voice browsing capabilities to a host computer. For this purpose, a number of multimodai browser architectures have been proposed where these operations are placed on a network server. The proposed architectures differ in whether just the speech engines are placed on the network side (with a multimodai browser located in the wireless client terminal), or whether the multimodai browsing system is split to separate the visual browser (residing in a wireless client terminal) and the voice browser to different parts of the network.
The known systems suffer from the fact that many voice input options for legacy content are not especially designed for voice browsing, and therefore may include many different but phonetically similar voice input options, making the speech recognition task difficult. Further, placing the computationally expensive speech recognition and synthesis operations in a wireless client may be prohibitive from the perspective of implementation cost, especially if multimodai browsing is the only reason for implementing the voice recognition, and especially if low-end devices are targeted.
All the above approaches for a multimodai browsing architecture have in common that they are not suitable for use in mobile terminal devices such as mobile phones, or handheld computers, due to low computing power, or low battery capacity. Therefore, a computationally light multimodai interactive browser architecture is needed to overcome these problems.
One of the problems associated with a multimodai browser architecture is the equivocal design of the user interfaces for browsing in the information pages.
Because the information content of the recognition result is often merely some bytes, another problem associated with a multimodai browser architecture is the waste of network resources required for the transmission of large amounts of data as speech waveforms or feature vectors derived from speech.
According to a first aspect of the present invention, there is provided a method for multimodai interactive browsing, comprising the steps of recalling a visual content and vocal content from said terminal, separating said contents, displaying said visual content on said electronic device, using an acoustic unit -based recognition algorithm for selecting an element from said vocal content; and displaying visual content according to said selection. The visual content typically includes manual input elements such as buttons links and so on. The vocal content comprises vocal equivalents to said manual input elements. Both contents can be recalled from an internal memory in the electronic device. The contents are separated and the visual content is displayed. Then an acoustic unit- based recognition algorithm is used to recognize a voice input from an user. The algorithm compares said recognized input with the vocal content, searching for similarities.
As in the case of a computer, in which a mouse and a keyboard represent different input modes, the voice equivalents and the manual input elements can optionally be used to start the same operations. Voice equivalents are words or text that can be transformed, e.g., to a string of acoustic units, e.g., phonemes. The voice equivalents can be related to a text or a shape used for the manual input elements. If the voice input element is related to a text, key words may be emphasized by different color, script types or the like, as conventionally used in internet sites to indicate that a text is 'clickable' to start a link or the like. For example the WML (wireless markup language) content typically contains lists of items displayed on the screen, and the user may select one of these by his voice, rather than having to key in the selection. The system is not limited to WML, but can be applied to every ML (markup language) used in networks, e.g. XML (Extended ML), HTML (Hyper Text ML), XHTML (Extended HTML), and the like. Obviously, the user has still the choice of using the manual user interface, as well.
Preferably the method further comprises the steps of assembling a visual content and a vocal content to a multimodai information page, and transferring said information page to an electronic device, e.g., a terminal device like a mobile phone.
The vocal equivalents to be recognized, i.e. the textual words or phrases in the grammar, are converted into strings representing acoustic units. The speech grammars specify the vocabularies for data entry by speech. The acoustic units are preferably phonemes, but they may also be parts of phonemes or multiple phonemes. For example, it is sometimes useful to split each phoneme into three parts, the beginning, the mid-part and the end. The acoustic units can be represented by phonetic characters, normal characters used in the western languages, or any other type of characters, like the graphic characters used in many oriental languages.
When a string of acoustic units has been formed, it can be used to make a statistical model of each acoustic unit of the word it represents. . Actually, there is a single statistical model for each single acoustic unit The statistical model is typically a hidden Markov model (HMM), but other models can also be used. These models are then stored in the memory of the voice "recognizer. Besides, the speech grammars define the words/sequences of words that are expected as user input. In the speech recogniser, the speech grammar is converted into the relevant sequences of (synbols of) acoustic units. The speech recogniser has a statistical model for each acoustic unit. Because the speech recogniser now has the speech grammar described as strings of symbols representing acoustic units, and it has the relevant models for acoustic units, it can receive incoming speech and determine if the speech (as detected by the statistical models for the acoustic units) contains strings of acoustic units matching one of the strings of acoustic units described by the speech grammar.
Conveniently, the method further comprises the step of retrieving new content. The retrieved new content can be a multimodai information page, or a monomodal information page. The multimodai information page comprises vocal and visual contents. Monomodal information pages only comprise visual content. The method is capable of browsing monomodal information pages conventionally. The origin of the contents or the information pages is not important and can be an internal memory, an external memory, a server or computer connected to a network, and the like. The transmission medium is not relevant for the invention and can be radio, optical, wireless or wired and so on. The transmission method is not relevant for the invention either and can be analog or digital, use time or frequency hopping algorithms, be coded or not and so on.
Advantageously, using an acoustic unit based algorithm comprises the steps of recognizing a voice input from the user, processing said voice input, comparing said processed voice input with said vocal content, and executing the input.
In the case of a simultaneous voice and manual input, it is necessary to determine which input
has the higher priority to be executed. The invention can encompass not only to recognize the conventional voice elements, but can use an extended model set of acoustic units including even unarticulated sounds used by humans, e.g., whistles, humming sounds, screams or the like.
As in the case of conventional browsing, data like information pages can be polled from terminal devices. The terms information page, multimodai information page are used to describe the data content, which includes visual and vocal content of information pages as well as visual and vocal data stored in the terminal device to support multimodai browsing. The main difference is that the contents of the information pages contain visual and vocal content. As in the case of monomodal browsing a request for data transmission is transferred from a terminal device to a server in the network. The server may have received a "multimodai" information with the request, and therefore knows that the visual content of the requested page, has to be delivered with additional vocal content. The server recalls or generates the vocal content, assembles it with the visual content and transmits it as a multipart document to the terminal device. The visual content is substantially the monomodal information page. The vocal content is substantially the grammar, which is represented by vocal input equivalents. The vocal content can also be generated automatically by an intermediate server before sending it to the requesting terminal device.
At the terminal device the multimodai information page is split into the visual and the vocal content. The visual content is displayed in a conventional manner. The vocal content is transferred to a voice recognizing means. Now it is described how to bind visual content with vocal content. The grammars and the markup, i.e. the vocal and the visual content have to be authored according to agreed authoring rules which define the bindings between the input elements of the markup language and the corresponding speech grammar rules. The approach is to use common naming between speech grammar rules (e.g. "$tocity"), and elements in WML (e.g. <select title- city" name="tocity">. The rule names in native user input (Ul) grammars can be assigned to common user input operations (e.g. "$back" bound to a "back key" etc). Multiple sets of vocal contents can be activated for a given context. But it is sufficient to use just the retrieved vocal content, and not rely on the native vocal content at all.
A first grammar is the "native Ul (User Interface) grammar" which resides in the phone. It includes definitions of vocabulary for common operations related to common Ul operations like navigation, control, options, submit, etc. The native Ul grammar can be stored in the device in a read only memory, or in a random access memory. A random access memory has the advantage that different versions of grammar can be downloaded, e.g., from a network. Therefore, only one type of terminal device can be produced, e.g., for Europe, and the respective native language can
be installed e.g. by software download.
A second grammar comes with the content, and it contains context specific vocabulary for the Ul to fields in forms, radio buttons, and checkboxes. This grammar is for the complete information page and it contains parts that are specific to each manually oriented input element in the information page. When the context changes, the native grammar will remain activated, but the context specific grammar is changed. The grammar rules specific to , e.g., individual WML cards in the deck may be enabled or disabled depending on the current view of the browser.
An auxiliary "politeness grammar" can be used to recognize pre-attached or post-attached additional phrases. Typical pre-attachrnents are, e.g., "please", "now", "go", "move", "kindly", "back to", and the like. Typical post-attachments are, e.g., "now", "please", "thank you". Human politeness used in communication with machines often include a zero information content. The politeness grammar enables the voice recognition system to recognize invalid speech input. The system therefore should be able to recognize every speech input without errors, which can prevent unnecessary "speech input could not be recognized, please try again" output. The system can be extended with other frequently used expression of colloquial speech, e.g., "well", "ah", "oh" or even slang expressions. To avoid the problem of low information content input recognition, a keyword-seeking algorithm can be used to pick out a certain keyword from multi- word speech input.
A page designer, designing a multimodai web page can extend the visual content of a monomodal page with vocal content. As there are less obvious vocal equivalents than manually input elements, the designer may run out of vocal equivalents. The proof is easy to show, just try to describe the graphic task menu bar on your computer by words . By an 'inverted' order of first preparing the vocal content and then the visual, it can be prevented that a page designer runs out of vocal equivalents.
It is possible to supply the content, e.g., with pre-recorded audio files for speech and music output. Additional audio files can be used as voice prompts. Another alternative is to synthesize the prompts from phoneme strings. The prompts can also be synthesized from a text using text- to-speech synthesis. Yet alternatively voice prompts can be separately requested from the network. Alternatively, the mobile terminal device can be combined with a digital music player, e.g., MP3 which receives digital music files, or different , e.g., via WAP and received music files can be paid via the telephone bill.
It is possible to refer to these files, e.g., from the WML content by using a similar naming
convention approach as proposed for cross-referencing, e.g., the WML content and the speech grammar. It is also possible to extend the content with speech synthesis markup elements. The voice equivalents may be processed in a standardized speech grammar format like Speech Recognition Grammar Format (SRGF) or the Java Speech Grammar Format (JSGF) or like. It is to be understood that the visual content may comprise displayable parts that may be displayed optically, acoustically, sensually, or the like.
Preferably, the method for multimodai interaction browsing can further comprise the steps of recalling and displaying a visual content, recognizing a voice input from the user, processing the voice input of the user, comparing said processed voice input with the voice input equivalents, and executing the input.
The voice input recognition can be executed easiest by a user pressing a PTVI- ('press to voice input-') key or like. A PTVI key prevents the system from accidentally 'auto-browsing', by eavesdropping. The visual content is displayed as in conventional browsing systems on a display, screen, or like. In case of a manual user input the browsing is executed in a conventional manner. To execute a voice browsing the user has to make a voice input, and the device has to recognize that the user is executing an input. The best way is to use a press-release-talk PTVI-key, where the user just briefly presses the key and then talks. A digital signal processing algorithm can be used to find the end of user's input. The PTVI can even be combined with a timing circuit, so that a user need to activate the PTVI key only once for, e.g., half a second of voice input, and, e.g., a digital signal processor is used to reset the timer if a longer input is detected. More comfortably this can be executed by a specific voice input like spelling the word 'input', or by recognizing a certain voice activation keyword, wherein the device recognizes that the user is actually talking to the device. The use of a multimodai browser, e.g., in a mobile terminal device may result in a widespread use of voice activated devices. In a voice activated environment a user may need to address the device he is actually talking to, to prevent auto activating other devices in the environment of the user.
The actual method used for the voice recognition process does not bear any major relevance decisive for the method according to the invention. The voice recognition can be executed, e.g., by transforming the user's acoustic speech input into a digital speech waveform. Feature vectors, descriptive of the different characteristics of the spoken input, are then extracted from the speech waveform. The feature vectors can be transformed into data that can be compared with the vocal content, e.g., the vocal equivalents. The comparison can be executed directly, or with the help of statistical models. The use of statistical models can enable the system to use recognition error compensating comparing methods and to use a trainable input systems. Also, any other voice
recognition system capable of transforming the voice input to data comparable with the vocal content of the multimodai page can be used.
The recognizer compares this digitized data against the active grammar and finds the best match. The corresponding grammar entry (rule) is used as input to the browser, as if the user had used some other means of providing the same input to the browser. If no corresponding voice equivalent can be found, the terminal device may start a "please repeat" or "input invalid" output, or alternatively completely ignore the spoken input. The features supported by the approach include selecting by speech one from a list of items, navigating to a bookmark, shortcutting through multiple menu levels, filling a free-text input field by speech by selecting the filled item from a background speech grammar etc. This can even include a (language) grammar recognition, e.g., for the input of a free text element, e.g., for generating a SM (Short message), an e-mail, or like.
Conveniently, the speaker independent voice recognition (SIVR) algorithm is an acoustic element recognition algorithm. The acoustic elements can be, e.g., phonemes. By using a phoneme recognition algorithm, the number of possible and recognizable elements of human speech is kept low so that it is applicable to a low cost low energy consumption mobile communication device like a mobile phone.
Advantageously, the method for multimodai interaction browsing further comprises the steps of receiving a monomodal information page, analyzing the visual content of said monomodal information page, generating a vocal content according to said analyzed visual content, assembling said visual content and said vocal content to a multimodai information page; and transferring said information page to an electronic device. These steps usually are executed in a server or at a service provider. This enables a user to multimodai browse a former monomodal information page. This feature can help to establish multimodai browsing in a monomodal environment.
Advantageously, the recalled visual content is recalled from an internal or an external memory. This can be used for browsing, e.g., the internal memory of a mobile terminal device. This can be used for starting all internal operations or operations which do not access external memories and therefore do not require the contact to a service provider or a server in a network. This can also be used to contact a server or a service provider by voice input. Accessing external memory is the standard procedure for accessing or browsing in a network like the internet, a WAP-(Wireless Application Protocol) network or like.
According to another aspect of the present invention, a computer program for carrying out the method for multimodai interaction browsing is provided, which comprises program code means for performing all of the steps of the preceding description when said program is run on a computer or a network device.
According to yet another aspect of the invention, a computer program product is provided comprising program code means stored on a computer readable medium for carrying out the method for multimodai interaction browsing of the preceding description when said program product is run on a computer. A computer is a device capable of processing information. According to this definition a computer can be a portable computer, a mobile telephone, communicator, or any device with information processing capability.
Preferably the computer program and the computer program product are distributed in different parts and devices of the network. The computer program and the computer product device run in different devices of the network. Therefore, the computer program and the computer program device have to be different in abilities and source code.
According to another aspect of the invention, an electronic device for the execution of multimodai interactive browsing comprises means for recalling a visual content and a vocal content, means to separate said contents, an acoustic unit based voice recognition system for recognizing voice input, means for processing said recognized voice input, means for comparing data from said speaker independent recognition system with said voice content, and displaying said visual content on said electronic device, and means for carrying out voice input.
The device comprises means for recalling a visual and vocal content. This means has to be connected to or incorporated in a memory device to recall the contents from. The means to recall is connected to means to separate the visual and the vocal content. The means to separate the contents is connected to the displaying means, to display the visual content. If the visual content is displayed, the electronic device waits for input. The electronic device can comprise conventional voice-less input elements for conventional browsing. The electronic device comprises an acoustic unit based voice recognition system for recognizing voice input. The actual design of the voice recognition system is not vital for the invention. It can be a speaker dependent, or a speaker independent voice recognition system. The voice recogmtion system can be only based on voice input; or can be a combined system, e.g., with a vision system for some kind of lip reading, or a pressure sensor for distinguishing between "B" and "P", or any other combined speech recognition system.
The voice recognition system is connected to the comparing mans, to compare the recognized voice input with said vocal content. If the system can detect similarities of the input and the content, the system carries out the command represented by the recognized vocal content element.
The means for recalling, separating, voice recognizing, for processing, comparing, and for carrying out the input can be integrated in at least one integrated circuit.
Preferably the electronic device further comprises means for retrieving new content. New content can be located in the electronic device itself, or in the memory devices spaced apart. The new content can be stored in, e.g., a phone book in a mobile telephone or the like, hi a computer device, the new content can be internet pages, and therefore this can be a modem and an internet software. In a mobile telephone this can be a WAP browser or the like. It is not vital how the new content is accessed, and if the new content is multimodai. The content can be, e.g., a multimodai internet page or a multimodai WML deck or the like.
Conveniently, said processing means further comprises means for recognizing a voice input from the user and means for processing said voice input. The means for recognizing a voice input can be designed to be capable of excecuting the voice input recognition as described in the preceding description of the method.
Preferably the voice recognition system is a phoneme recognition system. The advantages of a phoneme recognition system for a multimodai interactive browser are described in the preceding description of the method, and have the same effects.
According to yet another aspect of the invention a network server for supporting multimodai interactive browsing is provided that comprises means for storing multimodai information pages, means for receiving data, and means for transmitting multimodai information pages. These are the minimum requirements for a network server required for multimodai browsing.
Preferably, the network server further comprises means for receiving a monomodal information page, means for analyzing the visual content of said monomodal information page, means for generating a vocal content according to said analyzed visual content, means for assembling said visual content and said vocal content to a multimodai information page; and means for transferring said multimodai information page to an electronic device. The benefits of a network server capable of transforming monomodal information pages into multimodai information pages are described in the preceding description of the method.
According to another aspect of the present invention a multimodai interactive browsing system is provided that comprises at least a terminal device and a server as described in the preceding description and uses multimodai information pages described in the preceding description.
The terminal device can be a PDA (personal digital assistant) like a notebook a laptop a WAP enabled mobile phone an organizer and the like. The device can be connected via a BlueTooth™- Infrared or even wired connection to a computer or data network. The actual form of the PDA is not limited to a hand held portable browsing device. The device can be a pair of video goggles because the voice input system enables the device to be keyless. The video goggle can be fitted with a remote control, or preferably be provided with an integrated eye cam, for tracking the eye movement to move a cursor. Even a visible cursor is obsolete, because a user knows what he is focussing with his eyes. The input can be executed by blinking, by voice input, by manually input.
A key input system comprises every manually oriented input system, e.g., a keyboard, a mouse, a trackball, a joystick, a touch-pad, a touch-screen and like. Preferably the mobile terminal device can comprise an optical sensor of an optical mouse on its back, so that a cursor is moved, if the whole terminal is moved over a surface. Another cursor-based input system comprises two wheels on the sides of the device to move the cursor on the display of the terminal.
The major advantages over other approaches are, that: no simultaneous voice and data connections across the wireless channel is required,
- no new multi-modal authoring standards - however it is required to define how two of the existing authoring standards (XHTML, speech grammars) are used together, the usage of specially developed speech grammars avoid the problems that emerge with automatic on-the-fly analysis of the content to generate the speech grammars to be used, due to the speech recognizer running in the device, no additional latencies in interpreting the user's speech input occur, when a remote speech server has to be visited each time the user speaks, and that,
- relatively minor changes are required to already existing terminals which have phoneme- based speech recognizer and XHTML or WAP browser, minimizing the cost of development and the time of development.
In the following, the invention will be described in detail by referring to the enclosed drawings in which: '
Figure 1 is a flowchart of a multimodai interactive browsing operation according to one aspect of the present invention, and
Figure 2 is an example of a WML card with multimodai content and the corresponding speech grammar.
Figure 1 shows a flow chart of a multimodai interactive browsing operation. In a the first step a multimodai information page, e.g., a web page is received. The multimodai information page contains a visual content, manual input elements, and assigned voice equivalents in the form of a grammar. This can be received from an internal or an external memory, a server via internet, WAP, or any other way. hi the second step the content of the multimodai information page is split in manual input elements and voice equivalents. The visual content and the manual input elements are displayed. After that the device waits for user input. In the case of a manual user input, after the determination that a manual input is entered, the conventional chain of processes is executed as in the case of a standard monomodal browsing. In the case of voice input, this voice input is transformed into a train of feature vectors. The train of feature vectors is then compared to the statistical models comprising strings of phonemes obtained from the voice equivalents in the multimodai information page and, if a voice equivalent is found, the respective manual input element is activated. If a respective voice equivalent can not be found, the device may initiate a 'voice input failure' output.
In Figure 2 a sample WML deck is shown for a special application for an online booking system for air travel where the user could download the page onto his/her mobile terminal, check the available options, select the To and From cities, and submit user information. For this application, the grammar files depicted on the right side are used. This is a simple example of a WML application in which the user can select an option from a list of available choices. In the example, the user can select only one as the multiple attribute of the select element is set (by default) to false. Each of these cities is assigned a value which will be set for the name variable if selected. The default value if none is selected is given by the value attribute.
The WML content and Speech grammar arrives at the user terminal as a single entity (multipart content, not shown). Therefore, a module is needed that could split the content and feed the appropriate parser with the right content. This module called a "splitter" would have to be aware of where the boundary of the WML content starts and where it ends. The rest of the content would define the speech grammar.
The Speech grammar depicted on the right side can be divided in native Ul (User Interface)
grammar, politeness grammar and grammar defined by the author within the context. The native Ul grammar and the politeness grammar can be stored in the device, or can be contained in the multimodai WML deck. The third part of the grammar is to be transferred with the WML deck, because it's content is related to input elements of the visual content of the WML deck.
This application contains the description of implementations and embodiments of the present invention with the help of examples. It will be appreciated by a person skilled in the art that the present invention is not restricted to details of the embodiments presented above, and that the invention can also be implemented in another form without deviating from the characteristics of the invention. The embodiments presented above should be considered illustrative, but not restricting. Thus the possibilities of implementing and using the invention are only restricted by the enclosed claims. Consequently various options of implementing the invention as determined by the claims, including equivalent implementations, also belong to the scope of the invention.