HK1160957B

HK1160957B - Audio user interface

Info

Publication number: HK1160957B
Application number: HK12101271.4A
Authority: HK
Inventors: 威廉．布尔; 本．罗特勒
Original assignee: 苹果公司
Priority date: 2008-09-09
Filing date: 2009-07-28
Publication date: 2015-11-20

Description

Audio user interface

Cross Reference to Related Applications

This application is related to U.S. patent application No.10/981,993 entitled "Audio User Interface computing Devices" filed on 4.2004 and U.S. patent application No.10/623,339 entitled "Voice Menu System" filed on 18.7.2003, which are hereby incorporated by reference in their entirety.

Technical Field

The present invention relates generally to audio user interfaces, and more particularly to techniques for providing an audio user interface to a computing device.

Background

Electronic devices, such as portable media players, cellular phones, Personal Digital Assistants (PDAs), etc., are popular in the marketplace today, as are peripheral electronic devices that provide support for their use, such as docking stations (docking stations), etc. As the competition for personal electronic devices has become increasingly more hectic, consumers have increasingly demanding on the functionality and use of these devices.

Users listen to, view, or otherwise receive and consume content in various environments. For example, music is often listened to while driving, riding public transportation, exercising, hiking, doing housework, and the like. In addition, in addition to playing back content stored on the media player, users now use the media player more often to receive radio, television, satellite broadcasts, global positioning, and other broadcast-based location services for navigation and recreation.

Traditionally, a media player or portable media player may have the ability to play media, such as audio (e.g., songs) or video (e.g., movies), for its user. While playing audio, if the media player includes a display, the display may present the song title, artist, and other information related to the song. In the case of playing a video, the display may be used to present the video.

To achieve portability, many handheld devices may use user interfaces that present the user with various display screens for interaction, which is predominantly visual. Users can interact with these user interfaces to manipulate a scroll wheel and/or a set of buttons to navigate a display screen to access the functionality of these handheld devices. However, these user interfaces are sometimes difficult to use for various reasons. One reason is that: these display screens can often have a small size and form factor and are therefore difficult to see. Another reason is that: the user may have poor reading vision or may be visually weak for other reasons. Even if these display screens are perceptible, it can be difficult for a user to navigate the user interface in situations where the user is unable to divert visual focus from important activities to the user interface. These activities include, for example: driving a car, exercising, and crossing streets.

Accordingly, there is a need for improved methods and apparatus to address some of the above-mentioned problems. Additionally, there is a need for improved methods and apparatus that reduce some of the above-mentioned disadvantages.

Disclosure of Invention

In various embodiments, the experience of user interaction with an electronic device (e.g., a media player or portable media device) can be enhanced by including an audio user interface that provides an intelligent way to determine whether a suitable audio dialog for the audio user interface is available. For example, depending on whether the electronic device has a broadband connection to a communication network (e.g., the internet), a determination may be made to request that an audio file of a first type or category (e.g., high quality voice recording) be streamed from a voice server to the electronic device for output by the audio user interface. In another example, a determination may be made to use only audio files of a second type or category (e.g., low quality voice recordings) that are available on a media storage device accessible to the electronic device. In yet another example, in the event of a lack of availability of pre-recorded voice audio data, a determination may be made to create a third category of audio data for audio prompts of the audio user interface using one or more speech synthesis techniques or text-to-speech techniques.

In some embodiments, a user of an electronic device (e.g., a media player or portable media device) may determine the quality of audio prompts to be presented (e.g., played) for the audio user interface. The user may provide one or more user preferences indicating whether pre-recorded audio data should be used, whether audio prompts synthesized using one or more synthesis techniques should be used, or whether traditional beeps or other non-speech audio data should be used for the audio user interface. Thus, an electronic device (e.g., a media player or portable media device) with or without a display can be enhanced by an audio user interface to facilitate user interaction based on whether services are available or based on other selection criteria.

In one embodiment, an input may be received that represents a user's interaction with a user interface associated with an electronic device (e.g., a media player or portable media device). A user may interact with the media player by pressing a button (e.g., a play/pause button) or selecting/highlighting a menu entry of a graphical user interface. The electronic device may identify an audio prompt associated with audibility of user interaction with the user interface (audiometry). The electronic device may determine whether one of a plurality of categories of audio data corresponding to the audio prompt is available to the media player. For example, the electronic device may determine whether a pre-recorded celebrity voice audio file is stored on the internal storage, whether a voice synthesis module or text-to-speech engine is capable of synthesizing numbers, or whether a voice server is capable of streaming voice data to the electronic device for the audio user interface.

A portion of the first category of audio data may then be output or otherwise presented at the electronic device. In some embodiments, playback of the media file may be paused or suspended in response to outputting the portion of audio data from the first source. In response to outputting the portion of audio data from the first source, the playback volume of the media file may be reduced or muted.

A further understanding of the nature, advantages and improvements provided by the inventions disclosed in this application can be realized by reference to the remaining portions of this document and the attached drawings.

Drawings

To better illustrate and explain embodiments and/or examples of any invention presented in this document, reference will be made to one or more figures. The additional details or examples used to describe the figures should not be construed as limiting the scope of any of the disclosed inventions, any of the presently described embodiments and/or examples, or any presently considered best mode of any invention presented in this document.

FIG. 1 is a block diagram of a media player that may incorporate an embodiment of the present invention;

FIG. 2 is a block diagram of a media player that may provide pre-recorded or synthesized audio prompts in one embodiment in accordance with the invention;

FIG. 3 is a block diagram of an audio user interface management system that may provide pre-recorded or synthesized audio prompts in accordance with one embodiment of the present invention;

FIG. 4 is a block diagram of streaming an audio prompt system in an embodiment in accordance with the invention;

FIG. 5 illustrates a schematic diagram of a media player and its associated user input controls in one embodiment in accordance with the invention;

FIG. 6 illustrates a schematic diagram of a media player and its associated user input controls in an alternative embodiment in accordance with the invention;

FIG. 7 is a simplified flowchart of a method for providing an audio user interface to a user of an electronic device in one embodiment in accordance with the present invention;

FIGS. 8A and 8B are flow diagrams of a method for providing an audio user interface to an electronic device in one embodiment in accordance with the invention;

FIG. 9 is a flow diagram of a method of streaming audio prompts for an audio user interface in one embodiment in accordance with the invention;

FIG. 10 is a flow diagram of a method for creating audio prompts at a host computer system using one or more speech or text-to-speech synthesis techniques in one embodiment in accordance with the invention;

FIG. 11 is a flow diagram of a method for creating audio prompts using one or more speech or text-to-speech synthesis techniques, in accordance with an alternative embodiment of the present invention;

FIG. 12 is a block diagram of an electronic device that may incorporate embodiments of the present invention.

Detailed Description

Various embodiments may be applicable to electronic devices with audio playback capabilities, such as media devices (e.g., digital media players or portable MP3 players) or other portable multifunction devices (e.g., mobile phones or personal digital assistants). For example, portable devices often can store and play digital media material (media items), such as music (e.g., songs), videos (e.g., movies), audio books, podcasts (podcasts), conference recordings, and/or other multimedia recordings. Portable devices, such as portable media players or other portable multifunction devices, may also be small and highly portable. In addition, the portable device is a handheld device, such as a handheld media player or a handheld multifunction device, that can be easily held in one hand of a user. The portable device may also be pocket-sized, miniature, or wearable.

In various embodiments, the experience of user interaction with an electronic device (e.g., a media player or portable media device) can be enhanced by including an audio user interface that provides an intelligent way to determine whether a suitable audio dialog for the audio user interface is available. For example, depending on whether the electronic device has a broadband connection to a communication network (e.g., the internet), a determination may be made to request that an audio file of a high quality voice recording be streamed from a voice server to the electronic device for output by the audio user interface. In another example, a determination may be made to use only low quality voice recorded audio files that are available on a media storage device accessible to the electronic device. In yet another example, in the event of a lack of availability of pre-recorded speech audio data, a determination may be made to create an audio prompt for the audio user interface using one or more speech synthesis techniques or text-to-speech techniques.

Aspects of the environments in which various examples and/or embodiments of the invention in the present application operate will first be described.

Fig. 1 is a block diagram of a media player 100 that may incorporate an embodiment of the present invention. In general, a media player stores content and/or media assets, such as audio tracks, movies, or photos that can be played or displayed on the media player. An example of the media player 100 may be an iPodA media player, which is commercially available from Apple, inc. Another example of a media player 100 may be a personal computer, such as a laptop or desktop computer.

In this example, media player 100 includes a processor 110, a storage device 120, a user interface 130, and a communication interface 140. The processor 110 may control various functions associated with the media player 100. The media player 100 may output audio content, video content, image content, and the like. The media player 100 may also output metadata or other information associated with the content, such as track information and album artists.

Typically, a user can load or store content onto the media player 100 using the storage device 120. The storage device 120 may include Read Only Memory (ROM), Random Access Memory (RAM), non-volatile memory, flash memory, floppy disks, hard disks, and the like. A user may interact with the user interface 130 of the media player 100 to view or consume content. Some examples of user interface 130 may include buttons, click wheels, touch pads, displays, touch screens, and other input/output devices.

The media player 100 can include one or more connectors or ports that can be used to load content, retrieve content, interact with applications running on the media player 100, interface with external devices, and the like. In this example, media player 100 includes a communication interface 140. Some examples of communication interface 140 may include a Universal Serial Bus (USB) interface, IEEE 1394 (i.e., FireWire/iLink)) Interfaces, universal asynchronous receiver/transmitter (UART), wired and wireless network interfaces, transceivers, etc. Communication interface 140 can be used to connect media player 100 to devices, accessories, private and public communication networks (e.g., the internet), and the like.

In one example, media player 100 may be coupled via a wired and/or wireless connector or port to output audio and/or other information to speaker 150. In another example, the media player 100 can be coupled via a wired and/or wireless connector or port to output audio and/or other information to the headphones 160. In yet another example, media player 100 can be coupled via a wired and/or wireless connector or port to interface with accessory 170 or host computer 180. Different connections may be allowed at different times by the same connector or port.

Media player 100 may be physically plugged into docking system 190. The media player 100 may be coupled via a wired and/or wireless connector or port to interface with the docking system 190. Docking system 190 may also enable one or more accessory devices 195 to be coupled by wire or wirelessly to interface with media player 100. Many different types and functions of accessory devices 170 and 195 may be interconnected to or with media player 100. For example, the accessory may allow a remote control to wirelessly control the media player 100. As another example, an automobile can include a connector into which the media player 100 can be inserted such that an automobile media system can interact with the media player 100, thereby allowing media content stored on the media player 100 to be played in the automobile.

In various embodiments, the media player 100 may receive content or other media assets from a computer system (e.g., host computer 160). The computer system may be used to enable a user to manage media assets stored on the computer system and/or stored on the media player 100. For example, communication interface 140 may allow media player 100 to interface with host computer 160. The host computer 160 may execute a media management application to manage media assets, such as loading songs, movies, photos, etc. onto the media player 100. The media management application may also create playlists, record or capture content, schedule content for playback or recording, and so forth. An example of a media management application may be iTunes manufactured by Apple, Inc. of Cupertino, California

In various embodiments, media player 100 may include an audio user interface. Upon user interaction with the media player 100 (e.g., upon the user pressing a button, touching a touch screen, or selecting an entry on a graphical user interface), embodiments of the audio user interface may present or otherwise output an audio prompt selected from the audio dialog for playback. The audio prompts may include audio indicators that allow the user to focus their visual attention on other tasks (e.g., driving a car, performing an exercise, or crossing a street), while still enabling the user to interact with the user interface 130. As examples, the audio prompt may aurally name or description of the sound of the hardware button being pressed, sound activation of the virtual button or control, or sound version of the user interface (e.g., displaying a selected (e.g., highlighted) menu entry or selected function of the menu). The audio prompts may include pre-recorded voice data, or may be generated by voice or speech generation techniques.

In one aspect, embodiments of media player 100 may include techniques for providing an audio user interface to an electronic device that effectively improve the availability of audio prompt sources for the audio user interface. For example, media player 100 may selectively output audio prompts from different audio dialogs based on whether a source of the audio dialog is available, whether a higher quality source is available, and so forth. In one example, prior to connecting to the internet, a user of media player 100 may hear low quality voice audio prompts or audio prompts synthesized by media player 100, while a higher quality, pre-recorded voice audio prompt may be downloaded or streamed to the audio user interface when connecting to the internet. Thus, in various embodiments, media player 100 may determine whether a source of audio prompts for an audio user interface is available and automatically switch from one source to another to selectively provide a best available audio feedback to the user.

Fig. 2 is a block diagram of a media player 200 that may provide pre-recorded or synthesized audio prompts in one embodiment in accordance with the invention. In this example, media player 200 may be implemented in the form of media player 100 and may include a portable computing device dedicated to processing content or other media assets (e.g., audio, video, or images). For example, the media player 200 may be a music player (e.g., an MP3 player), a game player, a video recorder, a camera, an image viewer, a mobile phone (e.g., a cellular phone), a personal handheld device, and so forth. These devices typically operate using batteries and are highly portable to enable a user to listen to music, play games or video, record video or take pictures wherever the user travels.

In one implementation, the media player 200 may comprise a handheld device sized to be placed in a pocket or hand of a user. By being handheld, the media player 200 may be small and easily manipulated and used by its user. With the pocket size, the user need not hold the media player 200 directly, so the device can be taken almost anywhere the user travels (e.g., the user is not limited to carrying a bulky, awkward, and often heavy device, as is the case with a portable computer). Further, the media player 200 may be operated by the user's hands, thereby eliminating the need for a reference surface (e.g., a desktop). In an alternative embodiment, the media player 200 may be a computing device that is not specifically limited to playing media files. For example, the media player 200 may also be a mobile phone or a personal digital assistant.

In such an example, the media player 200 may include a user interface control module 210, an audio prompt database 220, and a text-to-speech engine 230. The user interface control module 210 may include hardware and/or software elements for managing a user interface that allows a user to interact with the media player 200 (e.g., navigate, initiate content playback, etc.). The user interface may, for example, allow a user of the media player 200 to browse, sort, search, play, etc., content or other media assets resident or otherwise accessible on the media player 200. The user interface may also allow a user of the media player 200 to download (add) or delete (remove) media items from the media player 200.

Interaction with the user interface of the media player 200 can cause audio prompts for the audio user interface to be played back (e.g., through headphones or speakers). Audio prompt database 220 may include hardware and/or software elements for storing audio files and audio data for audio prompts. In some embodiments, the audio files may include audio prompts that are pre-recorded and stored on the media player 200. In other embodiments, the audio files may include audio files that are streamed from one or more computers and cached in the audio prompt database 220 for later use. In various embodiments, the audio files may include audio prompts that are generated by the media player 200 or by another device using one or more speech synthesis techniques. Audio prompt database 220 may include other content or media material.

The text-to-speech conversion engine 230 may include hardware and/or software elements for converting data (e.g., text) into audio data or audio files that can be played to generate user interface audio prompts that can audibilize the data (e.g., text strings) (e.g., verbalized in human-like speech or in pronunciation). Such text-to-speech (TTS) engines may use various techniques to create audio data or audio files. For example, some algorithms use techniques such as: the words are broken down into segments or syllables which are then assigned a certain sound. The words may then be linguistically expressed by combining the individual sounds. Where the media content relates to music, these text strings may correspond to song titles, album names, artist names, contact names, addresses, phone numbers, and playlist names, for example.

In one example of operation, media player 200 may selectively provide audio prompts for an audio user interface based on whether the audio prompts are available to audio database 220 and TTS engine 230. For example, media player 200 may selectively output audio prompts from audio prompt database 220 when pre-recorded audio prompts are available or otherwise stored in audio prompt database 220. Media player 200 may also selectively select between various quality audio prompts, such as presenting higher quality or bitrate audio prompts instead of lower quality or bitrate audio prompts. In another example, media player 100 may present audio prompts or voice prompts synthesized by TTS engine 230 due to a lack of pre-recorded audio prompts stored in audio prompt database 220, or in response to a user's preference for a particular simulated voice profile (profile). In various embodiments, media player 100 may dynamically output audio prompts from audio prompt database 220, or TTS engine 230, or both.

In other embodiments, an electronic device (e.g., a media player or portable media device) may include an audio user interface provided by an audio user interface management system. The audio user interface management system may include a media playback device and include one or more of a host computer or server computer system to provide an audio user interface on the media playback device. For example, the host computer system may comprise a personal computer and the media playback device may comprise an MP3 player. In some embodiments, the media playback device may allow for multi-modal interaction with the user interface. For example, a user may interact with the user interface through audio and visual cues.

FIG. 3 is a block diagram of an audio user interface management system 300 that can provide pre-recorded or synthesized audio prompts in one embodiment in accordance with the invention. In such an example, the management system 300 may include a media player 310 and a personal computer (host computer) 340. The media player 310 may be implemented in the form of the media player 100 described above and may be linked or coupled to a personal computer 340.

Media player 310 may be implemented in the form of media player 100 of fig. 1 and may comprise, for example, a portable battery-operated device. In one embodiment, the media player 310 comprises an MP3 player. In general, media player 310 may store content or other media assets to one of a plurality of data storage devices (e.g., disk drives). Media player 310 may store content or other media assets in media files.

Media player 310 may include a user interface control module 320 and an audio prompt database 330. User interface control module 320 may include hardware and/or software elements for managing a user interface that allows a user to interact with media player 310 (e.g., navigate, initiate content playback, etc.). Interaction with the user interface of media player 310 may cause audio prompts for the audio user interface to be played back (e.g., through headphones or speakers). Audio prompt database 330 may include hardware and/or software elements for storing audio files and audio data for audio prompts.

Personal computer 340 may include a media manager 350, an audio prompt database 360, and a text-to-speech (TTS) engine 370. Personal computer 340 may serve as the host computer system for media player 310. Personal computer 340 may also be any type of computer that acts as a server with respect to media player 310 (as a client).

Media manager 350 may include hardware and/or software elements that enable a user of personal computer 350 to directly manage content or other media assets stored on personal computer 340. Media manager 350 may also be configured to manage content or other media assets stored on media player 310 in a direct or indirect manner. In one example, media player 310 and personal computer 340 may be coupled by a peripheral cable. Typically, a peripheral cable may couple together data ports provided on media player 310 and personal computer 340. In some embodiments, these data ports may be FIREWIRE ports and the peripheral cables may be FIREWIRE cables. In another example, the data ports may be Universal Serial Bus (USB) ports and the peripheral device cable may be a USB cable. More generally, an external device cable may be used as the data link. Media items may be transferred between media player 310 and personal computer 340 and vice versa via an external device cable.

In various embodiments, media manager 350 may also include a user interface that allows a user to browse, sort, search, play, and control content or other media assets residing on personal computer 340Play list, burn Compact Disc (CD), etc. The user interface may also allow a user of personal computer 340 to download (add) or delete (remove) media items from personal computer 340. In one embodiment, the media manager 350 and its associated user interface are iTunes by Apple, Inc. of Cupertino, California^TMProvided is a method for preparing a compound.

Audio prompt database 360 of personal computer 340 may include hardware and/or software elements for storing audio files or audio data of audio prompts of an audio user interface associated with media player 310 or personal computer 340. Audio prompt database 330 may include audio prompts for audio dialogs that are downloaded from the internet, ripped from a CD, recorded by a user, or generated by TTS engine 370. TTS engine 370 may include hardware and/or software elements for converting information or data into audio files or voice data capable of being played in the form of audio prompts that audibilizes the information.

In one example, a synchronization operation may occur between personal computer 340 and media player 310 to upload audio prompts to audio prompt database 330 of media player 310 or to update audio prompts stored in audio prompt database 330 with audio prompts stored in audio prompt database 360 or generated by TTS engine 370. In one example, when a comparison between content from the various databases indicates that a particular audio prompt resident on personal computer 340 that is not resident on media player 330, then the particular audio prompt may be transmitted (downloaded) to media player 310, e.g., using a wireless link or over a peripheral cable. Thus, the synchronization operation between personal computer 340 and media player 310 can ensure that media player 310 contains audio data or audio files suitable for presenting an available audio user interface.

The data of the audio files to be downloaded onto media player 310 may depend on user settings for the audio user interface. For example, the user may wish to download audio files or other audio data stored in audio prompt database 360 to associate with options or features of all or part of the audio user interface on media player 310.

Fig. 4 is a block diagram of streaming audio prompt system 400 in an embodiment in accordance with the invention. In this example, the media player 410 is linked to a communication network 420. The media player 410 may be implemented in the form of the media player 200 of fig. 2 or the media player 310 of fig. 3. A voice server 430 is also linked to the communication network 420 and is capable of communicating with the media player 410.

In various embodiments, media player 410 may determine the presence of a connection to voice server 430 over communication network 420. In one example of operation, the media player 410 can choose to receive audio prompts from the voice server 430 for presentation by the audio user interface of the media player 410. Media player 410 may generate one or more requests for audio prompts, and voice server 430, upon receiving the requests, may stream corresponding audio prompts to media player 410 for output to the user.

Speech server 430 may include an audio prompt database 440 and a TTS engine 450. Audio prompt database 440 of voice server 430 may include hardware and/or software elements for storing audio data or audio files for audio prompts for an audio user interface associated with media player 410. Audio prompt database 330 may include audio prompts for audio dialogs that were pre-recorded by one or more content producers, provided by content publishers, or generated by TTS engine 450. TTS engine 370 may include hardware and/or software elements for converting information or data into audio files or voice data capable of being played as audio prompts that audibilizes the information.

Thus, the media player 410 can selectively select between sources of audio prompts for the audio user interface to provide audio voice feedback to the user. The media player 410 may receive audio prompts (e.g., pre-recorded or synthesized) from the voice server 430 until a connection is lost. At this point, the media player 410 may automatically select audio prompts from other sources (e.g., an internal audio prompt database, or a speech synthesis module).

FIG. 5 illustrates a schematic diagram of a media player 500 and its associated user input controls in one embodiment in accordance with the invention. The media player 500 may include any computing device for playing media files (e.g., song files). The media player 500 may include a memory that stores a media database and a playback module for rendering or playing back content or other media assets stored in the media database. A set of nested menus 505 can present at least a portion of a user interface that allows a user to navigate, select, and thereby listen to a desired song file. A media file can be reached in different ways using the set of nested menus 505. The user interface may also allow a user to navigate and select desired functions provided by the media player 500.

Fig. 5 also illustrates user interface controls 510 of media player 500. According to one embodiment, user interface controls 510 include a "menu" button 515, a "next" button 520, a "play/pause" button 525, and a "previous" button 530. The user interface control 510 may include a scroll wheel implemented in the form of a rotating wheel device capable of rotating, or a touchpad device that understands rotational user gestures. The user may press, rub, or otherwise interact with the user interface control 510 to navigate the nested menu 505.

Fig. 6 illustrates a schematic diagram of a media player 600 and its associated user input controls in an alternative embodiment in accordance with the invention. The media player 600 may include a "previous" button 610, a "play/pause" button 620, and a "next" button 630. LEDs 640 and 650 may be used to communicate information to a user, such as indicating a power status or a media playback status. In such an example, the media player 600 may not include a display configured to form a graphical user interface (e.g., the nested menu 505 of fig. 5). Thus, a user interface that audibly communicates information related to the operation of the media player 600 can greatly enhance the user experience.

FIG. 7 is a simplified flowchart of a method for providing an audio user interface to a user of an electronic device in one embodiment in accordance with the present invention. The processing of method 700 shown in fig. 7 may be performed by software (e.g., instructions or code modules) when executed by a central processing unit (CPU or processor) of a logic machine (e.g., a computer system or an information processing device), by hardware components or application specific integrated circuits of an electronic device, or by a combination of software and hardware elements. Fig. 7 begins at step 710.

At step 720, information is received, the information representing user interaction with the user interface. The information may include signals, messages, interrupts, inputs, and the like. The information may indicate that the user pressed or depressed a button, clicked a click wheel, touched a touch screen, stroked a gesture, highlighted or selected an element on a graphical user interface, and so forth. The information may represent a single action by the user or a combination of actions.

At step 730, an audio prompt corresponding to the interaction of the user is identified. The audio prompt may include information identifying audio data that vocalizes, auditorily, or otherwise provides feedback to the user regarding the registered interaction. At step 740, the type or category of audio data is determined for the audio prompt. In various embodiments, the audio cues may be represented by different types or categories of audio data. The types or categories of audio data may include, for example: audio data of different hearing qualities, speech and non-speech, bit rate, compression, encoding, source, delivery mechanism, etc. For example, synthesized audio data generated by the speech synthesis module may be used to provide audio prompts for numbers, dates, and the like. In another example, the compressed pre-recorded audio data may be used to provide audio prompts for button interactions (e.g., play, pause, next, rewind, fast forward, rewind, etc.). In yet another example, pre-recorded audio data of CD quality may be used to provide the full set of audio prompts for numbers, dates, button presses, menu selections, and any other user interactions that may be involved in a given audio user interface.

At step 750, a determination is made as to whether the type or category of audio data determined for the audio prompt is available. For example, a selection may be made to use a pre-recorded audio dialog (e.g., a set of pre-recorded audio files) for audio prompts of an audio user interface. The electronic device may check its internal storage to determine whether an audio file exists for the audio prompt. Alternatively, the electronic device may request an audio file for the audio prompt from a host computer or streaming voice server. In another example, if the pre-recorded audio cues are not stored locally at the electronic device, a selection may be made to use the pre-recorded audio data for some audio cues and the synthesized audio data for other audio cues.

At step 760, a portion of the determined type or category of audio data is output from an available source. Thus, various embodiments may provide dynamic selection of different types or categories of audio data to the audio prompts of the audio user interface. Additionally, as part of the audio user interface, some embodiments may also provide a mechanism for placing selected or identified types or categories of audio data to the electronic device for use. Fig. 7 ends at step 770.

Fig. 8A and 8B are a flow diagram of a method 800 for providing an audio user interface to an electronic device in one embodiment in accordance with the invention. The method 800 generally comprises an intelligent decision approach that determines whether a suitable audio dialog for an audio user interface is available and obtains the best available audio dialog for output to the user. Fig. 8A begins at step 805.

At step 810, an input is received indicating a button press. For example, the user may interface with user interface control 510 of media player 500 of fig. 5. The media player 500 may generate one or more analog or digital signals representing button presses, touches, pressures, gestures, motions, etc.

At step 815, a determination is made whether an audio prompt is to be presented for the button press. In some embodiments, the control selection is accompanied by an indication that an audio prompt is output to the user to confirm the selection. For example, the user may be enabled to hear "play" to provide feedback on the true depression of the play/pause button 525. These embodiments may involve repeated user actions to select the user interface control. For example, the user may have to "click" multiple times on a user interface control to make a selection. The first "click" may cause media player 500 to determine to audibilize the selected user interface control. For example, when the user presses a play button, "play" may be auditorized. This first audio prompt may provide audio guidance as to which button was depressed, which is helpful to the user when not directing visual attention to the handheld device.

A subsequent "click" may then cause the media player 500 to perform an action corresponding to the user interface control. For example, pressing the play button a second time may cause the media file to be played. On the other hand, the audio prompt may have informed the user that: a selection other than intended will be made. Thus, the user may attempt to select a different user interface control. For example, the user may thereafter attempt to press the "next" button 520, rather than continue to press the play button 525 a second time.

If it is determined at step 815 that an audio prompt is presented for a button press, processing proceeds along an intelligent decision path that determines whether a suitable dialog is available for the audio prompt and determines how to place the suitable audio dialog on the electronic device. The intelligent decision approach may include, for example: the type or category of audio data and whether the audio data is available are found or identified.

At step 820, a determination is made as to whether a high quality source is available. The high quality source may comprise a digital audio file or the following audio data, relative to the low quality source: the audio data is sampled at a frequency above a predetermined or recognized frequency, at a given bit rate, has a size that exceeds a predetermined threshold or limit, and so on. This determination may be made based on whether there is a wireless or wired connection to the following communication network: a high quality source is accessible through the communication network. In one implementation, this determination may be made based on selection criteria or user preferences. For example, in one mode of operation, the user may wish to hear audio prompts for each action and menu entry selected by the user. In another mode, the user may not activate the audio prompts for those control selections (e.g., the "play" button), but merely listen to the audio prompts for the highlighted menu item. In another mode, audio prompts may be output only for the top menu items.

If a high quality source is determined to be available, then at step 825, an audio prompt is retrieved from the high quality source corresponding to the button press. One example of a high quality source may include lossless or CD quality pre-recorded audio data or audio files. The pre-recorded audio data or audio file may include a recording of: professionally produced celebrity voices, cartoon characters, or excerpts from television programs, feature films.

Alternatively, if it is determined that a high quality source is not available, then at step 830, it is determined whether a low quality source is available. If a low-quality source is determined to be available, at step 835, an audio prompt is retrieved from the low-quality source corresponding to the button press. One example of a low quality source may include pre-recorded audio data or audio files that are compressed using one or more compression or encoding techniques (e.g., MP3, WMA, OGG, etc.). These pre-recorded audio data or audio files may include a common recording of a human voice, or stored audio files or audio data generated using one or more speech or text synthesis techniques.

Referring now to FIG. 8B, if it is determined that a low quality source is not available, then at step 840, a determination is made as to whether text-to-speech (TTS) or speech synthesis is available. If it is determined that one or more synthesis sources are available, then at step 845, the audio prompt is synthesized or generated using speech synthesis or TTS synthesis.

If no source of audio prompts for the audio user interface can be determined or selected, one or more beeps or other general sounds may be output corresponding to the button press at step 850. Preferably, at step 855, an audio prompt corresponding to the button press is output, the audio prompt being selectively obtained from a high quality source at step 825, obtained from a low quality source at step 835, or synthesized at step 845. In some embodiments, the audio prompt may be played according to the selected audio interface mode. When the media player or portable media device is not playing audio files, only the audio files corresponding to the user interface may be played and heard by the user.

In various embodiments, when a media file is being played back, the audio interface mode may be set to mix the media file with the audio prompt playback in a different manner. According to one setting, the volume for playback of a media file may be dynamically reduced when an audio prompt is to be played. For example, during playback of the audio prompt, the playback volume of a song or movie clip may be reduced. According to another setting, playback of the media file is paused during playback of the audio prompt and then resumed after the audio prompt is played. If the user makes multiple user control selections within a certain time limit, playback of the media file may be paused for a short time such that playback of the media file does not have to be paused and resumed multiple times. This may avoid repeatedly interrupting song playback. For example, if the user makes at least three user control selections within 5 seconds, playback of the media file may be paused for five seconds. The length of time and number of user control selections may vary according to the user's preferences. Some audio interface modes may specify that audio prompts are played through left, right, or dual-sided speaker or headphone channels.

Therefore, the following case is judged: whether a suitable audio dialog (e.g., on the electronic device or on a host computer/server computer connected to the device) is available, and whether the best available audio dialog is available for output to the user. Fig. 8B ends at step 860.

Fig. 9 is a flow diagram of a method 900 of streaming audio prompts for an audio user interface in an embodiment in accordance with the invention. The method 900 generally includes streaming audio prompts to a media playback device according to a connection to a voice server. Fig. 9 begins at step 910.

At step 920, a media playback device (e.g., media player 100) detects a broadband connection. For example, the media playback device may successfully associate with the wireless access point. In another example, the media playback device may recognize a wired connection to the internet.

At step 930, the media playback device determines to use the voice server to obtain a voice conversation for the audio user interface. For example, a software program executed by the media playback device may initiate and complete a handshake with one or more applications hosted by the voice server. In another example, the media playback device may periodically poll the voice server to determine the availability of a connection.

At step 940, the media playback device generates a request for an audio prompt. The request may include information identifying the audio prompt, information identifying a user interaction corresponding to the requested audio prompt, and so forth. The request may include one or more of: header, flag, field, check, hash, etc. In one embodiment, the request may include hypertext transfer protocol (HTTP) data or real-time transport protocol (RTP) data.

At step 950, the voice server streams the audio prompt to the media playback device. At step 960, the media playback device outputs the streamed audio prompt. The voice server may use one or more streaming protocols (e.g., real-time or faster than real-time) to cause the media playback device to buffer a portion of the audio prompt prior to playback.

In various embodiments, the voice server may be made accessible on a pay-per-entry or per-order basis. The voice server may support streaming uncompressed and compressed (e.g., lossless or lossy) audio data. The voice server may also support the delivery of information associated with content or other media assets from which a user may interact (e.g., navigate), such as title information, album information, artist information, genre information, metadata, and so forth. Fig. 9 ends at step 970.

FIG. 10 is a flow diagram of a method 1000 for creating audio prompts at a host computer system using one or more speech or text-to-speech synthesis techniques in one embodiment in accordance with the invention. Method 1000 generally includes synthesizing audio prompts for an audio user interface and transmitting the synthesized audio prompts to a media playback device. Fig. 10 begins at step 1010.

At step 1020, a media playback device (e.g., media player 100 of FIG. 1) detects a connection to a host computer. For example, the media playback device can detect whether the media playback device is coupled to the host computer with a peripheral device cable. In another example, the media playback device may detect proximity to the host computer and establish a wireless connection, for example using a WiFi or bluetooth module.

At step 1030, the media playback device determines to use the host computer to obtain a voice dialog for the audio user interface. For example, the media playback device may determine to use the host computer when the internal storage of the media playback device does not have sufficient space to store audio prompts in addition to content or other media assets. In another example, when the media playback device does not include a TTS engine, the media playback device may determine to use the host computer.

At step 1040, the host computer synthesizes an audio prompt. The host computer may generate the audio prompts using one or more speech synthesis or text-to-speech synthesis techniques. For example, the host computer may determine a profile associated with the media playback device. The profile may include textual descriptions of events registered by button presses, menu selections, or other user interactions that are specific to an electronic device. The host computer may audibilize the textual description of the profile by generating and recording a synthesized voice reading. The host computer may generate an audio prompt for each textual description. The host computer may also generate an audio prompt containing audio data for each textual description, along with the following information: the information represents audio data for a given textual description within the one audio prompt.

At step 1050, the host computer transmits the audio prompt to the media playback device. In one implementation, a host computer generates a plurality of audio prompts for an audio user interface for an audio conversation. The host computer then transmits the entire audio dialog to the media playback device, such as when managing the content or other media assets on the device. In another example, the host computer may generate and transmit audio prompts to the media playback device in substantially real-time. At step 1060, the media playback device outputs the audio prompt. Fig. 10 ends at step 1060.

FIG. 11 is a flow diagram of a method 1100 for creating audio prompts using one or more speech or text-to-speech synthesis techniques, according to an alternative embodiment of the present invention. The method 1100 generally includes creating or synthesizing audio data representing a textual description of an event. Fig. 11 begins at step 1110.

At step 1120, an event is identified. The event may include any user interface that may be performed for the electronic device. An event may be represented by a user's button press, click, scroll, touch, selection, highlight, etc. At step 1130, a textual description of the identified event is determined. The textual description may include words, sentences, etc. that describe the event, the device, the user, a portion of the content, etc. The textual description may be generated by a user, developer, or other third party.

At step 1140, speech audio is synthesized or otherwise generated from the textual description of the event. In one example, a computer system may retrieve configuration settings for a text-to-speech conversion process. The configuration settings may control various aspects of the speech synthesis or text-to-speech conversion process. For example, the configuration settings may determine certain text strings to be converted to audio files, the quality of the TTS conversion, the gender of the speech that verbalizes the text strings, the speed at which audio prompts are audibilized (e.g., speech speed may be increased as users become more familiar with the audio prompts), and customized speech for different subtasks (e.g., controls and functions may be audibilized with one speech while data (e.g., songs and contact names) may be audibilized with other speech). In addition, the configuration settings may also handle skilled manipulation of user interface controls by playing only a portion of the audio prompts as the user navigates. For example, when browsing contact names in a dictionary, only the letters (a, b, c.) are represented until the user reaches a contact name that begins with the desired letter. For example, j in the case of Jones. It should therefore be understood that TTS configuration settings may have various settings corresponding to device, configuration, or user desires.

Various sound synthesizer rules and engines may be used to generate the audio file. A general example of a process for converting words into audio files may work as follows. The process for converting the word "browse" begins with the decomposition of this word into segments or syllables representing units of double-pitched (diphone), such as "b", "r", "ow", "s". Various techniques then generate audio cues for each component, which can then be combined to form intelligible words or phrases. The audio file is typically assigned an extension corresponding to the type of audio file being created. For example, an audio file for "browse" may be identified by a browse.

At step 1150, a voice audio prompt is output. The voice audio prompt may be output in response to a user interaction with a media playback device having an audio user interface. In one embodiment, the audio user interface may include an indication that points to a corresponding audio prompt or audio file. For example, a look-up table may be used to keep track of the relevant indications to the audio prompts. Fig. 11 ends at step 1160.

FIG. 12 is a simplified block diagram of a computer system 1200 that may incorporate an embodiment of the present invention. Fig. 12 is merely an illustration of an embodiment that incorporates the present invention and should not limit the scope of the invention as recited in the claims. Various alterations, modifications, and substitutions will occur to those skilled in the art.

In one embodiment, computer system 1200 includes processor(s) 1210, Random Access Memory (RAM)1220, disk drive 1230, input device(s) 1240, output device(s) 1250, display 1260, communication interface(s) 1270, and a system bus 1280 interconnecting the above components. Other components (e.g., file system, storage disk, Read Only Memory (ROM), cache memory, codecs, etc.) are also possible.

RAM 1220 and disk drive 1230 are examples of tangible media configured to store data (e.g., audio, image, and movie files), operating system code, embodiments of the present invention, including executable computer code, human-readable code, and the like. Other types of tangible media include floppy disks, removable hard disks, optical storage media (e.g., CD-ROMs, DVDs, and bar codes), semiconductor memories (e.g., flash memories), read-only memories (ROMs), battery-backed volatile memories, networked storage devices, and the like.

In various embodiments, the input device 1240 is generally implemented in the following manner: computer mice, trackballs, trackpads, joysticks, wireless remote controls, drawing pads, voice command systems, eye tracking (eye tracking) systems, multi-touch interfaces, scroll wheels, click wheels, touch screens, FM/TV tuners, audio/video input devices, and the like. Input device 1240 may allow a user to select objects, charts, text, etc. via commands (e.g., clicking on buttons, etc.). In various embodiments, output device 1250 is typically implemented in the following manner: a display, a printer, a force feedback mechanism, an audio output device, a video component output, and the like. The display 1260 can include a CRT display, an LCD display, a plasma display, and the like.

Embodiments of communications interface 1270 may include a computer interface including, for example, an ethernet card, a modem (telephone, mini, cable, ISDN), (asynchronous) Digital Subscriber Loop (DSL) unit, FireWire interface, USB interface, or the like. For example, these computer interfaces may be coupled to a computer network 1290, a FireWire bus, or the like. In other embodiments, these computer interfaces may be physically integrated on a system board or motherboard of computer system 1200, and may be software programs or the like.

In various embodiments, computer system 1200 may also include software that allows communication over a network, such as HTTP, TCP/IP, RTP/RTSP protocols, and the like. Other communication software and transport protocols, such as IPX, UDP, etc., may also be used in alternative embodiments of the present invention.

In various embodiments, computer system 1200 may also include an operating system, such as Microsoft WindowsLinuxMac OS XReal-time operating system (RTOS), open source and proprietary OS, etc.

FIG. 12 is a representation of a media player and/or computer system capable of implementing the present invention. Those of ordinary skill in the art will readily recognize that many other hardware and software configurations are suitable for use with the present invention. For example, the media player may be a desktop, portable, rack-mounted, or tablet configuration. In addition, the media player may also be a series of networked computers. Further, the media player may be a mobile device, an embedded device, a personal digital assistant, a smart phone, and the like. In other embodiments, those techniques described above may be implemented on a chip or on an auxiliary processing board.

The present invention may be implemented in the form of control logic in hardware or software or a combination of both. The control logic may be stored in the information storage medium in the form of a plurality of instructions adapted to direct an information processing device to perform a set of steps disclosed in embodiments of the present invention. Other ways and/or methods of implementing the invention will occur to those skilled in the art from the disclosure and teachings herein.

The embodiments described in this application are illustrative of one or more examples of the invention. Having described embodiments of the invention with reference to the accompanying figures, those skilled in the art will appreciate various alterations or modifications to the methods and/or specific structures described. All changes, modifications, and variations that rely upon the teachings of the present invention and through which these teachings have advanced the art are deemed to be within the scope of the present invention. Accordingly, the specification and drawings are not to be regarded in a limiting sense, because it is to be understood that the invention is in no way limited to the illustrated embodiments.

The above description is illustrative and not restrictive. Many variations of the invention will occur to those skilled in the art upon review of this disclosure. The scope of the invention should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled, and not by reference to the foregoing specification.

Claims

1. A method of providing audio prompts to a user through a media player, the method comprising:

receiving an input representing the user's interaction with a user interface associated with the media player;

determining whether to output an audio prompt that audibilizes the interaction; and

in case an audio prompt is to be output:

determining whether a connection exists from the media player to the voice server via the communications network;

receiving a preliminary audio prompt associated with the interaction from the voice server if the connection exists;

generating a new audio prompt if the connection does not exist; and

outputting at least a portion of the preliminary audio prompt or the new audio prompt.

2. The method of claim 1, wherein generating a new audio prompt comprises: synthesizing, by the media player, the new audio prompt using text-to-speech technology.

3. The method of claim 1, wherein the preliminary audio prompt received from the voice server comprises a high quality voice recording.

4. The method of claim 1, wherein the quality of the preliminary audio prompt is higher than the quality of the new audio prompt.

5. The method of claim 1, wherein receiving the preliminary audio prompt comprises:

receiving a streaming input from the voice server, the streaming input including the preliminary audio prompt.

6. A portable media playback device comprising:

a media playback unit;

a user interface; and

a processor configured to:

receiving an input representing a user interaction with the user interface;

determining whether to output an audio prompt that audibilizes the interaction; and is

In the event that an audio prompt is to be output, the processor is further configured to:

determining whether a connection exists from the portable media playback device to a voice server over a communications network;

generating a new audio prompt if the connection does not exist; and

7. The portable media playback device of claim 6, wherein the preliminary audio prompt received from the voice server comprises a high quality voice recording.

8. The portable media playback device of claim 6, wherein the quality of the preliminary audio prompt is higher than the quality of the new audio prompt.

9. The portable media playback device of claim 6, wherein receiving the preliminary audio prompt comprises:

10. The portable media playback device of claim 6 wherein the new audio prompt is generated by the processor using text-to-speech synthesis techniques.

11. An apparatus for providing audio prompts to a user through a media player, the apparatus comprising:

means for receiving input representing interaction by the user with a user interface associated with the media player;

means for determining whether to output an audio prompt that audibilizes the interaction; and

means for performing operations including, in a case where an audio prompt is to be output:

generating a new audio prompt if the connection does not exist; and