WO2009158077A2 - Devices and methods used in the processing of converting audio messages to text messages - Google Patents
Devices and methods used in the processing of converting audio messages to text messages Download PDFInfo
- Publication number
- WO2009158077A2 WO2009158077A2 PCT/US2009/044270 US2009044270W WO2009158077A2 WO 2009158077 A2 WO2009158077 A2 WO 2009158077A2 US 2009044270 W US2009044270 W US 2009044270W WO 2009158077 A2 WO2009158077 A2 WO 2009158077A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- agent
- audio
- text
- message
- file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/06—Message adaptation to terminal or network requirements
- H04L51/066—Format adaptation, e.g. format conversion or compression
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/07—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail characterised by the inclusion of specific contents
- H04L51/10—Multimedia information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M2201/00—Electronic components, circuits, software, systems or apparatus used in telephone systems
- H04M2201/40—Electronic components, circuits, software, systems or apparatus used in telephone systems using speech recognition
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M2201/00—Electronic components, circuits, software, systems or apparatus used in telephone systems
- H04M2201/60—Medium conversion
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M3/00—Automatic or semi-automatic exchanges
- H04M3/42—Systems providing special services or facilities to subscribers
- H04M3/50—Centralised arrangements for answering calls; Centralised arrangements for recording messages for absent or busy subscribers ; Centralised arrangements for recording messages
- H04M3/53—Centralised arrangements for recording incoming messages, i.e. mailbox systems
- H04M3/533—Voice mail systems
- H04M3/53333—Message receiving aspects
Definitions
- the audio file is played to the human agent who listens to the audio file, and assisted by computer-based tools, intelligently converts the audio file into text in a text file.
- the current invention described herein i.e. the voice message conversion tool described in this specification, is the interface tool used between the human agent and the computer-based tools in the SpinVox voice to text conversion process.
- This interface known as Tenzing, is the interface tool used by the human agent in the SpinVox process of converting an audio file to a text file.
- this interface tool is used in the process of entering text into the text file in the conversion process.
- this interface tool is used to review, compare, and edit a previously converted/entered text file against its related audio file.
- the purpose of the review interface tool not only edits and corrects the previously converted text file but also provides input data to the processes and techniques used in the SpinVox process for the purpose of increasing the overall predictive abilities of the system.
- the entering/converting and review/editing interface tools can be used together by the same agent or separately by different agents or their supervisors. In these processes (i.e. conversion and review/edit) the human agent guides predictive programs and processes to assist the agent in converting the audio file into text file.
- These predictive programs and processes can use different and varied input data, such as speech recognition processes performed on the audio file, contextual information about the time and date of the original voice message, the duration of the audio file, the phone numbers of the caller and recipient, data gathered from previous calls from the caller, and the text input provided by the agent during both the conversion and review/editing processes.
- the quality/speed of the conversion process is the measurement of two target objectives in the SpinVox process: the quality (i.e. accuracy of the converted message) and the speed (i.e. turnaround time) in receiving an audio file and converting it to a text file.
- the quality i.e. accuracy of the converted message
- the speed i.e. turnaround time
- Action, Review also called the “Quickflow” process, that combines the three process described above: Listening (to the audio file for a set time); Action (i.e. entering/converting the audio file to a text file), and Review (reviewing/editing the converted text file based on its related audio file).
- Listening to the audio file for a set time
- Action i.e. entering/converting the audio file to a text file
- Review reviewing/editing the converted text file based on its related audio file.
- Fig. 1 is a depiction of the Graphic User Interface (GUI) login screen used in an embodiment of the current invention.
- Fig. 2 is a depiction of the GUI priority language notification screen used in an embodiment of the current invention.
- Fig. 3 is a depiction of the GUI priority language selection screen used in an embodiment of the current invention.
- Fig. 4 is a depiction of the GUI message receiving notification screen used in an embodiment of the current invention.
- Fig. 5 is a depiction of the GUI of the voice message conversion tool used in an embodiment of the current invention.
- Fig. 6 is a depiction of the GUI of the settings used for the voice message conversion tool used in an embodiment of the current invention.
- Fig. 12 Fig.
- FIG. 7 is a depiction of the GUI of the shortcut or hot-keys used for the voice message conversion tool used in an embodiment of the current invention.
- Fig. 8 is a flowchart showing the login process used in an embodiment of the current invention.
- Fig. 9 is a flowchart showing the time out processes used in an embodiment of the current invention.
- Fig. 10 is a flowchart showing an overview of the processes used in the voice message conversion tool used in an embodiment of the current invention.
- Fig. 11 is a flowchart showing an overview of the valid control commands used in the voice message conversion tool in an embodiment of the current invention.
- Fig. 12 is a flowchart showing the post conversion processes used in an embodiment of the current invention.
- Fig. 12 is a flowchart showing the post conversion processes used in an embodiment of the current invention.
- FIG. 13 is a flowchart showing the audio control processes used in an embodiment of the current invention.
- Fig. 14 is a depiction of the GUI of the initial display in the review/editing tool used in an embodiment of the current invention.
- Fig. 15 is a depiction of the GUI of the word editing function used in the review/editing tool used in an embodiment of the current invention.
- Fig. 16 is a depiction of the GUI of the completed display used in the review/editing tool used in an embodiment of the current invention.
- GUI Graphical User Interface
- Figure 1 is an embodiment of the login screen of the GUI in accordance with the present invention.
- the login screen 100 includes a login window 102, where the agent is prompted to enter his authorization details.
- the agent enters his username and password in username field 104 and password field 106 respectively.
- the agent presses the OK button 108.
- the authorization details are then verified and the agent is authorized to use the voice message conversion tool.
- a finger print recognition system, an iris recognition system or any other suitable method can be employed to authorize the agent.
- An exit button 110 is provided in case the agent wishes to close the application and exit out of the tool.
- a visual indication is shown by flashing and altering the color of an indication icon 112, and prompting the agent to re-enter the authorization details.
- any other method of indicating incorrect login details can be employed.
- the agent fails to enter the login details for the third consecutive time, the application is closed.
- buttons 114 are provided to minimize, maximize and close the application.
- One of the buttons from the set of buttons 114 can be used in order to display the hot-keys panel.
- another button is used to display a settings panel.
- the settings panel and the hot-keys panel are further explained in conjunction with Figure 6 and Figure 7.
- the voice message conversion tool can support multiple languages.
- a language also includes its regional dialects as well as the languages' linguistic designation.
- English contains the languages US-English, UK-English, Australian-English, South African-English, etc.
- an agent may be versed with several of the supported languages (i.e. US-English and South African-English). In such cases, the agent may be given a choice of prioritizing the languages, of which, he wishes to receive the messages for conversion.
- FIGS. 2 and Figure 3 are diagrams showing the queue selection screens of the voice message conversion tool GUI.
- the queue selection screen 200 includes a queue selection window 202.
- a high priority language can be selected by using a high priority language selection button 204.
- a low priority language can be selected by using a low priority button 206.
- the agent wishes to make a selection of his/her priorities, he/she presses one of the two buttons 204 and 206; and is then taken to a second queue selection screen 300 of Figure 3.
- the second queue selection screen 300 includes two language selection options - selection for live message conversion, and selection for training message conversion.
- the voice message conversion tool may be used for both live message conversion and training of the agent.
- the agent In case the agent is converting a voice message to text in real time, referred to as 'live conversion', he or she may choose a high priority language, by using the options of supported languages presented in a live queue selection box 302.
- 'training conversion' In case the agent is undergoing training on voice message conversion, referred to as 'training conversion', he or she may choose a high priority language for training, by using the options of supported languages presented in the training queue selection box 304.
- the second queue selection screen here is shown for illustrating high priority language selection only.
- a similar queue selection screen may appear when the agent wishes to select a low priority language by clicking on the button 206.
- the list of supported languages is presented by using the languages for which messages are available for conversion.
- the high priority queue is the one from which the agent receives the messages for conversion.
- the tool automatically switches to the high priority queue and sends messages from it to the agent for conversion. In case there are no messages queued in the high priority queue, the tool will send messages from the low priority queue to the agent
- the high and low priorities as selected by the agent can be stored by the voice message conversion tool; and can be automatically loaded when the agent logs into the tool next time. Further, depending on the languages of the messages the agent is likely to convert, default high and low language priorities can be presented to the agent. In another embodiment of the present invention, the agent may be given an option of not selecting the low priority language queue. A button enabling the agent not to select a low priority language queue can be presented on the second queue selection screen 300. In an embodiment of the present invention, the tabs representing language options in the live queue selection box 302 and training queue selection box 304 may have different colors. A cancel button 306 is provided, using which, the agent can go back to the queue selection screen 200 without making a selection for language priorities.
- the system or agent's supervisors can select the language queue priorities for the agent. In this case, the agent does not need to make the selection of language priorities. In this embodiment the agent is taken to the message download dialogue box 400, bypassing the queue selection screens 200 and 300. Furthermore, the selection of the languages can be remotely made by the system administration or agent's supervisors in real-time. For example, when there is a smaller number of conversion requests in UK-English queue and the demand for US-English conversions is high; the system automatically shifts an agent from UK-English to US-English, when the agent is familiar with conversion of both languages.
- the agent can start receiving the messages for conversion by using a 'Start Converting' button 208.
- a 'Start Converting' button 208 At the queue selection screen 202, an option of logging out of the system may be presented. This option can be selected by using a 'Log Out' button 210.
- a check-box 212 is presented for selecting only escalated messages for conversion. This option may be made available to managers of the agents only.
- the escalated messages are the messages, a part of which, an agent was not able to convert due to various reasons. Escalated messages are explained in detail in conjunction with Figure 12.
- the option of selecting only escalated messages for conversion by using the check-box 212 is presented only after a user authenticates himself as a manager or team leader.
- the agent is given a predetermined amount of time to make the selection of high and low priority queues.
- the system logs the agent out.
- a message download dialogue box 400 can be optionally presented on the GUI as shown in Figure 4.
- the message download dialogue box has a provision of showing which language queue the message is being retrieved from.
- the message download dialogue box is presented to the agent during the retrieval of messages from the message queue. It can also display to the user, the time required for downloading the message from the queue.
- a download status bar 406 displays the progress of message retrieval from the system. Any other similar indication method can also be used to display the progress of message retrieval.
- a 'Cancel' button 402 can be presented, using which, the user can log out of the system at this stage. Further, a visual indication icon such as a moving circle 404 or a sand clock can be shown in order to show that the system is busy retrieving the message. Once the message is downloaded, the agent is taken to the conversion screen for converting the message.
- Figure 5 is a diagram showing the conversion screen of the voice message conversion tool GUI, in accordance with an embodiment of the present invention. Conversion screen 500 includes a main conversion window 502, an audio waveform display 504, a status display box 510, and a speedometer 520.
- the agent is not permitted to start converting the message for a predetermined amount of time.
- a dialogue box that displays an interrupt message is presented on the screen.
- An appropriate message such as 'Please listen to the message.
- Text entry disabled' can be displayed in the dialogue box.
- indicative visual icons may also be displayed, alerting the agent that the text entry is disabled.
- the listening phase is denoted by a loudspeaker icon until the predetermined time for disabling the keyboard as explained herein is finished, then a sliding icon marked "Action" slides across the top of the main conversion window 502 to denote that the keyboard has become active.
- the listen phase is 0.5 to 30 seconds long; in a better embodiment of the current invention the listen phase is 1.0 to 10 seconds long; in the preferred embodiment of the current invention the listening phase the predetermined period of time is calculated to be ten percent (10%) of the duration of the message but no less than three seconds long.
- the audio waveform display 504 shows a visual representation of the audio waves of the message to be converted.
- the audio waveform display 504 shows a graph whose X axis represents time, and the Y axis represents the intensity or the gain level of the audio.
- the waveform is symmetrical about the axis 508.
- a moving bar 506 indicates the position on the waveform at the time the message is being played to the agent.
- the moving bar 506 can be represented as a semi- transparent bar which gives an indication of the length of message that has been played.
- the agent can go back and re-play the audio.
- the agent can call on the "Forced Rewind” function and force the audio to rewind to the beginning of the audio by pressing the F5 key at any time when the keyboard is active (i.e. the predetermined disabled time period is over).
- the agent can call on the "Forced Rewind” function by first pressing the space bar when the keyboard becomes active. At this time, the moving bar 506 can be moved back along the axis 508. This is explained in detail in conjunction with Figure 7. If the agent first presses a key other than the spacebar when the keyboard becomes active the spacebar key will then return to its normal functionality and enter spaces on the text file.
- the playing of the audio file is paused for a period of time before it is restarted.
- this audio pause can be 0.5 to 10 seconds long; in the preferred embodiment of the current invention the audio pause is five seconds long.
- the audio is displayed in form of a simple linear slider which moves along the time axis.
- a cursor on the slider shows the position of the audio being played on a linear timescale.
- the audio waveform is divided into segments based on the pauses taken during the message. These segments are displayed by using audio tab lines 522 on the audio waveform or the linear slider.
- the agent can rewind, forward or replay the parts between the audio tab lines. Since the pauses are likely to be taken between two sentences, it automatically enables the agent to convert the message sentence after sentence.
- a feature of slowing down or speeding up the message from its normal speed is made available to the agent.
- the agent slows down or speeds up the message using a certain keystroke combination.
- the agent can slow down or speed up the message by using the foot-pedals and other input devices integrated with the voice message conversion tool.
- an indication that the message is being played slow or fast can be displayed along with the moving bar of the audio waveform or the cursor of the linear slider. Further, by what factor the message is slowed down or speeded up can also be displayed along with this indication. For example, if the speed of the message is halved, an indication of 0.5X is displayed on the moving bar 506.
- the agent listens to a particular section for more than a pre-determined number of times, immediate subsequent playing of that section of audio is automatically slowed down.
- the slowdown is retained till the agent proceeds to another section.
- the pre-determined number of times is 3, i.e. if the agent listens to a particular section for more than three times, playback of that section of the audio is automatically slowed down.
- the slowdown of the audio is in the range of 20% to 80% of the original speed; in a better embodiment of the current invention the slowdown of the audio is in the range of 33% to 66% of the original speed; in the preferred embodiment of the current invention the slowdown of the audio is fifty percent (50%) of the original speed.
- the conversion screen includes a properties window 510.
- the properties window 510 displays certain parameters such as message language 512, message ID 514, length of the message audio 516 and recipient 518. These parameters are provided for the agent's reference. It should be noted that any other parameters relevant to the message may also be displayed in the properties window 510.
- the conversion screen also includes a speedometer 520.
- the speedometer 520 displays the conversion speed of the agent.
- the Agent Conversion Ratio is broadly defined as the ratio of the amount of time taken by the agent to convert an audio file into text, to the length of the audio file.
- the speedometer 520 is calibrated in terms of the ACR.
- the pointer of the speedometer 520 can indicate at what ACR the agent is converting the message, at any given point of time.
- the ACR can be an average, mean, mode or other statistical calculation representing an evaluation of the speed the agent is converting messages.
- the speedometer When there is a permitted ACR range for the conversion, the speedometer displays a color indication that the agent's ACR measurement is within the permitted ACR range. For example, if the permitted range for the ACR is between 4 and 6 or less, the speedometer 520 displays a green color within that range. An ACR above 6 will be represented in red color.
- the speedometer 520 is configured to display whether the agent is in a permitted range of ACR or not.
- the actual permitted range of the ACR is kept unknown to the agent; however the agent is shown whether he is in the range of permitted ACR.
- the region of the permitted range of ACR can be displayed in green color; and the region of the ACR range above the maximum permitted value can be displayed in red color.
- the agent has to just convert the message and make sure that his speedometer arrow is maintained in the green region.
- the speedometer 520 is in the form of a linear bar. Movable pointers indicating the current ACR and the target or permitted ACRs are displayed along the liner bar. As in the circular speedometer, the linear meter can have different colors representing the permitted ACR range and the range above permitted ACR. [50] In various other optional embodiments of the present invention, parameters such as hourly conversion rate for the agent, total number of messages converted for which the ACR is displayed by the speedometer 520, Average ACR for the day and so forth maybe displayed along with the speedometer 520.
- the speedometer 520 is displayed to the agent only after five converted messages.
- the calculation for ACR can be made from the first message the agent converts. Therefore, the average ACR after first five messages gives a more realistic indication of the agent's ACR.
- an appropriate message such as 'The speedometer will be active in another 3 messages' can be displayed in place of the speedometer 520.
- the ACR can be displayed from the time the agent starts conversion.
- the voice message conversion tool works on the principle of working in conjunction with predictive text tools and processes that estimate the word or phrase of the sentence that the agent is going to enter. So when the predicted text matches the audio being played, the agent can simply accept it.
- An example of the procedure that the agent follows for converting the message is explained in detail in conjunction with Figure 10.
- an embodiment of the current invention contains functionalities to assist the agent in entering text into the text file to increase the quality/speed objectives of the audio file to text file conversion process.
- the review stage window comprises a review indication box that displays an appropriate message such as 'Message Review - Please check the message' or a slider icon marked "Review" moves along on the top of the main review window to inform the agent that they are in the review/edit stage of the SpinVox process.
- the agent supported by review process tools, can correct the punctuations and spellings; correct the case of the words and so forth.
- the system gives 'in-line' suggestions to the agent in review phase.
- the system provides a drop down list in-line with the text, offering the agent a number of probable alternatives to select from. When the agent clicks on the correct alternative, the word is replaced.
- the words that the system finds inappropriate/misspelled are underlined or displayed in a different color.
- the agent may change these words after listening to the audio.
- In-line suggestions as explained above can also be displayed to aid the agent in quickly selecting an appropriate word.
- a time delay on audio being played during review stage can be applied.
- the agent read the converted text for a time period before the audio starts playing. This provision lets the agent read through the message even before the audio starts playing. The agent can thus spot mistakes, decide whether the corrections offered by the system could be right and so forth.
- the audio can also be played back as soon as the review window is displayed to the agent.
- Figure 6 is a diagram showing a settings panel of the voice message conversion tool.
- the settings panel 600 includes a text settings window 602, an audio settings widow 604, a user settings window 606, a foot-pedal settings window 608, an accept button 610 and a cancel button 612.
- One of the buttons from the standard set of buttons 114 invokes the display of the settings panel, when the agent wants to change his or her settings. Following is a list of the settings options and their functions that can be made available to the agent by using the settings panel 600: [61] Text settings window 602:
- Text size enables the user to change the size of the text being displayed on the screen.
- this option can be presented to the agent in the form of a slider, numeric values of the text size and the like, using which the agent can configure the size of the text displayed.
- Color This option enables the agent to change the color of the text displayed on the screen.
- the text displayed has two components - agent inputted text and predicted text. The color settings, when changed enables user to select different colors for agent inputted text and the predicted text.
- Audio settings window 604 [62] Audio settings window 604:
- Play count This option enables the agent to select how many times the section of the audio should be repeated. For example, if the agent enters 2 in this option, the section of the audio demarcated by vertical lines on the audio waveform is played to the user twice before moving on the next section.
- Skip back size This option enables the agent to select the duration for which the audio goes back when the agent rewinds it. In an embodiment of the invention, this can be made available to the agent in the form of a slider. However, a provision to enter numeric values in seconds can also be made available.
- Audio display type This option enables the agent to select which type of audio display he prefers to be displayed on the conversion screen - audio waveform display or a simple linear slider. These options have already been explained in conjunction with Figure 5.
- Playback volume This additional option enables the agent to set an overall playback volume.
- the option of selecting the playback volume can be enabled using a hot-key. This is explained in detail in conjunction with Figure 9.
- the agent may also change the volume by using the volume control of his headset.
- the agent can select an option of receiving escalated messages only. As explained earlier, this option may be made available to the managers of the agents only. The same option is also available on the queue selection screen 200.
- the settings panel 600 also has options of accepting the settings by using accept settings button 610, or cancel the changed setting by using the cancel settings button 612.
- a 'Reset to default settings' button maybe provided, using which, the agent can reset the voice message conversion tool to its default settings.
- FIG. 7 is a diagram showing a hot-keys panel in accordance with an embodiment of the invention.
- the hot-keys panel 700 includes a message control keys window 702, an audio control keys window 704, and a miscellaneous control keys window 706.
- a hot-key is a combination of keys on the keyboard, which, when pressed, enables the agent to perform an action.
- One of the buttons from the standard set of buttons 114 invokes the display of the hotkeys panel, which enables the agent to display and modify the hot-keys.
- the hot-keys and their functions that can be made available to the agent by using the hot-keys panel 700:
- Re -queue message This option lets the agent re-queue the current message.
- the default hot-key combination for re -queuing the message is 'Ctrl + LLL'.
- Send message When the agent has finished converting the message, he or she can submit it. This has already been explained in conjunction with Figure 5 and Figure 6.
- the default hot-key for sending the message is 'Ctrl + Enter'.
- Hangup In case the agent detects a hangup in the message, i.e. there is no message to be converted; he can submit it as a hangup.
- the default hot-key for submitting the message as hangup is 'Ctrl + H'.
- Wrong Language In an embodiment of the present invention, in case the agent gets a message which is not from his or her language priority, the agent can re-queue the message by pressing this hot-key combination.
- the default hot-key for submitting the message as wrong language is 'Ctrl + W'.
- a pop-up box is presented.
- the pop-up box has a list of supported languages.
- the agent can re -queue the message in one of the languages by making the selection.
- the agent is not able to identify the probable language of the message, he or she can use an unknown language option.
- unknown language option is selected, the message is escalated to a manager for language check.
- the agent can send the message to manager for escalation.
- the message is thereafter made available to a manager.
- the default hot-key for sending the message to the manager is 'Ctrl + M'.
- Unconvertible In case the message is unclear or a substantial number of words of the message are not clear to the agent, he or she can submit the message as unconvertible.
- the default hot-key for submitting the message as unconvertible is 'Ctrl + U'.
- Send and logout This is an optional hot-key offered to the agent. While converting the last message of a session, the agent can submit it and log out from the system.
- the default hot-key combination for sending and logging out is 'Ctrl + Alt + Enter'.
- Audio control keys window 704 [74] Audio control keys window 704:
- Pause The audio can be paused and resumed for playback by using the default hotkey 'Escape'.
- Fast-forward The audio can be fast forwarded by using the default hot-key 'PGUP'.
- the interval for fast forwarding the message can be selected by the agent.
- Rewind The can rewind the audio by using the default hot-key 'PGDN'.
- the interval for rewinding the message can be selected by the agent.
- Jump to audio start The agent can jump to the beginning of the audio file.
- the default hot-key for jumping to the beginning of the audio is 'Fl '.
- Skip section forward The agent can skip a section of message demarcated by audio tab lines on audio waveform display.
- the default key for skipping a section forward is 'Ctrl + PGUP'
- Skip section back The agent can skip a section of message demarcated by vertical lines on audio waveform display, backwards.
- the default key for skipping a section backward is 'Ctrl + PGDN'
- volume Up The agent can increase the volume of the audio playback using a hotkey.
- the default hot-key for increasing the volume is 'Ctrl + Add'.
- volume Down The agent can reduce the volume of the audio playback using a hotkey.
- the default hot-key for increasing the volume is 'Ctrl +'.
- the agent can slow down the audio playback.
- the default hot-key for slowing down the audio playback is 'Ctrl + S'.
- the extent of slowing down the audio playback is specified by default. For example, slowing down reduces the speed of the audio playback by half, i.e. plays it at 0.5X the normal speed.
- the agent can slow down the audio playback step-by-step. For example, hitting this hot key every time allows the agent to slow down the audio playback by an interval of 0.1X.
- Speed Up Playback The agent can speed up the audio playback.
- the default hot-key for speeding up the audio playback is 'Ctrl + F'.
- the extent of speeding up the audio playback is specified by default. For example, speeding up increases the speed of the audio playback by double i.e. plays it at 2X the normal speed.
- the agent can speed up the audio playback step-by-step. For example, hitting this hot key every time allows the agent to speed up the audio playback by an interval of 0. IX.
- Normal Playback When the audio is slowed down or is running at a faster speed, the agent wishes to bring it back to the normal speed, he can use this hot-key combination.
- the default hot-key for selecting normal playback is 'Ctrl + M'.
- a number of audio preset equalizers are available for the agent to select from. When one of these audio presets is selected, the audio is pre- processed using the selected preset equalizer, and played back to the agent. This helps in improving quality of the audio that the agent listens to.
- a separate hot-key can be assigned to select an audio preset.
- a hot-key can be assigned to toggle between the available audio presets.
- Such hot-keys and their functions can also be displayed in the audio control keys window 704.
- the default hot key for toggling between presets and selecting a preset is 'F5 ⁇
- Miscellaneous control keys window 706 is 'F5 ⁇
- Insert recipient name The agent can insert the recipient name in a message by using this hot-key.
- the default hot-key for inserting the recipient name in a message is 'Ctrl + N ⁇
- Hotkeys panel As an alternative to the buttons in the standard set of buttons 114, the hotkeys panel can be displayed by using this hot-key.
- the default hot-key for displaying hot-keys panel is 'F2'.
- Settings panel As an alternative to the buttons in the standard set of buttons 114, the settings panel can also be displayed by using a hot-key.
- the default hot-key for displaying settings panel is 'F3'
- Run spell check The agent can run the spell check option by using the default hotkey 'F7'. 5. Show first names: A list of first names starting with the agent input is displayed on the screen when the agent presses this hot-key combination. In an embodiment of the invention, when the hot-key is pressed without any agent input character being present on the screen, a list of most common first names is displayed. The default hotkey for displaying the list of first names is 'Alt + F'.
- the agent can customize the hot-keys according to his ease. This can be done by clicking on the respective hot-key and specifying a desired combination. In case the combination of hot-keys is already in use, the GUI displays an appropriate message. The agent is then prompted to select another combination.
- agent can use this option by pressing the reset to default button 908.
- the agent defined hot-keys can be stored in the agent's profile and can be made available to the agent the next time.
- the voice message conversion tool in case the voice message conversion tool is not able to retrieve the message from a message queue server, it displays an appropriate message to the user. The user can wait till the time the tool is able to retrieve the messages.
- the voice message conversion tool can be used to convert the voice messages to post them directly on the internet.
- the messages can be posted in form of blog posts on various blogs.
- a suitable warning that the message is going to be converted into a blog post, of which there can be a number of potential readers is displayed to the agent. This warning is typically displayed before the agent starts converting the message.
- the voice message conversion tool is integrated with a foot-pedal system.
- the agent is provided with a left foot pedal, a centre foot pedal and a right foot pedal.
- the foot pedals are pre -configured for their functions. Their functionalities can be as follows:
- the agent can customize the functions of the foot pedals to suit his ease in working.
- An appropriate pop-up window for customizing the functions of the foot pedal can be provided.
- This foot-pedal functions customization window can be invoked by using a hot-key combination.
- FIGs 8 through 13 are flow diagrams showing the detailed steps of processes used in an embodiment of the current invention.
- the invention displays a login window to the agent.
- the login credentials of the user are accepted by the system.
- the login credentials include a username and a password.
- the login credentials include biometric information such as fingerprints or iris recognition.
- the login credentials are matched with those in the database. In other words, the agent is authorized for using the application, In case the agent is not authorized, a flashing icon may be displayed to the agent. In this case, the invention displays the login screen for subsequent attempts.
- the invention displays a queue selection window for the agent to select a high priority language queue and a low priority language queue, at step 808.
- the agent wishes to make a selection of his priorities, he is then taken to a queue selection screen that includes two language selection options - selection for live message conversion and selection for training message conversion.
- the agent makes a language queue selection at step 810.
- the queues for the agent are selected by default.
- the selection from one agent session is stored in memory and the same queues are selected for the subsequent sessions.
- the queue selection window 202 includes an option of starting the conversion by pressing Start Conversion button 208. It is checked at step 812 whether the agent wants to start conversion. When the agent wishes to start the conversion, he or she is presented with an audio file to be converted and a text file. Procedure for which is explained in conjunction with Figure 9.
- step 814 It is checked at step 814 whether the agent wishes to log out of the system, In case of logout, the system logs the agent out at step 816. In case the agent does not wish to log out, the queue selection window is displayed to him or her again, i.e. steps 808 onwards are repeated.
- an inactivity timer T2 is run to count a second predetermined time, while waiting for an input from the agent at the queue selection window. In case the agent provides an input before the inactivity timer T2 counts over the second predetermined time, appropriate action is taken as explained above. If the inactivity timer T2 is over counting the second predetermined time without any input from the agent, the agent is logged out from the application. Thereafter, the steps 802 onwards are repeated.
- Figure 10 shows initiation of conversion process. If the agent chooses to start conversion at step 812, the agent receives an audio file with a text file from the system as in step 902.
- a first predetermined time is set for the agent to listen to the audio file.
- modification to the text file is disabled for the first predetermined time.
- a timer Tl starts to count the first predetermined time.
- the audio file begins to play automatically as in step 904.
- audio controls are always enabled. The purpose of the first predetermined time is to let the agent understand the contents of the audio file before the agent actually starts modifying the text file.
- the text file may contain text, predicted based on the words in the audio.
- the text file may contain partially converted text along with the predicted text.
- the text file may be empty.
- the text file contains a combination of one or more of the above mentioned possibilities.
- Predictive text is a set of one or more words that the system suggests based on the audio to aid the agent in the conversion process.
- Predictive text can be calculated in real time, based on various factors, such as the words in the audio, by using an automatic speech recognition algorithm; agent input and other information extracted from a database.
- the database may contain information related to previously converted audio files.
- the agent may use one or more hot keys to accept the set of one or more words suggested by the system, if the suggested words match the words comprehended by the agent in the audio file. If the agent finds the suggested words to be different, the agent has to the words based on the audio.
- the system revises the predictive text in real time, based on each character entered by the agent.
- certain hot-keys such as 'Tab' and 'right arrow' can be assigned to aid the agent in accepting the system predicted text, By simply pressing these keys, the agent can accept the predicted text, he or she does not have to type the entire word using the keyboard. Therefore, the time required in converting the audio into text is reduced.
- the system submits a text file containing the first limited number of characters of the message.
- the agent can alter the converted text to fit the text file into the limited number of characters.
- the first predetermined time is set to a predetermined time period.
- the agent input is disabled, resulting in the agent focusing on the content of the audio. This results in efficient conversion of audio to text.
- this time can be reduced or increased, based on the length of the audio. For example, for an audio file less than the predetermined time in length, the disabled time can be reduced to a shorter length of time.
- the ranges mentioned here are for better understanding of the concept of the agent listening to a portion of audio without any input. Based on the requirements of the conversion, various other alterations in the time spans can be made, without affecting the scope of the present invention.
- step 906 the system continuously checks whether the time Tl is over.
- a polling mechanism can be used for performing this check.
- text file modification is enabled at step 908.
- some keys of the keyboard continue to be disabled. Such keys include some special characters that are not desired to be present in the converted text file.
- use of 'caps lock' can also be disabled when the system automatically corrects the converted text for correct case.
- the system starts an inactivity timer T2, to check an inactivity of the agent for more than a second predetermined period of time.
- the inactivity timer T2 can also be started before the enablement of text file modification, and it does not affect the scope of the invention in any way.
- the system keeps checking if there is an input from the agent, when the inactivity timer T2 is running. If the agent provides an input to the system within the second predetermined period of time, an appropriate action is taken, process for which is explained in conjunction with figure 11.
- a warning timer T3 starts to count a third predetermined period of time.
- a warning message flashes on the screen, indicating the agent that the message will be re-queued if the agent does not strike any key.
- an audio indication such as a beep is used to alert the agent about inactivity.
- both audio indication and warning message are used.
- the warning message can also be displayed with a visual indication such as a flashing icon.
- steps 918 and 920 While executing steps 918 and 920, the system keeps checking for the agent input, when the timer T3 is running. If the agent provides input to the system within the third predetermined period of time, the input is processed by the system, procedure for which is explained in conjunction with figure 11. If the third predetermined period of time passes without the agent striking any key, the audio file along with the text file is re-queued. Thereafter, steps 808 onwards are repeated as explained in conjunction with fig 8.
- the inactivity timer T2 is in operation during the time the application is running. It is interrupted when there is an agent input.
- the invention accepts the agent input at 1002.
- the agent input can be a control command, an audio command, a miscellaneous command or a text command.
- a check is performed to detect the type of command, based on the agent input.
- a check whether the agent input is a control command is performed at step 1004. In case the agent input is a control command, an appropriate action is taken, procedure for which is explained in conjunction with Figure 11 and Figure 12. [103] Similarly, a check whether the agent input is an audio command is performed at step 1004.
- agent input is an audio command
- an appropriate action is taken, procedure for which is explained in conjunction with Figure 13.
- a check whether the agent input is a system command is performed at step 1008. In case the agent input is a system command, it is executed at step 1010.
- the miscellaneous commands are inputted by using a hot-key combination from the miscellaneous control keys window 706. The appropriate actions that the system takes have already been explained in the description of these hot keys.
- the miscellaneous commands include system commands, such as displaying the hotkeys panel, displaying the settings panel, insert recipient name and the like.
- the miscellaneous commands also include the assist commands, such as running a spell check, show first names, show second names and the like.
- the agent inputs the miscellaneous commands.
- the miscellaneous commands can be made a part of the user interface. In other words, the agent need not always enter some commands such as the assist commands.
- the system may intelligently assist the agent with these functions.
- a check whether the agent input is a text command is performed at step 1012. In case the agent provides a text input, a check for its validity is performed at step 1014. In an embodiment of the present invention, an invalid text input may be 'caps lock' and punctuation marks such as semi-colon, colon, exclamation mark and the like. If the agent input is an invalid text input, the process is repeated from step 910. If the agent input is valid, the system accepts the input at step 1016. In an embodiment of the present invention, the system calculates the predictive text at step 1018 based on the input provided by the agent; and displays the calculated predictive text at step 1020. Thereafter, the process is repeated from step 910 onwards.
- the text input provided by the agent may be a special character such as an underscore.
- An underscore is used in case the agent is not able to identify a word by listening to the audio file and/or with the suggestions from the system.
- an agent can put any number of underscores in the converted text and submit the text file. However, if the number of underscores placed by the agent exceeds a pre-determined value, the converted text is automatically sent to a manager, i.e. the conversion is escalated.
- an agent is not allowed to submit the text file with the number of underscores in the message exceeding a pre-determined value.
- the agent is required to either reduce the number of underscores or get the audio file and/or the partially converted text file re-queued or return the audio file as unconvertible.
- the system provides some suggestions to the agent. The agent may accept an appropriate suggestion.
- the agent may use the option 'notify word' to store the word in the system's database.
- after an agent notifies a word it may be sent to a manager for a confirmation and added in the dictionary thereafter.
- the 'notify word' option is provided only to the managers.
- the suggestions provided to the agent also include the word followed by a special symbol, such as a (?). If the agent is not sure of existence of the word in usage but is sure that the word is present in the audio, the agent puts a (?) mark after the word or selects the same from the suggestions.
- step 1102 in case the agent input is a control command, it is checked at step 1102 whether it is a valid control command. In case the command is not a valid control command, steps 910 onwards are repeated. In other words, the T2 timer is initiated again and the system waits for the next input from the agent.
- control command is one of the valid control commands
- corresponding appropriate action is taken as explained further.
- the valid control commands include, but are not limited to, submitting the audio as a hangup, submitting the audio file as unconvertible, submitting the audio file as a wrong language, re-queuing the audio file, sending the audio file and the text file to a manager, sending the text file for review, and sending the text file for review followed by logging out.
- step 1104 it is checked whether the control command is to submit the audio file as a hangup.
- a hangup is detected when there is no message in the audio, and the audio is too short.
- Another type of hangup can be a long audio file in case of a message recorded by mistake. In such cases, the agent submits the audio file as a hangup at step 1106.
- the audio file submitted as a hangup is not sent for a review.
- steps 902 onwards are repeated, i.e. the agent is presented a new audio file for conversion.
- Unconvertible audio files are the audio files that the agent is not able to convert.
- an audio file may contain just music or noise, along with a person's voice, making the person's voice incomprehensible.
- an agent can submit such audio files as 'unconvertible' at step 1110.
- steps 902 onwards are repeated, i.e. the agent is presented a new audio file for conversion.
- step 1112 it is checked whether the control command is to submit the audio file containing the message in a wrong language.
- an agent identifies that the language of the audio is not the preferred language set by him or her, and the agent is unable to do the conversion, the agent submits the audio file as 'wrong language'.
- the audio file is then re -queued in an appropriate language queue at step 1114.
- the agent is given an option of identifying the language of the message. In case the agent is not able to identify the probable language of the message, he or she can use an unknown language option. When unknown language option is selected, the message is escalated to a manager for language check. In case of message submission as wrong language, steps 902 onwards are repeated, i.e. the agent is presented a new audio file for conversion after the wrong language message is re-queued in an appropriate language queue.
- step 1202 it is checked whether the control command is to re-queue the message.
- the agent wishes to re -queue the message, he or she can use the defined hot-key.
- the message is re-queued in the language queue at step 1204.
- the agent wishes to re -queue the message that he is working on, the audio file along with any converted text in the text file is re-queued. Thereafter, steps 902 onwards are repeated, i.e. the agent is presented with a new audio file for conversion.
- step 1206 it is checked whether the control command is to send the message to a manager.
- the agent believes that the message needs another level of supervision; he or she can escalate it to a manager.
- the message is then sent to a manager at step 1208. Thereafter, steps 902 onwards are repeated, i.e. the agent is presented with a new audio file for conversion.
- step 1210 it is checked whether the control command is to send the message.
- the agent When the agent is finished with converting the message, he can submit the text file for review at step 1212. In an embodiment of the present invention, the message can be submitted at any stage of the conversion. Thereafter, steps 902 onwards are repeated, i.e. the agent is presented with a new audio file for conversion.
- the agent may want to submit the message and log out of the system.
- the agent can use the defined hot-key to submit the message and logout.
- This control command is checked for, at step 1214.
- the text file is submitted for review, and the agent is logged out of the application, at step 1216.
- Fig, 13 shows actions followed, if result of the check performed at step 1006 is that the agent input is an audio command.
- [118] In case the agent input checked at step 1006 is an audio command, actions explained in Figure 13 are performed. Examples of audio commands include, but are not limited to, 'pause', 'rewind', 'fast- forward', 'jump to audio start', 'skip section forward', 'skip section back', 'volume up', 'volume down', 'select an audio preset equalizer', 'slow down the speed of audio playback', 'speed up the audio playback' and 'play audio at normal speed'.
- the system checks at steps 1302, 1304 and 1306, whether the audio command input provided by the agent is one of the audio commands.
- step 1308 Corresponding to the selected audio command option, appropriate action is performed at step 1308. If the agent input is a 'pause' audio command, the action performed depends on the current state of the audio. If the current state of the audio is 'playing', the audio is paused. If the current state of the audio is 'paused', then the then the audio playback is resumed from the point where the audio is paused.
- the agent input is a 'fast-forward' audio command
- the audio is fast- forwarded as long as the 'fast- forward' audio command is in effect.
- agent input is 'jump to audio start' audio command, the audio is played from the beginning. If the agent input is 'skip section forward' audio command, the currently playing section of the audio is skipped, and the audio starts playing from the beginning of the next section. If the agent input is 'skip section back' audio command, the currently playing section is skipped, and the audio starts playing from the beginning of the previous section.
- the agent input is 'volume up' audio command
- the volume of the audio playback increases. If the agent input is 'volume down' audio command, the volume of the audio playback decreases. It should be noted that the increase/decrease in the volume of the audio playback may have a predetermined magnitude.
- an appropriate preset equalizer from the available presets is selected.
- the agent may be given an option of toggling between the available presets.
- the available preset equalizers are displayed on the screen, and agent toggles and selects the preset of his/her choice by using the assigned hot-key.
- the preset is selected, the pre-processed audio corresponding to the selected preset equalizer is played back to the agent. It should be noted that toggling between available audio preset equalizers can be done in real time, and it does not hamper the quality of audio being played. Only a shift from one preset equalizer to other is made.
- a plurality of audio presets can be made available are provided to the agent, for example these audio presents are:
- a separate equalization function corresponds to each of these presets. When selected, the presets provide desired effect to the audio. It should be noted that the preset equalizers mentioned here are for illustrative purpose only. A number of other similar preset equalizers can be provided according to the agent requirements, and it does not restrict the scope of the invention in any way.
- the selected preset can be displayed to the agent by using an appropriate symbol on the Graphical User Interface, i.e. on conversion and review screens.
- representative letters that denote different presets can be used to indicate the selected preset to the agent.
- the agent has a preference for a particular audio preset, i.e. he/she is comfortable with listening to the audio processed in a particular way; the preset can be selected for the agent by default. However, the agent is still given a choice of selecting another audio preset at the time of audio playback.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Information Transfer Between Computers (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
In human agent assisted automated voice to text conversion processes several devices and methods are used to improve the speed of conversion while maintaining quality and accuracy. The current invention is an interface tool used between the human agent and the devices used in the voice to text conversion process. In one aspect of the embodiment of the current invention this interface tool is used in the process of entering text into the text file in the conversion process. In another aspect of the current invention, this interface tool is used to review, compare, and edit a previously converted/entered text file against its related audio file. The purpose of the review interface tool not only edits and corrects the previously converted text file but also provides input data to increase the overall predictive capabilities of the system.
Description
DEVICES AND METHODS USED IN THE PROCESSING OF CONVERTING AUDIO
MESSAGES TO TEXT MESSAGES
Background Art
[01] As described in earlier patent applications (US Patent Application No. 2007/0054678,
Doulton "Method of generating a sms or mms text message for receipt by a wireless information device", US Patent Application No. 2007/0116204, Doulton, "METHOD OF PROVIDING VOICEMAILS TO A WIRELESS INFORMATION DEVICE", and US Patent Application No. 2007/0127688, Doulton, "Mass-Scale, User-Independent, Device- Independent Voice Messaging System") the SpinVox voice to text conversion process employs many tools and techniques for converting an audio file, typically in the form of a telephonic voice message, into a text file. The text file is normally sent to the designated recipient by SMS, e-mail or another text messaging system.
Technical Problem
[02] In the SpinVox voice to text conversion process several tools and techniques are used. One important technique in the current SpinVox voice to text conversion process is to use human agents assisted by computer-based tools. The audio file is played to the human agent who listens to the audio file, and assisted by computer-based tools, intelligently converts the audio file into text in a text file. The current invention described herein, i.e. the voice message conversion tool described in this specification, is the interface tool used between the human agent and the computer-based tools in the SpinVox voice to text conversion process. This interface, known as Tenzing, is the interface tool used by the human agent in the SpinVox process of converting an audio file to a text file.
Technical Solution
[03] In one aspect of the embodiment of the current invention this interface tool is used in the process of entering text into the text file in the conversion process. In another aspect of the current invention, this interface tool is used to review, compare, and edit a previously converted/entered text file against its related audio file. The purpose of the review interface
tool not only edits and corrects the previously converted text file but also provides input data to the processes and techniques used in the SpinVox process for the purpose of increasing the overall predictive abilities of the system. In an alternative embodiment of the current invention, the entering/converting and review/editing interface tools can be used together by the same agent or separately by different agents or their supervisors. In these processes (i.e. conversion and review/edit) the human agent guides predictive programs and processes to assist the agent in converting the audio file into text file. These predictive programs and processes can use different and varied input data, such as speech recognition processes performed on the audio file, contextual information about the time and date of the original voice message, the duration of the audio file, the phone numbers of the caller and recipient, data gathered from previous calls from the caller, and the text input provided by the agent during both the conversion and review/editing processes.
[04] In another aspect of the embodiment of the current invention it has been found that in the conversion process delaying the entry of text into the text file at the beginning of the audio file listened to by the agent improves the overall quality/speed of the conversion process. The quality/speed of the conversion process is the measurement of two target objectives in the SpinVox process: the quality (i.e. accuracy of the converted message) and the speed (i.e. turnaround time) in receiving an audio file and converting it to a text file. As explained herein during the entering/conversion process a delay in the entry of text into the text file forces the human agent to listen to the beginning of the message and understand the context before beginning to enter text to the text file.
[05] Accordingly another aspect of the embodiment of the current invention is the Listen,
Action, Review ("L.A.R.") process, also called the "Quickflow" process, that combines the three process described above: Listening (to the audio file for a set time); Action (i.e. entering/converting the audio file to a text file), and Review (reviewing/editing the converted text file based on its related audio file).
Description of Drawings
[06] Fig. 1 is a depiction of the Graphic User Interface (GUI) login screen used in an embodiment of the current invention. [07] Fig. 2 is a depiction of the GUI priority language notification screen used in an embodiment of the current invention. [08] Fig. 3 is a depiction of the GUI priority language selection screen used in an embodiment of the current invention. [09] Fig. 4 is a depiction of the GUI message receiving notification screen used in an embodiment of the current invention. [10] Fig. 5 is a depiction of the GUI of the voice message conversion tool used in an embodiment of the current invention. [11] Fig. 6 is a depiction of the GUI of the settings used for the voice message conversion tool used in an embodiment of the current invention. [12] Fig. 7 is a depiction of the GUI of the shortcut or hot-keys used for the voice message conversion tool used in an embodiment of the current invention. [13] Fig. 8 is a flowchart showing the login process used in an embodiment of the current invention. [14] Fig. 9 is a flowchart showing the time out processes used in an embodiment of the current invention. [15] Fig. 10 is a flowchart showing an overview of the processes used in the voice message conversion tool used in an embodiment of the current invention. [16] Fig. 11 is a flowchart showing an overview of the valid control commands used in the voice message conversion tool in an embodiment of the current invention. [17] Fig. 12 is a flowchart showing the post conversion processes used in an embodiment of the current invention. [18] Fig. 13 is a flowchart showing the audio control processes used in an embodiment of the current invention.
[19] Fig. 14 is a depiction of the GUI of the initial display in the review/editing tool used in an embodiment of the current invention. [20] Fig. 15 is a depiction of the GUI of the word editing function used in the review/editing tool used in an embodiment of the current invention. [21] Fig. 16 is a depiction of the GUI of the completed display used in the review/editing tool used in an embodiment of the current invention.
Mode for Invention
[22] In an embodiment of the present invention a Graphical User Interface (GUI) is used to display the various functions of the interface tools to the agent; and enables the agent to convert the audio into text. Figure 1 is an embodiment of the login screen of the GUI in accordance with the present invention. The login screen 100 includes a login window 102, where the agent is prompted to enter his authorization details. The agent enters his username and password in username field 104 and password field 106 respectively. After entering the username and password the agent presses the OK button 108. The authorization details are then verified and the agent is authorized to use the voice message conversion tool. In an alternative embodiment of the present invention, a finger print recognition system, an iris recognition system or any other suitable method can be employed to authorize the agent. An exit button 110 is provided in case the agent wishes to close the application and exit out of the tool.
[23] In an embodiment of the present invention, in case the login details entered by the agent are incorrect, a visual indication is shown by flashing and altering the color of an indication icon 112, and prompting the agent to re-enter the authorization details. However, it should be noted that any other method of indicating incorrect login details can be employed. In an embodiment of the present invention, when the agent fails to enter the login details for the third consecutive time, the application is closed.
[24] Moreover, a standard set of buttons 114 is provided to minimize, maximize and close the application. One of the buttons from the set of buttons 114 can be used in order to display
the hot-keys panel. Similarly another button is used to display a settings panel. The settings panel and the hot-keys panel are further explained in conjunction with Figure 6 and Figure 7.
[25] The voice message conversion tool can support multiple languages. For the purposes of the current invention a language also includes its regional dialects as well as the languages' linguistic designation. For example English contains the languages US-English, UK-English, Australian-English, South African-English, etc. Accordingly an agent may be versed with several of the supported languages (i.e. US-English and South African-English). In such cases, the agent may be given a choice of prioritizing the languages, of which, he wishes to receive the messages for conversion.
[26] Figure 2 and Figure 3 are diagrams showing the queue selection screens of the voice message conversion tool GUI. The queue selection screen 200 includes a queue selection window 202. A high priority language can be selected by using a high priority language selection button 204. Similarly, a low priority language can be selected by using a low priority button 206. When the agent wishes to make a selection of his/her priorities, he/she presses one of the two buttons 204 and 206; and is then taken to a second queue selection screen 300 of Figure 3. The second queue selection screen 300 includes two language selection options - selection for live message conversion, and selection for training message conversion.
[27] The voice message conversion tool may be used for both live message conversion and training of the agent. In case the agent is converting a voice message to text in real time, referred to as 'live conversion', he or she may choose a high priority language, by using the options of supported languages presented in a live queue selection box 302. Similarly, in case the agent is undergoing training on voice message conversion, referred to as 'training conversion', he or she may choose a high priority language for training, by using the options of supported languages presented in the training queue selection box 304.
[28] It should be noted that the second queue selection screen here is shown for illustrating high priority language selection only. A similar queue selection screen may appear when the agent wishes to select a low priority language by clicking on the button 206. In an
embodiment of the present invention, the list of supported languages is presented by using the languages for which messages are available for conversion.
[29] The high priority queue is the one from which the agent receives the messages for conversion. When the messages from high priority queue are available for conversion, the tool automatically switches to the high priority queue and sends messages from it to the agent for conversion. In case there are no messages queued in the high priority queue, the tool will send messages from the low priority queue to the agent
[30] In an embodiment of the present invention, the high and low priorities as selected by the agent can be stored by the voice message conversion tool; and can be automatically loaded when the agent logs into the tool next time. Further, depending on the languages of the messages the agent is likely to convert, default high and low language priorities can be presented to the agent. In another embodiment of the present invention, the agent may be given an option of not selecting the low priority language queue. A button enabling the agent not to select a low priority language queue can be presented on the second queue selection screen 300. In an embodiment of the present invention, the tabs representing language options in the live queue selection box 302 and training queue selection box 304 may have different colors. A cancel button 306 is provided, using which, the agent can go back to the queue selection screen 200 without making a selection for language priorities.
[31] Further, in an embodiment of the present invention, the system or agent's supervisors can select the language queue priorities for the agent. In this case, the agent does not need to make the selection of language priorities. In this embodiment the agent is taken to the message download dialogue box 400, bypassing the queue selection screens 200 and 300. Furthermore, the selection of the languages can be remotely made by the system administration or agent's supervisors in real-time. For example, when there is a smaller number of conversion requests in UK-English queue and the demand for US-English conversions is high; the system automatically shifts an agent from UK-English to US-English, when the agent is familiar with conversion of both languages.
[32] After selecting the high priority and low priority language queues, the agent can start receiving the messages for conversion by using a 'Start Converting' button 208. At the queue selection screen 202, an option of logging out of the system may be presented. This option can be selected by using a 'Log Out' button 210.
[33] In an embodiment of the present invention, a check-box 212 is presented for selecting only escalated messages for conversion. This option may be made available to managers of the agents only. The escalated messages are the messages, a part of which, an agent was not able to convert due to various reasons. Escalated messages are explained in detail in conjunction with Figure 12. In an embodiment of the present invention, the option of selecting only escalated messages for conversion by using the check-box 212 is presented only after a user authenticates himself as a manager or team leader.
[34] In an embodiment of the present invention, the agent is given a predetermined amount of time to make the selection of high and low priority queues. In case of time out, the system logs the agent out.
[35] In an embodiment of the present invention, a message download dialogue box 400 can be optionally presented on the GUI as shown in Figure 4. The message download dialogue box has a provision of showing which language queue the message is being retrieved from. The message download dialogue box is presented to the agent during the retrieval of messages from the message queue. It can also display to the user, the time required for downloading the message from the queue. A download status bar 406 displays the progress of message retrieval from the system. Any other similar indication method can also be used to display the progress of message retrieval.
[36] In an embodiment of the present invention, a 'Cancel' button 402 can be presented, using which, the user can log out of the system at this stage. Further, a visual indication icon such as a moving circle 404 or a sand clock can be shown in order to show that the system is busy retrieving the message. Once the message is downloaded, the agent is taken to the conversion screen for converting the message.
[37] Figure 5 is a diagram showing the conversion screen of the voice message conversion tool GUI, in accordance with an embodiment of the present invention. Conversion screen 500 includes a main conversion window 502, an audio waveform display 504, a status display box 510, and a speedometer 520. In an embodiment of the present invention, the agent is not permitted to start converting the message for a predetermined amount of time. During this phase, referred to as the 'listening phase', a dialogue box that displays an interrupt message is presented on the screen. An appropriate message such as 'Please listen to the message. Text entry disabled' can be displayed in the dialogue box. Alternatively, indicative visual icons may also be displayed, alerting the agent that the text entry is disabled. For example in an alternative embodiment of the current invention the listening phase is denoted by a loudspeaker icon until the predetermined time for disabling the keyboard as explained herein is finished, then a sliding icon marked "Action" slides across the top of the main conversion window 502 to denote that the keyboard has become active.
[38] As stated herein during the listening phase, keystrokes from the agent are disabled by the voice message conversion tool. This helps the agent in understanding the content of the message and converting the message efficiently. In an embodiment of the current invention the listen phase is 0.5 to 30 seconds long; in a better embodiment of the current invention the listen phase is 1.0 to 10 seconds long; in the preferred embodiment of the current invention the listening phase the predetermined period of time is calculated to be ten percent (10%) of the duration of the message but no less than three seconds long.
[39] The audio waveform display 504 shows a visual representation of the audio waves of the message to be converted. In other words, the audio waveform display 504 shows a graph whose X axis represents time, and the Y axis represents the intensity or the gain level of the audio. The waveform is symmetrical about the axis 508. A moving bar 506 indicates the position on the waveform at the time the message is being played to the agent. In an embodiment of the present invention, the moving bar 506 can be represented as a semi- transparent bar which gives an indication of the length of message that has been played.
[40] At any point of time during the conversion of the message, the agent can go back and re-play the audio. In an embodiment of the current invention the agent can call on the "Forced Rewind" function and force the audio to rewind to the beginning of the audio by pressing the F5 key at any time when the keyboard is active (i.e. the predetermined disabled time period is over). Alternatively the agent can call on the "Forced Rewind" function by first pressing the space bar when the keyboard becomes active. At this time, the moving bar 506 can be moved back along the axis 508. This is explained in detail in conjunction with Figure 7. If the agent first presses a key other than the spacebar when the keyboard becomes active the spacebar key will then return to its normal functionality and enter spaces on the text file.
[41] Further, in an embodiment of the present invention if the first key pressed when the keyboard is active is a modification to the text file the playing of the audio file is paused for a period of time before it is restarted. In an embodiment of the current invention this audio pause can be 0.5 to 10 seconds long; in the preferred embodiment of the current invention the audio pause is five seconds long.
[42] In an embodiment of the present invention, the audio is displayed in form of a simple linear slider which moves along the time axis. In other words, a cursor on the slider shows the position of the audio being played on a linear timescale. Further, the audio waveform is divided into segments based on the pauses taken during the message. These segments are displayed by using audio tab lines 522 on the audio waveform or the linear slider. The agent can rewind, forward or replay the parts between the audio tab lines. Since the pauses are likely to be taken between two sentences, it automatically enables the agent to convert the message sentence after sentence.
[43] Additionally, in an embodiment of the present invention, a feature of slowing down or speeding up the message from its normal speed is made available to the agent. The agent slows down or speeds up the message using a certain keystroke combination. In an embodiment of the present invention, the agent can slow down or speed up the message by using the foot-pedals and other input devices integrated with the voice message conversion tool. In this case, an indication that the message is being played slow or fast can be displayed
along with the moving bar of the audio waveform or the cursor of the linear slider. Further, by what factor the message is slowed down or speeded up can also be displayed along with this indication. For example, if the speed of the message is halved, an indication of 0.5X is displayed on the moving bar 506.
[44] In an embodiment of the current invention, if the agent listens to a particular section for more than a pre-determined number of times, immediate subsequent playing of that section of audio is automatically slowed down. The slowdown is retained till the agent proceeds to another section. In an embodiment of the present invention, the pre-determined number of times is 3, i.e. if the agent listens to a particular section for more than three times, playback of that section of the audio is automatically slowed down. In an embodiment of the current invention the slowdown of the audio is in the range of 20% to 80% of the original speed; in a better embodiment of the current invention the slowdown of the audio is in the range of 33% to 66% of the original speed; in the preferred embodiment of the current invention the slowdown of the audio is fifty percent (50%) of the original speed.
[45] In an embodiment of the current invention, the conversion screen includes a properties window 510. The properties window 510 displays certain parameters such as message language 512, message ID 514, length of the message audio 516 and recipient 518. These parameters are provided for the agent's reference. It should be noted that any other parameters relevant to the message may also be displayed in the properties window 510.
[46] The conversion screen also includes a speedometer 520. The speedometer 520 displays the conversion speed of the agent. The Agent Conversion Ratio (ACR) is broadly defined as the ratio of the amount of time taken by the agent to convert an audio file into text, to the length of the audio file. In an embodiment of the present invention, the speedometer 520 is calibrated in terms of the ACR. The pointer of the speedometer 520 can indicate at what ACR the agent is converting the message, at any given point of time. In an alternative embodiment of the current invention the ACR can be an average, mean, mode or other statistical calculation representing an evaluation of the speed the agent is converting messages. When there is a permitted ACR range for the conversion, the speedometer displays
a color indication that the agent's ACR measurement is within the permitted ACR range. For example, if the permitted range for the ACR is between 4 and 6 or less, the speedometer 520 displays a green color within that range. An ACR above 6 will be represented in red color.
[47] In previous versions of Tenzing the speedometer was replaced by a numerical display which displayed the actual ACR number to the agent. This previous display adversely affected the quality/speed of the conversion process because the agent would work too quickly and with less accuracy in the conversion process in an effort to continually lower the ACR number displayed. The use of the speedometer aims to alleviate this problem and make the agent work within an acceptable ACR range in order to main the highest quality/speed in the conversion process. In these cases where there is a permitted range of ACR, and when the ACR is displayed using the speedometer 520, the agent tries to stay within the range and does not try to convert too quickly which can produce poor quality conversions. As a result, the overall quality of the conversion may degrade, and reach below the permitted quality. Therefore, at times it is desired that the agent is not shown the numeric ACR during the conversion.
[48] In a preferred embodiment of the present invention, the speedometer 520 is configured to display whether the agent is in a permitted range of ACR or not. The actual permitted range of the ACR is kept unknown to the agent; however the agent is shown whether he is in the range of permitted ACR. In this case, the region of the permitted range of ACR can be displayed in green color; and the region of the ACR range above the maximum permitted value can be displayed in red color. The agent has to just convert the message and make sure that his speedometer arrow is maintained in the green region.
[49] In another embodiment of the present invention, the speedometer 520 is in the form of a linear bar. Movable pointers indicating the current ACR and the target or permitted ACRs are displayed along the liner bar. As in the circular speedometer, the linear meter can have different colors representing the permitted ACR range and the range above permitted ACR.
[50] In various other optional embodiments of the present invention, parameters such as hourly conversion rate for the agent, total number of messages converted for which the ACR is displayed by the speedometer 520, Average ACR for the day and so forth maybe displayed along with the speedometer 520.
[51] Also often in calculation the ACR of an agent for first few messages the system may give a false indication of the ACR. Additionally, the agent might take some time to get into the mode of converting the messages to his or her usual speed. Therefore, in an embodiment of the invention, the speedometer 520 is displayed to the agent only after five converted messages. However, the calculation for ACR can be made from the first message the agent converts. Therefore, the average ACR after first five messages gives a more realistic indication of the agent's ACR. In this case, an appropriate message such as 'The speedometer will be active in another 3 messages' can be displayed in place of the speedometer 520. In an alternative embodiment of the present invention, the ACR can be displayed from the time the agent starts conversion.
[52] As explained herein in an embodiment of the current invention when the listening phase is over, either the dialogue box 522 is removed from the screen, or the "Action" slider icon appears on the top of the main conversion window 502 to denote that the agent is allowed to enter text into the text file.
[53] In an embodiment of the present invention, the voice message conversion tool works on the principle of working in conjunction with predictive text tools and processes that estimate the word or phrase of the sentence that the agent is going to enter. So when the predicted text matches the audio being played, the agent can simply accept it. An example of the procedure that the agent follows for converting the message is explained in detail in conjunction with Figure 10.
[54] In addition to the predictive tools, an embodiment of the current invention contains functionalities to assist the agent in entering text into the text file to increase the quality/speed objectives of the audio file to text file conversion process.
[55] In the review stage window comprises a review indication box that displays an appropriate message such as 'Message Review - Please check the message' or a slider icon marked "Review" moves along on the top of the main review window to inform the agent that they are in the review/edit stage of the SpinVox process. In the review/edit stage, the agent, supported by review process tools, can correct the punctuations and spellings; correct the case of the words and so forth.
[56] In an embodiment of the invention, the system gives 'in-line' suggestions to the agent in review phase. In other words, in case one of the words is spelt incorrectly, or is repeated, or if the punctuation is wrong, the system provides a drop down list in-line with the text, offering the agent a number of probable alternatives to select from. When the agent clicks on the correct alternative, the word is replaced.
[57] In an embodiment of the invention, the words that the system finds inappropriate/misspelled are underlined or displayed in a different color. The agent may change these words after listening to the audio. In-line suggestions, as explained above can also be displayed to aid the agent in quickly selecting an appropriate word.
[58] In an optional embodiment of the invention, a time delay on audio being played during review stage can be applied. The agent read the converted text for a time period before the audio starts playing. This provision lets the agent read through the message even before the audio starts playing. The agent can thus spot mistakes, decide whether the corrections offered by the system could be right and so forth. The audio can also be played back as soon as the review window is displayed to the agent.
[59] When the agent finishes reviewing the message and making any required changes to it, he can submit it to the system using a keystroke combination.
[60] Figure 6 is a diagram showing a settings panel of the voice message conversion tool.
The settings panel 600 includes a text settings window 602, an audio settings widow 604, a user settings window 606, a foot-pedal settings window 608, an accept button 610 and a cancel button 612. One of the buttons from the standard set of buttons 114 invokes the display of the settings panel, when the agent wants to change his or her settings. Following is a list of
the settings options and their functions that can be made available to the agent by using the settings panel 600: [61] Text settings window 602:
1. Text size: The text size option enables the user to change the size of the text being displayed on the screen. In accordance with various embodiments of the present invention, this option can be presented to the agent in the form of a slider, numeric values of the text size and the like, using which the agent can configure the size of the text displayed.
2. Color: This option enables the agent to change the color of the text displayed on the screen. In an embodiment of the invention, the text displayed has two components - agent inputted text and predicted text. The color settings, when changed enables user to select different colors for agent inputted text and the predicted text.
[62] Audio settings window 604:
1. Play count: This option enables the agent to select how many times the section of the audio should be repeated. For example, if the agent enters 2 in this option, the section of the audio demarcated by vertical lines on the audio waveform is played to the user twice before moving on the next section.
2. Skip back size: This option enables the agent to select the duration for which the audio goes back when the agent rewinds it. In an embodiment of the invention, this can be made available to the agent in the form of a slider. However, a provision to enter numeric values in seconds can also be made available.
3. Audio display type: This option enables the agent to select which type of audio display he prefers to be displayed on the conversion screen - audio waveform display or a simple linear slider. These options have already been explained in conjunction with Figure 5.
4. Playback volume: This additional option enables the agent to set an overall playback volume. In an embodiment of the invention, the option of selecting the playback volume can be enabled using a hot-key. This is explained in detail in conjunction with
Figure 9. Alternatively, the agent may also change the volume by using the volume control of his headset.
[63] User settings 606:
[64] The agent can select an option of receiving escalated messages only. As explained earlier, this option may be made available to the managers of the agents only. The same option is also available on the queue selection screen 200.
[65] Foot pedal settings 608:
[66] In an embodiment of then invention, when the voice message conversion tool is integrated with a set of foot pedals, this option lets the agent alter the default functions of the left and right foot pedals. The foot pedal functions are explained in detail, later in the specifications.
[67] The settings panel 600 also has options of accepting the settings by using accept settings button 610, or cancel the changed setting by using the cancel settings button 612. In an embodiment of the present invention, a 'Reset to default settings' button maybe provided, using which, the agent can reset the voice message conversion tool to its default settings.
[68] Figure 7 is a diagram showing a hot-keys panel in accordance with an embodiment of the invention. The hot-keys panel 700 includes a message control keys window 702, an audio control keys window 704, and a miscellaneous control keys window 706. A hot-key is a combination of keys on the keyboard, which, when pressed, enables the agent to perform an action. One of the buttons from the standard set of buttons 114 invokes the display of the hotkeys panel, which enables the agent to display and modify the hot-keys. Following is a list of the hot-keys and their functions that can be made available to the agent by using the hot-keys panel 700:
[69] Message control keys window 702:
1. Re -queue message: This option lets the agent re-queue the current message. The default hot-key combination for re -queuing the message is 'Ctrl + LLL'.
2. Send message: When the agent has finished converting the message, he or she can submit it. This has already been explained in conjunction with Figure 5 and Figure 6. The default hot-key for sending the message is 'Ctrl + Enter'.
3. Hangup: In case the agent detects a hangup in the message, i.e. there is no message to be converted; he can submit it as a hangup. The default hot-key for submitting the message as hangup is 'Ctrl + H'.
4. Wrong Language: In an embodiment of the present invention, in case the agent gets a message which is not from his or her language priority, the agent can re-queue the message by pressing this hot-key combination. The default hot-key for submitting the message as wrong language is 'Ctrl + W'.
[70] In an embodiment of the invention, when the agent detects a wrong language and presses the wrong language hot-key, a pop-up box is presented. The pop-up box has a list of supported languages. The agent can re -queue the message in one of the languages by making the selection. In case the agent is not able to identify the probable language of the message, he or she can use an unknown language option. When unknown language option is selected, the message is escalated to a manager for language check.
[71] 5. Send to Manager: In an embodiment of the present invention, the agent can send the message to manager for escalation. The message is thereafter made available to a manager. The default hot-key for sending the message to the manager is 'Ctrl + M'.
[72] 6. Unconvertible: In case the message is unclear or a substantial number of words of the message are not clear to the agent, he or she can submit the message as unconvertible. The default hot-key for submitting the message as unconvertible is 'Ctrl + U'.
[73] 7. Send and logout: This is an optional hot-key offered to the agent. While converting the last message of a session, the agent can submit it and log out from the system. The default hot-key combination for sending and logging out is 'Ctrl + Alt + Enter'.
[74] Audio control keys window 704:
1. Pause: The audio can be paused and resumed for playback by using the default hotkey 'Escape'.
2. Fast-forward: The audio can be fast forwarded by using the default hot-key 'PGUP'. In an embodiment of the invention, the interval for fast forwarding the message can be selected by the agent.
3. Rewind: The can rewind the audio by using the default hot-key 'PGDN'. In an embodiment of the invention, the interval for rewinding the message can be selected by the agent.
4. Jump to audio start: The agent can jump to the beginning of the audio file. The default hot-key for jumping to the beginning of the audio is 'Fl '.
5. Skip section forward: The agent can skip a section of message demarcated by audio tab lines on audio waveform display. The default key for skipping a section forward is 'Ctrl + PGUP'
6. Skip section back: The agent can skip a section of message demarcated by vertical lines on audio waveform display, backwards. The default key for skipping a section backward is 'Ctrl + PGDN'
7. Volume Up: The agent can increase the volume of the audio playback using a hotkey. The default hot-key for increasing the volume is 'Ctrl + Add'.
8. Volume Down: The agent can reduce the volume of the audio playback using a hotkey. The default hot-key for increasing the volume is 'Ctrl +'.
9. Slow down: As explained earlier, the agent can slow down the audio playback. The default hot-key for slowing down the audio playback is 'Ctrl + S'. In an embodiment of the present invention, the extent of slowing down the audio playback is specified by default. For example, slowing down reduces the speed of the audio playback by half, i.e. plays it at 0.5X the normal speed. In another embodiment of the present invention, the agent can slow down the audio playback step-by-step. For example, hitting this hot key every time allows the agent to slow down the audio playback by an interval of 0.1X.
10. Speed Up Playback: The agent can speed up the audio playback. The default hot-key for speeding up the audio playback is 'Ctrl + F'. In an embodiment of the present
invention, the extent of speeding up the audio playback is specified by default. For example, speeding up increases the speed of the audio playback by double i.e. plays it at 2X the normal speed. In another embodiment of the present invention, the agent can speed up the audio playback step-by-step. For example, hitting this hot key every time allows the agent to speed up the audio playback by an interval of 0. IX. 11. Normal Playback: When the audio is slowed down or is running at a faster speed, the agent wishes to bring it back to the normal speed, he can use this hot-key combination. The default hot-key for selecting normal playback is 'Ctrl + M'. [75] In an embodiment of the invention, a number of audio preset equalizers are available for the agent to select from. When one of these audio presets is selected, the audio is pre- processed using the selected preset equalizer, and played back to the agent. This helps in improving quality of the audio that the agent listens to. A separate hot-key can be assigned to select an audio preset. In addition, a hot-key can be assigned to toggle between the available audio presets. Such hot-keys and their functions can also be displayed in the audio control keys window 704. The default hot key for toggling between presets and selecting a preset is 'F5\ [76] Miscellaneous control keys window 706:
1. Insert recipient name: The agent can insert the recipient name in a message by using this hot-key. The default hot-key for inserting the recipient name in a message is 'Ctrl + N\
2. Hotkeys panel: As an alternative to the buttons in the standard set of buttons 114, the hotkeys panel can be displayed by using this hot-key. The default hot-key for displaying hot-keys panel is 'F2'.
3. Settings panel: As an alternative to the buttons in the standard set of buttons 114, the settings panel can also be displayed by using a hot-key. The default hot-key for displaying settings panel is 'F3'
4. Run spell check: The agent can run the spell check option by using the default hotkey 'F7'.
5. Show first names: A list of first names starting with the agent input is displayed on the screen when the agent presses this hot-key combination. In an embodiment of the invention, when the hot-key is pressed without any agent input character being present on the screen, a list of most common first names is displayed. The default hotkey for displaying the list of first names is 'Alt + F'.
6. Show second names: Similarly, a list of second names starting with the agent input is displayed on the screen when the agent presses this hot-key combination. In an embodiment of the invention, when the hot-key is pressed without any agent input character being present on the screen, a list of most common second names is displayed. The default hot-key for displaying the list of second names is 'Alt + F'.
[77] In an embodiment of the present invention, the agent can customize the hot-keys according to his ease. This can be done by clicking on the respective hot-key and specifying a desired combination. In case the combination of hot-keys is already in use, the GUI displays an appropriate message. The agent is then prompted to select another combination.
[78] Further, an option of resetting the hot-key combinations to default combinations is provided. The agent can use this option by pressing the reset to default button 908. In an embodiment of the invention, the agent defined hot-keys can be stored in the agent's profile and can be made available to the agent the next time.
[79] In an embodiment of the present invention, in case the voice message conversion tool is not able to retrieve the message from a message queue server, it displays an appropriate message to the user. The user can wait till the time the tool is able to retrieve the messages.
[80] In an embodiment of the present invention, the voice message conversion tool can be used to convert the voice messages to post them directly on the internet. For example, the messages can be posted in form of blog posts on various blogs. In such case, a suitable warning that the message is going to be converted into a blog post, of which there can be a number of potential readers is displayed to the agent. This warning is typically displayed before the agent starts converting the message.
[81] In an embodiment of the invention, the voice message conversion tool is integrated with a foot-pedal system. The agent is provided with a left foot pedal, a centre foot pedal and a right foot pedal. In an embodiment of the invention, the foot pedals are pre -configured for their functions. Their functionalities can be as follows:
1. Left pedal press: Rewind audio
2. Centre pedal press: Start audio
3. Centre pedal de -press: Stop audio
4. Right pedal press: Fast forward audio
[82] In another embodiment of the invention, the agent can customize the functions of the foot pedals to suit his ease in working. An appropriate pop-up window for customizing the functions of the foot pedal can be provided. This foot-pedal functions customization window can be invoked by using a hot-key combination.
[83] Figures 8 through 13 are flow diagrams showing the detailed steps of processes used in an embodiment of the current invention. Referring to Figure 8, at step 802, the invention displays a login window to the agent. At step 804, the login credentials of the user are accepted by the system. In an embodiment of the present invention, the login credentials include a username and a password. In another embodiment of the invention, the login credentials include biometric information such as fingerprints or iris recognition. At step 806, the login credentials are matched with those in the database. In other words, the agent is authorized for using the application, In case the agent is not authorized, a flashing icon may be displayed to the agent. In this case, the invention displays the login screen for subsequent attempts.
[84] When the agent is successfully authorized, the invention displays a queue selection window for the agent to select a high priority language queue and a low priority language queue, at step 808. When the agent wishes to make a selection of his priorities, he is then taken to a queue selection screen that includes two language selection options - selection for live message conversion and selection for training message conversion. The agent makes a language queue selection at step 810. In an embodiment of the invention, the queues for the
agent are selected by default. Moreover, the selection from one agent session is stored in memory and the same queues are selected for the subsequent sessions.
[85] The queue selection window 202 includes an option of starting the conversion by pressing Start Conversion button 208. It is checked at step 812 whether the agent wants to start conversion. When the agent wishes to start the conversion, he or she is presented with an audio file to be converted and a text file. Procedure for which is explained in conjunction with Figure 9.
[86] It is checked at step 814 whether the agent wishes to log out of the system, In case of logout, the system logs the agent out at step 816. In case the agent does not wish to log out, the queue selection window is displayed to him or her again, i.e. steps 808 onwards are repeated.
[87] In an embodiment of the present invention, an inactivity timer T2 is run to count a second predetermined time, while waiting for an input from the agent at the queue selection window. In case the agent provides an input before the inactivity timer T2 counts over the second predetermined time, appropriate action is taken as explained above. If the inactivity timer T2 is over counting the second predetermined time without any input from the agent, the agent is logged out from the application. Thereafter, the steps 802 onwards are repeated.
[88] Explained in conjunction with step 812, Figure 10 shows initiation of conversion process. If the agent chooses to start conversion at step 812, the agent receives an audio file with a text file from the system as in step 902.
[89] When an agent receives an audio file along with a text file, a first predetermined time is set for the agent to listen to the audio file. In an embodiment of the present invention, at step 904, modification to the text file is disabled for the first predetermined time. A timer Tl starts to count the first predetermined time. In an embodiment of the present invention, the audio file begins to play automatically as in step 904. In another embodiment of the present invention audio controls are always enabled. The purpose of the first predetermined time is to let the agent understand the contents of the audio file before the agent actually starts modifying the text file.
[90] In an embodiment of the present invention, the text file may contain text, predicted based on the words in the audio. In an alternative embodiment of the present invention, the text file may contain partially converted text along with the predicted text. In another embodiment of the present invention, the text file may be empty. In various other embodiments of the present invention, the text file contains a combination of one or more of the above mentioned possibilities.
[91 ] Predictive text is a set of one or more words that the system suggests based on the audio to aid the agent in the conversion process. Predictive text can be calculated in real time, based on various factors, such as the words in the audio, by using an automatic speech recognition algorithm; agent input and other information extracted from a database. The database may contain information related to previously converted audio files. The agent may use one or more hot keys to accept the set of one or more words suggested by the system, if the suggested words match the words comprehended by the agent in the audio file. If the agent finds the suggested words to be different, the agent has to the words based on the audio. The system revises the predictive text in real time, based on each character entered by the agent.
[92] Further, certain hot-keys, such as 'Tab' and 'right arrow' can be assigned to aid the agent in accepting the system predicted text, By simply pressing these keys, the agent can accept the predicted text, he or she does not have to type the entire word using the keyboard. Therefore, the time required in converting the audio into text is reduced.
1. When the agent presses 'tab' or 'enter' keys, complete word as predicted by the system is accepted in the text file. Similarly, when the agent presses the right arrow key, single letter of the predicted word is accepted in the text file. The system then recalculates the predictive text and displays on the screen it in real time.
2. In an embodiment of the present invention, when the text file is permitted to contain only a limited number of characters, and the message exceeds this limit, the system submits a text file containing the first limited number of characters of the message. In
another embodiment of the present invention, the agent can alter the converted text to fit the text file into the limited number of characters.
[93] In an embodiment of the present invention, the first predetermined time is set to a predetermined time period. During this time period the agent input is disabled, resulting in the agent focusing on the content of the audio. This results in efficient conversion of audio to text. It should be noted that this time can be reduced or increased, based on the length of the audio. For example, for an audio file less than the predetermined time in length, the disabled time can be reduced to a shorter length of time. The ranges mentioned here are for better understanding of the concept of the agent listening to a portion of audio without any input. Based on the requirements of the conversion, various other alterations in the time spans can be made, without affecting the scope of the present invention.
[94] At step 906, the system continuously checks whether the time Tl is over. A polling mechanism can be used for performing this check. After the first predetermined time is over, text file modification is enabled at step 908. In an embodiment of the present invention, some keys of the keyboard continue to be disabled. Such keys include some special characters that are not desired to be present in the converted text file. In addition, use of 'caps lock' can also be disabled when the system automatically corrects the converted text for correct case.
[95] In an embodiment of the current invention, if the agent starts typing immediately after the first predetermined time gets over, without listening to the audio from the beginning, the audio automatically pauses. This helps in ensuring that a better conversion quality is in place.
[96] Following the enablement of text file modification, at step 910, the system starts an inactivity timer T2, to check an inactivity of the agent for more than a second predetermined period of time. It should be noted that the inactivity timer T2 can also be started before the enablement of text file modification, and it does not affect the scope of the invention in any way. While executing steps 912 and 914, the system keeps checking if there is an input from the agent, when the inactivity timer T2 is running. If the agent provides an input to the system within the second predetermined period of time, an appropriate action is taken, process for which is explained in conjunction with figure 11. If the inactivity timer T2 is over, without
the system getting any input from the agent , at step 916, a warning timer T3 starts to count a third predetermined period of time. In an embodiment of the present invention, a warning message flashes on the screen, indicating the agent that the message will be re-queued if the agent does not strike any key.
[97] In an embodiment of the present invention, an audio indication, such as a beep is used to alert the agent about inactivity. In another embodiment of the invention both audio indication and warning message are used. The warning message can also be displayed with a visual indication such as a flashing icon.
[98] While executing steps 918 and 920, the system keeps checking for the agent input, when the timer T3 is running. If the agent provides input to the system within the third predetermined period of time, the input is processed by the system, procedure for which is explained in conjunction with figure 11. If the third predetermined period of time passes without the agent striking any key, the audio file along with the text file is re-queued. Thereafter, steps 808 onwards are repeated as explained in conjunction with fig 8.
[99] If the warning timer T3 counts over the third predetermined time period without any input from agent, the audio file along with the text file is re -queued at step 922. Thereafter, steps 808 onwards are repeated.
[100] In an embodiment of the present invention, the inactivity timer T2 is in operation during the time the application is running. It is interrupted when there is an agent input.
[101] Referring to Figure 10, the invention accepts the agent input at 1002. In an embodiment of the present invention, the agent input can be a control command, an audio command, a miscellaneous command or a text command. A check is performed to detect the type of command, based on the agent input.
[102] A check whether the agent input is a control command is performed at step 1004. In case the agent input is a control command, an appropriate action is taken, procedure for which is explained in conjunction with Figure 11 and Figure 12.
[103] Similarly, a check whether the agent input is an audio command is performed at step
1006. In case the agent input is an audio command, an appropriate action is taken, procedure for which is explained in conjunction with Figure 13.
[104] A check whether the agent input is a system command is performed at step 1008. In case the agent input is a system command, it is executed at step 1010. The miscellaneous commands are inputted by using a hot-key combination from the miscellaneous control keys window 706. The appropriate actions that the system takes have already been explained in the description of these hot keys.
[105] In an embodiment of the present invention, the miscellaneous commands include system commands, such as displaying the hotkeys panel, displaying the settings panel, insert recipient name and the like. The miscellaneous commands also include the assist commands, such as running a spell check, show first names, show second names and the like. It should be noted that in the described embodiment of the present invention, the agent inputs the miscellaneous commands. However, in another embodiment of the invention, the miscellaneous commands can be made a part of the user interface. In other words, the agent need not always enter some commands such as the assist commands. The system may intelligently assist the agent with these functions.
[106] A check whether the agent input is a text command is performed at step 1012. In case the agent provides a text input, a check for its validity is performed at step 1014. In an embodiment of the present invention, an invalid text input may be 'caps lock' and punctuation marks such as semi-colon, colon, exclamation mark and the like. If the agent input is an invalid text input, the process is repeated from step 910. If the agent input is valid, the system accepts the input at step 1016. In an embodiment of the present invention, the system calculates the predictive text at step 1018 based on the input provided by the agent; and displays the calculated predictive text at step 1020. Thereafter, the process is repeated from step 910 onwards.
[107] In an embodiment of the present invention, the text input provided by the agent may be a special character such as an underscore. An underscore is used in case the agent is not
able to identify a word by listening to the audio file and/or with the suggestions from the system. In an embodiment of the present invention, an agent can put any number of underscores in the converted text and submit the text file. However, if the number of underscores placed by the agent exceeds a pre-determined value, the converted text is automatically sent to a manager, i.e. the conversion is escalated. In another embodiment of the present invention, an agent is not allowed to submit the text file with the number of underscores in the message exceeding a pre-determined value. The agent is required to either reduce the number of underscores or get the audio file and/or the partially converted text file re-queued or return the audio file as unconvertible. In an embodiment of the present invention, if an agent provides a text input that is not available in the system's database, the system provides some suggestions to the agent. The agent may accept an appropriate suggestion. In an embodiment of the present invention, if the agent is sure that the typed word exists in usage even though the word is not present in the system's database, then the agent may use the option 'notify word' to store the word in the system's database. In another embodiment of the invention, after an agent notifies a word, it may be sent to a manager for a confirmation and added in the dictionary thereafter. In an embodiment of the invention, the 'notify word' option is provided only to the managers.
[108] In an embodiment of the present invention, the suggestions provided to the agent also include the word followed by a special symbol, such as a (?). If the agent is not sure of existence of the word in usage but is sure that the word is present in the audio, the agent puts a (?) mark after the word or selects the same from the suggestions.
[109] Referring to Figure 11, in case the agent input is a control command, it is checked at step 1102 whether it is a valid control command. In case the command is not a valid control command, steps 910 onwards are repeated. In other words, the T2 timer is initiated again and the system waits for the next input from the agent.
[110] In case the control command is one of the valid control commands, corresponding appropriate action is taken as explained further. The valid control commands include, but are not limited to, submitting the audio as a hangup, submitting the audio file as unconvertible,
submitting the audio file as a wrong language, re-queuing the audio file, sending the audio file and the text file to a manager, sending the text file for review, and sending the text file for review followed by logging out.
[I l l] At step 1104, it is checked whether the control command is to submit the audio file as a hangup. Generally, a hangup is detected when there is no message in the audio, and the audio is too short. Another type of hangup can be a long audio file in case of a message recorded by mistake. In such cases, the agent submits the audio file as a hangup at step 1106. In an embodiment of the present invention, the audio file submitted as a hangup is not sent for a review. In case of a hangup submission, steps 902 onwards are repeated, i.e. the agent is presented a new audio file for conversion.
[112] At step 1108, it is checked whether the control command is to submit the audio file as unconvertible. Unconvertible audio files are the audio files that the agent is not able to convert. For example, an audio file may contain just music or noise, along with a person's voice, making the person's voice incomprehensible. In an embodiment of the present invention, an agent can submit such audio files as 'unconvertible' at step 1110. In case of unconvertible audio file submission, steps 902 onwards are repeated, i.e. the agent is presented a new audio file for conversion.
[113] At step 1112, it is checked whether the control command is to submit the audio file containing the message in a wrong language. In an embodiment of the present invention, if an agent identifies that the language of the audio is not the preferred language set by him or her, and the agent is unable to do the conversion, the agent submits the audio file as 'wrong language'. The audio file is then re -queued in an appropriate language queue at step 1114. In an embodiment of the present invention, the agent is given an option of identifying the language of the message. In case the agent is not able to identify the probable language of the message, he or she can use an unknown language option. When unknown language option is selected, the message is escalated to a manager for language check. In case of message submission as wrong language, steps 902 onwards are repeated, i.e. the agent is presented a
new audio file for conversion after the wrong language message is re-queued in an appropriate language queue.
[114] Referring now to Figure 12, at step 1202, it is checked whether the control command is to re-queue the message. In case the agent wishes to re -queue the message, he or she can use the defined hot-key. The message is re-queued in the language queue at step 1204. In an embodiment of the invention, when the agent wishes to re -queue the message that he is working on, the audio file along with any converted text in the text file is re-queued. Thereafter, steps 902 onwards are repeated, i.e. the agent is presented with a new audio file for conversion.
[115] At step 1206, it is checked whether the control command is to send the message to a manager. In an embodiment of the present invention, in case the agent believes that the message needs another level of supervision; he or she can escalate it to a manager. The message is then sent to a manager at step 1208. Thereafter, steps 902 onwards are repeated, i.e. the agent is presented with a new audio file for conversion.
[116] At step 1210, it is checked whether the control command is to send the message.
When the agent is finished with converting the message, he can submit the text file for review at step 1212. In an embodiment of the present invention, the message can be submitted at any stage of the conversion. Thereafter, steps 902 onwards are repeated, i.e. the agent is presented with a new audio file for conversion.
[117] The agent may want to submit the message and log out of the system. In this case, the agent can use the defined hot-key to submit the message and logout. This control command is checked for, at step 1214. The text file is submitted for review, and the agent is logged out of the application, at step 1216. Fig, 13 shows actions followed, if result of the check performed at step 1006 is that the agent input is an audio command.
[118] In case the agent input checked at step 1006 is an audio command, actions explained in Figure 13 are performed. Examples of audio commands include, but are not limited to, 'pause', 'rewind', 'fast- forward', 'jump to audio start', 'skip section forward', 'skip section back', 'volume up', 'volume down', 'select an audio preset equalizer', 'slow down the speed
of audio playback', 'speed up the audio playback' and 'play audio at normal speed'. The system checks at steps 1302, 1304 and 1306, whether the audio command input provided by the agent is one of the audio commands.
[119] Corresponding to the selected audio command option, appropriate action is performed at step 1308. If the agent input is a 'pause' audio command, the action performed depends on the current state of the audio. If the current state of the audio is 'playing', the audio is paused. If the current state of the audio is 'paused', then the then the audio playback is resumed from the point where the audio is paused.
[120] If the agent input is a 'rewind' audio command, the audio is rewound as long as the
'rewind' audio command is in effect. If the agent input is a 'fast-forward' audio command, the audio is fast- forwarded as long as the 'fast- forward' audio command is in effect.
[121] If the agent input is 'jump to audio start' audio command, the audio is played from the beginning. If the agent input is 'skip section forward' audio command, the currently playing section of the audio is skipped, and the audio starts playing from the beginning of the next section. If the agent input is 'skip section back' audio command, the currently playing section is skipped, and the audio starts playing from the beginning of the previous section.
[122] If the agent input is 'volume up' audio command, the volume of the audio playback increases. If the agent input is 'volume down' audio command, the volume of the audio playback decreases. It should be noted that the increase/decrease in the volume of the audio playback may have a predetermined magnitude.
[123] If the agent input is 'select an audio preset' command, an appropriate preset equalizer from the available presets is selected. In an embodiment of the invention, the agent may be given an option of toggling between the available presets. In this case, the available preset equalizers are displayed on the screen, and agent toggles and selects the preset of his/her choice by using the assigned hot-key. When the preset is selected, the pre-processed audio corresponding to the selected preset equalizer is played back to the agent. It should be noted that toggling between available audio preset equalizers can be done in real time, and it does
not hamper the quality of audio being played. Only a shift from one preset equalizer to other is made.
[124] In an embodiment of the current invention a plurality of audio presets can be made available are provided to the agent, for example these audio presents are:
1. Off (the audio is played with the same volume/equalization as it is received)
2. Neutral
3. Slim
4. Clarity
5. Shine
6. Bright
[125] A separate equalization function corresponds to each of these presets. When selected, the presets provide desired effect to the audio. It should be noted that the preset equalizers mentioned here are for illustrative purpose only. A number of other similar preset equalizers can be provided according to the agent requirements, and it does not restrict the scope of the invention in any way.
[126] Moreover, in an embodiment of the present invention, the selected preset can be displayed to the agent by using an appropriate symbol on the Graphical User Interface, i.e. on conversion and review screens. Alternatively, representative letters that denote different presets can be used to indicate the selected preset to the agent.
[127] In case the agent has a preference for a particular audio preset, i.e. he/she is comfortable with listening to the audio processed in a particular way; the preset can be selected for the agent by default. However, the agent is still given a choice of selecting another audio preset at the time of audio playback.
[128] If the agent input is the 'slow down the speed of the audio playback' audio command, the speed of the audio playback decreases. If the agent input is the 'speed up the audio playback' audio command, the speed of the audio playback increases. If the agent input is 'play audio at normal speed' audio command, the audio is played at normal speed.
[129] This description includes various embodiments the current invention as claimed herein. The various embodiments described herein do not limit or reduce the nature and scope of the invention.
[130] Accordingly, we hereby claim:
Claims
1. A method for interfacing with processes for converting an audio message to a text message comprising the steps of: a) displaying an audio file and a text file, b) comparing the content of the audio file with the content of the text file, c) disabling processing on the text file for a predetermined period of time, d) performing at least one process on the text file, and e) transmitting the processed text file to a device for displaying the processed text file.
2. A method as in claim 1 wherein the predetermined period of time is between 0.5 and 30 seconds.
3. A method as in claim 2 wherein the predetermined period of time is between 1.0 and 10 seconds.
4. A method as in claim 1 wherein the predetermined period of time is one-tenth of the duration of the audio file but no less than three seconds.
5. A method as in claim 1 further comprising the step of pausing the audio file after the performing step.
6. A method as in claim 5 wherein the pause is between 0.5 and 10 seconds.
7. A method as in claim 6 wherein the pause is 5 seconds.
8. A method as in claim 1 further comprising the step of reviewing the text file after the performing step.
9. A method as in claim 8 wherein at least one process on the text file is performed during the reviewing step.
10. A method as in claim 1 further comprises selecting at least one of a plurality of audio presets for the displaying of an audio file.
11. A device for interfacing with the processes of converting an audio message to a text message for an end-user said device comprising: a) an audio file playback device, b) a text file display device, c) a text file processing device that is disabled for a predetermined period of time after the initiation of the audio file playback device playing an audio file and performs at least one human-initiated process on the text file displayed on the text file display device, and d) a transmitting device for transmitting the processed text file to a end-user device for displaying the processed text file.
12. A device as in claim 11 wherein the predetermined period of time is between 0.5 and 30 seconds.
13. A device as in claim 12 wherein the predetermined period of time is between 1.0 and 10 seconds.
14. A device as in claim 11 wherein the predetermined period of time is one-tenth of the duration of the audio file but no less than three seconds.
15. A device as in claim 11 further comprising the audio file playback device being paused after the disabling of the text file processing device.
16. A device as in claim 15 wherein the audio file playback device is paused between 0.5 and 10 seconds.
17. A device as in claim 16 wherein audio file playback device is paused for 5 seconds.
18. A device as in claim 11 wherein the audio file playback device has a plurality of audio presets for playing an audio file.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US5364008P | 2008-05-15 | 2008-05-15 | |
| US61/053,640 | 2008-05-15 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2009158077A2 true WO2009158077A2 (en) | 2009-12-30 |
| WO2009158077A3 WO2009158077A3 (en) | 2010-04-22 |
Family
ID=41445162
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2009/044270 Ceased WO2009158077A2 (en) | 2008-05-15 | 2009-05-15 | Devices and methods used in the processing of converting audio messages to text messages |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2009158077A2 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20130225128A1 (en) * | 2012-02-24 | 2013-08-29 | Agnitio Sl | System and method for speaker recognition on mobile devices |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5369704A (en) * | 1993-03-24 | 1994-11-29 | Engate Incorporated | Down-line transcription system for manipulating real-time testimony |
| GB2323693B (en) * | 1997-03-27 | 2001-09-26 | Forum Technology Ltd | Speech to text conversion |
| US20050060159A1 (en) * | 2003-09-17 | 2005-03-17 | Richard Jackson | Text transcription with correlated image integration |
-
2009
- 2009-05-15 WO PCT/US2009/044270 patent/WO2009158077A2/en not_active Ceased
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20130225128A1 (en) * | 2012-02-24 | 2013-08-29 | Agnitio Sl | System and method for speaker recognition on mobile devices |
| US9042867B2 (en) * | 2012-02-24 | 2015-05-26 | Agnitio S.L. | System and method for speaker recognition on mobile devices |
| US10749864B2 (en) | 2012-02-24 | 2020-08-18 | Cirrus Logic, Inc. | System and method for speaker recognition on mobile devices |
| US11545155B2 (en) | 2012-02-24 | 2023-01-03 | Cirrus Logic, Inc. | System and method for speaker recognition on mobile devices |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2009158077A3 (en) | 2010-04-22 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US8028248B1 (en) | Transcription editing | |
| EP1611504B1 (en) | Method and device for providing speech-enabled input in an electronic device having a user interface | |
| US7698127B2 (en) | Grammar-based automatic data completion and suggestion for user input | |
| US7028265B2 (en) | Window display system and method for a computer system | |
| JP5021802B2 (en) | Language input device | |
| US6370503B1 (en) | Method and apparatus for improving speech recognition accuracy | |
| EP2434482B1 (en) | Real-time transcription correction system | |
| EP1544719A2 (en) | Information processing apparatus and input method | |
| US8560326B2 (en) | Voice prompts for use in speech-to-speech translation system | |
| US8213579B2 (en) | Method for interjecting comments to improve information presentation in spoken user interfaces | |
| US20090326938A1 (en) | Multiword text correction | |
| CN112219214A (en) | System and method with time-matched feedback for interview training | |
| US20120016671A1 (en) | Tool and method for enhanced human machine collaboration for rapid and accurate transcriptions | |
| US6915258B2 (en) | Method and apparatus for displaying and manipulating account information using the human voice | |
| EP4393144B1 (en) | Determination and visual display of spoken menus for calls | |
| US9460718B2 (en) | Text generator, text generating method, and computer program product | |
| AU2005229676A1 (en) | Controlled manipulation of characters | |
| EP0087199A1 (en) | Device for generating audio information of individual characters | |
| JP5063056B2 (en) | Visible markers for voice-enabled links | |
| US20080282204A1 (en) | User Interfaces for Electronic Devices | |
| US20080109220A1 (en) | Input method and device | |
| JPWO2018043138A1 (en) | INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, AND PROGRAM | |
| WO2009158077A2 (en) | Devices and methods used in the processing of converting audio messages to text messages | |
| US20020072910A1 (en) | Adjustable speech menu interface | |
| Longoria | Designing software for the mobile context: a practitioner’s guide |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09770612 Country of ref document: EP Kind code of ref document: A2 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 09770612 Country of ref document: EP Kind code of ref document: A2 |