HK1232996A1

HK1232996A1 - In-call translation

Info

Publication number: HK1232996A1
Application number: HK17106649.3A
Authority: HK
Inventors: A‧奥伊; A‧A‧米恩泽斯; J‧N‧林布洛姆; F‧弗雷斯杰; P‧P‧N‧格雷博里奥
Original assignee: 微软技术许可有限责任公司
Priority date: 2014-05-27
Filing date: 2015-05-19
Publication date: 2018-01-19

Description

In-call translation

Background

Communication systems allow users to communicate with each other over a communication network (e.g., via a call over the network). The network may be, for example, the internet or a Public Switched Telephone Network (PSTN). During a call, audio and/or video signals can be transmitted between nodes of the network, thereby allowing users to send and receive audio data (e.g., voice) and/or video data (e.g., webcam video) with each other in a communication session over the communication network.

Such communication systems include voice over internet protocol (VoIP) systems. To use the VoIP system, a user installs client software on a user device and runs the client software on the user device. The client software establishes the VoIP connection and provides other functions such as registration and user authentication. In addition to voice communications, clients may also establish connections for communication modes, for example, to provide instant messaging ("IM"), SMS messaging, file transfer, and voicemail services to users.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

According to a first aspect, a computer-implemented method performed in a communication system is disclosed. The communication system is used to enable a voice call or a video call between at least a source user speaking a source language and a target user speaking a target language. Call audio for a call is received, the call audio including speech of a source user in a source language. A translation process is performed on the call audio to generate an audio translation of the source user's speech in the target language for output to the target user. A change in the behavior of the translation flow is signaled, the change being related to the generation of the translation, thereby causing a notification to be output to the target user to notify the target user of the change.

According to a second aspect, a computer system for use in a communication system is disclosed. The communication system is used to enable a voice call or a video call between at least a source user speaking a source language and a target user speaking a target language. The computer system includes one or more audio output components, translation output components, and notification output components available to the target user. The translation output component is configured to output, via the audio output component, an audio translation of the source user's speech in the target language to the target user. The translation is generated by performing an automatic translation procedure on call audio of a call that includes speech of a source user in a source language. The notification output component is configured to output a notification to the target user to notify the target user of a change in behavior of the translation flow, the change being related to the generation of the translation.

According to a third aspect, a computer program product is disclosed, comprising computer code stored on a computer readable storage medium, the computer code being configured to, when executed, implement any of the methods or systems disclosed herein.

Drawings

For a better understanding of the present subject matter and to show how the same may be carried into effect, reference will now be made, by way of example only, to the accompanying drawings in which:

FIG. 1 is a schematic illustration of a communication system;

FIG. 2 is a schematic block diagram of a user device;

FIG. 3 is a schematic block diagram of a server;

FIG. 4A is a functional block diagram illustrating the functionality of a communication system;

FIG. 4B is a functional block diagram illustrating some of the components of FIG. 4A;

FIG. 5 is a flow diagram of a method for supporting communication between users as part of a call;

FIG. 6 is a flow diagram for a method of operating a translator avatar to be displayed at a client user interface;

7A-7E schematically illustrate translator avatar behavior in various exemplary scenarios.

FIG. 8 is a functional block diagram of a notification-based translation system.

Detailed Description

Embodiments will now be described by way of example only.

Referring initially to fig. 1, fig. 1 illustrates a communication system 100, the communication system 100 being a packet-based communication system in this embodiment, but the communication system 100 may not be packet-based in other embodiments. A first user 102a (user a or "Alice") of the communication system operates a user device 104a, which user device 104a is shown connected to a communication network 106. For reasons that will become apparent, the first user (Alice) is also referred to below as the "source user". The communication network 106 may be, for example, the internet. The user device 104a is arranged to receive information from the user 102a of the device and to output information to the user 102a of the device.

The user device 104a runs a communication client 118a provided by a software provider associated with the communication system 100. The communication client 118a is a software program running on a local processor of the user device 104a that allows the user device 104a to establish communication events, such as audio calls, audio and video calls (equivalently referred to as video calls), instant messaging communication sessions, and so forth, over the network 106.

Fig. 1 also shows a second user 102B (user B or "Bob") that has a user device 104B, the user device 104B running a client 118B to communicate over the network 106 in the same manner as the user device 104a running the client 118a to communicate over the network 106. Thus, user a and user B (102a and 102B) are able to communicate with each other over the communication network 106. For reasons that will also become apparent, the second user (Bob) is also referred to as "target user" in the following.

There may be more users connected to the communication network 106, but for clarity only two users 102a and 102b connected to the network 106 in fig. 1 are shown.

Note that in alternative embodiments, the user devices 104a and/or 104b can be connected to the communication network 106 via additional intermediate networks not shown in fig. 1. For example, if one of the user equipments is a mobile equipment of a particular type, it may be connected to the communication network 106, e.g. a GSM or UMTS network, via a cellular mobile network (not shown in fig. 1).

The communication events between Alice and Bob can be established using clients 118a, 118b in various ways. For example, a call can be established by one of Alice and Bob initiating (either directly or indirectly by way of an intermediary network entity such as a server or controller) another accepted invitation to the call to the other, and the call can be terminated by one of Alice and Bob selecting to end the call at their client. Alternatively, as explained in more detail below, the call can be established by requesting another entity in the system 100 to establish a call with Alice and Bob as participants, the call being a multiparty (specifically 3-way) call between Alice, Bob and the entity at the event.

Each communication client instance 118a, 118b has a login/authentication facility that associates the user devices 104a, 104b with their respective users 102a, 102b, for example, by the user entering a username (or other suitable user identifier that conveys the user's identity within the system 100) and password at the client, and verifying against user account data stored at a server (or the like) of the communication system 100 as part of an authentication procedure. A user is thus uniquely identified within communication system 100 by an associated user identifier (e.g., username), where each username is mapped to a corresponding client instance(s) to which data (e.g., call audio/video) for the identified user can be sent.

The user can have communication client instances running on other devices that are associated with the same login/registration details. In the case where the same user with a particular username can be logged into multiple instances of the same client application on different devices at the same time, the server (or similar) is arranged to map the username (user ID) to all of those multiple instances, and to map a separate sub-identifier (sub-ID) to each particular individual instance. Thus, the communication system is able to distinguish between different instances while still maintaining a consistent identity for the user within the communication system.

The user 102a (alice) is logged in (authenticated) as "user 1" at the client 118a of the device 104 a. The user 102b (bob) is logged in (authenticated) as "user 2" at the client 118b of the device 104 b.

Fig. 2 illustrates a detailed view of the user device 104 (e.g., 104a, 104b) on which the communication client instance 118 (e.g., 118a, 118b) is running. The user device 104 includes at least one processor 202 in the form of one or more central processing units ("CPUs") to which a memory (computer storage) 214 for storing data is connected, an output device in the form of a display 222 (e.g., 222a, 222b) having an available display area, such as a display screen, a keypad (or keyboard) 218, and a camera 216 for capturing video data, which is an example of an input device. The display 222 may comprise a touch screen for inputting data to the processor 202 and thus also constitute an input device for the user device 104. An output audio device 210 (e.g., one or more speakers) and an input audio device 212 (e.g., one or more microphones) are connected to the CPU 202. The display 222, keypad 218, camera 216, output audio device 210, and input audio device 212 may be integrated into the user device 104, or one or more of the display 222, keypad 218, camera 216, output audio device 210, and input audio device 212 may not be integrated into the user device 104 and may be connected to the CPU 202 via respective interfaces. One example of such an interface is a USB interface. For example, an audio headset (i.e., a single device containing both output and input audio components) or headphones// ear buds (or the like) may be connected to the user device via an appropriate interface, such as a USB or audio jack-based interface.

The CPU 202 is connected to a network interface 220 (e.g., a modem) for communicating with the communication network 106 for communicating over the communication system 100. The network interface 220 may or may not be integrated into the user device 104.

The user devices 104 may be, for example, mobile phones (e.g., smart phones), personal computers ("PCs") (including, for example, Windows) capable of connecting to the network 106^TM、Mac OS^TMAnd Linux^TMA PC), a gaming device, a Television (TV) device (e.g., a smart TV) tablet computing device, or other embedded device.

Some of the above-mentioned components may not be present in some user devices, for example the user devices may take the form of telephone headsets (VoIP or otherwise) or teleconferencing devices (VoIP or otherwise).

FIG. 2 also illustrates an operating system ("OS") 204 running on the CPU 202. The operating system 204 manages the hardware resources of the computer and handles data sent to and from the network via the network interface 220. The client 118 is shown running on top of the OS 204. The client and OS can be stored in memory 214 for execution on processor 202.

The client 118 has a User Interface (UI) for presenting information to a user of the user device 104 and receiving information from the user of the user device 104. The user interface includes a Graphical User Interface (GUI) for displaying information in the available area of the display 222.

Referring to FIG. 1, Alice 102 (the source user) speaks the source language; bob (the target user) speaks a target language other than (i.e., different from) the source language and does not understand (or has only a limited understanding of) the source language. Bob will likely not be able to understand what Alice said in the conversation between the two users or at least have difficulty understanding what Alice said. In the following example, Bob is presented as a chinese speaker and Alice as an english speaker, which, as will be appreciated, is just one example and the user can speak any two languages in any country or region. In addition, "different languages" as used herein is also used to mean different dialects of the same language.

For this purpose, a language translation relay system (translator relay system) 108 is provided in the communication system 100. The purpose of the translator relay is to translate the audio in a voice call or video call between Alice and Bob. That is, the translator relay is used to translate call audio of a voice call or video call between Alice and Bob from a source language to a target language to support in-call communications between Alice and Bob (i.e., to help Bob understand Alice and vice versa during the call). The translator repeater generates a translation of the call audio received from Alice in the source language, the translation being in the target language. The translation may include an audible translation encoded as an audio signal for output to Bob via the speaker(s) of Bob's device and/or a text-based translation for display to Bob via Bob's display.

As explained in more detail below, the translator relay system 108 functions as both a translator and a relay in the sense that it receives untranslated call audio from Alice via the network 106, translates it, and relays a translated version of Alice's call audio to Bob (i.e., sends the translation directly to Bob via the network 106 for output during a call, e.g., as opposed to a user device such as Alice or Bob acting as a requester that requests translation from a translator service, the translation being returned to the requester to be passed on by the requester itself to other devices). This represents a fast and efficient path through the network that minimizes the burden placed on the client on network resources and improves the overall speed at which translations reach Bob.

The translator performs a "live" automatic translation process on voice or video calls between Alice and Bob in the sense that the translation is somewhat synchronized with Alice's and Bob's natural speech. For example, natural speech during a conversation will typically contain intervals of speech activity by Alice (i.e., intervals in which Alice speaks interspersed with intervals of speech inactivity by Alice, such as when Alice pauses to think or listen to Bob speaking). The intervals of speech activity may, for example, correspond to sentences or a small number of sentences before and after the pause in Alice's speech. Live translation may be performed at such intervals of voice activity, so that translation of the immediately preceding interval of voice activity by Alice is triggered by a sufficient (e.g., predetermined) interval of voice inactivity ("immediately preceding" refers to the most recent interval of voice activity that has not yet been translated). In this case, once the translation is complete, the translation may be sent to Bob for output so that Bob hears the translation as soon as possible after the most recent period of time that Alice's natural voice activity was heard, i.e., so that the period of voice activity by Alice is heard by Bob, followed by a short pause (while performing translation and sending of the period of voice activity by Alice), followed by Bob hearing and/or seeing the translation of Alice's voice in that interval. Performing the translation on such an interval basis may result in a higher quality translation because the translation flow can take advantage of the context in which the word appears in the sentence to achieve a more accurate translation. Because the translator service acts as a relay, the length of this short pause is minimized, resulting in a more natural user experience for Bob.

Alternatively, the automatic translation may be performed on a word-by-word or word-by-word basis and output, for example, while Alice's speech is still in progress and heard by Bob, e.g., as subtitles displayed on Bob's device and/or as audio played out on top of Alice's natural speech (e.g., where the volume of Alice's speech is reduced relative to the audible translation). This may result in a better responsive user experience for Bob, as the translation is generated in near real-time (e.g., with a response time of less than about 2 seconds). The two can also be combined; for example, intermediate results of the (translated) speech recognition system may be displayed on the screen, so that they can be edited as a change of the best hypothesis as the sentence proceeds, and the translation of the best hypothesis is then translated into audio (see below).

Fig. 3 is a detailed view of the translator relay system 108. The translator relay system 108 includes at least one processor 304 that runs code 110. Connected to the processor 304 are a computer storage (memory) 302 for storing the code 110 and data for the execution and a network interface 306 for connecting to the network 106. Although shown as a single computer device, the functionality of the relay system 108 may alternatively be distributed across multiple computer devices (e.g., multiple servers located in the same data center, for example). That is, the functionality of the relay system may be implemented by any computer system that includes one or more computer devices and one or more processors (e.g., one or more processor cores). The computer system may be "localized" in the sense that all of the processing and memory functions are located at substantially the same geographic location (e.g., in the same data center including one or more local web servers running on the same or different server devices of the data center). As will be apparent, this can help to further increase the speed at which the translation is relayed to Bob (which in the above example even further reduces the length of the short pause between the interval when Alice completes the speech and the start of the translation output, resulting in an even better user experience for Bob).

As part of code 110, memory 302 holds code configured to implement the calculations of the translator agent. As explained in more detail below, the translator agent is also associated with its own user identifier (username) within the communication system 100 in the same manner as a user is associated with a corresponding username. Thus, the translator agent is also uniquely identified by the associated user identifier, and thus appears in some embodiments to be another user of the communication system 100, for example to appear as a constant online user to/from which the ' real ' user 104a, 104b can add as a contact and send/receive data using the user's 104a, 104b respective client 118a, 118 b; in other embodiments, the bot having the user identifier may be hidden (or at least camouflaged so as to be substantially hidden) from the user, e.g., the client UI (discussed below) is configured so that the user will not be aware of the bot identity.

As will be appreciated, multiple robots can share the same identity (i.e., be associated with the same username), and those robots can be distinguished using different identifiers that may not be visible to the end user.

The translator relay system 108 may also perform other functions that are not necessarily directly related to translation, such as mixing of call audio streams as in the example embodiments described below.

Figure 4A is a functional block diagram illustrating the interaction and signaling between the user devices 104A, 104b and the call management component 400. According to various methods described below, call management component 400 supports interpersonal communication between people that do not share a common language (e.g., Alice and Bob). Fig. 4B is another illustration of some of the components shown in fig. 4A.

Call management component 400 represents functionality implemented by running code 110 on translator relay system 108. The session management component is shown to include function blocks (components) 402 and 412, which represent various functions performed by the code 110 when running. Specifically, call management component 400 includes the following components: an example of the aforementioned translator agent 402, the functionality of which is described in more detail below, an audio translator 404 configured to translate audio speech in a source language into text in a target language, a text-to-speech converter 410 configured to convert text in a destination language into synthesized speech in a destination language, and an audio mixer 412 configured to mix a plurality of input audio signals to generate a single mixed audio stream comprising audio from each of those signals. The audio translator includes an automatic speech recognition component 406 configured for the source language. That is, automatic speech recognition component 406 is configured to recognize a source language in the received audio, i.e., to identify that a particular portion of sound corresponds to a word in the source language (specifically, in this embodiment, audio speech in the source language is converted to text in the source language; in other embodiments, it need not be text, e.g., a translator can translate a complete set of hypotheses provided by a speech engine represented as a grid that can be encoded in various ways). Speech recognition may also be configured to identify in operation which language the source user speaks (and in response be configured for the source language, e.g., in 'french to …' mode in response to detecting french), or it may be preconfigured for the source language (e.g., via a UI or profile setting, or by instant messaging based signaling, etc., which provision the robot in 'french to …' mode, for example). Component 400 also includes a text translator 408 configured to translate text in a source language to text in a target language. The components 404, 408 collectively implement the translation functionality of the audio translator 404. Components 402, 404, and 410 make up the back-end translation subsystem (translation service 401), with components 404 and 410 making up their speech-to-speech translation (S2ST) subsystem, and the proxy acts as an intermediary between client 118a/l18b and the subsystem.

As indicated, the components of fig. 4A/4B may represent processes running on the same machine or different processes running on different machines (e.g., speech recognition and text translation may be implemented as two different processes running on different machines).

The translator agent has a first input connected to receive call audio from Alice's user device 104a via the network 106, a first output connected to the input of the audio translator 404, specifically the speech recognition component 406, a second input connected to the output of the speech recognition component 406 (which is the first output of the audio translator 404), a third input connected to the output of the text translator 408 (which is the second output of the audio translator 404), a second output connected to the first input of the mixer 412, a third output connected to send the translated text in the target language to Bob's user device 104b, and a fourth output configured to send the recognized text in the source language to both Alice's user device 104a and also to Bob's user device 104 b. The agent 402 also has a fourth input connected to the output of the text-to-speech converter 410 and a fifth output connected to the input of the text-to-speech converter. The mixer 412 has a second input connected to receive call audio from Alice's device 104a and an output connected to send the mixed audio stream to Bob via the network 106. The output of the speech recognition component 406 is also connected to an input of a text translator 408. The proxy 402 has a fifth input connected to receive feedback data from Alice's user device 104a via the network 106 conveying source user feedback regarding the results of the source recognition procedure (e.g., indicating its accuracy), the feedback information having been selected at Alice via her client user interface, and conveying information related to recognized text for configuring the speech recognizer 406 to improve its results. Alice is in a section that provides this feedback information that can be output via her client user interface when she receives information about the results of speech recognition.

Input/output representing an audio signal is shown as a bold solid arrow in fig. 4A; the input/output representing the text-based signal is shown as thin arrows.

The translator agent instance 402 serves as an interface between Alice and Bob's client 118 and the translation subsystem 401, and as a separate "software agent". Proxy-based computing is known in the art. A software agent is an autonomous computer program that performs tasks in an agent relationship on behalf of a user. In acting as a software agent, translator agent 402 acts as an autonomous software entity that, once initiated (e.g., in response to initiation of a call or related session), runs substantially continuously for the duration of that particular call or session (as opposed to being run on demand; i.e., as opposed to being run only when some particular task needs to be performed), waits for inputs, and when detected triggers automated tasks by translator agent 402 to be performed on those inputs.

In particular embodiments, the translator agent instance 402 has an identity within the communication system 100, just as a user of the system 100 has an identity within the system. In this sense, the translator agent can be considered a "robot"; which is an Artificial Intelligence (AI) software entity that appears as a regular user (member) of communication system 100 by virtue of its associated username and behavior (see above). In some implementations, a different respective instance of the robot may be assigned to each call (i.e., on a per-call instance basis), such as english spanish translator l, english spanish translator 2. That is, in some implementations, the bot is associated with a single session (e.g., a call between two or more users). On the other hand, the translation service to which the robot provides an interface may be shared among multiple robots (and also other clients).

In other implementations, Bot instances that can perform multiple conversations simultaneously can be configured in a straightforward manner.

In particular, the human users 104a, 104b of the communication system 100 can include the bot as a participant in a voice call or a video call between two or more human users, for example by inviting the bot as a participant to join an established call, or by requesting the bot to initiate a multi-party call between the desired two or more human participants and the bot itself. The request is initiated by a client user interface of one of the clients 118a, 118b, which provides an option for selecting the bot and any desired human user as participants in the call, for example by listing the human and bot as contacts in a contact list displayed via the client user interface.

Robot-based embodiments do not require specialized hardware devices or specialized software to be installed on the user's machine and/or require speakers (i.e., participants) to be physically close to each other, as the robot can be seamlessly integrated into existing communication system architectures without, for example, redistributing updated software clients.

The agent 402 (bot) appears as a regular member of the network on the communication system 100 (alternatively referred to as a chat network). Conversation participants can have their interlocutors' speech translated into their language by inviting the appropriate robot into a voice or video call (also known as a chat session or conversation), e.g., a chinese speaker speaking to an english speaker can invite a proxy named (i.e., with a user name) "english-chinese-translator" into the conversation. The bot then plays the role of a translator or interpreter in the rest of the conversation, translating any speech in its source language to its target language. This can be presented as text (for display at the target device, e.g., via subtitles or in a chat window of the target client user interface) and/or as target language speech (generated using text-to-speech component 410 for playout at the target device via speaker (s)).

Embodiments thus provide:

seamless integration into multimedia telephony/chat services (No separate installation required)

Remote communication (participants do not need to be physically close)

A device-independent server-based implementation (such that no separate software is required for the service clients (e.g., 104a, 104b) of the new platform), which enables more seamless deployment of upgrades and new features.

In some embodiments, the robot has access to individual audio streams per speaker, allowing for higher quality speech recognition.

In such embodiments, at the highest level is a "bot," which appears to a user of the chat system as a regular human network member would appear. The bot intercepts the audio stream(s) from all users (e.g., 104a) speaking their source languages and passes them to the speech to text translation system (audio translator 404). The output of the speech to text translation system is the target language text. The bot then communicates the target language information to the target language user(s) 104 b. The robot may also communicate the speech recognition results of the source audio signal to the source talker 104a and/or the target listener 104 b. The source speaker can then correct the recognition results by feeding back correction information to the robot via the network 106 to obtain a better translation, or attempt to repeat or restate their expression (or portions thereof) to achieve a better recognition and translation. Alternatively, the speaker can be presented with a representation of the n-best lists or phonetic grid (i.e., a graph that visually represents different possible hypotheses for the source language being recognized), allowing them to clarify or correct the imperfect 1-best recognition by feeding back selection information that identifies the best hypotheses. Identification information (e.g., the source language text itself) can also be sent to the target user, which can be useful to listeners with a small degree of proficiency in the source language or whose reading comprehension in that language is better than their hearing comprehension. Having access to the source text may also allow the target user to understand ambiguous or incorrect translations; named entities, such as names of people or places, for example, may be correctly recognized by a speech recognition system but incorrectly translated.

The implementation details of the bot depend on the architecture of the chat network or the level of access to the chat network.

The implementation for a system that provides an SDK ("software development kit") will depend on the features provided by the SDK. Typically, these will provide read access to separate video and audio streams for each conversation participant, and write access to the video and audio streams for the bot itself.

Some systems provide a server-side robot SDK that allows full access to all streams and implements scenarios such as: the video subtitles are applied to the video signal of the source speaker and/or replace or mix the audio output signal of the source speaker. Finally, where full control over the system is available, the translation can be integrated in any way, including changes to the client UI in order to make the interlingual conversation experience easier for the user.

At the weakest level, a "closed" network without a commonly defined protocol and/or SDK can be served by a robot that intercepts and modifies signals to and from microphone, camera, and speaker devices on a client computer (e.g., 104a, 104b, rather than at a separate repeater). In this case, the robot may perform language detection to find out which portions of the signal are in its source language (e.g., to distinguish speech in other languages in the mixed audio stream).

Communication of the target language text can occur in a variety of ways; the text can be communicated in a public chat channel (commonly visible/audible to all call participants such as Alice and Bob) or a private chat channel (between the robot and the target user only) and/or as a video caption superimposed on the video stream of the robot or of the speaker of the source language. The text can also be passed to a text-to-speech component (text-to-speech converter 410) that renders the target language text as an audio signal that can replace or be mixed with the speaker's original audio signal. In an alternative embodiment, only translated text is sent over the network and text-to-speech synthesis is performed on the client side (thereby saving network resources).

The translation can be retrospective (Bob waits until the user pauses or indicates in some other way that their expression is complete, e.g., clicks a button, and then conveys the target language information) or simultaneous, i.e., occurs substantially simultaneously with the source language (Bob begins conveying the target language information at the point where it has enough text to produce semantically and grammatically coherent output). The former uses voice activation detection to determine when to begin translating a previous portion of speech (translation is at intervals of detected speech activity); the latter uses a voice activation detection and automatic segmentation component (performed on each segment of the interval, which may have one or more segments, for each interval of detected speech activity). As will be appreciated, components for performing such functions are readily available. In a round scenario, the use of a robot acting as a third party visual translator in a call will help the user by designing them in a common real world scenario with translators (e.g. one scenario that might be in a court); simultaneous translation is similar to a human simultaneous interpreter (e.g., as would be encountered in a european parliament or UN). Thus, both provide an intuitive translation experience for the target user(s).

It should be noted that reference to "automated translation" (or the like) as used herein encompasses both retrospective and simultaneous translation (among others). That is, "automated translation" (or the like) encompasses both human translators and automated simulations of human interpreters.

As will be appreciated, the present subject matter is not limited to any particular speech recognition or translation component, and these can be considered black boxes for all intents and purposes. Techniques for rendering translations from speech signals are known in the art, and there are many components that may be used to perform such functions.

Although fig. 4A/4B show only one translation for simplicity, it will be readily appreciated that the robot 402 can perform the equivalent translation function for Bob's call audio to benefit Alice. Similarly, although the following methods are described with reference to one-way translation for simplicity, it will be appreciated that such methods can be applied to two-way (or multi-way) translations.

A flow chart of a method of supporting communication between users during a voice call or a video call between those users will now be described with reference to fig. 5, fig. 5 being a flow chart of the method. FIG. 5 depicts the in-call translation flow from Alice's language to Bob's language for simplicity only; it will be appreciated that a separate and equivalent process can be performed to translate from Bob's language to Alice's language (from this perspective, Alice can be considered the target and Bob can be considered the source) simultaneously in the same call.

At step S502, a request for a translator service is received by the translator relay system 108 requesting the robot to perform a translation service during a voice call or video call in which Alice, Bob and the robot will be participants. The call thus constitutes a multiparty (group) call, in particular a three-way call. At step S504, a call is established. The request may be a request for the agent 402 to establish a multi-party call between the robot 402 and at least Alice and Bob, in which case the robot establishes a call by initiating a call invitation to Alice and Bob (where S502 therefore precedes S504), or an invitation for the robot 402 to join a call already established at least between Alice and Bob (where S504 therefore follows S502), in which case Alice (or Bob) establishes a call by initiating a call invitation to Bob (or Alice) and the robot. The call may be initiated via the client UI or automatically by the client or some other entity (e.g., a calendar service configured to automatically initiate the call at a pre-specified time).

At step S506, the robot 402 receives Alice 'S call audio as an audio stream from Alice' S client 118a via the network 106. The call audio is audio captured by Alice's microphone and includes Alice's speech in the source language. The bot 402 supplies call audio to the speech recognition component 406.

At step S508, the voice recognition component 406 performs a voice recognition procedure on the call audio. The speech recognition process is configured to recognize a source language. In particular, the speech recognition process detects particular patterns in the call audio that match known speech patterns in the source language to generate an alternative representation of the speech. This may be, for example, a textual representation of the speech as a string of characters in the source language, where the flow constitutes a source language-to-source text recognition flow, or some other representation, such as a feature vector representation. The results of the speech recognition procedure (e.g., character strings/feature vectors) are input to the text translator 408 and also supplied back to the robot 402.

At step S510, the speech translator 408 performs a translation process on the input result to text (or some other similar representation) in the target language. The translation is performed substantially live, e.g. on a sentence-by-sentence (or several sentences), on a detected segment or on a word-by-word (or several words) basis as mentioned above. Thus, the translated text is output semi-continuously while still receiving call audio from Alice. The target language text is supplied back to the robot 402.

At step S512, the target language text is supplied by the robot to a text-to-speech converter that converts the target language text into artificial speech spoken in the target language. The synthesized speech is supplied back to the robot 402.

Because the text and synthesized speech output from audio translator 404 are in the target language, they are understandable to Bob speaking the target language.

At step S514, the synthesized audio is supplied to mixer 412, where it is mixed with Alice' S original audio (including her original natural speech) to generate a mixed audio stream including both synthesized translated speech in the target language and original natural speech in the source language, which is sent to Bob via network 106 (S516) for output via the audio output device (S) of his user device as part of the call. Bob can thus estimate Alice's pitch, etc., from natural speech (even if he cannot understand it), while grasping meaning from synthetic speech, resulting in a more natural communication. That is, the system is also capable of transmitting Alice's untranslated audio as well as translated audio. In addition, even when the target user does not understand the source language, there is still information gathered there from, for example, tones (e.g., they may be able to discern whether the source speaker is asking a question).

Alternatively, the original signal of Alice's speech may not be sent to Bob so that only the synthesized translated speech may be sent to Bob.

As mentioned, the target language text may also be sent by the robot to Bob (and displayed via his client user interface, e.g., in a chat interface or as subtitles). As also mentioned, the source language text obtained by the speech recognition procedure based on which the translation was made and/or other recognition information from the recognition procedure relating to the speech recognition process performed on her speech, e.g., a possible recognition of an alternative (e.g., where there is an ambiguity recognized in performing the recognition procedure), can also be sent to Alice and displayed via her user interface so that she can gauge the accuracy of the recognition procedure. The client user interface may present various feedback options by which Alice can feed information back to the robot via the network in order to improve and refine the speech recognition procedure as performed on her speech. The source language text may also be sent to Bob (e.g., if Bob selects an option to receive the source language text via his client user interface), for example, if Bob is more skilled in reading the source language spoken by Alice than he understands it audibly.

In an embodiment, the speech-to-text component 406 can output, when each word is recognized (e.g., on a per-word basis), a speech recognition result that is intermediate to a textual version or some other portion of the word, which can be displayed at Alice's user device when Alice speaks. That is, the speech recognition procedure may be configured to generate partial "temporary" speech recognition results while a speech activity is ongoing (i.e., when Alice at least temporarily stops speaking) for at least one interval of speech activity by the source user before generating final speech recognition results when the speech activity is completed. The translation is ultimately generated using the final result (non-partial result, which may undergo a change before the translation is performed, see below), but nonetheless information relating to the partial result results in being sent and output to Alice before the translation. This invites the source user (Alice) to influence subsequent translations, for example, by modifying their voice activity accordingly whenever they observe the presence of inaccuracies in the partial results (e.g., by repeating some portion that they can see has been misinterpreted).

As Alice continues to speak, the recognition flow is then refined so that component 406 can effectively "change its mind" as appropriate with respect to the word(s) it has previously recognized in view of the context provided by subsequent words. In general, component 406 can generate initial (and effectively provisional) speech recognition results in substantially real-time (e.g., where the results are updated on a timescale of about 2 seconds), which can be displayed to Alice in substantially real-time so that she can get the perception of how accurate her speech was recognized, even if the provisional results are subject to change before the final results from which the audio was actually generated, they can still give sufficient meaning to Alice to be useful. For example, if Alice can see that the recognition process has interpreted her speech in a highly inaccurate manner (and thus knows that if she simply continues to speak, the resulting translation subsequently output to Bob will be misinterpreted or meaningless), she can interrupt her current stream of speech and repeat what she has just said, rather than completing the entire portion of speech before the error becomes apparent (e.g., which might otherwise be the case only after Bob has heard and failed to understand the misinterpreted or meaningless translation). As will be appreciated, this will help support the natural flow of conversation between Alice and Bob. Another possibility is to have a button or other UI mechanism that Alice can use to stop the current recognition and restart.

In this embodiment, the mixer 412 of fig. 4A is also implemented by the relay system 108 itself. That is, in addition to implementing the translator function, the relay system 108 also implements the call audio mixing function. Implementing a mixing function at the relay system 108 itself, rather than elsewhere in the system (e.g., at one of the user devices 104a, 104), where multiple individual audio streams are mixed into a single respective audio stream for each human participant to be sent to the user, provides Bob with convenient access to the individual audio streams, with access to the individual conversation audio streams enabling better translation quality, as noted above. Where the relay system 108 is also localized, this also ensures that the robot has direct, quick access to the individual audio streams, which further minimizes any translation delays.

In the case where additional users are participating in the call (in addition to Alice, Bob, and the bot itself), the call audio streams from these users may also have separate translations performed by the bot 402 for each audio stream. In the case where more than two human users are engaged in a conversation, the audio streams for all those users may be received individually at the relay system 108 for mixing there, thereby also providing convenient access to all those individual audio streams for use by the robot. Each user may then receive a mixed audio stream containing all necessary translations (i.e., synthesized translated speech for each user speaking a language different from that user). A system with three (or more) users will have each user speaking a different language, where their speech utterances are translated into two (or more) target languages, and the speech from the two (or more) target speakers will be translated into their languages. Each user may be presented with the original text and their own translations via their client UI. For example, user A speaks English, user B speaks Italian, and user C speaks French. User a speaks and user B will see english and italian, whereas user C will see english and french.

In some existing communication systems, a user initiating a group call is automatically assigned to host the call, with call audio by default being mixed at the user's device, and other clients in the call by default automatically sending their audio streams to the user for mixing. The desired host then generates a respective mixed audio stream for each user that is a mixture of the audio of all other participants (i.e., all audio except the user's own audio). In such a system, the request for the robot to initiate a call will ensure that the robot is assigned as a master, thereby ensuring that each of the other participants' clients will by default send their individual audio streams to the relay system 108 for mixing there, thereby by default granting access to the individual audio streams to the robot. The bot then provides a respective mixed audio stream to each participant that includes not only the audio of the other human participants but also any audio (e.g., synthesized translated audio) to be conveyed by the bot itself.

In some robot-based implementations, the client software may be modified (specifically, the client graphical user interface may be modified) to disguise the fact that the robot is performing the translation. That is, from the perspective of the underlying architecture of the communication system, the robots appear substantially as if they are another member of the communication system, such that the robots can be seamlessly integrated into the communication system without modification of the underlying architecture; however, this may be hidden from the user such that the fact that any in-call translations they are receiving are conveyed by the bot, which is a participant in the call (at least according to the underlying protocol), is essentially not visible at the user interface level.

Although described above with reference to a robotic implementation, i.e., with reference to a translator agent that is integrated into communication system 100 by associating the translator agent with its own user identifier such that it appears to be a regular user of communication system 100, other embodiments may not be robotic implementations. For example, translator repeater 108 may instead be integrated into a communication system as part of the architecture of the communication system itself, with communications between system 108 and various clients being implemented by a customized communication protocol that is customized for such interactions. For example, the translator agent may be hosted in the cloud as a cloud service (e.g., running on one or more virtual machines implemented by the underlying cloud hardware platform).

That is, the translator cloud can be, for example, a computer device/system of such devices running a robot with a user identifier or a translator service running in a cloud, or the like. Either way, call audio is received from the source user, but the translation is sent directly from the translator system to the target user (without being relayed through the source user's client), i.e., in each case, the translator system acts as an effective relay between the source user and the target user. The cloud (or similar) service can be accessed directly, e.g., from a web browser (e.g., by downloading a plug-in or using communication in a plug-in-less browser, e.g., based on JavaScript), from a dedicated software client (application or embedded), by dialing from a regular phone or mobile phone, etc.

A method of delivering a translation of a source user's speech to a target user will now be described with reference to fig. 6, 7A-E and 8.

FIG. 8 illustrates a notification-based translation system 800 that includes the following functional blocks (components): a speech-to-speech translator (S2ST)802 (which may implement a similar function to the S2ST system constituted by components 404 and 410 in fig. 4A/B) that performs a speech-to-speech translation procedure to generate synthesized translated speech in a target language from Alice' S call audio, including the speech of Alice in the source language to be translated thereby; and a notification generation component (notification component) 804 configured to generate, when detected by the notification component, one or more notifications that convey a change in translation behavior of the translation flow, separate from the translated audio itself, that is a change in the nature of the translation-related operation performed in providing the in-call translation service, for output to the target user. These components represent functionality that is implemented, for example, by running code 110 on translator relay 108 (or by running code on some other back-end computer system), by running client 118a on device 104a, by running client 118b on device 104b, or any combination thereof (i.e., where functionality is distributed across multiple devices). In general, system 800 can be implemented by any computer system employing one or more computer devices in a local manner or a distributed manner.

The translation process outputs the audio translation as an audio stream that is output to the target user via the target device speaker(s) when it is output by the translation process (e.g., streamed to the target device via a network when translated remotely or directly to the speaker(s) when translated locally). Thus, the output of the audio translation by the translation flow occurs substantially simultaneously with the output of the translation at the target device (i.e., where the only significant delays are those introduced due to latency in the network and/or at the target device, etc.).

Additionally, the system 800 includes a notification output component 806 and a translation output component 808, the notification output component 806 and the translation output component 808 being implemented at the target user device 104b as separate (receiving separate and distinct inputs) from each other and representing functionality implemented by running the client 118b at the target user device 104 b. Components 806 and 808 receive (from components 804 and 802, respectively) the generated notification(s) and the translated audio, respectively, and output the generated notification(s) and the translated audio to the target user (which are output via speaker(s) of the target device). The notification(s) (respectively, translated audio) can be received via the network 106, where the notification generation component 804 (respectively, the translator 802) is implemented remotely (e.g., at a source device and/or server, etc.) from the target user device, or locally if the notification generation component 804 (respectively, the translator 802) is implemented on the target device itself.

The speech-to-speech translator has an input connected to receive Alice's call audio (e.g., via network 106, or locally in the case where component 802 is implemented at Alice's device), a first output connected to the translation output component 808 for delivering translated audio to Bob (e.g., via network 106, or directly to Bob's speaker when implemented at Bob's device), and a second output connected to the first input of the notification component 804. This second output communicates a signal to a notification component that signals a change in the behavior of the translation flow (e.g., via network 106 when those components are implemented at different devices, or by way of local, e.g., internal, signaling when implemented on the same device). The notification generation component has an output connected to an input of the notification output component 806 that causes the aforementioned notification to be output (by the notification output component) to Bob to notify him when such a change is detected. The notification component has at least one first output connected to a respective at least one output device (display, speaker, and/or other output device) of the target user device 118b for outputting the notification(s). The translation output component 808 has outputs connected to the speaker(s) of the target user device 104b for outputting audio translations.

In addition, notification output component 806 has a second output connected to a second input of the notification component that supplies information related to the output regarding the manner in which the notification(s) are to be output at the target user device for use in generating the notifications. That is, notification output component 806 feeds back information to notification generation component 804 regarding the manner in which the notification(s) are to be output at the target device, which notification generation component uses the information to determine how to generate the notifications. Thus, the manner in which the notification(s) are generated may depend on the manner in which they are actually to be output at the device. This information may be fed back remotely via the network 106 if the notification generation component 804 is implemented remotely, or the feedback may be a localized (internal) process at the target device if the notification generation component 804 is implemented locally at the target device.

In the case where the visual notification is displayed on a display of the target device, outputting the relevant information includes conveying layout information of how the output notification is to be positioned in an available area of the target device display.

In the examples described below, notification component 804 generates composite video data of an animated "avatar" for display of Bob on the user device (which may be sent over network 106 or communicated directly to the display when component 804 is implemented at Bob's device). In these examples, notification component 804 generates a composite video of an animated avatar that notifies(s) of changes in visual behavior implemented, for example, as an avatar. The layout information includes information about where the avatar video is to be displayed on the target device usable area with respect to the displayed video of the target user (Bob) and/or the source user (Alice) during the video call for use in determining the visual behavior of the avatar.

Fig. 6 is a flow chart for the method. The method of fig. 6 is performed during and as part of an established voice or video call between a source user (e.g., Alice) using a source user device (e.g., 104a) and a target user (e.g., Bob) using a target user device (e.g., 104b), wherein a translation procedure is performed on call audio of the call to generate an audio translation of the source user's speech in the target language for output to the target user, the call audio comprising the source user's speech in the source language. The translation process may be performed at the translator relay in the manner described above, or the translation process may not be and may be performed, for example, at one of the user devices or at some other component of the system (e.g., a server that performs the translation process but does not act as a relay, e.g., the server returns the translation to the source user device for indirect transmission to the target user device). The method is a computer-implemented method that is implemented by suitably programmed code (e.g., code 110 when running on processor 304 of fig. 3 and/or client code of clients 118a and/or 118b) at runtime. That is, the method may be performed in any suitable communication system for implementing a voice call or video call between a source user speaking a source language and a target user speaking a target language to implement some form of in-call speech-to-speech translation procedure for generating synthesized translated speech in the target language for output to the target user.

In speech-to-speech translation involving such a speech-to-speech translation flow, the overall translation flow may proceed as follows: the source user, e.g., Alice, speaks in their own (source) language, the system recognizes the speech, translates it, and sends the text-to-speech translation to the listener. When supported by video, there may be a delay (e.g., up to a few seconds) between others stopping speaking and the translated audio being sent. This creates a great deal of confusion, making it difficult for listeners to understand when to safely begin speaking without interrupting their conversation partner.

In other words, Alice's speech typically consists of intervals of speech activity in which Alice speaks the source language interspersed with intervals of speech inactivity in which Alice does not speak, e.g., because she is waiting for Bob to speak or because she is currently listening to what Bob says.

To this end, the method comprises signalling a change in the behaviour of the translation flow, the change relating to the generation of the translation, and thereby causing a notification to be output to the target user to notify the target user of the change when the change is detected. The signaling may be remote via the network 106 (if the translation process is not performed at the target device). There may also be some benefit to the same or similar notification being output to the source speaker, for example if they see the translation component busy performing the translation, they may pause, allowing their interlocutor to catch up before continuing with the rest of their spoken content.

In the following example, the change in behavior of the possible signaling includes a flow entry:

a "listen" ("wait") state in which it is not currently generating or outputting any translations, e.g., because it has nothing to translate (e.g., it is entered when it has completed translation of all voices in the most recent interval of voice activity by Alice, and Alice is still in an interval of voice inactivity (i.e., has not yet resumed speaking), so the flow has nothing to do at that point in time);

an "attention" ("passive translation") state in which Alice is currently speaking and the flow is monitoring (i.e., listening) the speech for translation thereof (e.g., entering from the listening state when Alice resumes speaking), which may also generate a temporal partial translation at that point in time (see above);

a "think" ("active translation") state in which Alice may not currently speak but has spoken sufficiently recently for the process to still process her most recent speech in order to translate it (e.g., enter from an attentive state when Alice stops speaking);

a "speaking" ("output") state in which the generated audio translation is currently being output (e.g., entered upon reaching a point in time at which the "speaking" state becomes feasible, such as when the process has just completed generating a translation of speech that Alice spoken during the most recent interval of speech activity by Alice).

A "confusion" ("error") state in which the flow is currently unable to proceed, for example because it is already unable to perform a translation of speech or some other error has occurred (entered at the point in time such an error was recognized).

In particular embodiments, with access to Bob's video stream (not shown in fig. 4A/B), the robot can play the character role of a "talking head" avatar, which is animated so that it is apparent when it speaks, listens (waits), etc. An avatar is an artificially generated graphical representation of an animated character that can be animated, for example, to convey meaning through visual cues such as facial expressions, body language, other gestures, and so forth. Here, the behavior of the avatar is controlled to match the behavior of the translation flow, i.e. the avatar effectively mimics the visual cues of a real human translator (when performing round translations) or interpreter (when performing continuous translations), thus providing an attractive and intuitive user experience for the target user, as well as enabling the information conveyed by the avatar view to be easily understood by the target user. For example, in a conversation with a human translator, the listener will notice the translator until they are finished and then begin speaking; by means of the above-described signaling, the avatar can be made to mimic this behavior by causing it to adopt a visual gesture that indicates that they are listening to Alice's speech when the flow enters the attentive state, and by causing its lips to move as the translation flow enters the speaking state, in line with the beginning of the output of the audio translation.

Thus, the avatar behaves like a human translator and provides visual cues. For example, the visual cues indicate to the listener when it is safe to begin speaking by taking a listening gesture when entering the listening state. Thus, the target user's client may output, via the speaker component, an audible translation of the source user's speech in the target language during the interval (i.e., the translated portion of the speech corresponding to the source speech in the interval), and output an indication (notification) to the target user to indicate that the target user is free to respond to the source user when the output of the audible translation (i.e., the translated portion) has substantially completed. Here, "substantially complete" includes any point in time sufficiently close to the completion of the output that Bob safely begins speaking without interrupting the natural flow of the conversation.

As will be apparent, the above-mentioned changes in the state of the translation (involution) flow in fact closely reflect the actual changes in the mind of a human translator or interpreter (contemporaneous translation) in a real-life live translation or interpretation scenario. That is, the same thing is done by the real life human mind as if the automated process were to operate in listening, waiting, paying attention, speaking, or confusion. This is exploited by configuring the avatars to approximate the various actions that are expected to be performed by a human translator when communicating changes in their mental state in a real-life translation scenario, which correspond to changes in the behavior of the translation flow. This is explained below with specific reference to FIGS. 7A-E, which illustrate the visual behavior of an avatar.

The avatar may be, for example, a representation of a human, animal, or other character having at least one visual characteristic (e.g., facial feature(s), body part(s), and/or an approximation thereof) that may be adapted to convey visual cues in a manner that at least partially mimics the desired human behavior of a human translator.

In a three-party video conversation with robot-based speech-to-speech translation, where the robot is integrated into an existing communication system, there may be two videos and one picture shown on the screen "by default" (since the communication system will simply treat the robot as if they were another user who has just no video capability, but has a still picture associated with his username in the communication system): a video of the caller, a video of the person being called, and a still picture representing the translation robot.

For example, in a video-based speech-to-speech translation system (S2ST) that includes video, Bob 'S client' S UI may show the video of the far-end user (Alice), the video of the near-end user (e.g., in a smaller portion of the available display area than Alice 'S video), and some pictures, such as automated robot static graphics, that are associated by default with the robot' S username. Bob is able to visually see the movement of Alice's lips while Alice speaks in her own language, and wait until Alice finishes speaking. Thereafter, the translator robot processes the audio (recognition and translation) and begins speaking Bob's language. During this time, the caller will not have a visual cue of whether and when the translation process is complete and whether and when to safely begin speaking. This tends to confuse Bob.

According to a particular embodiment, the idea is to effectively replace the picture of the translator robot with an avatar, thereby enabling:

use of avatars in a speech-to-speech translation system

Avatars imitating gestures of what a human translator or interpreter is going to do

That is, to avoid such confusion, the still picture is replaced with an avatar that is visually represented like a human translator. This can be achieved, for example, by: a video stream of synthetically generated video (generated in the manner described below) from the robot is sent to the target user as if it were a video stream from another human user on a video call, and the video stream will be automatically displayed via the client user interface (which would not require modification to the client software and would be compatible with legacy clients). Alternatively, the video could be generated at the target device itself, but nonetheless displayed as if it were an incoming video from another user (which may require some modification to the client software, but which would be more efficient in terms of network resources as it would not require the avatar video to be sent via the network 106).

Fig. 7A-E illustrate the display of Bob's user device 104b at various points in time during a video call. As shown, at each of these points in time, Alice's video 702, as captured at Alice's device 104a, is displayed in a first portion of the available display area, alongside the synthetic avatar video 704, the synthetic avatar video 704 is displayed in a second portion of the available display area (the first and second portions having similar sizes), with Bob's video 706, captured at Bob's device 104b (and also sent to Alice), being shown in a third portion of the available display area, underlying the avatar video 704 (the third portion being smaller than the first and second portions in this example). In this example, for illustration, the avatar has approximately a human male form.

Returning to fig. 6, at step S600 of fig. 6, the in-call translation flow starts. The in-call translation process causes Alice's speech to be translated from a source language to synthesized speech in a destination language for output to Bob during and as part of a voice call or video call in which at least Alice and Bob are engaged.

In this example, the translation flow starts in a "listen" state that is signaled to the notification component 804 (S602). In this case, the avatar is controlled in the composite video by the notification component 804 to assume a listening gesture as illustrated in FIG. 7A.

At step S604, the translator component detects whether Alice has begun to speak, for example, by monitoring the call audio received from Alice and performing Voice Activation Detection (VAD) thereon. As long as the translation process remains in the listening state (which would be the case until Alice started speaking), the avatar remains in the listening position. Upon detecting that Alice has begun speaking, the translator 802 signals the notification component 804 that the translation flow has entered an "attentive state" (S606), e.g., where it monitors Alice' S speech for eventual translation thereof, begins preparation for translation thereof, or performs partial translation of the speech that may undergo modification once more speech is received (as later speech may provide context that affects recognition or translation of earlier speech). In response, the notification component controls the avatar behavior to employ visual listening behavior, e.g., so that the avatar notices Alice when the remote user speaks, e.g., by turning his/her/its face toward Alice's video. This is illustrated in fig. 7B.

FIG. 7B illustrates an example in which feedback layout information relating to the relative positions of Alice and the avatar's video on the available display area of the target device can be used to influence the generation of the avatar video itself. In the example of FIG. 7B, the avatar video is displayed to the right of Alice's video, and layout information conveying this relative positioning is fed back from the notification output component 806 to the notification generation component 804. Based on this information, the notification generation component 804 controls the avatar video when the translator enters the "watch" mode to move the avatar eyes to the left, thereby ensuring that they are directed toward the display portion on the target device where Alice's video is displayed to give the avatar the impression that she is looking at Alice and noticing her. Thus, the layout-related information is used to make the user experience for Bob more natural by making the avatar behave naturally and intuitively.

At step S606, it is determined whether Alice is still speaking (i.e., whether she has been suspended for a sufficient (e.g., predetermined) amount of time from the beginning of her most recent interval of voice activity), for example, using the VAD. For as long as Alice is still speaking, the translation flow remains in the "attention" state, and the avatar therefore continues to exhibit listening behavior. When Alice does stop speaking, the translation process enters a "thought" state during which it performs processing for outputting the final audio translation of the most recent interval of Alice's speech. This is signaled to the notification component (S610), and in response, the notification component causes the avatar to employ visual behavior to convey the action of thinking, e.g., the avatar is able to employ a thinking gesture, e.g., placing his hand near his cheek or by mimicking a thinking face, which is illustrated in fig. 7C.

The avatar remains in the pose while the translation flow performs processing; when the processing is completed, the translation flow enters the "talking" state, and starts outputting the audio of the translation which is now ready (see S610). This is signaled at step S616, and in response the avatar is controlled to adopt a speech visual state, e.g. the avatar can notice (turn his face towards) the near-end user (i.e. look directly out of the display) when speaking the translation, and show the lips speaking (i.e. lip movement). This is illustrated in fig. 7D. For as long as the translator remains in the speaking state (i.e., for as long as translated audio is being output), the avatar remains in that state; upon completion of the output, the translator re-enters the listening state (see S620).

If there is an error during processing, the translator enters a "confusing" state, which is signaled to the notification component (S614). In response, the avatar is controlled to enter a confusing visual state, for example by grabbing his head or some other visual state of confusion. This is illustrated in fig. 7E. Additionally, when an avatar is also displayed at Alice's device, the avatar may "ask" Alice to repeat (i.e., speak again, do not hear I from scratch, etc.), i.e., an audio request may be output to Alice in the source language to ask her to repeat the content she just said.

Whereby a piece of information conveyed by the avatar using the visual information is used to indicate when the target user is free to start speaking, the point in time at which the avatar's lips stop moving constituting the visual indication conveying this.

Avatar behavior may also be affected by other behaviors (e.g., other events). For example, the notification generation component 804 can also receive information related to Bob, such as information related to the behavior of Bob (in addition to receiving information related to Alice, which in this case is received by means of information related to the translation procedure performed on Alice's speech). For example, Bob's voice may also be analyzed to detect when Bob begins speaking, at which point the avatar can be controlled to look at Bob's video 706 when displayed on Bob's display. Feedback layout information relating to the position of Bob's video on his display can also be used to control avatar behavior, e.g., in the example of FIGS. 7A-E, Bob's video is displayed below the avatar's video 704, and based thereon, the avatar can be controlled to look down when Bob speaks, thereby appearing to look at Bob.

Although described with reference to a robot, it should be noted that the subject matter described with reference to fig. 6, 7A-E and 8 also applies to non-robot-based systems where the avatar can be configured to behave in the same manner, but will effectively represent some other translation service (e.g., a cloud-based translation service) rather than the robot (user having an assigned user identifier and which thus appears as a communication system) itself.

Furthermore, although in the above the notification constitutes a visual notification conveyed by an animated avatar (i.e. implemented in an avatar video), the notification can in other embodiments take any desired form on the display, for example in the form of an icon that changes shape, color, etc. (e.g. by means of an animated representation of a light switching from red to green when it is safe for Bob to remain speaking), or an audible indication output via a speaker (e.g. a tone or other sound icon), or a tactile notification achieved by actuating a vibrating component, e.g. causing a physical tactile vibration of Bob's user device, and/or other mechanical components of the device. Audio and/or tactile notifications may be particularly useful for mobile devices.

As mentioned, although the above has been described with reference to a one-way translation for simplicity, a two-way translation may be performed, where separate and independent translations are performed for each individual call audio stream. Additionally, although the above has been described with reference to a call having two human participants, calls between any number (n <2) of human participants are also contemplated, where up to n translations are performed (e.g., if all n users speak different languages). Separate translations for each of the multiple persons may be performed separately and independently from each other on separate audio streams from different human participants during the n-way session to benefit (e.g., for transmission to) one or more of the other human participants. Additionally, translations in the target language may be sent to multiple target users who all speak the target language.

Reference to a media (e.g., audio/video) stream (or the like) refers to the transmission of media (e.g., audio/video) to a device via a communication network for output at the device when it is received, as opposed to media received in its entirety before output of the same is commenced. For example, where a composite audio or video stream is generated, the media is sent to the device as it is generated for output as it is when it is received (and thus, occasionally, while it is still being generated).

In accordance with another aspect of the subject matter, the present disclosure contemplates a method performed in a communication system in which users are uniquely identified by associated user identifiers, the communication system for effecting a voice call or a video call between a source user speaking a source language and a target user speaking a target language, the communication system maintaining computer code configured to effect a translator agent, the translator agent also being uniquely identified by an associated user identifier, thereby enabling communication with the agent substantially as if the agent is another user of the communication system, the method comprising: receiving a translation request requesting a translator agent to participate in a call; in response to receiving the request, including an instance of a translator agent as a participant in the call, wherein the translator agent instance, when so included, is configured to cause operations comprising: receiving call audio from a source user, the call audio including speech of the source user in a source language, performing an automatic speech recognition procedure on the call audio, the speech recognition procedure configured to recognize the source language, and using results of the speech recognition procedure to provide a translation of the source user's speech in a target language to the target user.

The agent may appear (by virtue of its associated user identifier) as another member of the communication system (e.g., in the user's contact list), or the nature of the bot may be hidden at the user interface level.

In accordance with yet another aspect of the present subject matter, there is disclosed a computer system for use in a communication system for enabling a voice call or a video call between at least a source user speaking a source language and a target user speaking a target language, the computer system comprising: one or more audio output components available to the target user; a translation output component configured to output, via the audio output component, for at least one interval of source user voice activity, an audible translation in the target language of the source user's voice during the interval; and a notification output component configured to output a notification to the target user when the output of the audible translation has substantially completed to indicate that the target user is free to respond to the source user.

According to yet another aspect of the present subject matter, a user equipment comprises: one or more audio output components; a display component for outputting visual information to a target user of a user device; computer storage holding client software for enabling a voice or audio call between a target user and a source user of another user device, the source user speaking a source language and the target user speaking a target language; a network interface configured to receive call audio of a call via a communication network, the call audio comprising speech of a source user in a source language during an interval of source user speech activity; one or more processors configured to execute client software, the client software configured to perform the following operations when executed: outputting, via an audio output component, the received call audio, outputting, via the audio output component, for at least one interval of source user voice activity, an audible translation of the source user's voice in the target language during the interval, and outputting, to the target user, when output of the audible translation has substantially completed, an indication to indicate that the target user is free to respond to the source user.

Generally, any of the functions described herein can be implemented using software, firmware, hardware (e.g., fixed logic circuitry), or a combination of these implementations. The terms "module," "functionality," "component," and "logic" as used herein generally represent software, firmware, hardware, or a combination thereof (e.g., the functional blocks of fig. 4A, 4B, and 8). In the case of a software implementation, the module, functionality, or logic represents program code (e.g., the method steps of fig. 5 and 6) that performs specified tasks when executed on a processor (e.g., CPU or CPUs). The program code can be stored in one or more computer readable memory devices. The features of the techniques described below are platform-independent, meaning that the techniques may be implemented on a variety of commercial computing platforms having a variety of processors.

For example, the user device may also include an entity (e.g., software of the client 118, for example) that causes hardware of the user device to perform operations (e.g., processor function blocks, etc.). For example, the user device may include a computer-readable medium that may be configured to maintain instructions that cause the user device, and more particularly an operating system and associated hardware of the user device, to perform operations. Thus, the instructions serve to configure the operating system and associated hardware to perform operations and in this manner cause a transition of the state of the operating system and associated hardware to perform functions. The instructions may be provided by the computer-readable medium to the user device in a variety of different configurations.

One such configuration of a computer-readable medium is a signal bearing medium and thus is configured to transmit the instructions (e.g., as a carrier wave) to a computing device, e.g., via a network. The computer-readable medium may also be configured as a computer-readable storage medium and thus is not a signal bearing medium. Examples of a computer-readable storage medium include Random Access Memory (RAM), Read Only Memory (ROM), optical disks, flash memory, hard disk memory, and other memory devices that may use magnetic, optical, and other techniques to store instructions and other data.

In an embodiment of the first aspect as set forth in the summary section, the change in behavior may be one of:

the translation flow enters a listening state, where the translation flow is currently waiting for future voice activity by the source user during a current interval of voice inactivity by the source user.

The translation flow enters a passive translation state in response to the source user starting a period of voice activity, where the translation flow is monitoring the current voice activity by the source user in the call voice.

The translation flow enters an active translation state in response to the source user completing an interval of speech activity in which the translation flow is currently generating an audio translation of the source user's speech in that interval to be output when that generation is complete.

The translation flow enters an output state in response to completion of generation of an audio translation of the source user's speech by the translation flow during a previous interval of source user speech activity, wherein the generated audio translation is currently being output by the translation flow for output to the target user.

The translation flow enters an error state in response to the flow encountering an error in generating the translation.

The translated audio may be transmitted, as it is generated, to a target device of a target user via a communication network for output, as it is received, via one or more audio output components of the device.

A composite video may be generated from the change in the signaled behavior, the composite video for display and implementation of the notification at the target user device of the target user. The composite video may have an animated avatar that performs a visual action, the notification being implemented by the avatar as a visual action. The actions implemented may approximate actions that are expected to be performed by a human translator or interpreter when a change in the state of mind of the human translator or interpreter is communicated in a real-life translation or interpretation scenario, the change corresponding to a change in the behavior of the translation flow.

The notification may include a visual notification for display at a target user device of the target user, and/or an audio notification for playout at the target user device, and/or a tactile notification output by actuating a mechanical component of the target user device.

In embodiments of the second aspect, the call audio may comprise speech of the source user in the source language during intervals of source user speech activity interspersed with intervals of speech inactivity in which the source user is not speaking; for an interval of the source user voice activity, the translation output component may be configured to output, via the audio output component, an audio translation of the source user's voice during the interval, and the notification output component may be configured to output a notification to indicate that the target user is free to respond to the source user when the output of the translation has substantially completed.

The computer system may be implemented by a target user device of a target user or by a combination of the target user device and at least one other computer device to which the target user device is connected via a communication network.

The computer system may include: an input configured to receive a signal signaling a change in a behavior of a translation process; and a notification generation component configured to generate a notification in accordance with the received signal.

The notification output component may be configured to generate output-related information defining a manner in which the notification is to be output to the target user; and the notification generation component may be configured to generate the notification in dependence on the output related information.

The computer system may include a display available to the target user, and the notification may include a visual notification to be displayed on the display, and the output related information includes related layout information. The notification generation component can be configured to generate a composite video that implements the notification, the composite video being generated according to the layout information. The composite video may have an animated avatar performing a visual action, the notification being implemented as a visual avatar action controlled according to the layout information.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A computer-implemented method performed in a communication system for enabling a voice call or a video call between at least a source user speaking a source language and a target user speaking a target language, the method comprising:

receiving call audio for the call, the call audio comprising speech of the source user in the source language;

performing a translation procedure on the call audio to generate an audio translation of the source user's speech in the target language for output to the target user; and

signaling a change in behavior of the translation flow, the change being related to the generation of the translation, and thereby causing a notification to be output to the target user to notify the target user of the change.

2. The method of claim 1, wherein the change in the behavior is one of:

the translation process enters a listening state in which the translation process is currently awaiting future voice activity by the source user during a current interval of voice inactivity by the source user;

the translation process entering a passive translation state in response to a period of time during which the source user begins voice activity, the translation process in the passive translation state monitoring current voice activity by the source user in the call audio;

the translation process entering an active translation state in response to the source user completing an interval of speech activity in which the translation process is currently generating an audio translation of the source user's speech in the interval, the audio translation to be output when the generation is complete;

the translation process entering an output state in response to the translation process completing generation of an audio translation of the source user's speech during a previous interval of source user speech activity, the audio translation generated in the output state being currently being output by the translation process for output to the target user;

the translation process enters an error state in response to the process encountering an error in generating the translation.

3. The method of any preceding claim, wherein the translated audio, as it is generated, is sent to a target device of the target user via a communication network for output via one or more audio output components of the device when the translated audio is received.

4. A method as claimed in any preceding claim, comprising generating a composite video from the change in the signalling of the behaviour, the composite video for display at a target user device of the target user and to implement the notification.

5. The method of claim 4, wherein the composite video has an animated avatar performing a visual action, the notification being implemented by the avatar as a visual action.

6. The method of any of the preceding claims, wherein the notification comprises a visual notification for display at a target user device of the target user, and/or an audio notification for playout at the target user device, and/or a tactile notification output by actuating a mechanical component of the target user device.

7. A computer program product comprising computer code stored on a computer readable storage device, the computer code when executed on a processor causing operations comprising:

establishing a voice call or a video call between at least a source user speaking a source language and a target user speaking a target language;

outputting to the target user an audio translation of the source user's speech in the target language, the translation generated by performing an automatic translation procedure on call audio of the call, the call audio of the call comprising the source user's speech in the source language; and

outputting a notification to the target user to notify the target user of a change in behavior of the translation process, the change being related to the generation of the translation.

8. A computer system for use in a communication system for enabling a voice call or a video call between at least a source user speaking a source language and a target user speaking a target language, the computer system comprising:

one or more audio output components available to the target user;

a translation output component configured to output, via the audio output component, an audio translation of the source user's speech in the target language to the target user, the translation generated by performing an automatic translation procedure on call audio of the call that includes the source user's speech in the source language; and

a notification output component configured to output a notification to the target user to notify the target user of a change in behavior of the translation process, the change being related to the generation of the translation.

9. The computer system of claim 8, wherein the call audio comprises speech of a source user in a source language during an interval of source user speech activity interspersed with intervals of speech inactivity during which the source user does not speak;

wherein for at least one interval of source user voice activity, the translation output component is configured to output, via the audio output component, an audio translation of the source user's voice during the interval, and wherein the notification output component is configured to output the notification to indicate that the target user is free to respond to the source user when the output of the translation has substantially completed.

10. The computer system of claim 8 or 9, wherein the computer system is implemented by a target user device of the target user, or by a combination of the target user device and at least one other computer device, the target user device being connected to the at least one other computer device via a communication network.

11. The computer system of claim 8, 9 or 10, comprising:

an input configured to receive a signal signaling the change in the behavior of a translation process; and

a notification generation component configured to generate the notification in accordance with the received signal.

12. The computer system of claim 11, wherein the notification output component is configured to generate output-related information defining a manner in which the notification is to be output to the target user; and

wherein the notification generation component is configured to generate the notification in accordance with the output-related information.

13. The computer system of claim 12, comprising a display available to the target user, wherein the notification comprises a visual notification to be displayed on the display, and the output-related information comprises related layout information.

14. The computer system of claim 13, wherein the notification generation component is configured to generate a composite video that implements the notification, the composite video being generated according to the layout information.

15. The computer system of claim 14, wherein the composite video has an animated avatar performing a visual action, the notification implemented as a visual avatar action controlled according to the layout information.