GB2590470A

GB2590470A - Providing an audio object

Info

Publication number: GB2590470A
Application number: GB1918827.5A
Authority: GB
Inventors: Juhani Laaksonen Lasse; Lehtiniemi Arto
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2019-12-19
Filing date: 2019-12-19
Publication date: 2021-06-30
Also published as: WO2021123495A1; GB201918827D0

Abstract

An apparatus comprises at least one processor and at least one memory and is configured to receive a first audio object 402, preferably comprising speech in a first language. The apparatus then provides and renders a second audio object 401, preferably speech in a second language, based at least on the first audio object and at least one parameter. The parameter may relate to a translation method such as literal translation or semantic translation, or the parameter may correspond to a period of time for carrying out the translation. The apparatus receives an instruction, which may be from a transmitting user 302 or a receiving user 301, to modify the second audio object; in response to receiving the instruction the at least one parameter is modified and a modified second audio object is provided based at least on the first audio object and at the at least one modified parameter. The apparatus may find use in an automatic language translation system where a user may choose between a quick translation with relatively low accuracy or a higher quality translation that takes a longer time to generate.

Description

PROVIDING AN AUDIO OBJECT

TECHNICAL FIELD

The present application relates generally to providing an audio object. More specifically, the present application relates to providing a first audio object based on a second audio object.

BACKGROUND

The amount of multimedia content increases continuously. Users create and consume multimedia content, and it has a big role in modern society.

SUMMARY

Various aspects of examples of the invention are set out in the claims. The scope of protection sought for various embodiments of the invention is set out by the independent claims. The examples and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention.

According to a first aspect of the invention, there is provided an apparatus comprising means for performing: receiving a first audio object, providing a second audio object based at least on the first audio object and at least one parameter, causing rendering at least the second audio object, receiving an instruction to modify the second audio object, in response to receiving the instruction, modifying the at least one parameter, and providing a modified second audio object based at least on the first audio object and the at least one modified parameter.

According to a second aspect of the invention, there is provided a method comprising: receiving a first audio object, providing a second audio object based at least on the first audio object and at least one parameter, causing rendering at least the second audio object, receiving an instruction to modify the second audio object, in response to receiving the instruction, modifying the at least one parameter, and providing a modified second audio object based at least on the first audio object and the at least one modified parameter.

According to a third aspect of the invention, there is provided a computer program comprising instructions for causing an apparatus to perform at least the following: receiving a first audio object, providing a second audio object based at least on the first audio object and at least one parameter, causing rendering at least the second audio object, receiving an instruction to modify the second audio object, in response to receiving the instruction, modifying the at least one parameter, and providing a modified second audio object based at least on the first audio object and the at least one modified parameter.

According to a fourth aspect of the invention, there is provided an apparatus comprising at least one processor and at least one memory including computer program code, the at least one memory and the computer program code configured to with the at least one processor, cause the apparatus at least to: receive a first audio object, provide a second audio object based at least on the first audio object and at least one parameter, cause rendering at least the second audio object, receive an instruction to modify the second audio object, in response to receiving the instruction, modify the at least one parameter, and provide a modified second audio object based at least on the first audio object and the at least one modified parameter.

According to a fifth aspect of the invention, there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus to perform at least the following: receiving a first audio object, providing a second audio object based at least on the first audio object and at least one parameter, causing rendering at least the second audio object, receiving an instruction to modify the second audio object, in response to receiving the instruction, modifying the at least one parameter, and providing a modified second audio object based at least on the first audio object and the at least one modified parameter.

According to a sixth aspect of the invention, there is provided a computer readable medium comprising program instructions for causing an apparatus to perform at least the following: receiving a first audio object, providing a second audio object based at least on the first audio object and at least one parameter, causing rendering at least the second audio object, receiving an instruction to modify the second audio object, in response to receiving the instruction, modifying the at least one parameter, and providing a modified second audio object based at least on the first audio object and the at least one modified parameter.

BRIEF DESCRIPTION OF THE DRAWINGS

Some example embodiments will now be described with reference to the accompanying drawings: Figure 1 shows a block diagram of an example apparatus in which examples of 30 the disclosed embodiments may be applied; Figure 2 shows a block diagram of another example apparatus in which examples of the disclosed embodiments may be applied; Figure 3 illustrates an example system in which examples of the disclosed embodiments may be applied; Figure 4 illustrates an example of real-time language translation; Figure 5 shows an example of a user providing feedback on an audio object; Figures 6A, 6B and 6C illustrate different examples of user interactions; Figure 7 illustrates an example relating to an instruction to modify an audio object; and Figure 8 illustrates an example method.

DETAILED DESCRIPTION

The following embodiments are exemplifying. Although the specification may refer to "an", "one", or "some" embodiment(s) in several locations of the text, this does not necessarily mean that each reference is made to the same embodiment(s), or that a particular feature only applies to a single embodiment. Single features of different embodiments may also be combined to provide other embodiments.

Example embodiments relate to an apparatus configured to receive a first audio object, provide a second audio object based at least on the first audio object and at least one parameter, cause rendering at least the second audio object, receive an instruction to modify the second audio object, in response to receiving the instruction, modify the at least one parameter, and provide a modified second audio object based at least on the first audio object and the at least one modified parameter.

Some example embodiments further relate to automatic language translation.

Automatic translation comprises using a computer program to translate a first natural language into a second natural language. A natural language comprises a human language. Automatic language translation may comprise, for example, receiving a spoken utterance in a first language, recognizing one or more words, evaluating a meaning of a sentence and creating a corresponding translation in a second language. The first and/or the second language may be given, or the language to be translated may be recognized automatically by, for example, different kinds of statistical approaches, artificial intelligence (Al) such as deep neural networks (DNN) or a combination thereof Automatic language translation may also utilize, for example, speech-to-text (STT) and/or text-to-speech (TTS) techniques.

Automatic language translation may comprise cloud computing. Cloud computing comprises on-demand availability of computer system resources, such as data storage and computing power, without active management of the system by a user. In an example, cloud computing comprises data centres available to users over the internet.

In cloud based automatic language translation, audio such as a spoken utterance may be transmitted to an application or a service implemented on the cloud. Before transmitting audio to the cloud, the data may be compressed using an audio or speech codec. Compression may comprise, for example, removing redundant information and/or removing irrelevant information such as perceptually insignificant information. A codec may be a device or an apparatus comprising a computer program that compresses data for transmission and decompresses received data.

An audio codec is a codec that is configured to encode and/or decode audio signals. An audio codec may comprise, for example, a speech codec that is configured to encode and/or decode speech signals. In practice, an audio codec comprises a computer program implementing an algorithm that compresses and decompresses digital audio data. For transmission purposes, the aim of the algorithm is to represent high-fidelity audio signal with minimum number of bits while retaining quality. In that way, storage space and bandwidth required for transmission of an audio file may be reduced.

There are different kinds of audio/speech codecs, for example, an enhanced voice services (EVS) codec suitable for improved telephony and teleconferencing, audiovisual conferencing services and streaming audio. Another example codec is an immersive voice and audio services (IVAS) codec. An aim of the 1VAS codec is to provide support for real-time conversational spatial voice, multi-stream teleconferencing, virtual reality (VR) conversational communications and/or user generated live and on-demand content streaming. Conversational communication may comprise, for example, real-time two-way audio between a plurality of users. An 1VAS codec provides support for, for example, from mono to stereo to fully immersive audio encoding decoding and/or rendering. An immersive service may comprise, for example, immersive voice and audio for virtual reality (VR) or augmented reality (AR), and a codec may be configured to handle encoding, decoding and rendering of speech, music and generic audio. A codec may also support channel-based audio, object-based audio and/or scene-based audio.

Channel-based audio may, for example, comprise creating a soundtrack by recording a separate audio track (channel) for each loudspeaker or panning and mixing selected audio tracks between at least two loudspeaker channels. Common loudspeaker arrangements for channel-based surround sound systems are 5.1 and 7.1, which utilize five and seven surround channels, respectively, and one low-frequency channel. A drawback of channel-based audio is that each soundtrack is created for a specific loudspeaker configuration such as 2.0 (stereo), 5.1 and 7.1.

Object-based audio addresses this drawback by representing an audio field as a plurality of separate audio objects, each audio object comprising one or more audio signals and associated metadata. An audio object may be associated with metadata that defines a location or trajectory of that object in the audio field. Object-based audio rendering comprises rendering audio objects into loudspeaker signals to reproduce the audio field. As well as specifying the location and/or movement of an object the metadata may also define the type of object, for example, acoustic characteristics of an object, and/or the class of renderer that is to be used to render the object. For example, an object may be identified as being a diffuse object or a point source object. Object-based renderers may use the positional metadata with a rendering algorithm specific to the particular object type to direct sound objects based on knowledge of loudspeaker positions of a loudspeaker configuration.

Scene-based audio combines the advantages of object-based and channel-based audio and it is suitable for enabling truly immersive VR audio experience. Scene-based audio comprises encoding and representing three-dimensional (3D) sound fields for a fixed point in space. Scene-based audio may comprise, for example, ambisonics and parametric immersive audio. Ambisonics comprises a full-sphere surround sound format that in addition to a horizontal plane comprises sound sources above and below a listener. Ambisonics may comprise, for example, first-order ambisonics (FOA) comprising four channels or higher-order ambisonics (H0A) comprising more than four channels such as 9, 16, 25, 36, or 49 channels. Parametric immersive audio may comprise, for example, metadata-assisted spatial audio (MASA).

Object-based audio may be utilized in conversational voice services. Conversational voice services may include, for example, a voice service comprising a plurality of audio objects. The plurality of audio objects may relate to different, II) independent, sound sources, sound sources that are interrelated, or a combination thereof. Conversational voice services may also involve translating a first language into second language.

Cloud-based automatic language translation together with one or more suitable codecs enable real-time language translation. Real-time language translation may be needed in different situations. For example, assuming a car rental customer travelling in a foreign country has an issue with their rental car. The customer may place a voice call via a dedicated customer service number to solve the issue. The customer service number may be, for example, an audio translation service that is routed to a data centre for carrying out a language translation for received audio. In such a case, the translation may be carried out such that first a first user talks, a second user listens to a translation of what the first user said and then the second user responses. The response of the second user in then translated and provided to the first user. This procedure is referred to as a sequential voice experience. A problem with this kind of procedure is that things happen sequentially, which takes time. Another problem is that the natural flow of discussion is constantly interrupted by the translations.

Real-time language translation may be provided locally in a user terminal such as a mobile computing device, using a real-time translation service in a network or an external real-time translation service.

For example, speech of a first user may be translated locally in a user terminal and then encoded into a bitstream that is transmitted to a second user. A decoder in the second user's mobile computing device is configured to decode the received bit stream and render the decoded bit stream for the second user.

As another example, translation may be provided using network translation such that speech of a first user is encoded in the terminal device and transmitted to a network entity for translation. The network entity decodes the incoming speech signals, translates the speech signals, encodes at least the translation and transmits at least the encoded translation to a second user. As a further example, the network entity may repackage the original encoded speech signal or provide a new encoding of the decoded speech signal, where the encoding may further comprise the translated speech signal.

Some example embodiments relate to spatial conversational voice services.

Spatial conversational voice services comprise spatial audio presented in a conversational context. A spatial conversational voice service may comprise, for example, a conference call and/or translating speech of a user from a first language into a second language. Spatial audio may comprise a full sphere surround-sound to mimic the way people perceive audio in real life. Spatial audio may comprise audio that appears from a user's position to be assigned to a certain direction and/or distance. Therefore, the perceived audio may change with the movement of the user or with the user turning. Spatial audio may comprise audio created by sound sources, ambient audio or a combination thereof Ambient audio may comprise audio that might not be identifiable in terms of a sound source such as traffic humming, wind or waves, for example. The full sphere surround-sound may comprise a spatial audio field and the position of the user or the position of the capturing device may be considered as a reference point in the spatial audio field. According to an example embodiment, a reference point comprises the centre of the audio field.

Utilizing spatial audio in real-time translation enables providing, for example, a plurality of different audio objects for a user in different directions with respect to a reference point in a spatial audio field. For example, using spatial audio may enable providing an original audio object that appears from a user's position to be assigned to the right of the user and a translated audio object that appears from a user's position to be assigned to the left of the user.

However, a problem with translations may be that there are different kinds of translation engines. For example, different translation engines may use different sizes of language segments to be translated and, hence, different translation engines may require different amounts of time for providing a translation. A language segment may comprise a speech segment or a text segment. A quality of translation may depend on the size of a speech segment selected for translation. A quality of translation may relate to how accurate a translation is or how well a user understands the translation. For example, in order to provide a high-quality translation, a language segment with a plurality of words or sentences may be needed. A high-quality translation may take into account a context of words and/or sentences rather than mechanically translating a word for word. On the other hand, in some cases a lower quality translation may be enough for a user and in such a case, there is no need to spend too much time for a translation. In current systems, the user's needs are not considered. Further, machine processing causes unnatural pauses thereby making it challenging to carry a smooth conversation.

Figure 1 is a block diagram depicting an apparatus 100 operating in accordance with an example embodiment of the invention. The apparatus 100 may be, for example, an electronic device such as a chip or a chipset. The apparatus 100 comprises one or more control circuitry, such as at least one processor 110 and at least one memory 160, including one or more algorithms such as computer program code 120 wherein the at least one memory 160 and the computer program instructions are configured, with the at least one processor 110 to cause the apparatus to carry out any of example functionalities described below.

In the example of Figure 1, the processor 110 is a control unit operatively connected to read from and write to the memory 160. The processor 110 may also be configured to receive control signals received via an input interface and/or the processor may be configured to output control signals via an output interface. In an example embodiment the processor 110 may be configured to convert the received control signals into appropriate commands for controlling functionalities of the apparatus.

The at least one memory 160 stores computer program instructions 120 which 10 when loaded into the processor 110 control the operation of the apparatus 100 as explained below. In other examples, the apparatus 100 may comprise more than one memory 160 or different kinds of storage devices.

Computer program instructions 120 for enabling implementations of example embodiments of the invention or a part of such computer program instructions may be loaded onto the apparatus 100 by the manufacturer of the apparatus 100, by a user of the apparatus 100, or by the apparatus 100 itself based on a download program, or the instructions can be pushed to the apparatus 100 by an external device. The computer program instructions may arrive at the apparatus 100 via an electromagnetic carrier signal or be copied from a physical entity such as a computer program product, a memory device or a record medium such as a Compact Disc (CD), a Compact Disc Read-Only Memory (CD-ROM), a Digital Versatile Disk (DVD) or a Blu-ray disk.

In the example embodiment of Figure 2, the apparatus 200 is illustrated as comprising the apparatus 100 and a network interface 220 for communicating with one or more devices such as a cloud computer. For example, the apparatus 200 may be configured to receive and/or transmit an audio stream from a device via the network interface. The apparatus 200 of the example of Fig. 2 may also be configured to establish radio communication with another device using for example, a cellular network, Bluetooth or WiFi connection.

According to an example embodiment, the apparatus 200 comprises an audio codec comprising a decoder for decompressing received data such as an audio stream and/or an encoder for compressing data for transmission. Received audio data may comprise, for example, an encoded bitstream comprising binary bits of information that may be transferred from one device to another. According to an example embodiment, the audio codec comprises a speech codec.

Different audio codecs may have different bit rates. A bit rate refers to the number of bits that are processed or transmitted over a unit of time. Typically, a bit rate is expressed as a number of bits or kilobits per second (e.g. kbps or kbits/second). A bit rate may comprise a constant bit rate (CBR) or a variable bit rate (VBR). CBR files allocate a constant amount of data for a time segment while VBR files allow allocating a higher bit rate, that is more storage space, to be allocated to the more complex segments of media files and allocating a lower bit rate, that is less storage space, to be allocated to less complex segments of media files. A VBR operation may comprise discontinuous transmission (DTX) that may be used in combination with CBR or VBR operation. In DTX operation, parameters may be updated selectively to describe, for example, a background noise level and/or spectral noise characteristics during inactive periods such as silence, whereas regular encoding may be used during active periods such as speech.

In some example embodiments, the apparatus 200 may further comprise one or more microphones, a plurality of loudspeakers and a user interface for interacting with the apparatus 200 such as a mobile computing device. The apparatus 200 may also comprise a display configured to act as a user interface. For example, the display may be a touch screen display. In an example embodiment, the display and/or the user interface may be external to the apparatus, but in communication with it.

Additionally or alternatively, the user interface may also comprise a manually operable control such as a button, a key, a touch pad, a joystick, a stylus, a pen, a roller, a rocker, a keypad, a keyboard or any suitable input mechanism for inputting and/or accessing information. Further examples include a camera, a speech recognition system, eye movement recognition system, acceleration-, tilt-and/or movement-based input systems. Therefore, the apparatus 200 may also comprise different kinds of sensors such as one or more gyro sensors, accelerometers, magnetometers, position sensors and/or tilt sensors.

The apparatus 200 may comprise or be in communication with an augmented reality (AR) device. An AR device comprises a device configured to present AR content for a user. For example, an AR device may comprise a wearable AR device such as AR glasses, a head-mounted AR device such as a visor, or a handheld device such as a smartphone.

The apparatus 200 may comprise a mobile computing device, a cloud computer or a separate entity acting as a broker between a first user and a second user. According to an example embodiment, the apparatus 200 is configured to receive a first audio object According to an example embodiment, the first audio object comprises a spatial audio object.

According to an example embodiment, the first audio object comprises an audio stream. An audio stream may comprise a live audio stream such as real-time two-way communications or teleconferencing communications, or an on-demand audio stream. An audio stream may comprise real-time audio delivered through a network connection. Audio may be streamed together with other types of media streaming or audio may be streamed as a part of other types of media streaming such as video streaming. The audio stream may comprise, for example, audio data representing speech during a voice or video call.

According to an example embodiment the audio stream comprises a speech segment comprising at least one speech signal.

According to an example embodiment an audio object comprises audio data associated with metadata. Metadata associated with an audio object provides information on the audio data. Information on the audio data may comprise, for example, one or more properties of the audio data, one or more characteristics of the audio data and/or identification information relating to the audio data. For example, metadata may provide information on a position associated with the audio data in a spatial audio field, movement of the audio object in the spatial audio field and/or a function of the audio data. Without limiting the scope of the claims, an advantage of an audio object is that metadata may be associated with audio signals such that the audio signals may be reproduced by defining their position in a spatial audio field.

According to an example embodiment, the first audio object comprises an audio stream in a first language. The first language may comprise a natural language. A natural language comprises a human language such as English, French, German, Chinese or Finnish.

According to an example embodiment, the apparatus 200 is further configured to provide a second audio object based at least on the first audio object and at least one parameter. Providing a second audio object may comprise providing a second audio object based on at least one parameter value or in dependence upon a parameter value.

According to an example embodiment, the second audio object comprises one or more audio signals. According to an example embodiment, the second audio object comprises a spatial audio object The second audio object may be associated with metadata that defines a location and/or trajectory of the second audio object in a spatial audio field. The metadata may be associated with the second audio object by the apparatus 200 or a device which the apparatus 200 is configured to communicate with.

According to another example embodiment, the second audio object comprises one or more audio signals and associated metadata that defines a location and/or trajectory of the second audio object in a spatial audio field.

According to an example embodiment, the first audio object and/or the second audio object may be modified in response to a user input. Modifying the first audio object and/or the second audio object may comprise modifying the content of the first audio object and/or the second audio object, one or more properties of the first audio object and/or the second audio object, or modifying metadata associated with the first audio object and/or the second audio object.

For example, assuming the first audio object comprises original speech and the second audio object comprises a translation of the original speech, the location of the rendered translation may be moved from the left side of the user to the right side of the user in response to a user input. As another example, the volume of the original speech and/or the volume of the translation may be increased or decreased in response to a user input.

Providing a second audio object may comprise different kinds of operations.

For example, providing a second audio object may comprise generating the second audio object, providing the second audio object by modifying the first audio object, providing the second audio object by translating the contents of the first audio object, associating metadata with the second audio object and/or the like. Providing a second audio object may also comprise encoding and/or decoding audio data.

The apparatus 200 may be configured to receive information on one or more settings relating to the first audio object and the second audio object. For example, the apparatus 200 may be configured to receive information on a language setting relating to the first audio object and/or information on a language setting relating to the second audio object A language setting relating to an audio object may comprise information on a language setting associated with a device from which the audio object is received, or a language setting associated with a device to which the audio object is provided. According to an example embodiment, the apparatus 200 is configured to provide the second audio object based on a language setting relating to the first audio object and/or a language setting relating to the second audio object.

According to an example embodiment, providing the second audio object comprises receiving a translation of the contents of the first audio object. A translation of the contents of the first audio object may be received, for example, from a network device.

According to an example embodiment, providing the second audio object comprises translating the first audio object from a first language into a second language based on the at least one parameter. The second language may comprise a natural language. A natural language comprises a human language such as English, French, German, Chinese or Finnish.

A translation may comprise a machine translation performed by a computer program. The machine translation may be configured to utilize, for example, one or more corpora, neural networks and/or different kinds of algorithms. Translating the received audio data from a first language into a second language may comprise translating one language segment at a time. A language segment may comprise a translation unit such as one or more words, one or more sentences, one or more clauses or any other suitable translation unit. A translation unit may be a linguistic unit or a user defined unit.

According to an example embodiment, the apparatus 200 is configured to select a translation method based on a size of a language segment. For example, if a language segment comprises a plurality of words, the apparatus 200 may be configured to select a translation method capable of providing a context sensitive translation. As 35 another example, of a language segment comprises a single word, the apparatus 200 may be configured to select a word for word translation method.

According to an example embodiment, the apparatus 200 is configured to select a translation method based on recipient information. Recipient information may comprise information on a receiving user such as the identity of the receiving user. For example, a translation method may be selected based on language skills of the receiving user, social status of the receiving user or the like.

The apparatus 200 may be configured to provide different kinds of translation methods. The apparatus 200 may be configured to provide different translation methods utilizing artificial intelligence (Al) such as neural networks. A translation method may comprise, for example, a word for word translation, literal translation, faithful translation, semantic translation, communicative translation, idiomatic translation, free translation or adaptation. A word for word translation comprises translating a first language into a second language such that the word order of the first language is preserved, and words are translated by their most common meaning. A literal translation comprises translating a first language into a second language such that grammatical constructions of the first language are converted to their nearest equivalents in the second language. Faithful translation comprises translating a first language into a second language such that the aim is to reproduce precise contextual meaning of the first language. Semantic translation comprises translating a first language into a second language such that semantic information is used to aid the translation. Communicative translation comprises translation a first language into a second language such that the aim is to render the exact contextual meaning of the first language such that the language and content are acceptable. An idiomatic translation comprises translating a first language into a second language such that the structure of the second language is followed while communicating the exact message of the first language. A free translation comprises translating a first language into a second language such that the message of the first language may be augmented or distorted. Adaptation comprises translating a first language into a second language such that the culture of the first language is converted to the culture of the second language.

The at least one parameter may affect provision of the second object based on the first object in different ways. For example, the at least one parameter may define a method for providing the second audio object based on the first audio object. As another example, the at least one parameter may define at least one criterion for providing the second audio object based on the first audio object. The at least one criterion may comprise, for example, a period of time within which provision of the second object is to be performed. As a further example, the at least one parameter may define the size of a language segment to be translated. A size of a language segment may comprise a number of linguistic units to be translated.

According to another example embodiment the at least one parameter comprises a parameter corresponding to a translation method. For example, different parameter values may be associated with different translation methods. A first parameter value may correspond to a first method and a second parameter value may correspond to a second method.

According to an example embodiment the at least one parameter comprises a parameter corresponding to a default translation method. The default translation method may be different for different users. According to an example embodiment, a default translation method is associated with the identity of a receiving and/or transmitting user. For example, a default translation method may be associated with the identify of a caller and/or a person receiving a voice/video call. For example, assuming the person receiving a voice/video call knows the caller, default translation method may comprise a translation method of lower quality than when the person is receiving a voice/video call from an unknown party. When receiving a voice/video call from an unknown party, a high-quality translation may be desired by default According to an example embodiment, the translation method comprises at to least one of the following: a word for word translation, literal translation, faithful translation, semantic translation, communicative translation, idiomatic translation, free translation or adaptation.

According to an example embodiment, the at least one parameter comprises a parameter corresponding to a period of time for translating the first audio object from a first language into a second language. According to an example embodiment, a parameter corresponding to a period of time comprises a duration of the period of time. The duration may comprise a duration in seconds, tenths of seconds or the like. For example, a shorter period of time may correspond to a word for word translation and a longer period of time may corresponds to a translation that takes into account, for example, context.

A parameter corresponding to a period time for translating the first audio object from a first language into a second language may comprise a frequency of providing a translation or a period of time allowed for providing a translation. For example, a parameter corresponding to period of time for translating the first audio object from a first language into a second language may define that a translation is to be provided every 5 seconds or that 5 seconds is allowed for providing the translation.

According to an example embodiment a period of time for translating the first audio object from a first language into a second language corresponds to a size of a language segment to be translated. A language segment may comprise, for example, one or more words, one or more sentences, one or more clauses or any other suitable translation units. For example, if a duration of a period of time for translating the first audio object from a first language into a second language is below a first threshold value, a word for word translation may be selected. On the other hand, of a duration of a period of time for translating the first audio object from a first language into a second language is above a second threshold value, a translation taking into account the context may be selected.

According to an example embodiment, the at least one parameter comprises a size of a language segment to be translated. A size of a language segment may comprise, for example, the number of linguistic units to be translated. According to an example embodiment the at least one parameter comprises a parameter corresponding to a number of linguistic units to be translated. A linguistic unit may comprise, for example, a syllable, a word, a phoneme, a letter or a sentence.

The at least one parameter may comprise a default parameter, a user defined parameter or a pre-determined parameter.

According to an example embodiment the at least one parameter comprises a parameter corresponding to a delay between a start of the first audio object and a start of the second audio object. For example, assuming the first audio object comprises a speech segment and the second audio object comprises a translation of the speech segment, the at least one parameter corresponds to the delay between the start of the speech segment and the start of the translation of the speech segment.

According to an example embodiment the apparatus 200 is configured to select a translation method based on the at least one parameter.

According to an example embodiment the apparatus 200 is further configured to cause rendering at least the second audio object. According to an example embodiment, rendering at least the second audio object comprises object-based audio rendering.

Object-based audio rendering comprising rendering one or more audio objects into loudspeaker signals to reproduce a spatial audio field.

According to an example embodiment, the apparatus 200 is configured to cause rendering at least the second audio object by causing output of the second audio object via a plurality of loudspeakers. According to another example embodiment the apparatus 200 is configured to cause rendering at least the second audio object by transmitting the second audio object to another device for rendering.

The apparatus 200 may be configured to cause rendering of a plurality of audio objects. According to an example embodiment, the apparatus 200 is configured to cause rendering of the first audio object and the second audio object.

According to an example embodiment the apparatus 200 is configured to further cause rendering the first audio object. In other words, a person may concurrently hear the first audio object and the second audio object that appear to come, for example, from different directions.

Without limiting the scope of the claims, an advantage of rendering both the first audio object and the second audio object may be that a user, such as a receiving user and/or a transmitting user, may compare the first audio object and the second audio object. Another advantage may be that when a transmitting user hears the translation that is also provided to a receiving user, the transmitting user is aware that the translation is still ongoing, and the receiving user may be listening to it.

The apparatus 200 is configured to receive feedback on the second audio object. The feedback may relate to, for example, quality of the second audio object or desired quality of the second audio object. According to an example embodiment, the apparatus 200 is configured to receive an instruction to modify the second audio object. The instruction may comprise information relating to the modification of the second audio object. For example, the instruction may comprise information on a type of the modification of the second object.

According to an example embodiment, the apparatus 200 is configured to receive information on a characteristic of the instruction and provide the modified second audio object based on the characteristic of the instruction. For example, if the instruction is a gesture such as a touch gesture, one or more characteristics of the gesture may be used for determining the type of the modification of the second object Assuming the gesture comprises a drag gesture on a touch screen, the apparatus 200 may be configured to receive information on a length of the gesture and determine a desired parameter value for modifying the second object. As another example, assuming again that the gesture comprises a hold gesture such as a touch and hold gesture, the apparatus 200 may be configured to receive information on a duration of the gesture and determine a desired parameter value for modifying the second object. As another example, assuming again that the gesture comprises a force touch gesture, the apparatus 200 may be configured to receive information on the force exerted on the touch display, or the device, and determine a desired parameter value for modifying the second object As another example, the instruction may be a voice input, such as phrase recognized by the apparatus 200, and the apparatus 200 may be configured to receive the voice input and determine a desired parameter value for modifying the second object According to an example embodiment, the apparatus is configured to determine a modification of the at least one parameter based on at least one characteristic of the instruction to modify the second audio object According to an example embodiment, the apparatus 200 is configured to receive the instruction to modify the second audio object concurrently with causing rendering of at least the second audio object. The apparatus 200 may be configured to cancel, in response to receiving the instruction, causing rendering at least the second audio object Alternatively, the apparatus 200 may be configured to continue, despite receiving the instruction, rendering at least the second audio object and concurrently provide a modified second audio object based on the first audio object Without limiting the scope of the claims, an advantage of receiving an instruction to modify the second audio object concurrently with causing rendering of at least the second audio object may be that real-time feedback on the second audio object may be provided.

According to an example embodiment, the instruction is received from a user. According to an example embodiment the user comprises a receiving user or a transmitting user. A receiving user may be, for example, a user for whom the provided second audio object is rendered. A receiving user may control the provision of the second audio object by providing feedback on the second audio object. For example, a receiving user may consider that the quality of the second audio object is not sufficient and provide feedback on the quality of the second audio object Assuming the second audio object comprises a translation of the contents of the first audio object and the receiving user considers that the quality of the translation is not good enough, the receiving user may request a better translation. On the other hand, if the receiving user thinks that it is more important to receive the translation fast than with a good quality, the receiving user may instruct the apparatus 200 to provide a less accurate translation.

A transmitting user may be, for example, from whom the first audio object is received. A transmitting user may also control the provision of the second audio object. For example, assuming the second audio object comprises a translation of the contents of the first audio object comprising a first portion and a second portion, and the first portion comprises less complex vocabulary while the second portion comprises more complex vocabulary. The transmitting user may control provision of the second audio object such that the transmitting user instructs the apparatus 200 to, for example, provide a less accurate translation for the first portion of the first audio object and a more accurate translation for the second portion of the first audio object.

According to an example embodiment, the apparatus 200 is further configured to modify, in response to receiving the instruction, the at least one parameter. Modifying a parameter may comprise modifying a parameter value, replacing a first parameter with a second parameter, replacing a first parameter value with a second parameter value, increasing or decreasing a parameter value, setting a criterion for applying a parameter, or the like.

According to an example embodiment, a modified parameter is stored as a default parameter. For example, assuming a user receives a voice/video call from a known caller such that the speech of the caller is translated for the user and the apparatus 200 receives an instruction from a user to provide a translation of better quality, the modified parameter may be stored as a default parameter for the known caller.

According to an example embodiment the apparatus 200 is further configured to provide a modified second audio object based at least on the first audio object and the at least one modified parameter. The apparatus 200 may be configured to provide the modified second audio object based on a complete first audio object or a portion of the first audio object.

Without limiting the scope of the claims an advantage of modifying, in response to receiving the instruction, the at least one parameter and providing a modified second audio object based on the first audio object and the at least one modified parameter may be that the apparatus 200 may adapt in real-time one or more processes of the apparatus 200 based on the received feedback. For example, a receiving user may provide real-time feedback based on which a more suitable translation may be provided for the receiving user. As another example, a transmitting user may control in real-time the provision of a translation such that a receiving user may concentrate on the most relevant parts of the translation.

According to an example embodiment, the apparatus 200 is configured to provide a plurality of audio objects. For example, the apparatus 200 may be configured to provide a plurality of translations of the first audio object. The apparatus 200 may be configured to provide a plurality of alternative translations or a plurality of translations in different languages.

According to an example embodiment, the apparatus 200 is configured to provide and cause rendering a plurality of independent second audio objects and enable controlling the second audio object independently. For example, an audio/video teleconference participant may receive translations of the speech of different other participants. According to an example embodiment, the apparatus 200 is configured to receive at least one instruction to modify a plurality of second audio objects and in response to receiving the at least one instruction, modify at least one parameter associated with the plurality of second audio objects. For example, assuming person A, person B and person C participate in a teleconference: person A may receive translations of the speech of person B and C, and provide an instruction to modify the translation of the speech of person B independent from the translation of the speech of person C. As another example, person A may provide instruction to modify the translations of the speech of person B and C in a similar manner.

According to an example embodiment, providing the modified second audio object comprises translating the first audio object from a first language into a second language according to the at least one modified parameter.

According to an example embodiment, the apparatus 200 comprises means for performing the features of the claimed invention, wherein the means for performing comprises at least one processor 110, at least one memory 160 including computer program code 120, the at least one memory 160 and the computer program code 120 configured to, with the at least one processor 110, cause the performance of the apparatus 200. The means for performing the features of the claimed invention may comprise means for receiving a first audio object means for providing a second audio object based on the first audio object and at least one parameter, means for causing rendering at least the second audio object, means for receiving an instruction to modify the second audio object, means for, in response to receiving the instruction, modifying the at least one parameter and means for providing a modified second audio object based on the first audio object and the at least one modified parameter.

The apparatus 200 may further comprise means for selecting a translation method based on the at least one parameter.

The apparatus may further comprise means for receiving information on a characteristic of the instruction and providing the modified second audio object based on the characteristic of the instruction. The means for performing the features of the claimed invention may further comprise providing a plurality of audio objects. The apparatus may further comprise means for receiving the instruction to modify the second audio object concurrently with causing rendering of at least the second audio object and/or means for causing rendering the first audio object Figure 3 illustrates a system according to an example embodiment. The system comprises a first user 301, a second user 302 and a translation service 303.

In the example of Figure 3 it is assumed that the first user 301 and the second user are in a voice call and the first user 301 speaks a first language and the second user 302 speaks a second language. The translation service 303 is configured to translate the first language into the second language and the second language into the first language. The translation service may be comprised by a device used by the first user 301 for the voice, by a device used by the second user 302 for the voice, or by a cloud server through which the voice call is carried out.

Figure 4 illustrates an example of real-time language translation. In the example of Figure 4, it is assumed that that a first user 301 using a first mobile computing device is in a voice call with a second user 302 using a second mobile computing device. The voice call may be carried out using, for example, an 1VAS codec comprised by the first mobile computing device and/or the second mobile computing device. It is also assumed that the first user 301 speaks a first language and the second user 302 speaks a second language. In the example of Figure 4, the first language is translated into the second language and the second language is translated into the first language during the voice call. The voice call may take place, for example, via a translation service or any other suitable instance enabling real time language translation. The translation may be performed, for example, based on language settings in the mobile computing devices used for the voice call. The translation may be performed, for example, based on language identification at the translation service 303, in the mobile computing device(s), or based on input from the first and/or second user.

In the example of Figure 4, the second user 302 utters an original sentence 402 in the second language during the voice call and the original sentence is translated into the first language. In the example of Figure 4, a translation 401 of the original sentence 402 is provided for the first user 301 in first second language. In addition to the translation, the original sentence 402 in the second language is also provided for the first user 301.

In the example of Figure 4, it is assumed that the translation service comprises a translation engine that is configured to use a word for word translation. However, the first user 301 may be of the opinion that the quality of the translation is not good enough, but it should be more accurate and wishes to indicate to the translation service that the quality of the translation is not satisfactory.

Figure 5 illustrates an example of a user providing feedback on an audio object.

In the example of Figure 5, the first user 301 indicates to the translation service that the quality of the translation is not satisfactory. In the example of Figure 5 the user is provided with a first audio object 501 and a second audio object 502, for example, by the apparatus 200 during a voice call. In the example of Figure 5, the first audio object 501 comprises an original utterance and the second audio object 502 comprises a translation of the original utterance.

The first user 301 may provide feedback on the translation using a mobile computing device 510. The mobile computing comprises a user interface such as a touch screen that is configured to receive user inputs. In the example of Figure 5, graphical representations of audio objects are provided on the user interface. A first graphical representation 511 corresponds to the first audio object 501 and a second graphical representation 512 corresponds to the second audio object 502. According to an example embodiment, the positions of the graphical representation correspond to the positions of the audio objects in a spatial audio field. For example, if a first audio object is provided to to the right of a user and a second audio object is provided to the left of the user, the first graphical representation of the first audio object may be provided on the right of a user interface and the second graphical representation of the second audio object may be provided on the left of the user interface, respectively.

The first user 301 may provide feedback on the translation using the corresponding graphical representation. For example, assuming that in the example of Figure 5, the second audio object 502 comprises a translation of the contents of the first audio object 501, the user may provide feedback on the translation by modifying the graphical representation of the second audio 512 object on the user interface. Modifying the graphical representation may comprise, for example, providing a touch gesture on the graphical representation. For example, the first user 301 may swipe the representation 512 to the left in order to dismiss the translation. A user may dismiss a translation, for example, if the user does not need the translation. Swiping the representation 512 to the left is illustrated with the arrow 513.

On the other hand, the first user could swipe the representation 511 to the right to dismiss the original utterance. A user may dismiss the original utterance, for example, if the user wishes to concentrate on the translation.

Figures 6A, 6B and 6C illustrate different examples of user interactions that may be used for providing feedback on the translation. In the example of Figures 6A, 6B and 6C the user interactions comprise touch gestures performed on a touch screen of the 30 mobile computing device 510.

Figure 6A illustrate an original position of the first graphical representation 511 and the second graphical representation 512. The first graphical representation corresponds to a first audio object in a spatial audio field and the second graphical representation corresponds to a second audio object in the spatial audio field.

Figure 6B illustrates an example of a touch and hold gesture. A touch and hold gesture comprises a combination of a drag gesture and a hold gesture. For example, a user may drag the representation 511 upwards and hold it there. Dragging the representation 511 upwards is illustrated with the arrow 601. According to an example embodiment, the apparatus 200 is configured to receive information on one or more characteristics of the touch gesture and modify at least one parameter for providing a modified audio object.

For example, in the case of a touch and hold gesture, a duration of the hold gesture may be determined as an instruction to modify the at least one parameter and the duration of the hold gesture may be relative to the modification of the at least one parameter.

Figure 6C illustrates and example of a drag or swipe gesture. In the example of 5 Figure 6C, a user may drag or swipe the representation 511 upwards 603 or downwards 602. According to an example embodiment, the apparatus 200 is configured to receive information on one or more characteristics of the touch gesture and modify at least one parameter for providing a modified audio object. For example, in the case of a drag gesture, a length of the drag gesture may be determined as an instruction to modify the at 10 least one parameter and the length of the drag gesture may be relative to the modification of the at least one parameter. Dragging the representation upwards on the touch screen may indicate to the apparatus 200 that a longer period of time is needed for the translation in order to provide a translation with better quality. In response to receiving information that a longer period of time is needed for the translation, the apparatus may, for example, choose another translation method. On the other hand, dragging the representation downwards on the touch screen may indicate to the apparatus 200 that a shorter period of time is appreciated, as a faster translation is needed. In this case, a lower quality is satisfactory for the user.

In another example, the gesture comprises a force touch gesture. A touch input with force exerted downwards in the Z direction at the representation 511 may be provided. The amount of force of the input exerted may be determined as an instruction to modify the at least one parameter and the amount of force may be relative to the modification of the at least one parameter. Determining that the amount of exerted force is over a first pre-determined threshold value may indicate to the apparatus 200 that a longer period of time is needed for the translation in order to provide a translation with better quality. Determining that the amount of exerted force is below a second predetermined threshold value may indicate to the apparatus 200 that a shorter period of time is appreciated, as a faster translation is needed. In this case, a lower quality is satisfactory for the user. There may be a plurality of threshold values and/or the first and second threshold values may be the same.

Figure 7 illustrates an example of informing a user that an instruction to modify an audio object is received. For example, a transmitting user may be informed that a better translation is requested. In the example of Figure 7, the first user 301 utilizes an augmented reality (AR) device 703. Augmented reality (AR) comprises a interactive experience of a real-world environment where objects that reside in the real world are enhanced by computer generated perceptual information. An AR device may comprise, for example, AR glasses, a head-up display or the like.

In the example of Figure 7, the first user 301 provides an audio object 701 in a first language. The audio object 701 is translated from the first language into a second 40 language for a second user. The second user may provide feedback that a better translation is needed. The first user 301 using the AR device may receive a notification 702 that a better translation is requested. The notification 702 may be presented on the AR device.

Figure 8 illustrates an example method 800 incorporating aspects of the previously disclosed embodiments. More specifically the example method 800 illustrates providing a second audio object based on a first audio object. The method may be performed by the apparatus 200 such as a network computer or a mobile computing device.

The method starts with receiving 805 a first audio object. The first audio 10 stream may comprise, for example, an audio stream in a first language. The first language may comprise a first natural language.

The method continues with providing 810 a second audio object based on the first audio object according to at least one parameter. The at least one parameter may comprise a period of time for providing the second audio object such as a period of time for translating the first audio object from a first language into a second language or the period of the at least one parameter may comprise a parameter corresponding to a translation method. The second language may comprise a natural language.

The method further continues with causing 815 rendering at least the second audio object Causing rendering at least the second audio object may comprise rendering the second audio object or transmitting the second audio object to another device for rendering.

The method further continues with receiving 820 an instruction to modify the second audio object. The instruction may be received from a user such as a receiving user or a transmitting user.

The method further continues with modifying 825, in response to receiving the instruction to modify the second audio object the at least one parameter. Modifying the at least one parameter may comprise, for example, modifying the parameter value. The method further continues with providing 830 a modified second audio object based on the first audio object and the at least one modified parameter. Providing a modified second audio object based on the first audio object and the at least one modified parameter may comprise, for example, translating the first audio object using a different translation method.

Without limiting the scope of the claims, an advantage of enabling a user to provide feedback on a provided audio stream object is that the audio stream object may be customized according to the user's needs. Another advantage may be that a provided audio stream may be adapted based on feedback from a user. Another advantage may be that in a conversational context the interaction between the users may feel more natural for the users.

Without in any way limiting the scope, interpretation, or application of the claims appearing below, a technical effect of one or more of the example embodiments disclosed herein is that the amount of time spent on an audio object that does not satisfy a user's requirements may be reduced. Another technical effect is that the system may be used more efficiently as a user may control the provision of the audio object As used in this application, the term "circuitry" may refer to one or more or all of the following: (a) hardware-only circuit implementations (such as implementations in only analog and/or digital circuitry) and (b) combinations of hardware circuits and software, such as (as applicable): (i) a combination of analog and/or digital hardware circuit(s) with software/firmware and (ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions) and (c) hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g., firmware) for operation, but the software may not be present when it is not needed for operation.

This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and/or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile device or a similar integrated circuit in server, a cellular network device, or other computing or network device.

Embodiments of the present invention may be implemented in software, hardware, application logic or a combination of software, hardware and application logic. The software, application logic and/or hardware may reside on the apparatus, a separate device or a plurality of devices. If desired, part of the software, application logic and/or hardware may reside on the apparatus, part of the software, application logic and/or hardware may reside on a separate device, and part of the software, application logic and/or hardware may reside on a plurality of devices. In an example embodiment, the application logic, software or an instruction set is maintained on any one of various conventional computer-readable media. In the context of this document, a 'computer-readable medium' may be any media or means that can contain, store, communicate, propagate or transport the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer, with one example of a computer described and depicted in FIGURE 2. A computer-readable medium may comprise a computer-readable storage medium that may be any media or means that can contain or store the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with each other. Furthermore, if desired, one or more 40 of the above-described functions may be optional or may be combined.

Although various aspects of the invention are set out in the independent claims, other aspects of the invention comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.

It will be obvious to a person skilled in the art that, as the technology advances, the inventive concept can be implemented in various ways. The invention and its embodiments are not limited to the examples described above but may vary within the scope of the claims.

Claims

CLAIMS1. An apparatus comprising at least one processor and at least one memory including computer program code, the at least one memory and the computer program code configured to with the at least one processor, cause the apparatus at least to: receive a first audio object; provide a second audio object based at least on the first audio object and at least one parameter; cause rendering at least the second audio object; receive an instruction to modify the second audio object; in response to receiving the instruction, modify the at least one parameter; and provide a modified second audio object based at least on the first audio object and at least one modified parameter.
2. The apparatus according to claim 1, wherein providing the second audio 13 object comprises translating the first audio object from a first language into a second language based on the at least one parameter.
3. The apparatus according claim 2, wherein providing the modified second audio object comprises translating the first audio object from the first language into the 20 second language based on the at least one modified parameter.
4. The apparatus according to any claims 2 or 3, wherein the at least one parameter comprises a parameter corresponding to a period of time for translating the first audio object from the first language into the second language.
5. The apparatus according to any preceding claim, wherein the at least one parameter comprises a parameter corresponding to a translation method and wherein the at least one memory and the computer program code are configured to with the at least one processor, cause the apparatus to select the translation method based on the at least one parameter.
6. The apparatus according to claim 5, wherein the translation method comprises at least one of the following: a word for word translation, literal translation, faithful translation, semantic translation, communicative translation, idiomatic translation, free translation or adaptation.
7. The apparatus according any preceding claim, wherein the first audio object and/or the second audio object comprises a spatial audio object
8. The apparatus according to any preceding claim, wherein the at least one memory and the computer program code are configured to with the at least one processor, cause the apparatus to receive information on a characteristic of the instruction and provide the modified second audio object based on the characteristic of the instruction.
9. The apparatus according to any preceding claim, wherein the instruction is received from a user.
10. The apparatus according to claim 9, wherein the user comprises a receiving user or a transmitting user.
11. The apparatus according to any preceding claim, wherein the at least one memory and the computer program code are configured to with the at least one processor, cause the apparatus to provide a plurality of audio objects.
12. The apparatus according to any preceding claim, wherein the at least one memory and the computer program code are configured to with the at least one processor, cause the apparatus to receive the instruction to modify the second audio object concurrently with causing rendering of at least the second audio object.
13. The apparatus according to any preceding claim, wherein the at least one memory and the computer program code are configured to with the at least one processor, cause the apparatus to cause rendering the first audio object.
14. The apparatus according to any preceding claim, wherein the first audio object comprises an audio stream in a first language.
15. The apparatus according to claim 14, wherein the audio stream comprises a speech segment comprising at least one speech signal.
16. A method comprising: receiving a first audio object; providing a second audio object based at least on the first audio object and at least one parameter; causing rendering at least the second audio object; receiving an instruction to modify the second audio object; in response to receiving the instruction, modifying the at least one parameter; and providing a modified second audio object based at least on the first audio object and at least one modified parameter.
17. A computer readable medium comprising instructions for causing an apparatus to perform at least the following: receiving a first audio object; providing a second audio object based at least on the first audio object and at least one parameter; causing rendering at least the second audio object; receiving an instruction to modify the second audio object; in response to receiving the instruction, modifying the at least one parameter; and providing a modified second audio object based at least on the first audio object and the at least one modified parameter.