GB2633769A

GB2633769A - Apparatus and methods

Info

Publication number: GB2633769A
Application number: GB2314333.2A
Authority: GB
Inventors: Pajunen Lauros; Shyamsundar Mate Sujeet; Pihlajakuja Tapani; Juhani Laaksonen Lasse
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2023-09-19
Filing date: 2023-09-19
Publication date: 2025-03-26
Also published as: GB202314333D0; WO2025061468A1

Abstract

An Immersive Voice Audio Services (IVAS) codec encodes audio signals in different formats (stereo, multichannel MC, object based audio, scene based audio and Metadata Assisted Spatial Audio MASA) and associates signal information with the data packet (eg. indicating whether the packet is compact or header-full, or IVAS or Enhanced Voice Services format) to assist decoding. Further information may indicate transport channel modification, rotation transforms, empty MC channels, voice or non-voice channels, and the audiostream may be received as a Real-time Transport Protocol (RTP) transmission.

Description

APPARATUS AND METHODS

Field

The present application relates to apparatus and methods for efficient utilization of frame headers, but not exclusively for efficient utilization of frame headers in an immersive voice and audio services (IVAS).

Background

Immersive audio codecs are being implemented supporting a multitude of operating points ranging from a low bit rate operation to transparency. An example of such a codec is the immersive voice and audio services (IVAS) codec which is being designed to be suitable for use over a communications network such as a 3GPP 4G/5G network. Such immersive services include uses for example in immersive voice and audio for applications such as virtual reality (VR), augmented reality (AR) and mixed reality (MR) as well as spatial voice communication including teleconferencing. This audio codec is expected to handle the encoding, decoding and rendering of speech, music and generic audio. It is furthermore expected to support channel-based audio and scene-based audio inputs including spatial information about the sound field and sound sources. The codec is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions.

The input signals are presented to the IVAS encoder in one of the supported formats (and in some allowed combinations of the formats). The supported input formats for IVAS are stereo, multichannel (MC), object-based audio (ISM), scene-based audio (SBA), and MASA. Mono input is processed with an EVS (Enhanced Voice Services, described in TS 26.445) compliant processing. In addition, the following combinations are supported: Objects with MASA (OMASA) and Objects with SBA (OSBA). The IVAS output formats for IVAS include mono, stereo, multichannel (including custom loudspeaker layouts), first order Ambisonics (FOA), higher (second) order Am bisonics (HOA2), higher (third) order Am bisonics (HOA3), and binaural. Similarly, it is expected that the decoder can output the audio in supported formats. A pass-through mode has been proposed, where the audio could be provided in its original format after transmission (encoding/decoding).

As it was designed as a spatial audio codec, it can support three degrees of rotation freedom (yaw, pitch, roll) and is expected to be used in a variety of scenarios.

Additionally RTP (Real-Time Transport Protocol) is intended for an end-toend, real-time transfer of streaming media and provides facilities for jitter compensation and detection of packet loss and out-of-order delivery. RTP allows data transfer to multiple destinations through IP multicast or to a specific destination through IP unicast. The majority of the RTP implementations are built on top of the User Datagram Protocol (UDP). Other transport protocols may also be utilized. RTP is used in together with other protocols such as H.323 and Real Time Streaming Protocol (RTSP).

The RTP specification describes two protocols: RTP and RTCP. RTP is used 15 for the transfer of multimedia data, and its companion protocol (RTCP) is used to periodically send control information and QoS (Quality of Service) parameters. RTP sessions are typically initiated between client and server or between client and another client (or a multi-party topology) using a signalling protocol, such as H.323, the Session Initiation Protocol (SIP), or RTSP. These protocols typically 20 use the Session Description Protocol (SDP), such as defined by RFC 8866 to specify parameters for the sessions.

Summary

There is provided according to a first aspect an apparatus comprising means configured to: obtain at least one audio signal, wherein the at least one audio signal is associated with at least one input audio signal format; generate at least one audio signal data packet based on the at least one audio signal; and signal information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet at a decoder.

The means configured to signal information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet at a decoder may further be configured to generate the at least one audio signal data packet with a defined size to signal whether the data packet is a compact or header-full format type audio packet.

The means configured to signal information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet at a decoder may be configured to generate the at least one audio signal data packet with a defined size to signal whether the data packet is encoded as a first or second type audio packet.

The means configured to signal information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet at a decoder may be configured to generate at least one identifier header associated with the at least one audio signal data packet, wherein the at least one identifier header comprises information associated with at least one input audio signal format to assist the processing of the at least one audio signal data packet at the decoder.

The means configured to generate at least one identifier header associated with the at least one audio signal data packet may be further configured to generate the at least one identifier header to signal whether the data packet is encoded as a first or second type audio packet.

The first type audio packet may be an immersive voice and audio services 20 encoded audio packet and the second type audio packet may be an enhanced voice services encoded audio packet.

The means configured to generate at least one identifier header associated with the at least one audio signal data packet may be further configured to generate the at least one identifier header to signal a codec mode request change capacity.

The information associated with the at least one input audio signal format to assist a processing of the at least one audio signal data packet at a decoder may comprises one of: modification of at least one audio transport channel, the at least one audio transport channel associated with the at least one audio signal; rotation transforms prior to encoding the at least one audio signal; indicating an empty channel comprising no audio data within a multi-channel representation of the at least one audio signal; indicating an empty channel comprising no audio data within the at least one audio signal; indicating information associated with a stereo channel representation of the at least one audio signal; and indicating a voice/nonvoice channel of the at least one audio signal.

According to a second aspect there is provided an apparatus comprising means configured to: receive an audio packet stream, the audio packet stream comprising: at least one audio signal data packet comprising at least one audio signal, the at least one audio signal is associated with at least one input audio signal format; determine information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet; and process the at least one audio signal based on the information.

The means configured to determine information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet may be configured to determine the at least one audio signal data packet is encoded as a first or second type audio packet based on a defined size.

The means configured to determine information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet may be configured to determine the at least one audio signal data packet is encoded as a compact or header-full format type audio packet based on a defined size.

The audio packet stream may further comprise at least one identifier header associated with the at least one audio signal data packet, wherein the at least one identifier header comprises information associated with the at least one input audio signal format, and the means configured to determine information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet may be configured to determine the information based on the information associated with the at least one input audio signal format.

The at least one identifier header may further comprise an indication as to whether the data packet is encoded as a first or second type audio packet. The first type audio packet may be an immersive voice and audio services encoded audio packet and the second type audio packet is an enhanced voice services encoded audio packet.

The means configured to process the at least one audio signal based on the information from the at least one identifier header may be configured to decode the audio packet based on whether the at least one audio packet was encoded as a first or second type audio packet as indicated by the at least one identifier header. The at least one identifier header associated with the at least one audio signal data packet may further comprise an indicator to signal a codec mode 5 request change capacity.

The information associated with the at least one input audio signal format may comprise one of: modification of at least one audio transport channel, the at least one audio transport channel associated with the at least one audio signal; rotation transforms prior to encoding the at least one audio signal; indicating an empty channel comprising no audio data within a multi-channel representation of the at least one audio signal; indicating an empty channel comprising no audio data within the at least one audio signal; indicating information associated with a stereo channel representation of the at least one audio signal; and indicating a voice/nonvoice channel of the at least one audio signal.

The means may be configured to receive the audio packet stream as a real-time transport protocol transmission.

According to a third aspect there is provided a method for an apparatus comprising: obtaining at least one audio signal, wherein the at least one audio signal is associated with at least one input audio signal format; generating at least one audio signal data packet based on the at least one audio signal; and signalling information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet at a decoder.

Signalling information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet at a decoder may further comprise generating the at least one audio signal data packet with a defined size to signal whether the data packet is a compact or header-full format type audio packet.

Signalling information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet at a decoder may comprise generating the at least one audio signal data packet with a defined size to signal whether the data packet is encoded as a first or second type audio packet.

Signalling information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet at a decoder may comprise generating at least one identifier header associated with the at least one audio signal data packet, wherein the at least one identifier header comprises information associated with at least one input audio signal format to assist the processing of the at least one audio signal data packet at the decoder.. Generating at least one identifier header associated with the at least one audio signal data packet may further comprise generating the at least one identifier header to signal whether the data packet is encoded as a first or second type audio packet.

The first type audio packet may be an immersive voice and audio services encoded audio packet and the second type audio packet may be an enhanced voice services encoded audio packet.

Generating at least one identifier header associated with the at least one audio signal data packet may comprise generating the at least one identifier header to signal a codec mode request change capacity.

The information associated with the at least one input audio signal format to assist a processing of the at least one audio signal data packet at a decoder may comprise one of: modification of at least one audio transport channel, the at least one audio transport channel associated with the at least one audio signal; rotation transforms prior to encoding the at least one audio signal; indicating an empty channel comprising no audio data within a multi-channel representation of the at least one audio signal; indicating an empty channel comprising no audio data within the at least one audio signal; indicating information associated with a stereo channel representation of the at least one audio signal; and indicating a voice/non-voice channel of the at least one audio signal.

According to a fourth aspect there is provided a method for an apparatus comprising: receiving an audio packet stream, the audio packet stream comprising: at least one audio signal data packet comprising at least one audio signal, the at least one audio signal is associated with at least one input audio signal format; determining information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet; and processing the at least one audio signal based on the information from the at least one identifier header.

Determining information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet may comprise determining the at least one audio signal data packet is encoded as a first or second type audio packet based on a defined size.

Determining information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet may comprise determining the at least one audio signal data packet is encoded as a compact or header-full format type audio packet based on a defined size.

The audio packet stream may further comprise at least one identifier header associated with the at least one audio signal data packet, wherein the at least one identifier header comprises information associated with the at least one input audio signal format, and determining information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet may comprise determining the information based on the information associated with the at least one input audio signal format.

The at least one identifier header may further comprise an indication as to whether the data packet is encoded as a first or second type audio packet.

Processing the at least one audio signal based on the information from the at least one identifier header may comprise decoding the audio packet based on 25 whether the at least one audio packet was encoded as a first or second type audio packet as indicated by the at least one identifier header.

The at least one identifier header associated with the at least one audio signal data packet may further comprise an indicator to signal a codec mode request change capacity.

The information associated with the at least one input audio signal format may comprise one of: modification of at least one audio transport channel, the at least one audio transport channel associated with the at least one audio signal; rotation transforms prior to encoding the at least one audio signal; indicating an empty channel comprising no audio data within a multi-channel representation of the at least one audio signal; indicating an empty channel comprising no audio data within the at least one audio signal; indicating information associated with a stereo channel representation of the at least one audio signal; and indicating a voice/non-voice channel of the at least one audio signal.

The method may further comprise receiving the audio packet stream as a real-time transport protocol transmission.

According to a fifth aspect there is provided an apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain at least one audio signal, wherein the at least one audio signal is associated with at least one input audio signal format; signal information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet at a decoder.

The apparatus caused to signal information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet at a decoder may further be caused to generate the at least one audio signal data packet with a defined size to signal whether the data packet is a compact or header-full format type audio packet.

The apparatus caused to signal information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet at a decoder may be caused to generate the at least one audio signal data packet with a defined size to signal whether the data packet is encoded as a first or second type audio packet.

The apparatus caused to signal information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet at a decoder may be caused to generate at least one identifier header associated with the at least one audio signal data packet, wherein the at least one identifier header comprises information associated with at least one input audio signal format to assist the processing of the at least one audio signal data packet at the decoder.

The apparatus caused to generate at least one identifier header associated with the at least one audio signal data packet may further be caused to generate the at least one identifier header to signal whether the data packet is encoded as a first or second type audio packet.

The apparatus caused to generate at least one identifier header associated with the at least one audio signal data packet may be caused to generate the at least one identifier header to signal a codec mode request change capacity.

According to a sixth aspect there is provided an apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: receive an audio packet stream, the audio packet stream comprising at least one audio signal data packet comprising at least one audio signal, the at least one audio signal is associated with at least one input audio signal format; determine information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet; and process the at least one audio signal based on the information.

The apparatus caused to determine information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet may be caused to determine the at least one audio signal data packet is encoded as a first or second type audio packet based on a defined size.

The apparatus caused to determine information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet may be caused to determine the at least one audio signal data packet is encoded as a compact or header-full format type audio packet based on a defined size.

The audio packet stream may further comprise at least one identifier header associated with the at least one audio signal data packet, wherein the at least one identifier header comprises information associated with the at least one input audio signal format, and the apparatus caused to determine information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet may be caused to determine the information based on the information associated with the at least one input audio signal format.

The apparatus caused to process the at least one audio signal based on the information from the at least one identifier header may be caused to decode the audio packet based on whether the at least one audio packet was encoded as a first or second type audio packet as indicated by the at least one identifier header. The at least one identifier header associated with the at least one audio 25 signal data packet may further comprise an indicator to signal a codec mode request change capacity.

According to a seventh aspect there is provided an apparatus comprising: obtaining circuitry configured to obtain at least one audio signal, wherein the at least one audio signal is associated with at least one input audio signal format; generate at least one audio signal data packet based on the at least one audio signal; and generating circuitry configured to signal information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet at a decoder.

According to an eighth aspect there is provided an apparatus comprising receiving circuitry configured to receive an audio packet stream, the audio packet stream comprising at least one audio signal data packet comprising at least one audio signal, the at least one audio signal is associated with at least one input audio signal format; determining circuitry configured to determine information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet; and processing circuitry configured to process the at least one audio signal based on the information from the at least one identifier header.

According to a ninth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising program instructions] for causing an apparatus to perform at least the following: obtaining at least one audio signal, wherein the at least one audio signal is associated with at least one input audio signal format; generating at least one audio signal data packet based on the at least one audio signal; and signalling information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet at a decoder.

According to a tenth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising program instructions] for causing an apparatus to perform at least the following: receiving an audio packet stream, the audio packet stream comprising: at least one audio signal data packet comprising at least one audio signal, the at least one audio signal is associated with at least one input audio signal format; determining information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet; and processing the at least one audio signal based on the information from the at least one identifier header.

According to an eleventh aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus to perform at least the following: obtaining at least one audio signal, wherein the at least one audio signal is associated with at least one input audio signal format; generating at least one audio signal data packet based on the at least one audio signal; and signalling information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet at a decoder.

According to a twelfth aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus to perform at least the following: receiving an audio packet stream, the audio packet stream comprising: at least one audio signal data packet comprising at least one audio signal, the at least one audio signal is associated with at least one input audio signal format; determining information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet; and processing the at least one audio signal based on the information from the at least one identifier header.

According to a thirteenth aspect there is provided an apparatus comprising: means for obtaining at least one audio signal, wherein the at least one audio signal is associated with at least one input audio signal format; generating at least one audio signal data packet based on the at least one audio signal; and means for signalling information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet at a decoder. According to a fourteenth aspect there is provided an apparatus comprising: means for receiving an audio packet stream, the audio packet stream comprising: at least one audio signal data packet comprising at least one audio signal, the at least one audio signal is associated with at least one input audio signal format; means for determining information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet; and means for processing the at least one audio signal based on the information from the at least one identifier header.

According to a fifteenth aspect there is provided a computer readable medium comprising program instructions for causing an apparatus to perform at least the following: obtaining at least one audio signal, wherein the at least one audio signal is associated with at least one input audio signal format; generating at least one audio signal data packet based on the at least one audio signal; and signalling information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet at a decoder.

According to a sixteenth aspect there is provided a computer readable medium comprising program instructions for causing an apparatus to perform at least the following: receiving an audio packet stream, the audio packet stream comprising: at least one audio signal data packet comprising at least one audio signal, the at least one audio signal is associated with at least one input audio signal format; determining information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet; and processing the at least one audio signal based on the information from the at least one identifier header.

An apparatus comprising means for performing the actions of the method as 20 described above.

An apparatus configured to perform the actions of the method as described above.

A computer program comprising program instructions for causing a computer to perform the method as described above.

A computer program product stored on a medium may cause an apparatus to perform the method as described herein.

An electronic device may comprise apparatus as described herein.

A chipset may comprise apparatus as described herein.

Embodiments of the present application aim to address problems associated with the state of the ad.

Summary of the Figures

For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which: Figure 1 shows schematically example server and peer-to-peer teleconferencing systems within which embodiments may be implemented; Figure 2 shows schematically an example packet format with Enhanced Voice Services (EVS) Header-Full payload structure; Figure 3 shows schematically an example encoder configuration as shown in Figure 1 according to some embodiments; Figure 4 shows a flow diagram showing a method of operation for the 10 encoder as shown in Figure 3 according to some embodiments; Figure 5 shows schematically example Header-Full IVAS payload structures according to some embodiments; Figure 6a shows schematically example Table of Content (ToC) byte structures in Header-Full IVAS frame according to some embodiments; Figure 6b shows schematically example signalling of CMR in a IVAS Table of Content (ToC) byte structure; Figures 6c to 6o show tables of example bit configurations for signalling information; Figure 7 shows schematically an example decoder configuration as shown in Figure 1 according to some embodiments; Figure 8 shows a flow diagram showing a method of operation for the decoder as shown in Figure 7 according to some embodiments; and Figure 9 shows an example device suitable for implementing the apparatus shown.

Embodiments of the Application The following describes in further detail suitable apparatus and possible mechanisms for the provision of efficient IVAS audio.

An example system within which embodiments may be implemented is shown in Figure 1.

Figure 1, for example, shows an example teleconferencing system within which some embodiments can be implemented. In this example there is shown two sites or rooms, Room A 100 and Room B 102. Room A 100 comprises a 'talker' or user, Talker 1 103. Room B 102 comprises one 'talker' or user, Talker RX 141.

In the following example within room A is a suitable teleconference apparatus (or more generally telecommunications apparatus 110) configured to spatially capture and encode the audio environment and furthermore is configured to render a spatial audio signal to the room. Within each of the other rooms may be a suitable teleconference apparatus (or more generally telecommunications apparatus such as apparatus 120 within room B) configured to render a spatial audio signal to the room and furthermore is configured to capture and encode at least a mono audio and optionally configured to spatially capture and encode the audio environment.

In the following examples each room is provided with the means to spatially capture, encode spatial audio signals, receive spatial audio signals and render these to a suitable listener. It would be understood that there may be other embodiments where the system comprises some apparatus configured to only capture and encode audio signals (in other words the apparatus is a 'transmit' only apparatus), and other apparatus configured to only receive and render audio signals (in other words the apparatus is a 'receive' only apparatus). In such embodiments the system within which embodiments may be implemented may comprise apparatus with varying abilities to capture/render audio signals.

The teleconference apparatus (for each site or room) 110, 120 can be configured to call into a teleconference controlled by and implemented over a server 111.

In some embodiments the communications or teleconferencing system comprises a (peer-to-peer or point-to-point) communications system (rather than the server based system shown in Figure 1) within which some embodiments can be implemented. Thus, for example, two or more UEs can be configured to interact directly with each other (for example to implement an immersive audio phone call between users). In such a scenario one of the UEs can be configured to deliver spatial ambience as one stream and employ a close-up microphone (for example a Lavalier microphone) to capture the speech as an audio object or audio source. The sender UE can be configured to encode the spatial ambience audio signals in a MASA format stream and the close-up microphone audio signal as an object format stream. The two audio streams can then be delivered as separated IVAS streams. The sender UE, in addition, can be configured to encode processing information during the encoding to deliver the PI (Processing Information) frames as described in GB2313472.9 together with the IVAS frames to the receiver UE.

In a further example, Room A UE uses spatial audio with MASA input format whereas the other UE is capable of decoding IVAS but not rendering to head tracked binaural audio.

Furthermore another scenario could be one in which room A and room B are connected via Server. Another user C joins in via a legacy device and delivers only EVS audio. Consequently, the payload from Server to room A may be a mix of EVS frames and IVAS frames, if the SFU (Selective Forwarding Unit) wishes to avoid performing any transcoding, which would require appropriate disambiguation between EVS and IVAS frames in the same session.

The teleconference apparatus can be configured to spatially capture and encode the audio environment and furthermore can be configured to render a spatial audio signal to the room. In this example only the communications or signalling path from the Room A 100 to the Room B 102 is shown for simplicity but a duplex or multipoint communication system comprising multiple signalling paths can be implemented using the methods as described herein without significant inventive input.

The teleconference apparatus (for each site or room) 110, 120 is further configured to communicate with each other to implement a teleconference function.

As shown in Figure 1, the apparatus 110, 120 and server 111 can comprise suitable encoder and decoder functionality. For example the apparatus 110 is shown comprising an (IVAS) encoder 101, the server 111 is shown comprising a (IVAS) decoder and encoder 121 and the apparatus 120 is shown comprising an (IVAS) decoder 131. In such a manner an object 120 (the audio signals representing the user or talker 1 111) can be encoded by the encoder 101 which generates a bitstream 106 to be passed to a server 111. The server 111 can then decode, (optionally then mix with other objects and otherwise process the audio signals) and encode then to generate the bitstream 108 to be passed to the apparatus 120. The apparatus 120 can then decode the audio signals and present them to the user or talker 'Talker RX' 141.

Although this example shows a teleconference application the encoder/decoder functionality can be applied to the streaming of any suitable media.

The IVAS decoder/renderer for each of the teleconference apparatus 102 can be furthermore configured to handle multiple input streams that may each originate from a different encoder.

As discussed previously RTP is intended for an end-to-end, real-time transfer of streaming media and provides facilities for jitter compensation and detection of packet loss and out-of-order delivery. RTP is furthermore designed to carry a multitude of multimedia formats, which permit the transport of new formats without revising the RTP standard. To this end, the information required by a specific application of the protocol is not included in the generic RTP header. For a class of applications (e.g., audio, video), an RTP profile may be defined. For a media format (e.g., a specific video coding format), an associated RTP payload format may be defined. Every instantiation of RTP in a particular application may therefore require a profile and payload format specifications.

The profile is configured to define the codec used to encode the payload data and the mapping to payload format codes in the protocol field Payload Type (PT) of the RTP header.

For example, the RTP profile for audio and video conferences with minimal control is defined in RFC 3551. The profile defines a set of static payload type assignments, and a dynamic mechanism for mapping between a payload format, and a PT value using Session Description Protocol (SDP). The latter mechanism is used for newer video codec such as RTP payload format for H.264 Video defined in RFC 6184 or RTP Payload Format for High Efficiency Video Coding (HEVC) defined in RFC 7798.

An RTP session can be established for each multimedia stream. Audio and video streams may be implemented which use separate RTP sessions, enabling a receiver to selectively receive components of a particular stream. The RTP specification can furthermore be configured to recommend port numbers for RTP, and furthermore to recommend the use of the next odd port number for the associated RTCP session. A single port can be used for RTP and RTCP in applications that multiplex the protocols.

Each RTP stream can comprise RTP packets, and the RTP packet in turn can comprise a RTP header and payload pair.

Enhanced Voice Services (EVS) is a mono voice codec standardized in 3GPP and described in the TS 26.445 specification document. The codec can have 5 two operating modes: EVS Primary and EVS AMR-WB 10 (Adaptive Multi Rate Wideband Inter-Operable).

The IVAS codec can be considered to be an extension to the EVS codec and as such the IVAS and EVS codecs can have some similarities in terms of design and implementation. The RTP payload structure, while not having been specified for the IVAS codec, is envisioned to have similarities to the RTP payload structure in EVS.

The RTP payload format of EVS is described in 3GPP TS 26.445 Annex A. In EVS, the RTP payload format is divided into two different embodiments: a Compact format and a Header-Full format. In the EVS Compact payload format, a RTP packet includes only a single EVS speech frame for EVS Primary mode. For EVS AMR-WB 10 mode, the compact RTP packet also includes a 3-bit Codec Mode Request (CMR) field in front of the speech frame. In the EVS Compact format, the different modes and bitrates for the speech frames are identified by the size of the RTP payload. For example, an RTP packet of size 328 bits is assigned for EVS Primary mode with 16.4 kbps bitrate, as is shown in Table A.1 in TS 26.445 Annex A. In EVS Header-Full format, the RTP payload consists of the speech frame(s) accompanied by an optional CMR byte and Table of Content (ToC) bytes. The CMR byte is used to request a change in bitrate or coding mode that the receiver wants to receive. The request is sent as a CMR byte as part of the Header-Full EVS packet. In EVS AMR-WB 10 Compact format, the CMR functionality is also present as a 3-bit signalling at the beginning of the packet.

The ToC bytes in EVS Header-Full format describe the mode and bitrate for the accompanied EVS coded speech frames. RTP payload structures for EVS Header-Full format are furthermore shown in Figure 2.

Figure 2 shows, for example, a payload structure of Header-Full format with ToC single frame 201, a payload structure of Header-Full format with ToC multiple frames 203, a payload structure of Header-Full format with CMR and ToC single frame 205 and a payload structure of Header-Full format with CMR and ToC multiple frames 207.

Each of the payload structures comprises ToC/CMR bytes where for each of these bytes there is a first bit 211, 221, 231 at the beginning of the ToC/CMR 5 bytes which can be used to differentiate the bytes between ToC (where the bit value is 0) and CMR (where the bit value is 1).

In the payload structure 201 where there is a single speech frame in the payload, the speech data 215 is preceded by a single ToC byte 213 (with optional zero padding 217 at the end of the payload).

In examples where the single speech frame payload structure can have an optional CMR byte, such as shown in payload structure 205. The payload structure 205 differs from the payload structure 201 in that there is a CMR byte 232 (with associated indicator bit 231 value of 1) positioned before the ToC byte 213.

In some example structures there can be two speech frames in the payload structure 203, 207. In this example an additional ToC byte 223 (with associated indicator bit 221 value of 0) and the ToC byte 213 is positioned at the beginning of the payload followed by the speech data frames 215 and 225 respectively to the order of the ToC bytes.

Furthermore the multiple speech frame payload structure can have an 20 optional CMR byte, such as shown in payload structure 207. The payload structure 207 differs from the payload structure 203 in that there is a CMR byte 232 (with associated indicator bit 231 value of 1) positioned before the ToC bytes 213, 223. The Table of Content (ToC) byte for IVAS RTP packets as will be detailed herein in further detail is sent with every frame in the IVAS Header-Full mode. The 25 concept as discussed in the embodiments herein discusses apparatus and method for enabling further efficiencies on utilizing all of the bits in the IVAS ToC byte.

For example there is a need for input format specific signalling to enable optimal decoding and rendering for IVAS frames. In some situations, for example, with MASA input format, the transport channels can be modified before encoding and transmission based on a recent head orientation in rendering. This improves audio quality as there needs to be less adjustment of the transport signals at the receiver UE during the render processing. This pre-adjustment can therefore perform most of the adjustment (before encoding at the sender UE), and can as discussed herein in further detail be signalled to the renderer to avoid "double adjustment" (at the receiver UE) and possible switching of left and right prototype channels in binaural rendering.

Furthermore IVAS is a stereo and immersive extension of the EVS codec.

Algorithmically, IVAS thus includes the EVS and utilizes it for coding of mono audio with no associated metadata. The following embodiments therefore discuss apparatus and methods for handling EVS and IVAS frames in the same IVAS session without requiring any additional or further headers.

Furthermore is shown apparatus and methods of provision for backward compatibility with EVS when using IVAS.

As such as the concepts which are described in further detail with respect to the following embodiments relate to apparatus and methods for efficient rendering of immersive conversational audio where there is provided a method for signalling transport audio channel modification as part of immersive audio codec format header to effectively utilize the improved audio quality (by avoiding repeating the transport audio channel modification at the receiver in an efficient manner).

Thus in some embodiments the IVAS frame header can be utilized to efficiently signal the disambiguation of EVS and IVAS frames for the overlapping protected frame sizes.

The efficient signalling for transport audio channel modification, EVS and IVAS disambiguation, CMR byte signalling and potentially other input format specific signalling can be achieved as follows: With respect to the encoder/transmitter/sender (e.g., consider IVAS coded conversational audio data): Form an identifier header for a data frame in an immersive voice codec (IVAS) session; Fill the additional bits in the identifier header based on the input format of the data frame (e.g., IVAS frame) in order to signal transport audio channel modification, EVS and IVAS disambiguation, or other input format specific decoding or rendering information.

Attach the identifier header to the IVAS frame and transmit the combination to the receiver With respect to the decoder/receiver/renderer: Receive an immersive voice codec IVAS frame and an associated identifier header; Read the identifier header and the IVAS frame; and Based on the input format of the IVAS frame, interpret the identifier header accordingly and adjust the processing of the data.

In some embodiments, in case of header-less transport of immersive voice frames, the EVS and IVAS disambiguation can be implemented by adding an additional bit to the EVS data. This additional bit can be employed to disambiguate the IVAS and EVS for the overlapping protected frame sizes. The receiver can in some embodiments be configured to recognize the packet as being EVS data by examining the size of the packet and identifies that the payload size (or frame size) is one bit extra as an IVAS frame does not carry the extra disambiguation bit.

The (IVAS) encoder according to some embodiments, such as the example encoder 101, is shown in Figure 3.

The encoder in some embodiment is configured to receive or otherwise obtain the audio information (or audio signals) 300 and the audio information format or type 304. In some embodiments the audio information format -in other words the type or format of the audio signals is determined by the encoder according to

any suitable method.

Additionally the encoder comprises a RTP generator 301 which is configured to receive the audio information 300 and the audio information format or type 304 and generate the RTP packets comprising the audio signal payload.

The RTP generator 301, in some embodiments, comprises a ToC header identifier generator 303. The ToC header identifier generator 303 is configured to form an identifier header for a data frame in an immersive voice codec (IVAS) session as is discussed in further detail herein.

The RTP generator 301, in some embodiments, further comprises a ToC 30 header appender or modifier 305. The ToC header appender or modifier 305 in some embodiments is configured to add or amend the generated ToC header in such a manner that additional bits in the identifier header are inserted or modified based on the input format of the data frame (e.g., IVAS frame) in order to signal information such as: transport audio channel modification; EVS and IVAS disambiguation; or other input format specific decoding or rendering information.

The RTP generator 301, in some embodiments, further comprises a frame appender 307. The frame appender 307 is configured to attach or append the ToC header to the IVAS frame (and transmit or store the combination for later use by the receiver/decoder).

Figure 4 shows the operations of the example RTP generator shown in Figure 3.

The first operation is one of obtaining or receiving audio information and type of audio information as shown in Figure 4 by step 401.

Then for the RTP packet (corresponding to an update frequency) is 15 generated as shown in Figure 4 by step 403.

The packet generation operation of step 403 can for example be divided into a first operation of generating an identifier header (or ToC header identifier) for a data frame in an immersive voice codec (IVAS) session as shown by step 405.

The packet generation operation of step 403 can further comprise filling or modifying bits in the identifier header based on the input format of the data frame (e.g., IVAS frame) in order to signal information such as: transport audio channel modification; EVS and IVAS disambiguation; or other input format specific decoding or rendering information, as shown by step 407.

The packet generation operation of step 403 further comprises attaching the identifier header to the IVAS frame and transmit the combination to the receiver as shown by step 409.

The Header-Full IVAS frames comprise one or more than one Table-ofContent (ToC) indicators and their associated IVAS speech frames. Figure 5 shows an example packet structure of Header-Full IVAS payload format.

A first example packet structure is shown in 501, which represents a payload structure when a single IVAS frame is coded in the Header-Full format. In this example the IVAS speech frame 505 is preceded by a single ToC byte 503.

Furthermore example packet structures are shown 511 and 521 which represent alternative versions of the payload structure when two IVAS frames are coded in the Header-Full format.

Thus a first multiple IVAS frame packet structure 511 comprises one in which all the ToC bytes 513, 514 are positioned at the beginning of the payload and before the IVAS frames 515, 516.

A further multiple IVAS frame packet structure 521 comprises the ToC bytes 523, 527 can be positioned before each respective or associated IVAS frame 525 and 529.

The example packet structures 501 and 511 are similar to the packet structures for the EVS codec described in TS 26.445, where all the EVS ToC bytes precede the EVS speech frames. The ToC bytes describe the content of the audio or speech frames in the packet and have similar higher-level functionality in the RTP packet.

An example of the ToC byte structure in IVAS Header-Full frame ToC byte structure according to some embodiments is shown in Figure 6a.

The bits in the IVAS ToC byte structure 651 can therefore in some embodiments have the following structure: F (1 bit) 653: If set to 1, the bit indicates that the corresponding frame is followed by another speech or PI (processing information) frame in this payload as described in GB2313472.9, implying that another ToC byte follows this entry. If set to 0, the bit indicates that this frame is the last frame in this payload and no further header entry follows this entry.

SSB (Supplemental Signalling Bits) (2-3 bits) 655 only when SSB is 2 bits or 655 and 657 when SSB is 3 bits: These can be bits reserved for future use. If 4 bits are reserved for the BLI indicator, the 3 SSB bits can be partly used to identify other frame content than bitrates/frame-size as demonstrated below. SSB bits can be used for IVAS input format specific information signalling such as audio transport channel modification for MASA, disambiguation between IVAS and EVS for the overlapping protected bitrates and other format specific signalling.

BLI (4-5 bits) 659 only when BLI is 4 bits or 659 and 657 when BLI is 5 bits: Bitrate Level Indication. These bits can be configured to indicate the bitrate or other frame content indication for the frame. From the content indication, the receiver can determine the size of the received frame either directly from the bitrate or from predetermined frame sizes. The BLI bits can indicate, for example, the bitrate of an IVAS speech frame.

In some examples the SSB bits 655 can also be described as extra bits and the BLI bits 659 can also described as FT (format type) bits.

As mentioned above, the IVAS ToC byte has 2-3 SSB bits that are currently unassigned and unused. The whole byte, however, is transmitted as a part of every Header-Full IVAS frame in an RTP packet.

The following examples describe how the ToC header appender 305 can be configured to utilize these 2-3 bits in a meaningful way. In some embodiments the ToC header appender 305 is configured to utilize only 1 bit, and in some embodiments the following information elements can be signalled in a combination of the information elements to utilize the available SSB bits in an efficient manner.

As discussed above an IVAS codec can have multiple different input formats. Some of these input formats can benefit from additional signalling that is not part of the actual IVAS speech frame. This kind of input format specific signalling can be placed or appended in the SSB bits and can be interpreted based on the input format of the coded IVAS frame.

For example, if the IVAS frame is based on MASA format input, the receiver can be configured to interpret the SSB bits in the IVAS ToC in a manner which is further specified below with respect to a determined MASA format IVAS frame. The below examples relate to some example IVAS input formats. It would be appreciated that other input formats could also employ input format specific signalling where the specific signalling is developed for these formats.

One example of input format specific signalling can be employed for the MASA format, when two transport channels are transmitted. For example, if headtracking is not used in rendering, and there is no external orientation being applied, the left and right prototype signals created as part of the parametric binaural rendering are directly the left and right decoded transport channels. On the other hand, when headtracking is used, or when the audio scene is rotated by some external orientation, the left and right prototype signals are created according to the desired (combined) rotation based on the decoded transport channels.

However, in some scenarios, such as with extreme head rotations, it can be beneficial to apply the rotation transform to the transport channels prior to encoding or transmitting to obtain better audio quality at the renderer in the renderer output. In particular, this can be the case when the MASA encoder input is a product of a downmix operation, e.g., downmix of MASA and one or more audio objects.

The prime example for this is when orientation is close to 90° to the left or right. In this case, the transport signals cannot provide good left-right separation to the renderer output as they now correspond to front and back directions. This reduces quality and also may result in an incorrect perception. It is known to perform pre-rotation at the sender side to attempt to produce an improved output.

This pre-rotation may in practice be a different construction or selection of transport channels. Where such pre-rotation has been performed, in other words, transport channels have been transformed at the sender or encoder, then this pre-rotation operation should be indicated so that it can be taken into account in the IVAS renderer (or any other renderer on the receiver side) and thus prevent the renderer from performing the same transformation to the transport channels. This for example is shown in the example tables in Figures 6f, 6g, and 6h. Figure 6f, for example, shows a table with example SSB bit values for a MASA format where Pre-rotation is defined in 90° steps. In this example the value 111 is reserved for indicating a IVAS CMR byte, and SSB combinations 100-110 are reserved for future use. Figure 6g, for example, shows a table with example SSB bit values for a MASA format where Pre-rotation is defined in 45° steps and where CMR is not supported. Figure 6h shows a table with example SSB bit values for a MASA format where Pre-rotation is defined in 60° steps, 110 is reserved for future use, 111 is reserved for IVAS CMR byte indication.

In some embodiments the indication that this transformation has been applied to the transport channels and that the rotation transformations in the IVAS renderer should take into account the already performed transformation could be signalled by one or more of the available or SSB bits in the IVAS ToC byte.

In some embodiments, the available bits or SSB bits within the ToC byte or header can be employed as follows (in this example positive azimuth angles indicate counter-clockwise rotations but it would be understood that in other examples or embodiments positive angles could indicate clockwise rotations): * If 1 bit available (pi radian steps) o Bit value 1, 180° azimuth pre-rotation performed o Bit value 0, no pre-rotation performed * If 2 bits available (pi/2 radian steps) o Bit value 11, 270° azimuth pre-rotation performed o Bit value 10, 180° azimuth pre-rotation performed o Bit value 01, 90° azimuth pre-rotation performed o Bit value 00, no pre-rotation performed * If 3 bits available, same as above but use 45° steps (pi/4 radian steps) * 4 or more bits, follow the same pattern and use an appropriate step size (2*pi / 2^Nbits steps) Although the above example shows the quantization steps being uniform in distribution it would be understood that any suitable quantization, for example nonuniform distribution of step sizes and/or quantization boundary locations could be 15 employed.

In some embodiments, the 1-bit signal value 1 can be used to instead signal that pre-rotation has been implemented with "enough accuracy for good quality" and further transformations should not be performed in the renderer.

In some further embodiments, the 1-bit signal value 1 may signal that there is additional pre-rotation data provided to the receiver. The pre-rotation data could be provided, for example, within a PI (processing information) frame such as described in patent application GB2313472.9 In yet further embodiments, the 1-bit signal value indicates that the pre-rotation has been implemented and the current pre-rotation value is within a specified range of rotation of the listener orientation. The specified range is the maximum expected rotation adjustment. This maximum rotation adjustment range is derived based on the rate of listener head orientation feedback signalling to the sender UE from the receiver UE.

It should be noted that MASA format is an example in this case. In practice, 30 any spatial audio format which is transmitted as two transport channels corresponding to left-right division could benefit from this approach when rendering to binaural with head-tracking.

For example, with an Ambisonics input at the encoder/sender, transport channels can be generated in any orientation which is a simple way of performing pre-rotation.

Another example of input format specific signalling can be signalling related to a multichannel (MC) input format audio signal. In a MC format, some of the transport channels can be empty and contain no audio data. In some embodiments at least one of the available or SSB bits in the IVAS ToC could be used to indicate presence of empty channels in the MC transport channels. Alternatively, in some embodiments two available or SSB bits could be used to indicate the number of empty channels (0, 1, 2, 3+), where 3-F would indicate 3 or more empty channels.

In some embodiments, where 3 SSB bits are available, the number of empty channels could be extended to cover 0 -7+ number of channels.

For example, Figures 6i to 61 show tables demonstrating example bit allocations for the SSB bits and their indicated meaning or semantic value. For example, Figure 6i shows a MC format example where empty channels are indicated up-to 3+ channels, SSB bits 100-110 are reserved for future use, SSB 111 is reserved for indicating IVAS CMR byte. Figure 6j shows a MC format example where empty channels are grouped to rear, height and LFE groups; SSB 111 is reserved for indicating IVAS CMR byte. Figure 6k shows a MC format example where empty channels are indicated up-to 6+ channels, SSB 111 is reserved for indicating IVAS CMR byte. Figure 61 shows a MC format example where empty channels are indicated as individual channels, SSB 111 is reserved for indicating IVAS CMR byte.

In some embodiments, the available or SSB bits could be used to identify specific empty channels. For example, employing 3 available or SSB bits, enables the ToC header appender 305 to indicate 8 individual channels as empty channels, which would cover all the channels in 5.1 and 7.1 MC input formats.

In some further embodiments, channels can be grouped together, and the available or SSB bits can be used to indicate the emptiness of the grouped channels. For example, the height, rear or LFE channels could be divided into separate groups for this purpose.

In some embodiments a decoder/renderer can be configured to apply a modified rendering process where there is information available that some channels are empty. For example, if the elevated channels are empty in a 5.1+4 multichannel format signal and the output played through a loudspeaker setup, the diffuse field can be created using only the horizontal speakers instead of all the speakers in a 3D set. This would produce a diffuse field closer to the original 5.1 signal, and thus produce a reproduction which is closer to the original input signal.

In some embodiments an emptiness of the transmitted channels could also be indicated by use of these available or empty bits when encoding other IVAS input formats.

In some embodiments the information signalled with respect to a format specific input could be one of signalling information associated with a stereo format input. The SSB bits being used for signalling input format specific information can be applicable for all the BLIs. Consequently, the semantics of the SSB bits is obtained in some cases as a combination of SSB bits and input format which is obtained from the IVAS frame.

In some embodiments the available or empty bits in the ToC header byte can be employed to indicate if the two stereo channels are uncorrelated. For example, if there are two talkers in the separate channels, which are derived from separate scenes. This could be achieved by using one bit from the available or SSB bits.

Another example for stereo input would be to employ the available or SSB bits to differentiate near and far stereo channels (e.g., original in W02021/053266, MASA-variant in W02009/135532). For example, if the sender is talking to a mobile phone on-ear, one channel would be captured with the microphone close to the mouth of the user (near channel) and the other channel would be captured with the further-away microphone (far channel), which would include ambient/surround/scene audio. The signalling could be achieved by attaching one bit for each channel, and the value of those bits would determine if the attached channels are near or far channels. Alternatively, the two examples could be combined by using the values of the two bits as indicators, e.g., in binary format (00 -normal operation, 01 -uncorrelated channels, 10 -near/far channel order, 11 -far/near channel order). With three bits (or with reordering of the previous indicators), more combinations could also be indicated, e.g., near/far order + correlated/uncorrelated combinations. For example this 2 bit example is related to a stereo input format, for other input formats the same bit sequences can be reemployed with different semantic choices for the signalling. Furthermore in some embodiments there may be some SSB bit sequences which are allocated for specific semantics that are not specific to a particular input format but rather relevant independent of the input format. Such SSB bit sequences can be consequently reserved or taken for all input formats.

With respect to Figures 6m and 6n are shown example stereo signalling format information. Thus, as shown in Figure 6m, is shown a table where for a stereo format the near/far and uncorrelated channel indication, SSB 110 is for future use, SSB 111 is reserved for indicating IVAS CMR byte. Figure 6n shows a table where for a stereo format the SSB comprises values for voice channel indication, SSB 011-110 is for future use, SSB 111 is reserved for indicating IVAS CMR byte. In case of CMR byte, the 4 BLI bits for requesting bitrates can cover the currently specified IVAS bitrate levels.

In another example, the employ the available or SSB bits could be used to indicate voice/non-voice channels. In the on-ear phone call example, the voice channel would be the one captured from closer to the mouth of the user and the non-voice channel the one captured with the further away microphone. Thus, the voice/non-voice indication would indicate near/far and uncorrelated channels combination with a single indicator in the SSB bits.

The voice/non-voice channel indication could also be used in other IVAS input formats. For example, this indication could be useful when transmitting MASA and object streams, where the MASA stream is carrying the voice data. In most use cases it could be expected that one of the object streams is carrying the voice and MASA the ambience, and with the voice/non-voice indication it is possible to revert this expectation. With multiple object streams, the voice/non-voice indication could also be beneficial to indicate one or more objects carrying voice data.

The IVAS codec uses EVS-conformant processing, when the input is a mono signal without spatial metadata. In practice, this means that in an IVAS session, there could be both IVAS and EVS coded frames present in the session. Different IVAS input formats are identified from the identifier bits present in the beginning of each IVAS frame. However, EVS frames do not have such identifier bits present, so in practice it is impossible to distinguish if a received frame is IVAS or EVS frame in an IVAS session, without using any additional headers.

EVS can operate in a Compact or Header-Full format. In the EVS Compact format, the EVS frames are size-protected, in other words, each bitrate is reserved for a single EVS operating mode. In the EVS Header-Full format, the bitrate and mode of the frame is distinguished with a ToC byte in front of the frame. The EVS ToC bytes are not compatible with the IVAS ToC bytes, in other words, they cannot be used in an IVAS session without complying modifications.

In the IVAS Header-Full format, the encoded frames are also preceded by a ToC byte, which indicates the bitrate or other frame content. Directly attaching an EVS frame to an IVAS ToC byte is not possible, since the EVS frame cannot be distinguished from an IVAS frame. For example, if the IVAS ToC indicates 13.2 kbps bitrate for the frame without any additional information, the frame could be either IVAS or EVS frame, since the bitrate is supported in both codecs. The decoder would then have no way of determining which one it would be.

To overcome this issue, the employ the available or SSB bits in the IVAS ToC byte could be employed to indicate an EVS frame type. The BLI part of the IVAS ToC could then be employed to determine the bitrate and operating mode for the EVS frame. Thus as shown in Figure 6c is shown a table where different EVS modes are explicitly differentiated by the SSB bit values. In Figure 6d there is shown a table where EVS interoperability is obtained by supporting the entire EVS payload including the CMR and ToC bytes. This has the benefit of reusing existing payload implementations for EVS mode within IVAS session with minimum consumption of SSB bits. In Figure 6e there is shown a table where EVS indication is integrated to the SSB + BLI (1110 or 1111) combinations. EVS interoperability is obtained by supporting the entire EVS payload including the CMR and ToC bytes. This has the benefit of reusing existing payload implementations for EVS mode within IVAS session with no consumption of SSB bits when they are used together with IVAS bitrates (BLI 0000-1101). In other words, with ToC headers with bit sequence indicating EVS, the IVAS payload signaling operates in a tunneling mode, which enables transmission of non IVAS frames within an IVAS session with IVAS RTP payload header. In different embodiments, the exact bit sequence in IVAS ToC header could be different. Also the tunneling could be used payloads other than EVS, such as split rendered output.

For example, the employ the available or SSB bits in the IVAS ToC could indicate (00 -IVAS, 01 -EVS Primary, 10 -EVS AMR-WB 10). The used bitrate and possible SPEECH_LOST, NO_DATA and SID frame indications would be determined from the combination of SSB (or extra bits) and BLI (or FT) part of the IVAS ToC bytes. As these bit sequences are related to indicating the IVAS or EVS or EVS AMR-WB 10 mode, these can be reserved bit sequences. Consequently SSB or extra bits cannot reuse the same bit sequences for any input format.

The following table, Table 1, therefore present available bitrates for the EVS codec and their associated operating modes. The bold lines indicate bitrates that are also employed in IVAS. As presented in the table, EVS has multiple bitrates that are not supported in IVAS and that are not covered by the BLI part of the IVAS ToC byte (for example as indicated by GB2313472.9). However, IVAS supports also higher bitrates that are not supported in EVS (160, 192, 256, 384, 512 kbps) and these are covered in IVAS ToC byte. Therefore as shown, there are 14 available bitrate indications in the IVAS ToC byte (13.2 -512 kbps), as presented in GB2313472.9. These bitrate indications in the IVAS ToC byte could be remapped when the SSB ToC bits indicate EVS Primary or EVS AMR-WB 10 mode to cover all the bitrates available for the associated modes. For example, for EVS Primary mode the bitrates 13.2 -128 kbps could be mapped to the first eight BLI bit sequences (00000-00111), and the remaining bitrates (2.4, 2.8, 7.2, 8, 9.6 kbps) could be mapped to the higher bitrate indications (>128 kbps) in the IVAS ToC byte (BLI 01000-01100), since those are not relevant for the EVS Primary mode. An example bitrate mapping for the EVS Primary mode bitrates in the IVAS ToC byte is presented in Table 2. EVS AMR-WB 10 mode bitrate mapping example is presented in Table 3. This approach would cover all the bitrates for EVS primary and EVS AMR-WB 10 mode with the existing ToC ranges from the EVS codec and adapt them to be used without conflicting with the IVAS codec protected sizes.

The above approach although not explicitly covering CMR (Codec mode request) functionality from the EVS codec, can implement CMR functionality using three bits from the available or SSB bits from the IVAS ToC header.

For example, as shown in Figure 6b, in some embodiments, the IVAS ToC can be repurposed for signalling CMR without specifying a separate CMR byte as in case of EVS. Figure 6b shows a ToC (CMR) 601, and three ToC (IVAS) ToC1 603, ToC2 605, ToC3 607 and respective IVAS payloads IVAS1 609, IVAS2 611, IVAS3 613. The ToC (CMR) 601 further comprises a leading bit F 602. The F (1 bit) which indicates that there are more ToCs present in the RTP packet payload. Additionally the ToC (CMR) 601 further comprises SSB bits 604 and BLI bits 606, where specifying a certain sequence of bits in SSB (e.g., 111) 604 indicates the interpretation of BLI bits 606 as the requested bitrate level. The example IVAS ToC byte employed to indicate CMR will use the F bit in the same way as in case of signalling the bitrate level indication of speech frames. The difference compared to the regular ToC byte usage is that the IVAS ToC byte for CMR shall not have a corresponding speech frame, unlike the IVAS ToC byte that is not signalling CMR.

This approach enables CMR functionality by repurposing the BLI bits for describing the speech frame bitrate as well as indicating requested bitrate. This dynamic allocation of SSB bits to indicate CMR avoids one bit being reserved for all headers, irrespective of whether CMR is being used or not.

For example, the employ the available or SSB bits could be used to indicate the frame content in binary format as (000 -IVAS, 001 -EVS Primary, 010 -EVS Primary with CMR, 011 -EVS AMR-WB 10, 100 -EVS AMR-WB 10 with CMR). The EVS frames with CMR functionality would have a CMR byte at the beginning of the payload, and the IVAS ToC byte would replace the EVS ToC byte. In this approach the BLI part of the IVAS ToC byte would have 4 bits available. The EVS bitrates presented in Table 2 and Table 3 could then be presented by removing the first bit on the left (0) (most significant bit in the column BLI bits) in the tables.

2.4 (EVS Primary SID) EVS Primary Special case EVS AMR-WB 10 EVS Primary EVS Primary 2.8 6.6 7.2 Payload EVS AMR-WB 10 184 8.85 EVS Primary 192 9.6 EVS AMR-WB 10 256 12.65 EVS Primary 264 13.2 EVS AMR-WB 10 288 14.25 EVS AMR-WB 10 320 15.85 EVS Primary 328 16.4 EVS AMR-WB 10 368 18.25 EVS AMR-WB 10 400 19.85 EVS AMR-WB 10 464 23.05 EVS AMR-WB 10 480 23.85 EVS Primary 488s 24.4 EVS Primary 640 32 EVS Primary 960 48 EVS Primary 1280 64 EVS Primary 1920 96 EVS Primary 2560 128 Table 1 -Payload sizes and bitrates for EVS codec modes, the green lines indicate bitrates colliding with the IVAS codec 13.2 16.4 24.4 2.4 (SID) I bits EVS Primary bitrate (kbps) 01001 2.8 01010 7.2 01011 8 01100 9.6 Table 2 -Example allocations for the BLI bits in the IVAS ToC byte and their associated EVS Primary mode bitrates.

Table 3-Example allocations for the BLI bits in the IVAS ToC byte and their associated EVS AMR-WB 10 mode bitrates.

In some embodiments the available or SSB bits can be employed to indicate an EVS frame in the BLI part of the IVAS ToC byte, when 5 bits is allocated for the BLI indicator. As presented in GB2313472.9, the BLI indicator in the IVAS ToC byte has available bit allocations for future use, and there is room for an EVS frame indication. For example, the value 10010 could be employed as presented in Table 4. If the BLI indicator has 4 bits allocated for it, an EVS frame could be identified with the combination of the available or SSB bits + the BLI indicator, as presented in Table 5. Either approach would enable the full use of EVS ToC and CMR bytes in the RTP packet, although an extra overhead would be introduced from the IVAS ToC byte. This is because the EVS ToC byte would be required to determine the 2.8 6.6 8.85 12.65 14.25 15.85 18.25 19.85 23.05 23.85 bitrate or the operating mode of each EVS frame, because the IVAS ToC byte would not provide that information anymore. The benefit of such an approach is that the existing EVS packetization/depacketization implementations can be reused.

BL 1 bits Frame content ndication 10010 EVS Table 4 -EVS frame indication in IVAS ToC byte (BLI part with 5 bits).

P!sse,!1!:::oit 1,Frome. 1indication rt tail M: 1;0itp + 1110 EVS Table 5 -EVS frame indication in IVAS ToC byte (3 SSB bits + BLI part with 4 bits).

The above approaches describe how the EVS codec could be integrated to the IVAS codec in Header-Full format. However, the necessary ToC bytes for the integration are not present in the Compact formats of either codecs. In the Compact format, IVAS and EVS frames could be distinguished from each other by modifying the frame size of either frame. For example, if an additional bit is added to the EVS frames, they would be distinguishable from the IVAS frames that operate with the same bitrate as the EVS frame. For example, if the operating bitrate is 13.2kbps, the size of an IVAS frame would be 264 bits and the size of an EVS frame 265 bits. The addition to the frame could be one bit, one byte or any other suitable value to disambiguate EVS frames from IVAS frames. This type of modification would only be necessary in bitrates that are supported by both codecs. The size modification of the payload could in some examples be indicated with the P (padding) bit in the RTP header.

With the above approach of payload size modification, both IVAS and EVS frames are supported in a Compact format IVAS session. Furthermore, if the unique payload sizes for both codecs (e.g. IVAS and modified EVS payload sizes) are protected during the session, the Header-Full IVAS frames would also be supported in the same session, if the Header-Full IVAS frames are ensured not to collide in size with the protected payload sizes. In some embodiments, the size collision avoidance can be ensured by zero padding the Header-Full IVAS frames. The padding can in some examples be indicated with the P (padding) bit in the RTP header.

Figure 6o shows an example of protected payload sizes in an IVAS session.

In Figure 6o, the shaded or highlighted rows indicate EVS bitrates and possibly modified payload sizes. The colliding bitrates between the codecs are explicitly differentiated with either (IVAS) or (EVS) to indicate clearly which payload size is used for which codec. In this example, the colliding EVS payload sizes are modified by adding one byte to them, the rest of the payloads are un-modified. If the sender is using IVAS Compact Format, the transmitted payload sizes can follow the table in Figure 6o. In IVAS Header-Full format, the sender ensures that the transmitted payloads are not colliding with the protected payload sizes, e.g., by adding bytes (zero padding) to the end of the payload. The receiver can identify the format of the payload (IVAS Compact or IVAS Header-Full) from the size of the payload. If the size fits one of the protected payload sizes, the payload is in IVAS Compact format and the correct bitrate and mode (IVAS or EVS) can be identified from the table of protected sizes. Otherwise, the payload is in IVAS Header-Full format and the payload is described by the ToC bytes in the payload. In case of Compact format, the payload size is the same as the protected size of the IVAS or EVS frame. In case of Header-Full format, the RTP payload size comprises one or more ToC headers, one or more IVAS frames and/or one or more EVS data with Header-Full format data. This EVS Header-Full data may be similar to any EVS Header-Full frame specified in TS 26.445. Tunnelling of IVAS Header-Full data within another IVAS Header-Full is not recommended. Only one level of tunnelling of aggregate frames of IVAS (in Header-Full mode) is recommended.

The (IVAS) decoder according to some embodiments, such as the example decoder 131, is shown in Figure 7.

The decoder in some embodiment is configured to receive or otherwise obtain the RTP packets 1600. Additionally the decoder comprises a RTP extractor 1601 which is configured to receive the RTP packets 1600 and extract from these RTP packets the PI payload and enable the decoder to render the audio signals based on the RTP packet (IVAS) payload.

The RTP extractor 1601, in some embodiments, comprises a payload type determiner 1603. The payload type determiner 1603 is configured to determine from the RTP packets the header identifier and adjust the processing of the data based on the input format of the IVAS frame and identifier values.

Figure 8 shows the operations of the example RTP extractor shown in Figure 7.

The first operation is one of obtaining or receiving RTP packet as shown in Figure 8 by step 1701.

Then for the received RTP packet read or obtain an identifier header and 10 the IVAS frame as shown in Figure 8 by step 1703.

Then interpret the identifier header accordingly and adjust the processing of the data based on the input format of the IVAS frame as shown in Figure 8 by step 1705.

With respect to Figure 9 an example electronic device is shown. The device may be any suitable electronics device or apparatus. For example in some embodiments the device 1800 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc. In some embodiments the device 1800 comprises at least one processor or central processing unit 1807. The processor 1807 can be configured to execute 20 various program codes such as the methods such as described herein.

In some embodiments the device 1800 comprises a memory 1811. In some embodiments the at least one processor 1807 is coupled to the memory 1811. The memory 1811 can be any suitable storage means. In some embodiments the memory 1811 comprises a program code section for storing program codes implementable upon the processor 1807. Furthermore in some embodiments the memory 1811 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1807 whenever needed via the memory-processor coupling.

In some embodiments the device 1800 comprises a user interface 1805. The user interface 1805 can be coupled in some embodiments to the processor 1807.

In some embodiments the processor 1807 can control the operation of the user interface 1805 and receive inputs from the user interface 1805. In some embodiments the user interface 1805 can enable a user to input commands to the device 1800, for example via a keypad. In some embodiments the user interface 1805 can enable the user to obtain information from the device 1800. For example the user interface 1805 may comprise a display configured to display information from the device 1800 to the user. The user interface 1805 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1800 and further displaying information to the user of the device 1800.

In some embodiments the device 1800 comprises an input/output port 1809. The input/output port 1809 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1807 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.

The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).

The transceiver input/output port 1809 may be configured to receive the signals and in some embodiments obtain the focus parameters as described herein.

In some embodiments the device 1800 may be employed to generate a suitable audio signal using the processor 1807 executing suitable code. The input/output port 1809 may be coupled to any suitable audio output for example to a multichannel speaker system and/or headphones (which may be a headtracked or a non-tracked headphones) or similar.

In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware.

Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.

The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.

Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDS II, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.

3GPP 3rd Generation Partnership Project AMR-WB 10 Adaptive Multi Rate Wideband Inter-Operable CMR Codec mode request EVS Enhanced Voice Services FOA First-Order Ambisonics FT Frame type (index) HOA2 2nd order Higher-Order Ambisonics HOA3 3rd order Higher-Order Ambisonics ISM Independent Streams with Metadata (i.e., type of Object-Based Audio) IVAS Immersive Voice and Audio Services kbps kilobits per second LFE Low-Frequency Effects MASA Metadata-Assisted Spatial Audio MC Multichannel OMASA Object-based audio with MASA (combined input format) OSBA Object-based audio with SBA (combined input format) PI Processing information (audio) RTP Real-Time Transport Protocol SBA Scene-Based Audio

SDP Session Description Protocol

ToC Table of Content

Claims

CLAIMS: 1. An apparatus comprising means configured to: obtain at least one audio signal, wherein the at least one audio signal is associated with at least one input audio signal format; generate at least one audio signal data packet based on the at least one audio signal; and signal information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet at a decoder.
2. The apparatus as claimed in claim 1, wherein the means configured to signal information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet at a decoder is further configured to generate the at least one audio signal data packet with a defined size to signal whether the data packet is a compact or header-full format type audio packet.
3. The apparatus as claimed in any of claims 1 or 2, wherein the means configured to signal information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet at a decoder is configured to generate the at least one audio signal data packet with a defined size to signal whether the data packet is encoded as a first or second type audio packet.
4. The apparatus as claimed in claim 1, wherein the means configured to signal information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet at a decoder is configured to generate at least one identifier header associated with the at least one audio signal data packet, wherein the at least one identifier header comprises information associated with at least one input audio signal format to assist the processing of the at least one audio signal data packet at the decoder.
5. The apparatus as claimed in claim 4, wherein the means configured to generate at least one identifier header associated with the at least one audio signal data packet is further configured to generate the at least one identifier header to signal whether the data packet is encoded as a first or second type audio packet.
6. The apparatus as claimed in claims 3 or 5, wherein the first type audio packet is an immersive voice and audio services encoded audio packet and the second type audio packet is an enhanced voice services encoded audio packet.
7. The apparatus as claimed in claim 4 or any claim dependent on claim 4, wherein the means configured to generate at least one identifier header associated with the at least one audio signal data packet is further configured to generate the at least one identifier header to signal a codec mode request change capacity.
8. The apparatus as claimed in claim 4 or any claim dependent on claim 4, wherein the information associated with the at least one input audio signal format to assist a processing of the at least one audio signal data packet at a decoder comprises one of: modification of at least one audio transport channel, the at least one audio transport channel associated with the at least one audio signal; rotation transforms prior to encoding the at least one audio signal; indicating an empty channel comprising no audio data within a multi-channel representation of the at least one audio signal; indicating an empty channel comprising no audio data within the at least one audio signal; indicating information associated with a stereo channel representation of the at least one audio signal; and indicating a voice/non-voice channel of the at least one audio signal.
9. An apparatus comprising means configured to: receive an audio packet stream, the audio packet stream comprising at least one audio signal data packet comprising at least one audio signal, the at least one audio signal is associated with at least one input audio signal format; determine information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet; and process the at least one audio signal based on the information.
10. The apparatus as claimed in claim 9, wherein the means configured to determine information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet is configured to determine the at least one audio signal data packet is encoded as a first or second type audio packet based on a defined size.
11. The apparatus as claimed in any of claims 9 or 10, wherein the means configured to determine information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet is configured to determine the at least one audio signal data packet is encoded as a compact or header-full format type audio packet based on a defined size.
12. The apparatus as claimed in claim 9, wherein the audio packet stream further comprises at least one identifier header associated with the at least one audio signal data packet, wherein the at least one identifier header comprises information associated with the at least one input audio signal format, and the means configured to determine information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet is configured to determine the information based on the information associated with the at least one input audio signal format.
13. The apparatus as claimed in claim 12, wherein the at least one identifier header further comprises an indication as to whether the data packet is encoded as a first or second type audio packet.
14. The apparatus as claimed in claim any of claims 10 or 13, wherein the first type audio packet is an immersive voice and audio services encoded audio packet and the second type audio packet is an enhanced voice services encoded audio packet.
15. The apparatus as claimed in claim 12 or any claim dependent on claim 12, wherein the means configured to process the at least one audio signal based on the information from the at least one identifier header is configured to decode the audio packet based on whether the at least one audio packet was encoded as a first or second type audio packet as indicated by the at least one identifier header.
16. The apparatus as claimed in claim 12 or any claim dependent on claim 12, wherein the at least one identifier header associated with the at least one audio signal data packet further comprises an indicator to signal a codec mode request change capacity.
17. The apparatus as claimed claim 12 or any claim dependent on claim 12, wherein the information associated with the at least one input audio signal format comprises one of: modification of at least one audio transport channel, the at least one audio transport channel associated with the at least one audio signal; rotation transforms prior to encoding the at least one audio signal; indicating an empty channel comprising no audio data within a multi-channel representation of the at least one audio signal; indicating an empty channel comprising no audio data within the at least one audio signal; indicating information associated with a stereo channel representation of the at least one audio signal; and indicating a voice/non-voice channel of the at least one audio signal.
18. The apparatus as claimed in any of claims 10 to 17, wherein the means configured to receive the audio packet stream as a real-time transport protocol transmission.
19. A method comprising: obtaining at least one audio signal, wherein the at least one audio signal is associated with at least one input audio signal format; generating at least one audio signal data packet based on the at least one audio signal; and signalling information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet at a 5 decoder.
20. A method comprising: receiving an audio packet stream, the audio packet stream comprising at least one audio signal data packet comprising at least one audio signal, the at least one audio signal is associated with at least one input audio signal format; determining information associated with the at least one audio signal data packet to assist a processing of the at least one audio signal data packet; and processing the at least one audio signal based on the information from the at least one identifier header.
21. The method as claimed in any of claims 19 and 20, further comprising receiving the audio packet stream as a real-time transport protocol transmission.