GB2639269A

GB2639269A - Immersive conversational audio

Info

Publication number: GB2639269A
Application number: GB2403747.5A
Authority: GB
Inventors: Juhani Laaksonen Lasse; Anton Pajunen Lauros; Shyamsundar Mate Sujeet; Johannes Pihlajakaja Tapani
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2024-03-15
Filing date: 2024-03-15
Publication date: 2025-09-17
Also published as: GB202403747D0; GB2639269A8; WO2025190692A1

Abstract

An apparatus for controlling transmission of metadata related to a conversational immersive audio, the apparatus comprising means configured to: obtain a temporal parameter 1101, the temporal parameter being associated with at least one immersive audio data frame; determine a discontinuous transmission mode period 1103; obtain at least one metadata 1105 during the discontinuous transmission mode period; and control 1107 the transmission of the at least one metadata based on the temporal parameter.

Description

IMMERSIVE CONVERSATIONAL AUDIO

Field

The present application relates to apparatus and methods for signalling for immersive conversational audio, but not exclusively for implementing processing information signaling for immersive conversational audio employing immersive voice and audio services (IVAS) real-time transport protocol (RTP) payload during discontinuous transmission (DTX).

Background

Immersive audio codecs are being implemented supporting a multitude of operating points ranging from a low bit rate operation to transparency. An example of such a codec is the immersive voice and audio services (IVAS) codec which is designed to be suitable for use over a communications network such as a 3GPP 4G/5G network. Such immersive services include uses for example in immersive voice and audio for applications such as virtual reality (VR), augmented reality (AR) and mixed reality (MR) as well as spatial voice communication including teleconferencing. This audio codec handles the encoding, decoding and rendering of speech, music and generic audio. It provides backwards interoperable EVS mono operation and furthermore supports stereo, scene-based audio (SBA), metadata-assisted spatial audio (MASA), channel-based audio, object-based audio, and certain combinations of these input formats. The codec operates with low latency to enable conversational services as well as support high error robustness under various transmission conditions.

The input signals are presented to the IVAS encoder in one of the supported formats (and in some allowed combinations of the formats). Similarly, the decoder can output the audio in several supported formats including a pass-through operation, where the audio can be provided in its original format after transmission (encoding/decoding).

Additionally RTP (Real-Time Transport Protocol) is intended for an end-to-end, real-time transfer of streaming media and provides facilities for jitter compensation and detection of packet loss and out-of-order delivery. RTP allows data transfer to multiple destinations through IP multicast or to a specific destination through IP unicast. The majority of the RTP implementations are built on top of the User Datagram Protocol (UDP). Other transport protocols may also be utilized. RTP is used in together with other protocols such as H.323 and Real Time Streaming Protocol (RTSP).

The RTP specification describes two protocols: RTP and RTCP. RTP is used for the transfer of multimedia data, and its companion protocol (RTCP) is used to periodically send control information and QoS (Quality of Service) parameters.

RTP sessions are typically initiated between client and server or between client and another client (or a multi-party topology) using a signalling protocol, such as H.323, the Session Initiation Protocol (SIP), or RTSP. These protocols typically use the Session Description Protocol (SDP), such as defined by RFC 8866 to specify parameters for the sessions. The RTP specification recommends even port numbers for RTP, and the use of the next odd port number for the associated RTCP session. A single port can be used for RTP and RTCP in applications that multiplex the protocols. Each RTP stream consists of RTP packets, which in turn consist of RTP header and payload pairs.

Summary

There is provided according to a first aspect an apparatus for controlling transmission of metadata related to a conversational immersive audio, the apparatus comprising means configured to: obtain a temporal parameter, the temporal parameter being associated with at least one immersive audio data frame; determine a discontinuous transmission mode period; obtain at least one metadata during the discontinuous transmission mode period; and control the transmission of the at least one metadata based on the temporal parameter. The means configured to control the transmission of the at least one metadata may be configured to at least one of: associate at least one metadata based on the temporal parameter with the at least one immersive audio data frame transmitted preceding the obtaining of the at least one metadata; associate at least one metadata based on the temporal parameter with the at least one immersive audio data frame transmitted succeeding the obtaining of the at least one metadata; and associate at least one metadata with a silence descriptor frame transmitted during the discontinuous transmission mode period.

The means configured to control the transmission of the at least one metadata may be configured to at least one of: associate at least one metadata based on the temporal parameter with the at least one immersive audio data frame transmitted before the start of the discontinuous transmission mode period; associate at least one metadata based on the temporal parameter with a latest transmitted at least one immersive audio data frame or silence descriptor frame; associate at least one metadata with an immersive audio data frame transmitted after an end of the discontinuous transmission mode period; and associate at least one metadata with a silence descriptor frame transmitted during the discontinuous transmission mode period.

The means configured to control the transmission of the at least one metadata may be configured to transmit the at least one metadata during the discontinuous transmission mode period.

The means configured to control the transmission of the at least one metadata may be further configured to: determine a further time period following a start of the discontinuous transmission mode period; and select, as the associated at least one metadata, at least one of the at least one metadata obtained during the discontinuous transmission mode period and further obtained during the further time period within the discontinuous transmission mode period.

The means configured to control the transmission of the at least one metadata may be configured to delay a transmission of the at least one metadata. The means configured to control a transmission of the at least one metadata may be configured to at least one of: select a latest of the at least one metadata obtained during the discontinuous transmission mode period to be transmitted; select one of the at least one metadata obtained during the discontinuous transmission mode period to be transmitted; select more than one of the at least one metadata obtained during the discontinuous transmission mode period to be transmitted; and generate, as the at least one metadata to be transmitted, a combination metadata based a combination of metadata obtained during the discontinuous transmission mode period.

The means configured to select more than one of the at least one metadata obtained during the discontinuous transmission mode period to be transmitted may be configured to select, based on a sampling pattern, more than one of the at least one metadata.

The means configured to control the transmission of the at least one metadata based on the temporal parameter may be configured to delay the transmission of the at least one metadata frame obtained during the discontinuous transmission mode period.

The at least one metadata may comprise at least one processing information frame for assisting processing of the at least one immersive audio data frame.

The means configured to control the transmission of the at least one metadata based on the temporal parameter may be configured to packetize the at least one metadata in a transport protocol payload, the transport protocol payload may comprise at least one of: a real-time transport protocol payload; a real-time transport protocol header extension; and a real-time transport control protocol payload.

The means configured to control the transmission of the at least one metadata based on the temporal parameter may be configured to switch the transmission of the at least one metadata from a real-time transport protocol to real-time transport control protocol.

The at least one metadata may comprise at least one of: information identifying an orientation of the apparatus or a capturing apparatus; information identifying a microphone array constellation configuration of the apparatus or capturing apparatus; information identifying an orientation applied to the at least one immersive audio data frame; information identifying a subsequent input format change for the at least one immersive audio data frame; information identifying a subsequent coded format change for the at least one immersive audio data frame; information identifying a specified number of frames or number of seconds until a change in an input format for at least one immersive audio data frame; information identifying binauralization data that can be used to revert the binauralization at the apparatus; and information about an acoustic environment.

The means may be further configured to negotiate with a further apparatus support for a discontinuous transmission mode and support for the control the transmission of the at least one metadata based on the temporal parameter.

The means configured to negotiate with the further apparatus may be configured to negotiate employing a session description file.

The temporal parameter may comprise at least one of: a generation time; a time of appliance; a future appliance time; a future playback time; a deferred playback time; a time stamp related to real time protocol packet; a capture time; a time of audio capture; a composition time; and a playback time.

The at least one metadata may be at least one metadata frame.

According to a second aspect there is provided an apparatus for controlling reception of metadata related to a conversational immersive audio signal, the apparatus comprising means configured to: obtain a transport protocol payload, the transport protocol payload comprising at least one metadata associated with a temporal parameter associated with at least one immersive audio data frame, the at least one metadata having been generated during a discontinuous transmission mode period; and determine from the transport protocol payload the at least one metadata based on the temporal parameter.

The means may be further configured to determine that the at least one 20 metadata was generated during a discontinuous transmission mode period.

The means configured to determine that the at least one metadata was generated during a discontinuous transmission mode period may be configured to determine from an analysis of the transport protocol payload that the at least one metadata was generated during a discontinuous transmission mode period.

The transport protocol payload may further comprise the at least one immersive audio data frame.

The means may be further configured to process the at least one metadata to enable the at least one metadata to assist a processing of the immersive audio data frame.

The means configured to process the at least one metadata to assist the processing of the immersive audio data frame may be configured to modify an association between the at least one metadata and the at least one immersive audio data frame based on the temporal parameter.

The means configured to process the at least one metadata to assist the processing of an immersive audio data frame may be configured to select at least one of the at least one metadata to assist the processing of an immersive audio data frame.

The means configured to process the at least one metadata to assist the processing of an immersive audio data frame may be configured to at least one of: select a latest of the at least one metadata; select one of the at least one metadata; select more than one of the at least one metadata; and generate, as the selected at least one metadata, a combination metadata.

The combination metadata may comprise a combination of information from the at least one metadata.

The means may be further configured to extract from the transport protocol payload the at least one immersive audio data frame, wherein the at least one immersive audio data frame is at least one of: transmitted either before or after the discontinuous transmission mode period; and a silence descriptor data frame transmitted during the discontinuous transmission mode period.

The means configured to determine from the transport protocol payload at least one metadata may be configured to extract the at least one metadata from at least one of: a real-time transport protocol payload; a real-time transport protocol header extension; and a real-time transport control protocol payload.

The at least one metadata may comprise at least one of: information identifying an orientation of a further apparatus or a capturing apparatus; information identifying a microphone array constellation configuration of a further apparatus or capturing apparatus; information identifying an orientation applied to the at least one immersive audio data frame; information identifying a subsequent input format change for the at least one immersive audio data frame; information identifying a subsequent coded format change for the at least one immersive audio data frame; information identifying a specified number of frames or number of seconds until a change in an input format for the at least one immersive audio data frame; information identifying binauralization data that can be used to revert the binauralization at a further apparatus; and information about an acoustic environment.

The means may be further configured to negotiate with a further apparatus support for the discontinuous transmission mode and support for a control of transmission of the at least one metadata based on the temporal parameter at the further apparatus.

The at least one metadata may be at least one metadata frame.

The at least one metadata may comprise processing information.

According to a third aspect there is provided a method for an apparatus for controlling transmission of metadata related to a conversational immersive audio, the method comprising: obtaining a temporal parameter, the temporal parameter being associated with at least one immersive audio data frame; determining a discontinuous transmission mode period; obtaining at least one metadata during the discontinuous transmission mode period; and controlling the transmission of the at least one metadata based on the temporal parameter.

Controlling the transmission of the at least one metadata may comprise at least one of: associating at least one metadata based on the temporal parameter with the at least one immersive audio data frame transmitted preceding the obtaining of the at least one metadata; associating at least one metadata based on the temporal parameter with the at least one immersive audio data frame transmitted succeeding the obtaining of the at least one metadata; and associating at least one metadata with a silence descriptor frame transmitted during the discontinuous transmission mode period.

Controlling the transmission of the at least one metadata may comprise at least one of: associating at least one metadata based on the temporal parameter with the at least one immersive audio data frame transmitted before the start of the discontinuous transmission mode period; associating at least one metadata based on the temporal parameter with a latest transmitted at least one immersive audio data frame or silence descriptor frame; associating at least one metadata with an immersive audio data frame transmitted after an end of the discontinuous transmission mode period; and associating at least one metadata with a silence descriptor frame transmitted during the discontinuous transmission mode period. Controlling the transmission of the at least one metadata may comprise transmitting the at least one metadata during the discontinuous transmission mode period.

Controlling the transmission of the at least one metadata may further comprise: determining a further time period following a start of the discontinuous transmission mode period; and selecting, as the associated at least one metadata, at least one of the at least one metadata obtained during the discontinuous transmission mode period and further obtained during the further time period within the discontinuous transmission mode period.

Controlling the transmission of the at least one metadata may comprise delaying a transmission of the at least one metadata.

Controlling a transmission of the at least one metadata may comprise at least one of: selecting a latest of the at least one metadata obtained during the discontinuous transmission mode period to be transmitted; selecting one of the at least one metadata obtained during the discontinuous transmission mode period to be transmitted; selecting more than one of the at least one metadata obtained during the discontinuous transmission mode period to be transmitted; and generating, as the at least one metadata to be transmitted, a combination metadata based a combination of metadata obtained during the discontinuous transmission mode period.

Selecting more than one of the at least one metadata obtained during the discontinuous transmission mode period to be transmitted may comprise selecting, based on a sampling pattern, more than one of the at least one metadata.

Controlling the transmission of the at least one metadata based on the temporal parameter may comprise delaying the transmission of the at least one metadata frame obtained during the discontinuous transmission mode period.

Controlling the transmission of the at least one metadata based on the temporal parameter may comprise packetizing the at least one metadata in a transport protocol payload, the transport protocol payload may comprise at least one of: a real-time transport protocol payload; a real-time transport protocol header extension; and a real-time transport control protocol payload.

Controlling the transmission of the at least one metadata based on the temporal parameter may comprise switching the transmission of the at least one metadata from a real-time transport protocol to real-time transport control protocol.

The method may further comprise negotiating with a further apparatus support for a discontinuous transmission mode and support for the control the transmission of the at least one metadata based on the temporal parameter.

Negotiating with the further apparatus may comprise negotiating employing

a session description file.

The at least one metadata may be at least one metadata frame.

According to a fourth aspect there is provided a method for an apparatus for controlling reception of metadata related to a conversational immersive audio signal, the method comprising: obtaining a transport protocol payload, the transport protocol payload comprising at least one metadata associated with a temporal parameter associated with at least one immersive audio data frame, the at least one metadata having been generated during a discontinuous transmission mode period; and determining from the transport protocol payload the at least one metadata based on the temporal parameter.

The method may further comprise determining that the at least one metadata was generated during a discontinuous transmission mode period.

Determining that the at least one metadata was generated during a 10 discontinuous transmission mode period may comprise determining from an analysis of the transport protocol payload that the at least one metadata was generated during a discontinuous transmission mode period.

The method may further comprise processing the at least one metadata to enable the at least one metadata to assist a processing of the immersive audio data frame.

Processing the at least one metadata to assist the processing of the immersive audio data frame may comprise modifying an association between the 20 at least one metadata and the at least one immersive audio data frame based on the temporal parameter.

Processing the at least one metadata to assist the processing of an immersive audio data frame may comprise selecting at least one of the at least one metadata to assist the processing of an immersive audio data frame.

Processing at least one of the at least one metadata to assist the processing of an immersive audio data frame may be comprise at least one of: selecting a latest of the at least one metadata; selecting one of the at least one metadata; selecting more than one of the at least one metadata; and generating, as the selected at least one metadata, a combination metadata.

The method may further comprise extracting from the transport protocol payload the at least one immersive audio data frame, wherein the at least one immersive audio data frame is at least one of: transmitted either before or after the discontinuous transmission mode period; and a silence descriptor data frame transmitted during the discontinuous transmission mode period.

Determining from the transport protocol payload at least one metadata may comprise extracting the at least one metadata from at least one of: a real-time transport protocol payload; a real-time transport protocol header extension; and a real-time transport control protocol payload.

The method may further comprise negotiating with a further apparatus support for the discontinuous transmission mode and support for a control of transmission of the at least one metadata based on the temporal parameter at the further apparatus.

Negotiating with the further apparatus may comprise negotiating employing a session description file.

The at least one metadata may be at least one metadata frame.

The at least one metadata may comprise processing information.

According to a fifth aspect there is provided an apparatus for controlling transmission of metadata related to a conversational immersive audio, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain a temporal parameter, the temporal parameter being associated with at least one immersive audio data frame; determine a discontinuous transmission mode period; obtain at least one metadata during the discontinuous transmission mode period; and control the transmission of the at least one metadata based on the temporal parameter.

The apparatus caused to control the transmission of the at least one metadata may be caused to at least one of: associate at least one metadata based on the temporal parameter with the at least one immersive audio data frame transmitted preceding the obtaining of the at least one metadata; associate at least one metadata based on the temporal parameter with the at least one immersive audio data frame transmitted succeeding the obtaining of the at least one metadata; and associate at least one metadata with a silence descriptor frame transmitted during the discontinuous transmission mode period.

The apparatus caused to control the transmission of the at least one metadata may be caused to at least one of: associate at least one metadata based on the temporal parameter with the at least one immersive audio data frame transmitted before the start of the discontinuous transmission mode period; associate at least one metadata based on the temporal parameter with a latest transmitted at least one immersive audio data frame or silence descriptor frame; associate at least one metadata with an immersive audio data frame transmitted after an end of the discontinuous transmission mode period; and associate at least one metadata with a silence descriptor frame transmitted during the discontinuous transmission mode period.

The apparatus caused to control the transmission of the at least one metadata may be caused to transmit the at least one metadata during the discontinuous transmission mode period.

The apparatus caused to control the transmission of the at least one metadata may be further caused to: determine a further time period following a start of the discontinuous transmission mode period; and select, as the associated at least one metadata, at least one of the at least one metadata obtained during the discontinuous transmission mode period and further obtained during the further time period within the discontinuous transmission mode period.

The apparatus caused to control the transmission of the at least one metadata may be caused to delay a transmission of the at least one metadata. The apparatus caused to control a transmission of the at least one metadata may be caused to at least one of: select a latest of the at least one metadata obtained during the discontinuous transmission mode period to be transmitted; select one of the at least one metadata obtained during the discontinuous transmission mode period to be transmitted; more than one of the at least one metadata obtained during the discontinuous transmission mode period to be transmitted; and generate, as the at least one metadata to be transmitted, a combination metadata based a combination of metadata obtained during the discontinuous transmission mode period.

The apparatus caused to select more than one of the at least one metadata obtained during the discontinuous transmission mode period to be transmitted may be caused to select, based on a sampling pattern, more than one of the at least one metadata.

The apparatus caused to control the transmission of the at least one metadata based on the temporal parameter may be caused to delay the transmission of the at least one metadata frame obtained during the discontinuous transmission mode period.

The apparatus caused to control the transmission of the at least one metadata based on the temporal parameter may be caused to packetize the at least one metadata in a transport protocol payload, the transport protocol payload may comprise at least one of: a real-time transport protocol payload; a real-time transport protocol header extension; and a real-time transport control protocol payload.

The apparatus caused to control the transmission of the at least one metadata based on the temporal parameter may be caused to switch the transmission of the at least one metadata from a real-time transport protocol to real-time transport control protocol.

The apparatus may be further caused to negotiate with a further apparatus support for a discontinuous transmission mode and support for the control the transmission of the at least one metadata based on the temporal parameter.

The apparatus caused to negotiate with the further apparatus may be caused to negotiate employing a session description file.

The at least one metadata may be at least one metadata frame.

According to a sixth aspect there is provided an apparatus for controlling reception of metadata related to a conversational immersive audio signal, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain a transport protocol payload, the transport protocol payload comprising at least one metadata associated with a temporal parameter associated with at least one immersive audio data frame, the at least one metadata having been generated during a discontinuous transmission mode period; and determine from the transport protocol payload the at least one metadata based on the temporal parameter.

The apparatus may be further caused to determine that the at least one metadata was generated during a discontinuous transmission mode period.

The apparatus caused to determine that the at least one metadata was generated during a discontinuous transmission mode period may be caused to determine from an analysis of the transport protocol payload that the at least one metadata was generated during a discontinuous transmission mode period.

The apparatus may be further caused to process the at least one metadata to enable the at least one metadata to assist a processing of the immersive audio data frame.

The apparatus caused to process the at least one metadata to assist the 15 processing of the immersive audio data frame may be caused to modify an association between the at least one metadata and the at least one immersive audio data frame based on the temporal parameter.

The apparatus caused to process the at least one metadata to assist the processing of an immersive audio data frame may be caused to select at least one of the at least one metadata to assist the processing of an immersive audio data frame.

The apparatus caused to process the at least one metadata to assist the processing of an immersive audio data frame may be caused to at least one of: select a latest of the at least one metadata; select one of the at least one metadata; select more than one of the at least one metadata; and generate, as the selected at least one metadata, a combination metadata.

The apparatus caused to select more than one of the at least one metadata obtained during the discontinuous transmission mode period to be transmitted may be caused to select, based on a sampling pattern, more than one of the at least 30 one metadata.

The apparatus may be further caused to extract from the transport protocol payload the at least one immersive audio data frame, wherein the at least one immersive audio data frame is at least one of: transmitted either before or after the discontinuous transmission mode period; and a silence descriptor data frame transmitted during the discontinuous transmission mode period.

The apparatus caused to determine from the transport protocol payload at least one metadata may be caused to extract the at least one metadata from at least one of: a real-time transport protocol payload; a real-time transport protocol header extension; and a real-time transport control protocol payload.

The apparatus may be further caused to negotiate with a further apparatus support for the discontinuous transmission mode and support for a control of transmission of the at least one metadata based on the temporal parameter at the further apparatus.

The at least one metadata may be at least one metadata frame. The at least one metadata may comprise processing information.

According to a seventh aspect there is provided an apparatus for controlling transmission of metadata related to a conversational immersive audio, the apparatus comprising: obtaining circuitry configured to obtain a temporal parameter, the temporal parameter being associated with at least one immersive audio data frame; determining circuitry configured to determine a discontinuous transmission mode period; obtaining circuitry configured to obtain at least one metadata during the discontinuous transmission mode period; and controlling circuitry configured to control the transmission of the at least one metadata based on the temporal parameter.

According to an eighth aspect there is provided an apparatus for controlling reception of metadata related to a conversational immersive audio signal, the apparatus comprising: obtaining circuitry configured to obtain a transport protocol payload, the transport protocol payload comprising at least one metadata associated with a temporal parameter associated with at least one immersive audio data frame, the at least one metadata having been generated during a discontinuous transmission mode period; and determining circuitry configured to determine from the transport protocol payload the at least one metadata based on the temporal parameter.

According to a ninth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising program instructions] for causing an apparatus for controlling transmission of metadata related to a conversational immersive audio, to perform at least the following: obtain a temporal parameter, the temporal parameter being associated with at least one immersive audio data frame; determine a discontinuous transmission mode period; obtain at least one metadata during the discontinuous transmission mode period; and control the transmission of the at least one metadata based on the temporal parameter.

According to a tenth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising program instructions] for causing an apparatus for controlling reception of metadata related to a conversational immersive audio signal, to perform at least the following: obtain a transport protocol payload, the transport protocol payload comprising at least one metadata associated with a temporal parameter associated with at least one immersive audio data frame, the at least one metadata having been generated during a discontinuous transmission mode period; and determine from the transport protocol payload the at least one metadata based on the temporal parameter. According to an eleventh aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus for controlling transmission of metadata related to a conversational immersive audio, to perform at least the following: obtain a temporal parameter, the temporal parameter being associated with at least one immersive audio data frame; determine a discontinuous transmission mode period; obtain at least one metadata during the discontinuous transmission mode period; and control the transmission of the at least one metadata based on the temporal parameter.

According to a twelfth aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus for controlling reception of metadata related to a conversational immersive audio signal, to perform at least the following: obtain a transport protocol payload, the transport protocol payload comprising at least one metadata associated with a temporal parameter associated with at least one immersive audio data frame, the at least one metadata having been generated during a discontinuous transmission mode period; and determine from the transport protocol payload the at least one metadata based on the temporal parameter.

According to a thirteenth aspect there is provided an apparatus for controlling transmission of metadata related to a conversational immersive audio, the apparatus comprising: means configured to obtain a temporal parameter, the temporal parameter being associated with at least one immersive audio data frame; means configured to determine a discontinuous transmission mode period; means configured to obtain at least one metadata during the discontinuous transmission mode period; and control the transmission of the at least one metadata based on the temporal parameter.

According to a fourteenth aspect there is provided an apparatus for controlling reception of metadata related to a conversational immersive audio signal, the apparatus comprising: means configured to obtain a transport protocol payload, the transport protocol payload comprising at least one metadata associated with a temporal parameter associated with at least one immersive audio data frame, the at least one metadata having been generated during a discontinuous transmission mode period; and means configured to determine from the transport protocol payload the at least one metadata based on the temporal parameter..

According to a fifteenth aspect there is provided a computer readable medium comprising program instructions for causing an apparatus for controlling transmission of metadata related to a conversational immersive audio, to perform at least the following: obtain a temporal parameter, the temporal parameter being associated with at least one immersive audio data frame; determine a discontinuous transmission mode period; obtain at least one metadata during the discontinuous transmission mode period; and control the transmission of the at least one metadata based on the temporal parameter.

According to a sixteenth aspect there is provided a computer readable medium comprising program instructions for causing an apparatus for controlling reception of metadata related to a conversational immersive audio signal, to perform at least the following: obtain a transport protocol payload, the transport protocol payload comprising at least one metadata associated with a temporal parameter associated with at least one immersive audio data frame, the at least one metadata having been generated during a discontinuous transmission mode period; and determine from the transport protocol payload the at least one metadata based on the temporal parameter.

An apparatus comprising means for performing the actions of the method as described above.

An apparatus configured to perform the actions of the method as described above.

A computer program comprising program instructions for causing a computer to perform the method as described above.

A computer program product stored on a medium may cause an apparatus to perform the method as described herein.

An electronic device may comprise apparatus as described herein.

A chipset may comprise apparatus as described herein.

Embodiments of the present application aim to address problems associated with the state of the art.

Summary of the Figures

For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which: Figure 1 shows schematically example server and peer-to-peer teleconferencing systems within which embodiments may be implemented; Figure 2 shows schematically an example Table of Content (ToC) byte structure for a Header-Full IVAS frame according to some embodiments; Figure 3 shows schematically an example Table of Content (ToC) byte 10 header structure according to some embodiments; Figure 4 shows schematically an example processing information (P1) frame in an IVAS RTP payload according to some embodiments; Figure 5 shows schematically an example IVAS extra (E) byte header structure according to some embodiments; Figure 6 shows schematically an example RTP Header with IVAS payload structure according to some embodiments; Figure 7 shows schematically Active and DTX operation in an example IVAS frames sequence; Figure 8 shows schematically PI frame during DTX operations; Figure 9 shows schematically PI frame during DTX operations implementing a two-way solution according to some embodiments; Figure 10 shows schematically a flow diagram session negotiation offer and answer operations for DTX PI operation support according to some embodiments; Figure 11 shows schematically a flow diagram session summarizing the 25 operations according to some embodiments; Figure 12a shows schematically a flow diagram session for hangover PI frames in an IVAS RTP stream payload according to some embodiments; Figure 12b shows schematically a flow diagram session for delayed PI frames in an IVAS RTP stream payload according to some embodiments and Figure 13 shows an example device suitable for implementing the apparatus shown.

Embodiments of the Application The following describes in further detail suitable apparatus and possible mechanisms for the provision of efficient IVAS audio.

An example system within which embodiments may be implemented is shown in Figure 1.

Figure 1, for example, shows an example teleconferencing system within which some embodiments can be implemented. In this example there is shown two sites or rooms, Room A 100 and Room B 102. Room A 100 comprises a 'talker' or user, Talker 1 103. Room B 102 comprises one 'talker' or user, Talker RX 141.

In the following example within room A is a suitable teleconference apparatus (or more generally telecommunications apparatus 110) configured to spatially capture and encode the audio environment and furthermore is configured to render a spatial audio signal to the room. The apparatus can in some embodiments be implemented by a user equipment (UE) operating within a cellular communications system or accessing any suitable access network. Within each of the other rooms may be a suitable teleconference apparatus (or more generally telecommunications apparatus such as apparatus 120 within room B) configured to render a spatial audio signal to the room and furthermore is configured to capture and encode at least a mono audio and optionally configured to spatially capture and encode the audio environment.

In the following examples each room is provided with the means to spatially capture, encode spatial audio signals, receive spatial audio signals and render these to a suitable listener. It would be understood that there may be other embodiments where the system comprises some apparatus configured to only capture and encode audio signals (in other words the apparatus is a 'transmit' only apparatus), and other apparatus configured to only receive and render audio signals (in other words the apparatus is a 'receive' only apparatus). In such embodiments the system within which embodiments may be implemented may comprise apparatus with varying abilities to capture/render audio signals.

The teleconference apparatus (for each site or room) 110, 120 can be configured to call into a teleconference controlled by and implemented over a server 111.

In some embodiments the communications or teleconferencing system comprises a (peer-to-peer) communications system (rather than the server-based system shown in Figure 1) within which some embodiments can be implemented. Thus, for example, two or more UEs can be configured to interact directly with each other (for example to implement an immersive audio phone call between users). In such a scenario one of the UEs can be configured to deliver spatial ambience as one stream and employ a close-up microphone (for example a Lavalier microphone) to capture the speech as an audio object or audio source. The sender UE can be configured to encode the spatial ambience audio signals in a MASA format stream and the close-up microphone audio signal as an object format stream. The two audio streams can then be delivered as separated IVAS streams.

The sender UE, in addition, can be configured to encode processing information during the encoding to deliver the PI frames together with the IVAS frames to the receiver UE.

The teleconference apparatus can be configured to spatially capture and encode the audio environment and furthermore can be configured to render a spatial audio signal to the room. In this example only the communications or signalling path from the Room A 100 to the Room B 102 is shown for simplicity but a duplex or multipoint communication system comprising multiple signalling paths can be implemented using the methods as described herein without significant inventive input.

The teleconference apparatus (for each site or room) 110, 120 is further configured to communicate with each other to implement a teleconference function.

As shown in Figure 1, the apparatus 110, 120 and server 111 can comprise suitable encoder and decoder functionality. For example, the apparatus 110 is shown comprising an (IVAS) encoder 101, the server 111 is shown comprising a (IVAS) decoder and encoder 121 and the apparatus 120 is shown comprising an (IVAS) decoder 131. In such a manner an object 120 (the audio signals representing the user or talker 1 111) can be encoded by the encoder 101 which generates a bitstream 106 to be passed to a server 111. The server 111 can then decode, (optionally then mix with other objects and otherwise process the audio signals) and encode then to generate the bitstream 108 to be passed to the apparatus 120. The apparatus 120 can then decode the audio signals and present them to the user or talker 'Talker RX' 141.

Although this example shows a teleconference application the encoder/decoder functionality can be applied to the streaming of any suitable media.

The IVAS decoder/renderer for each of the teleconference apparatus 102 can be furthermore configured to handle multiple input streams that may each originate from a different encoder.

As discussed previously RTP is intended for an end-to-end, real-time transfer of streaming media and provides facilities for jitter compensation and detection of packet loss and out-of-order delivery. RTP is furthermore designed to carry a multitude of multimedia formats, which permit the transport of new formats without revising the RTP standard. To this end, the information required by a specific application of the protocol is not included in the generic RTP header. For a class of applications (e.g., audio, video), an RTP profile may be defined. For a media format (e.g., a specific video coding format), an associated RTP payload format may be defined. Every instantiation of RTP in a particular application may therefore require a profile and payload format specifications.

The profile is configured to define the codec used to encode the payload data and the mapping to payload format codes in the protocol field Payload Type (PT) of the RTP header.

For example, the RTP profile for audio and video conferences with minimal control is defined in RFC 3551. The profile defines a set of static payload type assignments, and a dynamic mechanism for mapping between a payload format, and a PT value using Session Description Protocol (SDP). The latter mechanism is used for newer video codec such as RTP payload format for H.264 Video defined in RFC 6184 or RTP Payload Format for High Efficiency Video Coding (HEVC) defined in RFC 7798.

An RTP session can be established for each multimedia stream. Audio and video streams may be implemented which use separate RTP sessions, enabling a receiver to selectively receive components of a particular stream. The RTP specification can furthermore be configured to recommend port numbers for RTP, and furthermore to recommend the use of the next odd port number for the associated RTCP session. A single port can be used for RTP and RTCP in applications that multiplex the protocols.

Each RTP stream can comprise RTP packets, and the RTP packet in turn can comprise a RTP header and payload pair.

Enhanced Voice Services (EVS) is a mono voice codec standardized in 3GPP and described in the TS 26.445 specification document. The codec can have 5 two operating modes: EVS Primary and EVS AMR-WB 10 (Adaptive Multi Rate Wideband Inter-Operable).

The IVAS codec can be considered to be an extension to the EVS codec and as such the IVAS and EVS codecs can have some similarities in terms of design and implementation. The RTP payload structure, while not having been specified for the IVAS codec, is envisioned to have similarities to the RTP payload structure in EVS.

The RTP payload format of EVS is described in 3GPP TS 26.445 Annex A. In EVS, the RTP payload format is divided into two different embodiments: a Compact format and a Header-Full format. In the EVS Compact payload format, a RTP packet includes only a single EVS speech frame for EVS Primary mode. For EVS AMR-WB 10 mode, the compact RTP packet also includes a 3-bit Codec Mode Request (CMR) field in front of the speech frame. In the EVS Compact format, the different modes and bitrates for the speech frames are identified by the size of the RTP payload. For example, an RTP packet of size 328 bits is assigned for EVS Primary mode with 16.4 kbps bitrate, as is shown in Table A.1 in TS 26.445 Annex A. In EVS Header-Full format, the RTP payload consists of the speech frame(s) accompanied by an optional CMR byte and Table of Content (ToC) bytes. The CMR byte is used to request a change in bitrate or coding mode that the receiver wants to receive. The request is sent as a CMR byte as part of the Header-Full EVS packet. In EVS AMR-WB 10 Compact format, the CMR functionality is also present as a 3-bit signalling at the beginning of the packet.

The ToC bytes in EVS Header-Full format describe the mode and bitrate for the accompanied EVS coded speech.

One aspect or technique of audio encoding and employed in various speech processing algorithms is Voice Activity Detection (VAD), also known as speech activity detection or more generally as signal activity detection. For example VAD can be employed in speech codecs, for detecting the presence or absence of human speech. It can be generalized to detection of active signal, i.e., a sound source other than background noise. Based on a VAD decision, it is possible to utilize, e.g., a certain encoding mode in a speech encoder, e.g., active signal encoding instead of background noise encoding.

Discontinuous Transmission (DTX) is a technique utilizing VAD intended to temporarily shut off parts of active signal processing (such as speech coding according to certain modes) and the frame-by-frame transmission of encoded audio (e.g., transmission of RTP packets). Instead of normal encoded frames, e.g., it can be sent simplified update frames to drive a CNG at the decoder. Typically, this is done at slower update rate, e.g., only every Nth frame is updated/transmitted.

Otherwise, no data is transmitted. The use of DTX can help with reducing interference and/or preserving/reallocating capacity in a mobile network. It can also help with battery life of the device. Consequently, the transmission of simplified update frames (to drive a CNG at the decoder) can be used to detect DTX operation.

Comfort Noise Generation (CNG) is a technique for creating a synthetic background noise to fill silence periods that would otherwise be observed, e.g., under the DTX operation. A complete silence can be confusing or annoying to a receiving user. For example, the listener could judge that the transmission may have been lost and then unnecessarily say "hello, are you still there?" to confirm or simply hang up. On the other hand, sudden changes in sound level (from total silence to active background and speech or vice versa) could also be very annoying. Thus, CNG is applied. Typically, this is based on a highly simplified transmission of noise parameters (e.g., spectral shape) derived from the real

captured background noise.

Silence Descriptor (SID) frames are sent during speech inactivity to keep the receiver CNG sufficiently aligned with the background noise level and spectral shape at the sender side. This is of particular importance at the onset of each new talk spurt. Thus, SID frames should not be too old, when a new talks burst starts.

Commonly SID frames are sent regularly, e.g., every 8th frame, but a codec may allow also variable rate SID updates. SID frames are typically quite small, e.g., 2.4 kbps SID bitrate equals 48 bits per frame for the typical 20-ms frame size used in modern communications codecs (AMR, AMR-WB, ITU-T G.718, EVS, and IVAS).

IVAS SID frame size is 5.2 kbps (104 bits) for a 20-ms frame, and the default SID frame interval in IVAS is 8 frames. Other update intervals are furthermore supported. IVAS includes EVS, and in EVS primary mode operation the 48-bit SID is kept.

In an IVAS session, processing information (PI) data frames are associated to the corresponding audio frames. The processing information can also be referred to as rendering metadata or PI data or extension metadata or any suitable term which is addition to the IVAS audio data. In other words the PI frames do not increase the media time of the packet and the timestamp of a PI frame is the same as the timestamp of an associated audio frame. In an IVAS session, there can be DTX periods where no audio frames are transmitted. During a DTX period new PI data might however be available. In this case, the audio frames that should be associated with the new PI data frames are missing (as there would be no transmission). Therefore, the PI frames cannot be transmitted as they are not associated to any timestamp and the new PI frames are lost. Consequently, it is to be specified how to handle PI frames during a DTX period where no audio is available. This can be even more critical in case of PI frame information carried in a reverse direction for split rendering purposes. Furthermore, there is currently no known method to transmit information when RTP streams are not active to transmit any PI frame information. Furthermore, SID frames during the DTX period progress the timeline (similar to regular speech or audio frames of IVAS), however, the DTX period with low bitrate SID frames are low bitrate periods as far as transmission bitrate is concerned. SID frame transmission frequency is lower than the usual IVAS frames (in order to reduce bitrate), however, this increases the temporal gap between the different PI data which may be associated with the SID frames. Consequently, handling of PI data updates during the DTX period needs special consideration and handling.

Thus as discussed in further detail in the following embodiments, the concept relates to apparatus and methods to handling PI frames during a DTX period within an immersive conversational audio session. This can, for example be implemented, by apparatus and method to associate a PI frame to a previously transmitted audio data frame (prior to the start of DTX operation), previously transmitted audio data frame such as SID frame (during the DTX operation) or to delay the transmission of a PI frame until after the end of DTX period, in order to enable transmission and handling of PI frames during a DTX period in an immersive conversational audio session.

These embodiments therefore achieve the ability to update the metadata information carried in the PI frames during the DTX period or provide information for the period during the DTX period after the end of DTX period. In other embodiments, the update rate might be reduced during the DTX period with full frequency update rate resuming after the DTX period.

In other words, the embodiments have the benefit of either providing updates during DTX period (with update rates similar to pre-DTX period or with reduced update rates) or providing total radio silence during the DTX period and provide information after the end of the DTX period. The radio silence during the DTX period may be motivated by power saving, especially if there are no other media streams that need to continue (e.g., video stream in an audio-video phone call).

In some embodiments, the PI frame transmission during DTX period as well as the mode of transmission during DTX period is negotiated during session setup. This applies to PI frames carrying data in the forward direction (e.g., orientation of the scene or audio objects in the IVAS frames) as well as PI frames carrying data in the reverse direction (e.g., playback device orientation). For scenarios such as split rendering, reverse direction PI frame information may be important since playback device orientation for split rendering (receiver UE2 120 to sender UE1 110) is independent of the DTX mode selection for the audio stream encoding/transmission (from sender UE1 110 to receiver UE2 120). Furthermore, the session negotiation can also include the aspect of transmission of PI data during DTX together with the SID frames.

In some embodiments, the PI frame transmission during DTX period can be performed either via the RTP stream or via RTCP stream.

In some embodiments, the updated metadata information carried in the PI frames during the DTX period example can be summarized as follows: Sender: * PI frame transmission during DTX operation is negotiated.

* Encoder determines start of DTX period, DTX is now active or on.

* The last transmitted audio frame prior to start of DTX period, is referred to or designated "frame0".

* Obtain a new PI frame (Pl_new).

* Associate the new PI frame to frame° by using the timestamp of frame° for the new PI frame.

* Transmit the new PI frame to the receiver without any audio frames (or with, e.g., NO_DATA audio frames). Receiver: * Receive a RTP packet containing a PI data frame.

* Detect that the timestamp of the packet is the same as a previously received RTP packet that contained audio (or last audio frame prior to start of DTX operation) * Take the newly received PI data into account in the processing.

In some embodiments, the provided information for the period during the DTX period after the end of DTX period example can be summarized as follows: Sender: * PI frame transmission during DTX operation is negotiated.

* Encoder determines start of DTX period, DTX is now active or on.

* The last transmitted audio frame is referred to or designated "frame0".

* Obtain a new PI frame (Pl_new) * Delay the transmission of the new PI frame until the DTX period is over.

* When the DTX is deactivated or off, in other words the DTX period is over, associate the PI new to the first audio frame that is transmitted after the DTX period.

* Transmit the Pl_new to the receiver with audio frame(s).

Receiver: * Receive a packet with multiple PI data frames of same type (that originated during the DTX period and were delayed to be transmitted after the DTX period) and audio frame(s) * Detect the most recent PI data from the received PI data frames of same type and use that in the processing. In an embodiment, if more than one PI data is available, then the data can be combined in a suitable manner before taking that into use.

The Header-Full IVAS frames (or IVAS frame as RTP payload with a payload header described as ToC) comprise of one or several Table-of-Content (ToC) indicators and their associated IVAS speech/audio frames.

With respect to Figure 2 is shown an example Table-of-Content (ToC) byte structure 201 for an IVAS frame.

The (F) bit 203 indicates whether more frames follow this entry. The F (1 bit) 203: If set to 1, the bit indicates that the corresponding frame is followed by another speech or PI frame in this payload, implying that another ToC byte follows this entry. If set to 0, the bit indicates that this frame is the last frame in this payload and no further header entry follows this entry.

The Bitrate Level Indication (BLI) 207 describes the content of the associated frame, e.g., the bitrate of an IVAS frame. The Supplemental Signaling Bits (SSB) 205 can be used to describe various things, for example IVAS input format specific signaling. The SSB 205 can in some embodiments be referred to as extra bits.

The BLI (4-5 bits) 207 can, for example, be configured to indicate the bitrate or other frame content indication for the frame. From the content indication, the receiver can determine the size of the received frame either directly from the bitrate or from pre-determined frame sizes, e.g., for the SPEECH_LOST and NO_DATA frames. The BLI (or in some implementations the Frame Type or FT) bits can indicate, for example, the bitrate of an IVAS speech frame, a SPEECH_LOST, NO_DATA or comfort noise (SID) frames.

An example for BLI bit values (for 5 bits) are presented in the following table. At this point of the codec development, it is not decided if the aforementioned three types (SPEECH_LOST, NO_DATA, SID) are supported in the final codec. The SPEECH_LOST, NO_DATA and SID frames are part of the EVS codec and at it is likely that at least the SID and NO_DATA frame types are being incorporated into the IVAS specification. It is understood that the final IVAS codec specification, however, might have different frame types present than presented here.

BLI bits Bitrate (kbps) or other frame content indication 00000 13.2 5 00001 16.4 24.4 00011 32 48 00101 64 80 00111 96 01000 128 01001 160 01010 192 01011 256 01100 384 01101 512 01110 SPEECH_LOST 01111 NO_DATA 10000 SID (5.2) 10001 -11111 (reserved for future use) In some embodiments, 4 bits are reserved for the BLI part. The frame type bit values and content indications when using 4 bits are presented below. In these embodiments the BLI part identifies the bitrate where the BLI bits are bit values other than a defined value (for example 1111) and can be used to identify other aspects, for example SPEECH_LOST, NO_DATA and SID frames with the combination of the BLI indicator OTHER (the defined value) and the extra bits.

BLI bits Bitrate (kbps) or other frame content indication 0000 13.2 0001 16.4 24.4 0011 32 48 0101 64 80 0111 96 1000 128 1001 160 1010 192 1011 256 1100 384 1101 512 1110 Reserved for future use 1111 OTHER Extra + BLI bits Frame content indication 000 + 1111 SPEECH LOST 001 + 1111 NO DATA + 1111 SID (2.4) 011 + 1111 Reserved for future use + 1111 Reserved for future use 101 + 1111 Reserved for future use + 1111 Reserved for future use 111 + 1111 Reserved for future use As shown above there are 15 available bit allocations for the BLI bits reserved for future use (bit allocations 10001 -11111), when 5 bits are reserved for the BLI indicator. In some embodiments when 4 bits are reserved for the BLI indicator, there are 5 available bit allocations for future use, when BLI value 1111 is used. BLI value 1110 provides an additional 8 bit allocations for future use when combined with the extra bits. These available bit allocations can be used in the future to indicate frame contents that will be defined, such as PI frames described hereafter.

These bit allocation values are examples only and it would be appreciated that in some embodiments the bit allocation values can be otherwise configured.

The number of bits in the SSB 205 can vary between 2-3 depending on how many bits are reserved for the BLI bits 207 in the IVAS ToC byte 201. The SSB (2- 3 bits) 205 can be bits reserved for future use. If 4 bits are reserved for the BLI indicator, the extra 3 bits can be partly used to identify other frame content than bitrates/frame-size (e.g., SPEECH_LOST, NO_DATA, SID) as demonstrated in further detail in GB2313472.9, where the BLI bits are referred to as FT bits (frame type index) and SSB as extra bits.

Annex A of 3GPP TS 26.253, v1.1.0 describes the most up-to-date design for the IVAS ToC byte. Figure 3 shows an example IVAS ToC byte structure according to the current design where the ToC byte from EVS is extended to cover IVAS bitrates. In summary, the ToC byte consists of an H-bit 301 (1 bit, set to 0 for ToC byte) which separates the ToC byte from the E-byte (which is further described below), a F-bit (1 bit) 303 similar to the F-bit mentioned above, and ToC data 305 comprising 2-bit mode field 307 differentiating between IVAS, EVS and AMR-WB 10 modes and 4-bit field 309 indicating the bitrate.

In order to enable rendering or consumption of the IVAS audio bitstream data, additional (non-audio) data can be added to streamed IVAS RTP packets.

Specifically, there can be inserted data that requires maintaining a sufficient or exact alignment with the IVAS audio bitstream data. The alignment can be considered, for example, relative to an IVAS audio frame. For example, any external orientations from the sender UE side could be included in these (audio) processing information (PI) frames.

A RTP payload with respect to these embodiments can comprise one or more of the following: immersive audio coded bitstream; the immersive audio coded bitstream payload header. The immersive audio coded bitstream payload header can be a ToC header byte, which can comprise supplemental signaling bits or frame type indication, bitrate level indication bits, etc; and processing information frame.

Packetization furthermore with respect to these embodiments can be a process of including the immersive audio coded bitstream header, and at least one of the immersive audio coded bitstream or processing information frame as RTP payload in order to deliver the immersive audio coded bitstream over real-time transport protocol. In some embodiments, the packetization can also be performed for inclusion and carriage of control information as payload of the real-time transport control protocol.

An example PI frame structure is shown with respect to Figure 4.

The example PI frame structure 401 comprises a "format" field 403. The format field 403 is configured to describe the type of the PI data, for example, orientation data.

The example PI frame structure 401 can further comprise a "usage" field 405. The usage field 405 is configured to describe how the PI data should be used. For example, the usage field 405 is configured to define whether the data should be applied to the next IVAS frame, to all frames in the RTP packet or if the data is more general information to be sent to the receiver. In other words the usage field can describe that the PI data should not be taken account in the rendering, i.e., that the data is more general information sent to the receiver.

Additionally the example PI frame structure 401 can further comprise a "validity" field 407. The validity field 407 in some embodiments describes how long the PI data is valid at the receiver end. For example, the field could describe that for X amount of processing frames the receiver should apply the PI data described in the frame, and if no new PI data is received within X frames, stop applying the data. The validity could also be indicated in other time formats than number of processing frames, e.g., in milliseconds. Furthermore, some audio frame values may be "hold" whereas other values may be "instantaneous" only. This enables the renderer to perform rendering accordingly.

Furthermore the example PI frame structure 401 can comprise an optional "size" field 409. The size field 409 in some embodiments is configured to define the size of the PI data 411. In some embodiments to minimize the number of additional bits introduced by the PI frames, the maximum size of the PI frames can be restricted. For example, the frames can be restricted to have a maximum size of K bits that is common for all IVAS bitrates. Alternatively, the maximum size of the PI frames could be linked to the used bitrate of the IVAS speech frames, e.g., by having the maximum number of bits for PI frames be some percentage share of the bits used for the speech frames. In addition, the use of PI frames could be restricted to be used only in the highest IVAS bitrates to reduce the risk of adding too much load to the sent packets. In some embodiments the "usage" and "validity" fields could be also combined into a single field, for example, to a "scope" field. The "scope" field would indicate both how the data should be applied and how long the data is valid.

In another embodiment, the PI data carried is metadata carried in addition to the audio data (e.g., IVAS speech or audio frames and SID frames) as part of the RTP stream via indication that this data supplements the IVAS frames. The PI data may also carry indication of whether it is applied in the forward direction (e.g., receiver of the RTP stream utilizes the PI data for rendering) or the reverse direction (e.g., receiver of the RTP stream utilizes the PI data for encoding).

Additionally the PI frame structure comprises the PI data 411.

Figure 5 shows an example Extra byte (E-byte) structure as defined by the IVAS RTP payload design (described in Annex A of TS 26.253, v1.1.0). The payload header bytes in an IVAS session are divided into two categories based on the first (H) bit 501. H=0 indicates that the header byte is a ToC byte describing the bitrate for an associated audio frame. H=1 indicates that the header byte is an extra information (E) byte which contains or identifies additional information related to the session (e.g., CMR or PI frames). Furthermore the extra byte structure comprises the E-data 505 or extra data or extra information related to the session.

Figure 6 shows an example full structure for an IVAS RTP packet 600. An RTP header (with possible RTP header extension) 601 precedes the IVAS payload 602. The IVAS payload 602 consists of the IVAS payload header 603 and data frames (PI or IVAS) 605. The IVAS payload header 603 identifies the different data frames (IVAS or PI) in the payload (through ToC or E-bytes) and contains optional information, e.g., CMR through E-bytes.

In normal operation mode in an IVAS session, PI frames are tied/linked/associated to the timestamp of the associated audio frames (IVAS or EVS in mono mode). That is, a PI frame does not increase the media time of the packet. The RTP timestamp defines the sampling instant (media time) of the first sample of the first audio frame in an RTP packet.

In DTX (discontinuous transmission) operation mode, SID frames can be transmitted at a given SID update interval (for example, 8 frames interval or about six SID frames per second). The media timeline progresses with the SID frames, however, the update rate is lower, because typically the SID frames are transmitted at a lower frame rate to reduce bitrate consumption.

In some situations, relevant PI data frames could still be available for transmission. For example, the updated orientation values of the capturing device could still be available for transmission. These values can then be received at suitable intervals from a tracker that does not depend on the input audio in any way. As mentioned above, the PI frames are tied/linked/associated to the timestamp of associated audio frames, however in DTX, these audio frames are not available, rather only lower frame rate SID frames may be available.

Below examples describe DTX during an IVAS session. The examples indicate transmitted data frames as IVAS frames. However, in an IVAS session these could also be audio frames coded by the EVS codec or SID frames or any data frames that can be associated with a sampling instance. Additionally the PI related DTX embodiments described herein can also be applied to other (future) codecs where similar (processing) information is transmitted or required to be transmitted when an associated bitstream or frames are paused or not transmitted. Figure 7, for example shows an example in which as shown by 701 there is a regular transmission of IVAS frames 703. In other words in the active operation IVAS frames are transmitted continuously.

In DTX operation, as shown by 711 there can be IVAS frame transmission, such as shown by IVAS frame 0 713, a time where no IVAS audio/speech frames are transmitted (only SID frames are transmitted at their update interval, which can be, e.g., 8 frames or also substantially longer) 715 and then a restoration of IVAS frame transmission as shown by IVAS frame N 717. The figure does not show transmission of SID frames transmission for illustration simplicity.

Figure 8 shows the effect of DTX operation such as shown in Figure 7 on associated PI frames. Thus, as shown in Figure 8, the frame timeline 711 with IVAS frame transmission, such as shown by IVAS frame 0 713, a time where no IVAS audio/speech frames are transmitted 715 and then a restoration of IVAS frame transmission as shown by IVAS frame N 717. The figure does not show transmission of SID frames transmission for illustration simplicity.

Additionally shown by Figure 8 is the PI frame timeline 801, where there is PI frame 0 803 associated with IVAS frame 0 713. The PI frame timeline 801 furthermore shows PI frame 1 805 and PI frame 2 807 which are not associated with any IVAS frame and occur during the DTX operation time. Following the restoration of transmission of IVAS frames (or the deactivation of the DTX operation) the PI timeline 801 shows a PI frame N 809 which is associated with IVAS frame N 717. There is thus shown a presence of PI frames during the DTX period, in with there are PI frames available, but the associated IVAS frames are missing.

In such an example a RTP timestamp defines the sampling instant (media time) of the first sample of the first IVAS frame in an RTP packet. The duration of one IVAS frame is 20 ms (though other examples may implement other frame durations). Thus, the media time is increased for each successive IVAS frame of an RTP packet by 320 units (when the timestamp clock frequency is set to 16 kHz). The PI frames in the IVAS RTP payload have the media time assigned that is defined by the RTP timestamp and media time increment of IVAS frames in the payload.

In some embodiments, PI frame transmission is continued within a PI frame hangover period or extension period. Thus following an activation or start of a DTX period, a 'hangover period' or 'extension period' is defined for PI frames. The hangover period is a time period or duration during the DTX operation where any new PI frames are associated (or tied or linked to) the last transmitted IVAS frame.

In such embodiments the timestamps of the PI frames are set using the timestamp of the last transmitted IVAS frame.

This is shown for example in Figure 9 which shows the IVAS frame timeline 711 as described above.

Additionally is shown a modified PI frame timeline 901, which is based on the PI frame timeline 801 and modified according to the above hangover or extension period as discussed above.

Thus the PI frame 0 803 is associated 902 in the manner described and shown with respect to Figure 9 to the IVAS frame 0 713.

However following the activation of DTX a PI DTX hangover period 905 is initiated (and starts at the beginning of the DTX period) and is continued for a defined time.

During the hangover period if there is a new PI frame created (PI frame 1 805), then the newly created PI frame (PI frame 1 805) is associated 904 with the last transmitted IVAS frame (IVAS frame 0 713).

The duration of the hangover period can be pre-defined with a static value, in terms of frames or time. In some embodiments, the duration of the hangover period can be dependent on a PI frame type. For example an orientation PI frame type can have a first defined hangover period and a different PI frame type has a second hangover period.

In some embodiments the duration of the hangover period could be set dynamically per session. This can be managed, for example, through a suitable SDP based negotiation.

In some embodiments, following the end of the hangover period but while still during DTX operation, any new PI frames which are generated are delayed and not transmitted until the DTX period ends and active transmission resumes.

Furthermore, following the end of the DTX period then at least one of the (all or some) delayed PI frames are associated with (or tied to or linked to) the first transmitted IVAS frame following the resumption of active transmission of IVAS frames.

In some embodiments, the delayed PI frames are transmitted with the first transmitted IVAS frame after the end of the DTX period. Furthermore, during the delay period, there might be multiple new PI frames for a certain PI frame type. For example, a device orientation might be updated several times during the delay period which would be indicated with several PI frames with the same PI frame type (for both forward direction PI frames as well as reverse direction PI frames such as in case of split rendering).

In some embodiments, all of the device orientation PI frames are transmitted to the receiver, in which case the receiver might choose to use only the most up-to-date device orientation in the processing and discard the other device orientation information.

In some other embodiments, the sender can choose to send or select only the latest device orientation if there are multiple device orientations available after the delay period.

In other words, and more generically, for a specific PI frame type, where there is more than one PI frame available, one PI frame is selected to be transmitted when the DTX operation is stopped and IVAS frames are transmitted. In some embodiments the selection is a sub-set (more than one but not all) of the PI frames generated during the DTX period but after the hangover period.

The selection can be a latest or last (or the latest generated) PI frames, a first or initial PI frame or frames, or some other determined selection (for example a sampling of PI frames which may be uniformly distributed or have some sampling bias).

This is shown, for example, with respect to Figure 9 where the PI frame 2 807 which is generated during the DTX period 715 but following the PI frame hangover period 905 is delayed and transmitted as PI frame 2 907 and associated 906 with IVAS frame N 717 (as is PI frame N 809 which is generated after the DTX period ends).

In some embodiments rather than selecting one or more PI frame to be transmitted after the DTX period, values associated with the PI frame (or in other words PI data from the delayed PI frames) is used to generate a combined PI frame value which is transmitted as a combined PI frame after the DTX period. For example, all or some of the delayed PI frames of same type could be combined in some manner.

Thus, for device orientation based PI data, a set of delayed device orientations could be combined to create a smooth orientation shift from a previously received orientation (for example data from the last PI frame in the PI frame handover period) to a most-recently generated orientation (the device orientation generated following the end of the DTX period). This could be achieved, for example, by interpolating from the currently applied device orientation to the most recent orientation. If the most recent date orientation value was applied immediately, the orientation shift may be too abrupt and might cause artifacts in the perceived audio. The combination process could be handled at the sender or at the receiver.

In some embodiments when the receiver detects DTX transmission, the rendering processing is modified or altered (For example CNG is generated for playback).

In some embodiments the receiver, during the hangover period, is configured to receive updated PI data from the sender. For example, updated orientation values may be transmitted to the receiver. These orientation values could be applied to the rendering by always selecting the latest available orientation. For example, there may be network congestion that results in multiple orientations arriving to the receiver simultaneously during DTX period. The receiver can then choose to select and use only the latest orientation for the rendering process. Alternatively, in some embodiments the received orientations can be interpolated for a smoother orientation processing in the rendering.

In some embodiments only the PI frame delays are employed (in other words an effective zero hangover period is employed). Furthermore in some embodiments only the hangover period is implemented and PI frames generated after the hangover period but before the end of the DTX period are discarded. In some embodiments both hangover period and delays are employed.

In some embodiments, any new PI data generated during a DTX period are transmitted with the SID frames. In this case the timestamp of the PI data shall be associated with the SID frames timestamp. In many cases, the SID frames and the PI data can be part of the same RTP payload.

For example, in a manner similar to the frame delay examples described above where the PI frames are delayed until an IVAS frame is transmitted, the PI frames during a DTX period could be delayed until a SID frame is transmitted. In such a manner all or some of the delayed PI frames could be then transmitted with the SID frame (or a combination of the values for the delayed PI frames are obtained or determined and the combined value or values are transmitted with the SID frame). The SID frames progress the timeline in the same way as the usual IVAS audio frames, thus the operation for PI data transmission is similar but at a reduced frequency to align with the lower frame rate of SID frames.

In some further embodiments, an RTCP stream (or any suitable data channel or stream) could be used to transmit PI frames during DTX period. In general, RTCP data packets are transmitted less frequently than RTP packets, which makes RTP transmission for PI frames more suitable during active audio transmission (at least for the more time critical PI types). However, during DTX no audio is transmitted in the RTP stream, and the transmission of PI frames could be shifted to employ RTCP for the duration of DTX.

In some other embodiments, an immediate (or delayed) transmission of some PI frame data during a DTX period can be determined based on a PI type.

For example, if the PI frame indicates an input format change that is occurring in some time in the future (e.g., INPUT_FORMAT_CHANGE_FUTURE as described in UK patent application GB2317940.1), the information is time-critical for the receiver and should be transmitted as soon as possible. In this case, the PI frame could be transmitted in IVAS RTP packets together with NO_DATA (or SID) frames to the receiver.

In other words a type of the PI frame can, in some embodiments, trigger a forced transmission of the PI frame even during a DTX period. In such examples a timestamp of the RTP packet can be set to the most recent available (i.e., the timestamp of the last transmitted data frame). To signal the exact time when the future change is applied, the timestamp when to apply the change could, for example, be signaled as part of the transmitted data.

In some situations, it can be beneficial to transmit all the delayed PI frames to the receiver and not just the latest available. For example, if the receiver is recording the session, it can be beneficial for the receiver to have all the PI data for recording purposes, and for later playback.

In some embodiments a DTX PI capability or capacity can be explicitly negotiated for an IVAS session, e.g., through SDP. An SDP parameter (e.g., dtxpi) could be used to indicate if any DTX specific processing is applied to PI frames during the session. The dtx-pi could accept values 0 (to disable) and 1 (to enable) any DTX specific processing for PI frames during the session. If the dtx-pi is not present in the SDP negotiation, the DTX processing for PI frames could be present by default.

In addition, the dtx-pi parameter could, in some embodiments, be extended to cover receiving and sending direction aspects with dtx-pi-recv and dtx-pi-send, respectively. The individual parameters could be used to control the DTX operations for the two transmission directions independently.

Below is an example of the SDP description for PI data use during the session and during the DTX period.

m=audio 49152 RTP/AVP 96 97 98 99 a=rtpmap:96 IVAS/16000 a=fmtp:96 br=512; pi_data=[0-1];pi_data_dtx=[0-8]; a=ptime:20 a=maxptime:240

Example SDP answer

m=audio 49152 RTP/AVP 96 a=rtpmap:96 IVAS/16000 a=fmtp:96 br=512; pi_data=1; pi_data_dtx=1 a=ptime:20 a=maxptime:240 Media format parameters pi_data and pi_data_dtx indicate the two additional parameters to control the use of PI data in the session and the use of PI data during DTX period in the session.

pi_data 0 indicates no use of PI data in the session whereas 1 indicates the use of PI data in the session.

pi_data_dtx shall have one or more of the following possible values in the offer. The answer shall select a subset of the values.

In addition, the dtx-pi or pi_data_dtx parameter could be extended to cover the enabling or disabling of the specific DTX operations for PI frames. For example, the dtx-pi parameter could have below values to describe the session: * 0 to disable DTX specific processing for PI frames * 1 to enable DTX specific processing for PI frames, PI data shall be transmitted during the DTX period with association (i.e. use timestamp) of the last audio frame prior to the start of the DTX period. The PI data transmission rate shall not be impacted by the DTX mode.

* 2 to enable only the hangover period DTX operation for PI frames, transmission of PI data with association to the audio frames during the DTX operation under DTX period. The default hangover period can be 20ms (or one IVAS frame interval), for example. The length of the hangover period can also be negotiated explicitly.

* 3 to enable only the delay based DTX operation for PI frames, PI data shall be transmitted after the end of the DTX period. The PI data shall be generated during the DTX period with the same rate as during the non DTX period. The PI data shall be associated with the first frame after the end of DTX period.

* 4 to enable DTX specific processing for PI frames AND to tie/link the DTX PI frames to SID frames * 5 to enable only the hangover period DTX operation for PI frames AND to tie/link the DTX PI frames to SID frames * 6 to enable only the delay based DTX operation for PI frames AND to tie/link the DTX PI frames to SID frames * 7 to only tie/link the DTX PI frames to SID frames, PI data shall be transmitted during the DTX period with association (i.e. use timestamp) to the SID frames during the DTX period.

* 8 to transmit PI data via RTCP during DTX period. This may be done for forward direction PI data (capture device orientation) as well as reverse direction PI data such as format change requests or playback device orientation In another embodiment, the above listed specific DTX operations for PI frames could be controlled with explicit SDP parameters, e.g., dtx-pi-enablehangover, dtx-pi-enable-delay and dtx-pi-enable-sid-attaching.

In some embodiments the above parameters can have permissive values of 0 (to disable) and 1 (to enable) the respective DTX operations for PI frames.

Additionally in some embodiments if the parameters are not present, the associated DTX operations could be disabled by default. Alternatively, the aforementioned SDP parameters could be replaced by their "disable" counterparts (e.g., dtx-pi-disable-hangover) to explicitly state disabling of the specific DTX operations for PI frames and enable the operations by default if the parameters are not present.

For the DTX hangover period, the duration of the period could be explicitly negotiated through SDP. For example, an SDP parameter dtx-pi-hangover-duration could be used to indicate the duration of the hangover period in number of audio processing frames (or any suitable time period). A value of 0 could be used to disable the hangover DTX operation. This would mean that all the PI frames during the DTX period would be delayed until a new IVAS frame is available for transmission (if the delay based DTX operation for PI frames is enabled for the session).

In some embodiments, the used DTX approach or operation for PI frames could vary between different PI types. For example, some less time critical PI frames (e.g., labelling data) could be always associated to the previously transmitted audio frame (following embodiment 1). Some more time critical PI frames (e.g., orientations) could be always tied/linked to the next transmitted audio frame.

Figure 10, for example, shows an example flow diagram for session negotiation where the support for DTX operations for PI frames is negotiated. This can be divided into session negotiation offer creation and answer creation processes according to some embodiments.

With respect to the session negotiation offer creation process there is shown, by 1001, obtaining and including a parameter to enable DTX operations for PI frames in the session description file.

Following this, as shown by 1003, is generating session negotiation offer to a receiver user equipment.

With respect to the session negotiation answer process there is shown, by 1011, of the operation of receiving a session negotiation offer.

Following this, as shown by 1013, is parsing one or more IVAS DTX PI indication in the received session negotiation offer.

Then is shown, by 1015, including DTX PI parameter indicating support for DTX operations for PI frames within a session negotiation answer session description file.

This then results, as shown by 1017, in transmitting the session negotiation answer to the sender user equipment which delivered the session negotiation offer.

With respect to Figure 11 is shown a flow diagram representing transmission and reception examples which summarize the embodiments described above.

With respect to the encoder or transmitter or sender there is shown, by 1101, the operation of obtaining a temporal parameter, the temporal parameter being associated with at least one immersive audio data frame. The temporal parameter can, for example, represent timestamps associated with the at least one immersive audio data frame. For example the temporal parameter is a time stamp related to the RTP packet or a composition time or a playback time related to other encoder specification related to MPEG or QUIC. The temporal parameter may comprise at least one of: a generation time basically covering the situation if the audio is generated virtually and not captured. The temporal parameter may further comprise a time of appliance or future appliance time or future playback time or deferred playback time covering the aspect that the PI frame data should be applied in the future.Following this, as shown by 1103, is determining a discontinuous transmission mode period. In other words determining that the apparatus is operating in a discontinuous transmission mode.

Then is shown, by 1105, obtaining at least one metadata during the 20 discontinuous transmission mode period. As discussed above the metadata can be processing information or any suitable element or parameter related to the audio data.

After which is shown, by 1107, controlling the transmission of the at least one metadata based on the temporal parameter. The controlling of the transmission can, as discussed above, be a delaying of the transmission of the metadata until after the end of the discontinuous transmission mode period, or to a (next) SID frame. In such circumstances the transmission timestamp associated with the metadata is a packet after the period or the SID frame. In some embodiments the controlling of the transmission can be transmission of the metadata within the discontinuous transmission mode period (for an extension or hangover period) where the timestamp associated with the transmission of the metadata is the last transmitted audio frame.

With respect to the receiver or decoder there is shown, by 1121, of the operation of obtaining a transport protocol payload, the transport protocol payload comprising at least one metadata associated with a temporal parameter associated with at least one immersive audio data frame, the metadata having been generated during a discontinuous transmission mode period.

Then as shown by 1123 is the operation of determining (for example extracting) from the transport protocol payload the at least one metadata based on the temporal parameter.

Following this, as shown by 1125, is optionally the operation of determining that the at least one metadata was generated during a discontinuous transmission mode period. In some embodiments this determination is implicit in the determination operation.

Then is shown, by 1127, is optionally the operation of processing the at least one metadata to enable the at least one metadata to assist the processing of the immersive audio data frame.

With respect to Figures 12a and 12b there are shown flow diagrams representing packetization examples for IVAS payloads with PI frames during a DTX period according to the embodiments described above, and specifically the hangover period and delay frame implementations, respectively.

Figure 12a for example shows flow diagrams of the packetization and depacketization of hangover period implementation embodiments.

With respect to the packetization of the PI DTX IVAS payload with respect to the encoder or sender there is shown, by 1201, the operation of during the DTX obtaining PI data and generating a PI frame from the data. This implements the operations 1103, 1105 as discussed above.

Following this, as shown by 1202, is attaching the PI data frame to an empty audio frame (e.g., NO_DATA frame). This implements the operation 1107 as discussed above.

Then is shown, by 1203, generating an IVAS RTP payload header identifying 30 the PI frame and the empty audio frame, and attaching the header with the above data frames. This implements the operation 1107 as discussed above.

After which is shown, by 1204, generating an RTP header where the timestamp is set to the same value as the last transmitted packet containing audio. This implements the operations 1101, 1107 as discussed above.

There follows, as shown by 1205, attaching the resulting bitstream (IVAS 5 RTP payload header + data frames) with the RTP header resulting in an RTP packet. This implements the operation 1107 as discussed above.

This then results in the operation, as shown by 1206, of transmitting RTP packet to the decoder/receiver user equipment (for example UE2). This implements the operation 1107 as discussed above.

With respect to the depacketization of the PI DTX IVAS payload there is shown, by 1211, of the operation of receiving a RTP packet and detecting that the timestamp is the same as in a previously received packet. This implements the operations 1121, 1125 as discussed above.

Following this, as shown by 1212, is extracting RTP payload and determining presence of PI frame(s) in the RTP payload. This implements the operation 1123 as discussed above.

Then is shown, by 1213, extracting the PI frame. This implements the operation 1123 as discussed above.

After which is shown, by 1214, delivering the PI frame to the processing pipeline. This implements the operation 1127 as discussed above.

Figure 12b furthermore shows flow diagrams of the packetization and depacketization of in frame delay implementation embodiments.

With respect to the packetization of the PI DTX IVAS payload with respect to the encoder or sender there is shown, by 1221, the operation of during the DTX period obtaining PI data and storing it for later use. This implements the operations 1103, 1105 as discussed above.

Following this, as shown by 1222, is following the end of the DTX period, gathering the delayed (stored) PI data and possibly new PI data and generating PI frame(s) from the data. This implements the operation 1105 as discussed above.

After this, as shown by 1223, is attaching the PI data frame to an IVAS frame.

This implements the operation 1107 as discussed above.

Then is shown, by 1224, generating an IVAS RTP payload header identifying the PI frame and the IVAS frame, and attaching the header with the above data frames. This implements the operation 1107 as discussed above.

After which is shown, by 1225, generating an RTP header with a timestamp representing the audio (IVAS) progression. This implements the operations 1101 as discussed above.

There follows, as shown by 1226, attaching the resulting bitstream (IVAS RTP payload header + data frames) with the RTP header resulting in an RTP packet. This implements the operation 1107 as discussed above.

This then results in the operation, as shown by 1227, of transmitting RTP packet to the decoder/receiver user equipment (for example UE2). This implements the operations 1107 as discussed above.

With respect to the depacketization of the PI DTX IVAS payload there is shown, by 1231, of the operation of receiving a RTP packet. This implements the operation 1121 as discussed above.

Following this, as shown by 1232, is extracting RTP payload and determining presence of PI frame(s) in the RTP payload. This implements the operations 1123 as discussed above.

Then is shown, by 1233, extracting the PI frame and determining whether there are PI frames with similar types present. This implements the operations 1123 as discussed above.

After which is shown, by 1234, determining if the PI frames with a similar type are present and selecting one or more (for example the most recent PI data available). This implements the operation 1123 as discussed above.

Following this is shown, by 1235, delivering the PI frame to the processing pipeline. This implements the operation 1127 as discussed above.

With respect to Figure 13 an example electronic device is shown. The device may be any suitable electronics device or apparatus. For example in some embodiments the device 1900 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc. In some embodiments the device 1900 comprises at least one processor or central processing unit 1907. The processor 1907 can be configured to execute various program codes such as the methods such as described herein.

In some embodiments the device 1900 comprises a memory 1911. In some embodiments the at least one processor 1907 is coupled to the memory 1911. The memory 1911 can be any suitable storage means. In some embodiments the memory 1911 comprises a program code section for storing program codes implementable upon the processor 1907. Furthermore in some embodiments the memory 1911 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1907 whenever needed via the memory-processor coupling.

In some embodiments the device 1900 comprises a user interface 1905. The user interface 1905 can be coupled in some embodiments to the processor 1907. In some embodiments the processor 1907 can control the operation of the user interface 1905 and receive inputs from the user interface 1905. In some embodiments the user interface 1905 can enable a user to input commands to the device 1900, for example via a keypad. In some embodiments the user interface 1905 can enable the user to obtain information from the device 1900. For example the user interface 1905 may comprise a display configured to display information from the device 1900 to the user. The user interface 1905 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1900 and further displaying information to the user of the device 1900.

In some embodiments the device 1900 comprises an input/output port 1909.

The input/output port 1909 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1907 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.

The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).

The transceiver input/output port 1909 may be configured to receive the signals and in some embodiments obtain the focus parameters as described herein.

In some embodiments the device 1900 may be employed to generate a suitable audio signal using the processor 1907 executing suitable code. The input/output port 1909 may be coupled to any suitable audio output for example to a multichannel speaker system and/or headphones (which may be a headtracked or a non-tracked headphones) or similar.

In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.

The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general-purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.

Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.

3GPP 3rd Generation Partnership Project AMR-WB 10 Adaptive Multi Rate Wideband Inter-Operable BLI Bitrate Level Indication CMR Codec mode request CNG Comfort Noise Generator DTX Discontinuous Transmission EVS Enhanced Voice Services FOA First-Order Am bisonics FT Frame type (index) HOA2 2nd order Higher-Order Ambisonics HOA3 3rd order Higher-Order Ambisonics ISM Independent Streams with Metadata (i.e., type of Object-Based Audio) IVAS Immersive Voice and Audio Services kbps kilobits per second MASA Metadata-Assisted Spatial Audio MC Multichannel OMASA Object-based audio with MASA (combined input format) OSBA Object-based audio with SBA (combined input format) PI Processing information (audio) RTCP Real-Time Transport Control Protocol RTP Real-Time Transport Protocol SBA Scene-Based Audio

SDP Session Description Protocol

SID Silence Descriptor SSB Supplemental Signaling Bits

ToC Table of Content

UE User equipment VAD Voice Activity Detection

Claims

CLAIMS: 1. An apparatus for controlling transmission of metadata related to a conversational immersive audio, the apparatus comprising means configured to: obtain a temporal parameter, the temporal parameter being associated with at least one immersive audio data frame; determine a discontinuous transmission mode period; obtain at least one metadata during the discontinuous transmission mode period; and control the transmission of the at least one metadata based on the temporal parameter.
2. The apparatus as claimed in claim 1, wherein the means configured to control the transmission of the at least one metadata is configured to at least one 15 of: associate at least one metadata based on the temporal parameter with the at least one immersive audio data frame transmitted preceding the obtaining of the at least one metadata; associate at least one metadata based on the temporal parameter with the at least one immersive audio data frame transmitted succeeding the obtaining of the at least one metadata; and associate at least one metadata with a silence descriptor frame transmitted during the discontinuous transmission mode period.
3. The apparatus as claimed in claim 2, wherein the means configured to control the transmission of the at least one metadata is configured to transmit the at least one metadata during the discontinuous transmission mode period.
4. The apparatus as claimed in any of claims 2 or 3, wherein the means configured to control the transmission of the at least one metadata is further configured to: determine a further time period following a start of the discontinuous transmission mode period; and select, as the associated at least one metadata, at least one of the at least one metadata obtained during the discontinuous transmission mode period and further obtained during the further time period within the discontinuous transmission mode period.
5. The apparatus as claimed in any of claims 1 to 4, wherein the means configured to control the transmission of the at least one metadata is configured to delay a transmission of the at least one metadata.
6. The apparatus as claimed in claim 5, wherein the means configured to control a transmission of the at least one metadata is configured to at least one of: select a latest of the at least one metadata obtained during the discontinuous transmission mode period to be transmitted; select one of the at least one metadata obtained during the discontinuous transmission mode period to be transmitted; select more than one of the at least one metadata obtained during the discontinuous transmission mode period to be transmitted; and generate, as the at least one metadata to be transmitted, a combination metadata based a combination of metadata obtained during the discontinuous transmission mode period.
7. The apparatus as claimed in any of claims 1 to 6, wherein the means configured to control the transmission of the at least one metadata based on the temporal parameter is configured to delay the transmission of the at least one metadata frame obtained during the discontinuous transmission mode period.
8. The apparatus as claimed in any of claims 1 to 7, wherein the at least one metadata comprises at least one processing information frame for assisting processing of the at least one immersive audio data frame.
9. The apparatus as claimed in any of claims 1 to 8, wherein the means configured to control the transmission of the at least one metadata based on the temporal parameter is configured to packetize the at least one metadata in a transport protocol payload, the transport protocol payload comprising at least one of: a real-time transport protocol payload; a real-time transport protocol header extension; and a real-time transport control protocol payload.
10. The apparatus as claimed in any of claims 1 to 9, wherein the at least one metadata comprises at least one of: information identifying an orientation of the apparatus or a capturing 10 apparatus; information identifying a microphone array constellation configuration of the apparatus or capturing apparatus; information identifying an orientation applied to the at least one immersive audio data frame; information identifying a subsequent input format change for the at least one immersive audio data frame; information identifying a subsequent coded format change for the at least one immersive audio data frame; information identifying a specified number of frames or number of seconds until a change in an input format for at least one immersive audio data frame; information identifying binauralization data that can be used to revert the binauralization at the apparatus; and information about an acoustic environment.
11. The apparatus as claimed in any of claims 1 to 10, wherein the means is further configured to negotiate with a further apparatus support for a discontinuous transmission mode and support for the control the transmission of the at least one metadata based on the temporal parameter.
12. The apparatus as claimed in claim 11, wherein the means configured to negotiate with the further apparatus is configured to negotiate employing a session description file.
13. The apparatus as claimed in any of claims 1 to 12, wherein the temporal parameter comprises at least one of: a generation time; a time of appliance; a future appliance time; a future playback time; a deferred playback time; a time stamp related to real time protocol packet; a capture time; a time of audio capture; a composition time; and a playback time.
14. The apparatus as claimed in any of claims 1 to 13, wherein the at least one metadata is at least one metadata frame.
15. An apparatus for controlling reception of metadata related to a conversational immersive audio signal, the apparatus comprising means configured to: obtain a transport protocol payload, the transport protocol payload comprising at least one metadata associated with a temporal parameter associated with at least one immersive audio data frame, the at least one metadata having been generated during a discontinuous transmission mode period; and determine from the transport protocol payload the at least one metadata 25 based on the temporal parameter.
16. The apparatus as claimed in claim 15, wherein the means is further configured to determine that the at least one metadata was generated during a discontinuous transmission mode period.
17. The apparatus as claimed in claim 16, wherein the means configured to determine that the at least one metadata was generated during a discontinuous transmission mode period is configured to determine from an analysis of the transport protocol payload that the at least one metadata was generated during a discontinuous transmission mode period.
18. The apparatus as claimed in any of claims 15 to 17 wherein the transport protocol payload further comprises the at least one immersive audio data frame, and the means is further configured to process the at least one metadata to enable the at least one metadata to assist a processing of the immersive audio data frame.
19. The apparatus as claimed in claim 18, wherein the means configured to process the at least one metadata to assist the processing of the immersive audio data frame is configured to modify an association between the at least one metadata and the at least one immersive audio data frame based on the temporal parameter.
20. The apparatus as claimed in any of claims 18 or 19, wherein the means configured to process the at least one metadata to assist the processing of an immersive audio data frame is configured to select at least one of the at least one metadata to assist the processing of an immersive audio data frame.
21. The apparatus as claimed in claim 18, wherein the means configured to process the at least one metadata to assist the processing of an immersive audio data frame is configured to at least one of: select a latest of the at least one metadata; select one of the at least one metadata; select more than one of the at least one metadata; and generate, as the selected at least one metadata, a combination metadata.
22. The apparatus as claimed in claim 21, wherein the combination metadata comprises a combination of information from the at least one metadata.
23. The apparatus as claimed in any of claims 15 to 22, wherein the means is further configured to extract from the transport protocol payload the at least one immersive audio data frame, wherein the at least one immersive audio data frame is at least one of transmitted either before or after the discontinuous transmission mode period; and a silence descriptor data frame transmitted during the discontinuous transmission mode period.
24. The apparatus as claimed in any of claims 15 to 23, wherein the means configured to determine from the transport protocol payload at least one metadata is configured to extract the at least one metadata from at least one of: a real-time transport protocol payload; a real-time transport protocol header extension; and a real-time transport control protocol payload.
25. The apparatus as claimed in any of claims 15 to 24, wherein the at least one metadata comprises at least one of: information identifying an orientation of a further apparatus or a capturing apparatus; information identifying a microphone array constellation configuration of a further apparatus or capturing apparatus; information identifying an orientation applied to the at least one immersive audio data frame; information identifying a subsequent input format change for the at least one immersive audio data frame; information identifying a subsequent coded format change for the at least one immersive audio data frame; information identifying a specified number of frames or number of seconds until a change in an input format for the at least one immersive audio data frame; information identifying binauralization data that can be used to revert the binauralization at a further apparatus; and information about an acoustic environment.
26. The apparatus as claimed in any of claims 15 to 25, wherein the means is further configured to negotiate with a further apparatus support for the discontinuous transmission mode and support for a control of transmission of the at least one metadata based on the temporal parameter at the further apparatus.
27. The apparatus as claimed in claim 26, wherein the means configured to negotiate with the further apparatus is configured to negotiate employing a session description file.
28. The apparatus as claimed in any of claims 15 to 27, wherein the temporal parameter comprises at least one of: a generation time; a time of appliance; a future appliance time; a future playback time; a deferred playback time; a time stamp related to real time protocol packet; a capture time; a time of audio capture; a composition time; and a playback time.
29. The apparatus as claimed in any of claims 15 to 28, wherein the at least one metadata is at least one metadata frame. 25
30. A method for an apparatus for controlling transmission of metadata related to a conversational immersive audio, the method comprising: obtaining a temporal parameter, the temporal parameter being associated with at least one immersive audio data frame; determining a discontinuous transmission mode period; obtaining at least one metadata during the discontinuous transmission mode period; and controlling the transmission of the at least one metadata based on the temporal parameter.
31. A method for an apparatus for controlling reception of metadata related to a conversational immersive audio signal, the method comprising: obtaining a transport protocol payload, the transport protocol payload comprising at least one metadata associated with a temporal parameter associated with at least one immersive audio data frame, the at least one metadata having been generated during a discontinuous transmission mode period; and determining from the transport protocol payload the at least one metadata based on the temporal parameter.