GB2631410A

GB2631410A - Rendering of transported audio data

Info

Publication number: GB2631410A
Application number: GB2309834.6A
Authority: GB
Inventors: Tapani Vilermo Miikka; Olavi Heikkinen Mikko; Juhani Lehtiniemi Arto
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2023-06-29
Filing date: 2023-06-29
Publication date: 2025-01-08
Also published as: GB202309834D0

Abstract

The invention provides determining a loss of a packet comprising first and second audio data. As the audio data are transported in the same packet for contemporaneous rendering, they suffer the same transportation loss. The effect of the loss is remedied by generating substitution audio data differently for the first audio data and second audio data. The generating is configured to conceal (obscure) the packet loss by producing audio data having characteristics as expected which creates perceptual similarity. The first generation process can, for example, use silence or simulated noise to mimic speech, without attempting to reproduce previous speech. The second generation process can, for example, mimic background noise and reproduce long-term characteristics of previous background noise. In examples, the packet further comprises metadata for positioning the audio source in three dimensions. The invention may provide packet loss concealment for Immersive Voice and Audio Services (IVAS) calls, for example, in mobile radio telecommunication devices or in Metadata-assisted spatial audio (MASA) format optimized for direct mobile device use.

Description

TITLE

Rendering of transported audio data.

TECHNOLOGICAL FIELD

Examples of the disclosure relate to rendering of transported audio data, and in particular rendering of audio data following a loss of transported audio data.

BACKGROUND

It is common in modern telecommunications for data, including audio data, to be transferred from a source to a destination as packets.

The packets can be transferred by different routes or can experience time-varying interference in the same route.

The desire is to transfer information (data) from source to destination. However information can be lost. For example, a packet may not arrive or may arrive late. For example, a packet can be corrupted in the transfer.

If the transferred audio data is being rendered to a user, the loss of audio data can have a negative impact on the user experience.

The impact on the user experience can depend upon the nature of the audio data.

In modern audio systems, different audio data streams can be sent from source to 25 destination.

In some examples, each audio stream is sent separately in separate packets. It may therefore be possible to improve transfer of one type of audio data (one type of packet) in preference to another type of audio data (another type of packet).

In some examples, each audio stream is sent using shared packets. It is not therefore possible to improve transfer of one type of audio data in preference to another type of audio data by improving transfer of one type of packet in preference to another type of packet.

BRIEF SUMMARY

According to various, but not necessarily all, examples there is provided an apparatus comprising: means for determining a loss of a packet comprising first audio data and second audio data; means for generating, using a first generation process, first substitution audio data to be used as first audio data instead of first audio data of the packet determined as lost; means for generating, using a second generation process different to the first generation process, second substitution audio data to be used as second audio data instead of second audio data of the packet determined as lost; and means for causing contemporaneous rendering of the first substitution audio data and the second substitution audio data.

In some but not necessarily all examples, the packet is a packet communicated over a radio interface and the packet comprises a first data structure for transporting the first audio data, and a second data structure for transporting the second audio data. In some but not necessarily all examples, the first data structure comprises metadata for positioning a source of the first audio data in three-dimensions.

In some but not necessarily all examples, the means for determining a loss of a packet comprising first audio data and second audio data comprise means for determining that an expected packet has not arrived, and/or means for determining that an expected packet has arrived late, and/or means for determining that an expected packet has arrived corrupted.

In some but not necessarily all examples, the means for generating, using the first generation process, first substitution audio data to be used as first audio data instead of first audio data comprises: means for generating, using the first generation process, first substitution audio data to be used as first audio data instead of first audio data of a delayed packet until the first audio data of the delayed packet is available for rendering. In some but not necessarily all examples, the apparatus comprises means for rendering the first audio data of the delayed packet at a rate faster than a normal rendering rate at which rendering occurs without packet delay.

In some but not necessarily all examples, the first generation process produces audio data having audio characteristics similar to expected audio characteristics of the first audio data of a transferred packet, wherein the expected audio characteristics of the first audio data are those expected based on analysis of previous audio characteristics of the first audio data; and wherein the second generation process produces audio data having audio characteristics similar to expected audio characteristics of the for second audio data of the transferred packet wherein the expected audio characteristics of the second audio data are those expected based on analysis of previous audio characteristics of the second audio data.

In some but not necessarily all examples, the first generation process is dependent upon first audio data of one or more previously received packets, and/or the first generation process is dependent upon a classification of first audio data of one or more previously received packets, and/or the first generation process is dependent upon audio energy density across the frequency domain and/or the time domain of first audio data of one or more previously received packets.

In some but not necessarily all examples, the first generation is dependent upon classification of first audio data of previously received packets as speech. In some but not necessarily all examples, the second generation process is dependent upon second audio data of one or more previously received packets, and/or the second generation process is dependent upon a classification of second audio data of one or more previously received packets.

In some but not necessarily all examples, the second generation process is dependent upon classification of second audio data of previously received packets as ambient audio.

In some but not necessarily all examples, the first generation process is one of: muting and speeding up.

In some but not necessarily all examples, the second generation is one of: attenuation, replication or prediction.

In some but not necessarily all examples, the first audio data is discontinuous in the time domain and/or frequency domain and the second audio data is continuous in the time domain and/or frequency domain; and/or the first audio data is speech and the second audio data is not speech; and/or the first audio data is directional audio and the second audio data is not directional audio; and/or the first audio data is a point source and the second audio data is a diffuse ambient source; and/or the first audio data comprises larger scale energy fluctuations in the time domain and/or the frequency domain compared to the second audio data and the second audio data comprises random smaller scale energy fluctuations in the time domain and/or the frequency domain compared to the first audio data; and/or the first audio data is a foreground or proximal audio source and the second audio data is a background or distal audio source; and/or the first audio data is a an audio source of higher energy than the second audio data.

In some but not necessarily all examples, the apparatus comprises means for receiving for immediate rendering a stream of packets comprising first audio data and second audio data.

In some but not necessarily all examples, the apparatus is configured as a mobile apparatus, and comprising a radio transceiver for receiving packets comprising first audio data and second audio data.

According to various, but not necessarily all, examples there is provided a method comprising: determining a loss of a packet comprising first audio data and second audio data; generating, using a first generation process, first substitution audio data to be used as first audio data instead of first audio data of the packet determined as lost; generating, using a second generation process independent of the first generation process, second substitution audio data to be used as second audio data instead of second audio data of the packet determined as lost; and causing contemporaneous rendering of the first substitution audio data and the second substitution audio data.

According to various, but not necessarily all, examples there is provided a computer program comprising instructions that when run on one or more processors of an apparatus cause: determining a loss of a packet comprising first audio data and second audio data; generating, using a first generation process, first substitution audio data to be used as first audio data instead of first audio data of the packet determined as lost; generating, using a second generation process independent of the first generation process, second substitution audio data to be used as second audio data instead of second audio data of the packet determined as lost; and causing contemporaneous rendering of the first substitution audio data and the second substitution audio data.

According to various, but not necessarily all, examples there is provided examples as claimed in the appended claims.

While the above examples of the disclosure and optional features are described separately, it is to be understood that their provision in all possible combinations and permutations is contained within the disclosure. It is to be understood that various examples of the disclosure can comprise any or all of the features described in respect of other examples of the disclosure, and vice versa. Also, it is to be appreciated that any one or more or all of the features, in any combination, may be implemented by/comprised in/performable by an apparatus, a method, and/or computer program instructions as desired, and as appropriate.

BRIEF DESCRIPTION

Some examples will now be described with reference to the accompanying drawings in which: FIG 1 illustrates an example of an apparatus 10 configured to apply differential remediation for different types of audio data 22, 24 lost as a consequence of a loss of a packet 20 shared by the audio data 22, 24; FIG 2 illustrates an example of a packet 20 shared by different types of audio data 22, 24; FIG 3 illustrates another example of a packet 20 shared by different types of audio data 22, 24; FIG 4A illustrates, before transfer, a sequence 30 of packets 20 shared by different types of audio data 22, 24 and that therefore provides two packet-based streams of audio data 22, 24; FIG 4B illustrates, after transfer and differential remediation for different types of audio data 22, 24 lost as a consequence of a loss of a packet 20_3 shared by the audio data 22, 24, that first substitution audio data 32 is used as first audio data 22 instead of first audio data 22 of the packet 20 determined as lost and second substitution audio data 34 is used as second audio data 24 instead of second audio data 24 of the packet 20 determined as lost; FIG 5 illustrates, a method of differential remediation for different types of audio data 22, 24 lost as a consequence of a loss of a packet 20_3 shared by the audio data 22, 24, where first substitution audio data 32 is used as first audio data 22 instead of first audio data 22 of the packet 20 determined as lost and second substitution audio data 34 is used as second audio data 24 instead of second audio data 24 of the packet 20 determined as lost; FIG 6 illustrates an example of an apparatus 10 configured to apply differential remediation for different types of audio data 22, 24 lost as a consequence of a loss of a packet 20 shared by the audio data 22, 24; FIG 7A illustrates an example of a controller for an apparatus 10 configured to apply differential remediation for different types of audio data 22, 24 lost as a consequence of a loss of a packet 20 shared by the audio data 22, 24; FIG 7B illustrates an example of a computer program for an apparatus 10 that configures the apparatus 10 to apply differential remediation for different types of audio data 22, 24 lost as a consequence of a loss of a packet 20 shared by the audio data 22, 24; FIG 8 illustrates a method of differential remediation for different types of audio data; FIG 9 illustrates, a method of differential remediation for different types of audio data (object audio data 22, ambience audio data 24) lost as a consequence of a loss of a packet shared by the audio data (object audio data 22, ambience audio data 24) where first substitution audio data is used as first audio data (object audio data 22) instead of first audio data (object audio data 22) of the packet 20 determined as lost and second substitution audio data 34 is used as second audio data (ambience audio data 24) instead of second audio data (ambience audio data 24) of the packet 20 determined as lost; FIG 9 illustrates, a method of differential remediation for different types of audio data (object audio data 22, ambience audio data 24) lost as a consequence of a loss of a packet shared by the audio data (object audio data 22, ambience audio data 24) where first substitution audio data (mute data) is used as first audio data (object audio data 22) instead of first audio data (object audio data 22) of the packet 20 determined as lost and second substitution audio data 34 (repetition of previous ambience audio data) is used as second audio data (ambience audio data 24) instead of second audio data (ambience audio data 24) of the packet 20 determined as lost; FIG 10 illustrates, a method of differential remediation for different types of audio data (object audio data 22, ambience audio data 24) lost as a consequence of a loss (delay) of a packet shared by the audio data (object audio data 22, ambience audio data 24) where first substitution audio data (mute data) is used as first audio data (object audio data 22) instead of first audio data (object audio data 22) of the packet 20 determined as lost (delayed) while delayed and second substitution audio data 34 (repetition of previous ambience audio data) is used as second audio data (ambience audio data 24) instead of second audio data (ambience audio data 24) of the packet 20 determined as lost (delayed) while delayed.

The figures are not necessarily to scale. Certain features and views of the figures can be shown schematically or exaggerated in scale in the interest of clarity and conciseness. For example, the dimensions of some elements in the figures can be exaggerated relative to other elements to aid explication. Similar reference numerals are used in the figures to designate similar features. For clarity, all reference numerals are not necessarily displayed in all figures.

DEFINITIONS

Transport: transport is synonymous with transfer and relates to distribution of data. It does not imply a particular protocol layer.

Packet: a packet is a data structure used to carry data in a packet-switched network. It comprises control information and payload data.

Audio data: audio data is data configured to enable the production (rendering) of audio (sound).

Generate: Generate is synonymous with produce or create.

Substitution audio data: Substitution audio data is audio data that is used as a substitute for (in the place of) audio data.

Rendering: rendering means production and is often applied to creating audio signals as sound waves or in a format suitable to generating sound waves.

Contemporaneous rendering: means rendering (producing) at the same time. It includes simultaneous rendering but has a broader scope covering, for example, audio signals that are perceptually at the same time.

Loss: loss is used to refer to the non-maintenance of data. For example, data (information) that was present at transmission is not present when receiving at the expected time of receipt. This may for example arise because of non-delivery of the data, delayed delivery of the data or corrupted delivery of the data. Corrupted delivery means delivery of data but not in the form as transmitted.

Independent: independent when used to describe the relationship of one process to another process means that at least some of the processes are independently (separately) configurable. The process can, for example, be of different types or categories. The process can, for example, be of the same or similar type or category but configured differently. The processes cannot always be the same.

DETAILED DESCRIPTION

FIG 1 illustrates an example of an apparatus 10 configured to apply differential remediation for different types of audio data 22, 24 lost as a consequence of a loss of a packet 20 shared by the audio data 22, 24.

The apparatus 10 comprises: means 12 for determining a loss of a packet 20 comprising first audio data 22 and second audio data 24; means 14 for generating, using a first generation process, first substitution audio data 32 to be used as first audio data 22 instead of first audio data 22 of the packet 20 determined as lost; means 16 for generating, using a second generation process different to the first generation process, second substitution audio data 34 to be used as second audio data 24 instead of second audio data 24 of the packet 20 determined as lost; and means 18 for causing contemporaneous rendering of the first substitution audio data 32 and the second substitution audio data 34.

In some but not necessarily all examples, the loss of a packet 20 is a loss, in transport, of a packet 20. However, loss can occur in other ways.

In some but not necessarily all examples, the first generation process and the second generation process are not only different but are additionally independent.

As the first audio data 22 and second audio data 24 are transported in the same packet 20 for contemporaneous rendering they suffer the same transportation loss. However, the effect of loss is remedied differently for the first audio data 22 and the second audio data 24. The first generation process and the second generation process can be or are different.

The loss, in transport, of a packet 20 can for example be as a consequence of non-delivery of the packet 20, delayed delivery of the packet 20 or corrupted delivery of the packet 20.

The means 12 for determining a loss, in transport, of a packet 20 comprising first audio data 22 and second audio data 24 can, for example, comprise means for determining that an expected packet 20 has not arrived, and/or means for determining that an expected packet 20 has arrived late, and/or means for determining that an expected packet 20 has arrived corrupted.

In at least some examples, the first generation process and the second generation process are configured to conceal (obscure) packet loss. The first generation process can, for example, produce audio data having audio characteristics similar to expected audio characteristics of the first audio data 22 of a transferred packet 20. This creates perceptual similarity because the generated first audio data 22 has a perceptually expected audio characteristics. The second generation process can, for example, produce audio data having audio characteristics similar to expected audio characteristics of the second audio data 24 of the transferred packet 20. This creates perceptual similarity because the generated second audio data 24 has a perceptually expected audio characteristics.

The term expected is used in a statistical/mathematical sense. The expected audio characteristics are those expected based on analysis of previous audio characteristics.

The first generation process can, for example, mimic gaps in speech. It can use silence or simulated noise to simulate gaps in speech, without attempting to reproduce previous speech. The second generation process can, for example, mimic background noise. It can reproduce long-term characteristics of previous background noise.

This can, for example, be achieved by configuring the first generation process to be dependent upon first audio data 22 of one or more previously received packets 20 and/or the second generation process to be dependent upon second audio data 24 of one or more previously received packets 20.

In some examples, the first generation process is dependent upon a classification of first audio data 22 of one or more previously received packets 20. For example, a particular first generation process can be selected based on the classification.

In some examples, the first generation process is dependent upon audio energy density across the frequency domain and/or the time domain of first audio data 22 of previously received packets 20.

For example, speech (conversational voice) has a distinctive distribution of audio energy density across the frequency domain and the time domain which enables the classification of previously received first audio data as speech.

A gap in speech arising from a loss of a packet 20 can, for example, be handled by not rendering any audio (muting). A delay in speech arising from a loss (delay) of a packet 20 can, for example, be handled by rendering the late received speech audio from the packet at a faster rate. There may be an upper limit to the faster rate of playback, for example, up to twice (x2) the normal rate.

The first generation process can, for example, generate first substitution audio data 32 to be used as first audio data 22 instead of first audio data 22 of a delayed packet 20 until the first audio data 22 of a delayed packet 20 is available for rendering. The apparatus 10 can be configured to render the first audio data 22 of a delayed packet 20 (when it eventually arrives) at a rate faster than a normal rendering rate. The normal rate is the rate at which rendering occurs without packet delay (packet loss). The apparatus 10 can be configured to render the first audio data 22 of a delayed packet 20 (when it eventually arrives) at a rate faster than a previous rendering rate used for previously received packets 10 before the delay.

In some examples, the second generation process is dependent upon a classification of second audio data 24 of one or more previously received packets 20. For example, a particular second generation process can be selected based on the classification.

For example, the second audio data 24 can be classified as ambient audio.

For example, the second audio data 24 can be classified as music or as a vehicle.

A gap in music arising from a loss of a packet 20 can, for example, be handled by not rendering any audio (muting), however, a delay in music arising from a loss (delay) of a packet 20 is not handled by rendering the late received music from the packet at a faster rate.

A gap in vehicle noise arising from a loss of a packet 20 can, for example, be handled by not rendering any audio (muting), however, a delay in vehicle noise arising from a loss (delay) of a packet 20 is not handled by rendering the late received vehicle noise from the packet at a faster rate.

In some examples, the second audio data 24 can be classified as ambient audio, for example ambient noise.

A gap in ambient noise arising from a loss of a packet 20 can, for example, be handled by not rendering any audio (muting), replicating second audio data 24 from a previously received packet (for example the immediately preceding packet), or by predicting second audio data 24 using a prediction algorithm, which may, for example, be based on machine learning.

The first audio data 22 can have different characteristics than the second audio data 24. In some examples, the first generation process is suitable for maintaining continuity with a stream of the first audio data but is not necessarily suitable for maintaining continuity with a stream of second audio data 24. In some examples, the (different) second generation process is suitable for maintaining continuity with a stream of the second audio data 24 but is not necessarily suitable for maintaining continuity with a stream of the first audio data 22.

In at least some examples, these different characteristics can be detected and used to control the first generation process used to generate first substitution audio data 32 and/or to control the second generation process used to generate second In some examples, the first audio data 22 is discontinuous in the time domain and/or frequency domain and the second audio data 24 is continuous in the time domain and/or frequency domain.

In some examples, the first audio data 22 is speech and the second audio data 24 is not speech.

In some examples, the first audio data 22 is directional audio and the second audio data 24 is not directional audio.

In some examples, the first audio data 22 is a point source and the second audio data 24 is a diffuse ambient source.

In some examples, the first audio data 22 comprises large scale energy fluctuations in the time domain and/or the frequency domain and the second audio data 24 comprises random small scale energy fluctuations in the time domain and/or the frequency domain. The first audio data 22 can, for example, comprises larger energy fluctuations over a longer time period than the second audio data 24. The energy fluctuations of the first audio data 22 can, for example, exceed the energy fluctuations of the second audio data 24 by a threshold value over a defined time period. The first audio data 22 can, for example, comprises larger energy fluctuations over a broader frequency range than the second audio data 24. The energy fluctuations of the first audio data 22 can, for example, exceed the energy fluctuations of the second audio data 24 by a threshold value over a defined frequency range.

In some examples, the first audio data 22 is a foreground or proximal audio source and the second audio data 24 is a background or distal audio source.

In some examples, the first audio data 22 is a higher energy audio source and the second audio data 24 is a lower energy audio source FIG 2 illustrates an example of a packet 20 shared by different types of audio data 22, 24.

In this example, the packet 20 is a transport packet communicated to the apparatus 10, for example via a radio interface. The packet 20 comprises a first data structure 26 for transporting the first audio data 22, and a second data structure 28 for transporting the second audio data 24.

In some examples, the first data structure 26 is specifically configured to transport speech data as the first audio data 22.

In some examples, the first data structure 26 is specifically configured to transport spatial audio data, that is audio data that has a specified three-dimensional (3D) position.

In some examples, the second data structure 28 is specifically configured to transport ambient audio data as the second audio data 24.

Optionally, the packet 20 comprises a time stamp 21 that enables the sequential ordering of packets 20. The time stamps 21 can, for example, be used for detecting and quantifying delays.

Optionally, the packet 20 comprises a checksum 23 for detecting packet 20 corruption during transport.

FIG 3 illustrates an example of the packet 20 as illustrated in FIG 2. In this example, the first data structure 26 is specifically configured to transport spatial audio data, that is audio data that has a specified 3D position. The first data structure 26 comprises metadata 25 for positioning a source of the first audio data 22 in three-dimensions.

FIG 4A illustrates, before transfer, a sequence 30 of packets 20. The sequence 30 provides a stream of first audio data 22 and a stream of second audio data 24.

The stream of first audio data is comprised exclusively of first audio data 22 of the packets 20 in a sequence 30 of packets 20.

The stream of second audio data is comprised exclusively of second audio data 24 of the packets 20 in the sequence 30 of packets 20.

Each packet 20 in the sequence comprises first audio data 22 and second audio data 24. The packet 20_(Pn) is transmitted at time tn.

FIG 4B illustrates, after transfer, a sequence 30 of packets 20. The sequence 30 provides a stream of first audio data 22 and a stream of second audio data 24. There is a loss of packet 20_3. The other packets are not lost.

The stream of first audio data is comprised of first audio data 22 of the packets 20 in a sequence 30 of packets 20 that have not been lost and first substitution audio data 32 that replaces the first audio data 22 of a lost packet 20_3. The first substitution audio data 32 is used as first audio data 22 instead of first audio data 22 of the packet 20 determined as lost.

The stream of second audio data is comprised of second audio data 24 of the packets 20 in the sequence 30 of packets 20 that have not been lost and second substitution audio data 34 that replaces the second audio data 24 of the lost packet 20 3. The second substitution audio data 34 is used as second audio data 24 instead of second audio data 24 of the packet 20 determined as lost.

The stream of first audio data now comprises the first substitution audio data 32 at the time slot equivalent to t3 and the stream of second audio data now comprises the second substitution audio data 34 at the time slot equivalent to t3.

FIG 5 illustrates, a method 500 of differential remediation for different types of audio data 22, 24 lost as a consequence of a loss of a packet 20_3 shared by the audio data 22, 24, where first substitution audio data 32 is used as first audio data instead of first audio data 22 of the packet 20 determined as lost and second substitution audio data 34 is used as second audio data instead of second audio data 24 of the packet 20 determined as lost.

The method 500 comprises, at block 502, determining a loss, in transport, of a packet 20 comprising first audio data 22 and second audio data 24.

The method 500 comprises, at block 504, generating, using a first generation process, first substitution audio data 32 to be used as first audio data instead of first audio data 22 of the packet 20 determined as lost The method 500 comprises, at block 506, generating, using a second generation process independent of the first generation process, second substitution audio data 34 to be used as second audio data instead of second audio data 24 of the packet 20 determined as lost.

The method 500 comprises, at block 508, causing contemporaneous rendering of the first substitution audio data 32 and the second substitution audio data 34.

FIG 6 illustrates an example of an apparatus 10 configured to apply differential remediation for different types of audio data 22, 24 lost as a consequence of a loss of a packet 20 shared by the audio data 22, 24.

In this example, the packet 20 is a transport packet communicated over a radio interface 80 and the packet 20 comprises a first data structure 26 for transporting first audio data, and a second data structure 28 for transporting second data structure 28 for the second audio data 24.

The apparatus 10 comprises a radio transceiver 50 for receiving packets 20 comprising first audio data 22 and second audio data 24. The radio transceiver 50 can for example be a radio transceiver configured to 3GPP specifications.

The apparatus 10 comprises a controller 400 coupled to the radio transceiver 50 and to a rendering apparatus 100 for rendering audio data to a user. The rendering apparatus 100 can, for example, comprise one or more loudspeakers. In some examples, the rendering apparatus 100 is a headset.

In some examples, the rendering apparatus 100 is housed in the apparatus 10. In other examples, it is separated from or interconnected to the apparatus 10.

The controller 400 is configured to: determine a loss, in transport, of a packet 20 comprising first audio data 22 and second audio data 24; generate, using a first generation process, first substitution audio data 32 to be used as first audio data 22 instead of first audio data 22 of the packet 20 determined as lost; generate, using a second generation process independent of the first generation process, second substitution audio data 34 to be used as second audio data 24 instead of second audio data 24 of the packet 20 determined as lost; and cause contemporaneous rendering of the first substitution audio data 32 and the second substitution audio data 34.

In this example, a sequence 30 of packets 20 is received at the radio transceiver 50 and provided, as stream 29, to the controller 400.

In this example, a stream 40 of audio data is provided to the rendering apparatus 100 for separate rendering. The stream 40 of audio data comprises a first audio data stream for rendering and a second audio data stream for rendering.

The first audio data stream for rendering comprises the original first audio data 22 and the generated first substitution audio data 32. The second audio data stream for rendering comprises the original second audio data 24 and the generated second FIG 7A illustrates an example of the controller 400 for the apparatus 10. The controller 400 is configured to apply differential remediation for different types of audio data 22, 24 lost as a consequence of a loss of a packet 20 shared by the audio data 22, 24.

Implementation of the controller 400 may be as controller circuitry. The controller 400 may be implemented in hardware alone, have certain aspects in software including firmware alone or can be a combination of hardware and software (including firmware).

As illustrated in Fig 7A the controller 400 may be implemented using instructions that enable hardware functionality, for example, by using executable instructions of a computer program 406 in a general-purpose or special-purpose processor 402 that may be stored on a computer readable storage medium (disk, memory etc.) to be executed by such a processor 402.

The processor 402 is configured to read from and write to the memory 404. The processor 402 may also comprise an output interface via which data and/or commands are output by the processor 402 and an input interface via which data and/or commands are input to the processor 402.

The memory 404 stores a computer program 406 comprising computer program instructions (computer program code) that controls the operation of the apparatus 10 when loaded into the processor 402. The computer program instructions, of the computer program 406, provide the logic and routines that enables the apparatus to perform the methods illustrated in the accompanying Figs. The processor 402 by reading the memory 404 is able to load and execute the computer program 406.

The apparatus 10 comprises: at least one processor 402; and at least one memory 404 including computer program code the at least one memory 404 and the computer program code configured to, with the at least one processor 402, cause the apparatus 10 at least to perform: determining a loss, in transport, of a packet 20 comprising first audio data 22 and second audio data 24; generating, using a first generation process, first substitution audio data 32 to be used as first audio data 22 instead of first audio data 22 of the packet 20 determined as lost; generating, using a second generation process independent of the first generation process, second substitution audio data 34 to be used as second audio data 24 instead of second audio data 24 of the packet 20 determined as lost; and causing contemporaneous rendering of the first substitution audio data 32 and the second substitution audio data 34.

The apparatus 10 comprises: at least one processor 402; and at least one memory 404 including computer program code, the at least one memory storing instructions that, when executed by the at least one processor 402, cause the apparatus at least to: determining a loss, in transport, of a packet 20 comprising first audio data 22 and second audio data 24; generating, using a first generation process, first substitution audio data 32 to be used as first audio data 22 instead of first audio data 22 of the packet 20 determined as lost; generating, using a second generation process independent of the first generation process, second substitution audio data 34 to be used as second audio data 24 instead of second audio data 24 of the packet 20 determined as lost; and causing contemporaneous rendering of the first substitution audio data 32 and the second substitution audio data 34.

FIG 7B illustrates an example of a computer program for an apparatus 10 that configures the apparatus 10 to apply differential remediation for different types of audio data 22, 24 lost as a consequence of a loss of a packet 20 shared by the audio data 22, 24.

The computer program 406 may arrive at the apparatus 10 via any suitable delivery mechanism 408. The delivery mechanism 408 may be, for example, a machine readable medium, a computer-readable medium, a non-transitory computer-readable storage medium, a computer program product, a memory device, a record medium such as a Compact Disc Read-Only Memory (CD-ROM) or a Digital Versatile Disc (DVD) or a solid-state memory, an article of manufacture that comprises or tangibly embodies the computer program 406. The delivery mechanism may be a signal configured to reliably transfer the computer program 406. The apparatus 10 may propagate or transmit the computer program 406 as a computer data signal.

Computer program instructions for causing an apparatus to perform at least the following or for performing at least the following: determining a loss, in transport, of a packet 20 comprising first audio data 22 and second audio data 24; generating, using a first generation process, first substitution audio data 32 to be used as first audio data 22 instead of first audio data 22 of the packet 20 determined as lost; generating, using a second generation process independent of the first generation process, second substitution audio data 34 to be used as second audio data 24 instead of second audio data 24 of the packet 20 determined as lost; and causing contemporaneous rendering of the first substitution audio data 32 and the second substitution audio data 34.

The computer program instructions may be comprised in a computer program, a non-transitory computer readable medium, a computer program product, a machine readable medium. In some but not necessarily all examples, the computer program instructions may be distributed over more than one computer program.

Although the memory 404 is illustrated as a single component/circuitry it may be implemented as one or more separate components/circuitry some or all of which may be integrated/removable and/or may provide permanent/semi-permanent/ dynamic/cached storage.

Although the processor 402 is illustrated as a single component/circuitry it may be implemented as one or more separate components/circuitry some or all of which may be integrated/removable. The processor 402 may be a single core or multi-core processor.

The above examples can be used for packet loss concealment of Immersive Voice and Audio Services (IVAS) calls, for example, in mobile radio telecommunication devices.

The IVAS speech and audio codec is expected to have a mode where speech is transmitted as a separate object channel and ambience in its own separate channel.

In the Third Generation Partnership project (3GPP) IVAS packets 20 comprise first audio data 22 (object audio data) and second audio data 24 (ambience audio data).

In at least some examples, IVAS can support different audio formats and the audio data 22, 24 can have any suitable format. Immersive audio formats, include for example spatial audio formats. Spatial audio enables automated 3D sound localization and the controlled automated positioning of a sound source (a sound object) in 3D space. In at least some examples the position of a sound source is not fixed and spatial audio data can be adapted to change a position of a sound source.

In at least some examples, IVAS can support different audio formats including for example: speaker channel-based audio (including stereo multi-channel configurations e.g. 5.1, 7.1+4 etc.), binaural audio, scene-based audio (i.e. higher-order Ambisonics) and object-based audio.

IVAS is expected to support Metadata-assisted spatial audio (MASA) -a parametric spatial audio format optimized for direct mobile device use.

IVAS is expected to enable complex scenarios with multiple participants; speech (and audio surroundings), transmitted as streams and spatially rendered on the receiving mobile device at 3D positions to match a video scene, for example.

IVAS is expected to support scenarios where an intermediate call server combines multiple participants into an immersive 3D scene each participants having a particular 3D position in the 3D scene.

IVAS codec is expected to support streaming and virtual reality and augmented reality applications.

IVAS is expected to support headset rendering. Head-tracking technology associated with the headset will enable first-person perspective mediated reality where what is rendered changes with the user's point of view as determined by the headset.

In 3DoF mediated reality, an orientation of the user determined by the headset controls a virtual orientation of a virtual user. There is a correspondence between the user orientation and the virtual orientation such that a change in the user orientation produces the same change in the virtual orientation. The virtual orientation of the virtual user defines a virtual sound scene. The virtual orientation of the virtual user in combination with a virtual field of view can define a virtual visual scene within a virtual visual space. In 3DOF mediated reality, a change in the location of the user does not change the virtual location or virtual orientation of the virtual user.

In the example of 6DoF mediated reality, the situation is as described for 3DoF and in addition it is possible to change the rendered virtual sound scene and any displayed virtual visual scene by movement of a location of the user. For example, there may be a mapping between the location of the user and the virtual location of the virtual user.

A change in the location of the user produces a corresponding change in the virtual location of the virtual user. A change in the virtual location of the virtual user changes the rendered virtual sound scene and can also change the rendered virtual visual scene.

Packet loss management is necessary for mobile radio telecommunication devices because radio links are susceptible to packet loss (corruption, non-delivery, delayed delivery). This means that real-time streaming of audio especially over radio telecommunication links will suffer from data loss and thus interruptions in audio quality. Missing audio data typically causes gaps. Gaps can be concealed using gap concealment methods.

The apparatus 10, previously described, comprises: means 14 for generating, using a first generation process, first substitution audio data 32 to be used as first audio data 22 instead of first audio data 22 of the packet 20 determined as lost; means 16 for generating, using a second generation process independent of the first generation process, second substitution audio data 34 to be used as second audio data 24 instead of second audio data 24 of the packet 20 determined as lost; and means 18 for causing contemporaneous rendering of the first substitution audio data 32 and the second substitution audio data 34.

The generation processes generate substitution audio data 32, 34 used for gap concealment.

The generation processes for gap concealment can for example include: muting erroneous audio signal parts, repeating previous non-erroneous audio signal over erroneous parts, predicting non-erroneous parts, for example, using machine learning, to replace erroneous parts.

The playback of late-received (delayed) audio data can be speeded up to catch-up to the correct timing of packets that aren't late. Data packets 20 typically have time stamps that indicate when they should be played in relation to each other.

Different error concealment methods are used for the object audio data 22 and the ambience audio data 24 even though the loss/error (missing, corrupted or late packets) is in packet(s) 20 that contain both the object audio data 22 and the ambience audio data 24.

Object audio data 22 can be one of many things depending on the implementation. Object audio data 22 can, for example, represent the nearest or loudest user speech, speech in general, a Lavalier microphone audio or generally a separate microphone signal, any directional sound source, sounds from a pre-defined direction such as mouth reference point, or anything other than noise like sounds.

Ambient audio data 24 can be any one or more of noise like sounds, all sounds other than speech, all sounds other than closest or loudest speaker; non-directional sounds (sounds rendered diffusely for example with reverberation without a controlled position), sounds from pre-defined directions such as all directions other than a mouth reference point, sounds captured by an ambient microphone etc. In IVAS object audio data 22 and ambient audio data 24 are transmitted in the same packets and therefore packet loss affect both object audio data 22 and ambient audio data 24. However, some concealment methods are better for ambience audio data 24 and some for object audio data 22 because of their continuous and transient natures respectively. Error concealment methods are applied differently to object audio data 22 and ambient audio data 24 to achieve perceptually a least disturbing output audio.

Speech is transient in nature both in time and frequency domains. Ambient sounds are often more continuous meaning that the energy level of ambient sounds typically fluctuates less than in speech.

Transient sounds are tolerant to multiple methods of error concealment whereas continuous sounds suffer especially from muting.

For example, speech is tolerant to (limited) speeding up which does not affect intelligibility too much if done conservatively. Intelligibility is the most important aspect of speech quality.

For example, muting and speeding up ambient sounds typically works poorly. For example, background hum, traffic noise, rustle of wind, and waterfalls are examples of ambient sounds where muting works poorly. Music and car sounds are examples of ambient sounds where speeding up works poorly because speeding up abruptly changes the rhythm of the music and can modify apparent vehicle speed and/or direction.

Even if a same category of error concealment is chosen for ambient audio data 24 and object audio data 22, it is done differently with different parameters for object audio and ambient sounds.

Examples of parameters include but are not limited to: processing window length, forgetting factor, filter coefficients... In some examples, different concealment methods (muting, copying, prediction, machine learning) with different parameters are performed separately for the object audio data 22 and the ambient audio data 24 using previous using the object audio data 22 and the ambient audio data 24 of correctly received packets. It is then possible to identify which concealment method (with which parameters) gives the best result (least error compared to actual received packets) for the object audio data 22. It is also possible to identify which concealment method (with which parameters) gives the best result (least error compared to actual received packets) for the ambient audio data 24.

Audio characteristics in general can also be taken into account when choosing the best error concealment methods while also remembering the object and ambience signal. For example, muting can be used for an object signal whereas merely attenuation can be used for the ambient signal. The level of attenuation may depend on how continuous the ambient signal is. Speeding up can also be used for the ambient signal if it is not music nor car sounds. The amount of speeding-up that can be used is limited in speech to maintain intelligibility to -2x whereas ambience can be sped-up more if it is noise-like.

FIG 8 illustrates, a method of differential remediation for different types of audio data (object audio data 22, ambience audio data 24) lost as a consequence of a loss of a packet shared by the audio data (object audio data 22, ambience audio data 24) where first substitution audio data is used as first audio data (object audio data 22) instead of first audio data (object audio data 22) of the packet 20 determined as lost and second substitution audio data 34 is used as second audio data (ambience audio data 24) instead of second audio data (ambience audio data 24) of the packet 20 determined as lost.

At block 302 of method 300, a transmitter transmits (Tx) a stream 29 of first audio data 22 and second audio data 24 as a series of N data packets 20. The data packets comprise first audio data 22 and second audio data 24. The data packet 20_n is transmitted at time tn.

At block 304, a data packet 20 is lost in transport. In this example, the packet 20_3 transmitted at time t3 is lost. It is not received (Rx) at the apparatus 10.

At block 306, the apparatus 10 does however receive other packets 20 including preceding and following packets 20. If n packets are lost then N-n packets are received.

At block 308, the apparatus 10 decodes the received packets 20 into BOTH first audio data (object) 22 and second audio data (ambience) 24.

At block 310, data packet loss is concealed in the first audio data (object) 22 using a first concealment method. The apparatus 10 generates first substitution audio data 32 to be used as first audio data 22 instead of first audio data 22 of the packet 20 determined as lost.

At block 312, data packet loss is concealed in the second audio data (ambience) 24 using a second concealment method, different to the first concealment method. The apparatus 10 generates second substitution audio data 34 to be used as second audio data 24 instead of second audio data 24 of the packet 20 determined as lost; At block 314, the concealed first audio data (object) and the concealed second audio data (ambience) are rendered and combined for playback.

FIG 9 illustrates, a method of differential remediation for different types of audio data (object audio data 22, ambience audio data 24) lost as a consequence of a loss of a packet shared by the audio data (object audio data 22, ambience audio data 24) where first substitution audio data (muted data) is used as first audio data (object audio data 22) instead of first audio data (object audio data 22) of the packet 20 determined as lost and second substitution audio data 34 (repetition of previous ambience audio data) is used as second audio data (ambience audio data 24) instead of second audio data (ambience audio data 24) of the packet 20 determined as lost.

In this example, the Packet 20_3 transmitted at time t3 is not received.

The first substitution audio data 32 is generated as muted audio data.

The second substitution audio data 34 is generated by repetition of the immediately preceding received second audio data 24. Thus, the second audio data 24 received in the Packet 20_2 transmitted at time t2 is used instead of the second audio data 24 in the Packet 20_3 transmitted at time t3 which was not received.

Repetition of audio data can be replaced with prediction or machine learning based methods.

The method illustrated comprises: determining a loss, in transport, of a packet 20 comprising first audio data 22 and second audio data 24; generating, using a first generation process, first substitution audio data 32 to be used as first audio data 22 instead of first audio data 22 of the packet 20 determined as lost; generating, using a second generation process independent of the first generation process, second substitution audio data 34 to be used as second audio data 24 instead of second audio data 24 of the packet 20 determined as lost; and contemporaneous rendering 100_1 of the first substitution audio data 32 and rendering 100_2 of the second substitution audio data 34.

FIG 10 illustrates, a method of differential remediation for different types of audio data (object audio data 22, ambience audio data 24) lost as a consequence of a loss (delay) of a packet shared by the audio data (object audio data 22, ambience audio data 24) where first substitution audio data (Muted data) is used as first audio data (object audio data 22) instead of first audio data (object audio data 22) of the packet 20 determined as lost (delayed) while delayed and second substitution audio data 34 (repetition of previous ambience audio data) is used as second audio data (ambience audio data 24) instead of second audio data (ambience audio data 24) of the packet 20 determined as lost (delayed) while delayed.

The first generation process is configured to generate first substitution audio data 32 to be used as first audio data 22 instead of first audio data 22 of a delayed packet 20 until the first audio data 22 of a delayed packet 20 is available for rendering. The apparatus 10 is then configured to render the first audio data 22 of the delayed packet 20 at a rate faster than a normal rendering rate but does not speed up rendering of the second audio data 24 of the delayed packet 20.

In this example, the Packet 20_3 transmitted at time t3 is received but with a delay. There is therefore a period of time when the Packet 20_3 is lost. This period extends form when the packet 20_3 was expected to be received until it was received. This is a 'loss' time period.

There is therefore also a period of time after the delayed Packet 20_3 is received and before the next packet P4 is expected to be received. This is a 'catch-up' time period.

The first substitution audio data 32 is generated as muted audio data during the loss time period. The first audio data 22 of the delayed packet 20_3 is played back at a faster speed during the 'catch-up' time period.

The second substitution audio data 34 is generated by repetition of the immediately preceding received second audio data 24. Thus during the loss period the second audio data 24 received in the Packet 20_2 transmitted at time t2 is used instead of the second audio data 24 in the Packet 20_3 transmitted at time t3 which was not yet received. The second audio data 24 of the delayed packet 20_3 is played back at a normal speed during the 'catch-up' time period.

Thus "repetition" is used for ambience (the second audio data 24) and "muting and speed-up" is used for object audio data (the first audio data 22).

References to 'computer-readable storage medium', 'computer program product', 'tangibly embodied computer program' etc. or a 'controller', 'computer', 'processor etc. should be understood to encompass not only computers having different architectures such as single /multi-processor architectures and sequential (Von Neumann)/parallel architectures but also specialized circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processing devices and other processing circuitry. References to computer program, instructions, code etc. should be understood to encompass software for a programmable processor or firmware such as, for example, the programmable content of a hardware device whether instructions for a processor, or configuration settings for a fixed-function device, gate array or programmable logic device etc. As used in this application, the term 'circuitry' may refer to one or more or all of the 25 following: (a) hardware-only circuitry implementations (such as implementations in only analog and/or digital circuitry) and (b) combinations of hardware circuits and software, such as (as applicable): (i) a combination of analog and/or digital hardware circuit(s) with software/firmware 30 and (ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory or memories that work together to cause an apparatus, such as a mobile phone or server, to perform various functions and (c) hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (for example, firmware) for operation, but the software may not be present when it is not needed for operation.

This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor and its (or their) accompanying software and/or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit for a mobile device or a similar integrated circuit in a server, a cellular network device, or other computing or network device.

The blocks illustrated in the accompanying Figs may represent steps in a method and/or sections of code in the computer program 406. The illustration of a particular order to the blocks does not necessarily imply that there is a required or preferred order for the blocks and the order and arrangement of the block may be varied.

Furthermore, it may be possible for some blocks to be omitted.

Where a structural feature has been described, it may be replaced by means for performing one or more of the functions of the structural feature whether that function or those functions are explicitly or implicitly described.

The systems, apparatus, methods, and computer programs may use machine learning which can include statistical learning. Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed. The computer learns from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E. The computer can often learn from prior training data to make predictions on future data. Machine learning includes wholly or partially supervised learning and wholly or partially unsupervised learning. It may enable discrete outputs (for example classification, clustering) and continuous outputs (for example regression). Machine learning may for example be implemented using different approaches such as cost function minimization, artificial neural networks, support vector machines and Bayesian networks for example. Cost function minimization may, for example, be used in linear and polynomial regression and K-means clustering. Artificial neural networks, for example with one or more hidden layers, model complex relationship between input vectors and output vectors. Support vector machines may be used for supervised learning. A Bayesian network is a directed acyclic graph that represents the conditional independence of a number of random variables.

As used here 'module' refers to a unit or apparatus that excludes certain parts/components that would be added by an end manufacturer or a user. The controller 400 can be a module. The apparatus 10 can be a module.

The above-described examples find application as enabling components of: automotive systems; telecommunication systems; electronic systems including consumer electronic products; distributed computing systems; media systems for generating or rendering media content including audio, visual and audio visual content and mixed, mediated, virtual and/or augmented reality; personal systems including personal health systems or personal fitness systems; navigation systems; user interfaces also known as human machine interfaces; networks including cellular, non-cellular, and optical networks; ad-hoc networks; the internet; the internet of things; virtualized networks; and related software and services.

The apparatus can be provided in an electronic device, for example, a mobile terminal, according to an example of the present disclosure. It should be understood, however, that a mobile terminal is merely illustrative of an electronic device that would benefit from examples of implementations of the present disclosure and, therefore, should not be taken to limit the scope of the present disclosure to the same. While in certain implementation examples, the apparatus can be provided in a mobile terminal, other types of electronic devices, such as, but not limited to: mobile communication devices, hand portable electronic devices, wearable computing devices, portable digital assistants (PDAs), pagers, mobile computers, desktop computers, televisions, gaming devices, laptop computers, cameras, video recorders, GPS devices and other types of electronic systems, can readily employ examples of the present disclosure. Furthermore, devices can readily employ examples of the present disclosure regardless of their intent to provide mobility.

The term 'comprise' is used in this document with an inclusive not an exclusive meaning. That is any reference to X comprising Y indicates that X may comprise only one Y or may comprise more than one Y. If it is intended to use 'comprise' with an exclusive meaning then it will be made clear in the context by referring to "comprising only one..." or by using "consisting".

In this description, the wording 'connect', 'couple' and 'communication' and their derivatives mean operationally connected/coupled/in communication. It should be appreciated that any number or combination of intervening components can exist (including no intervening components), i.e., so as to provide direct or indirect connection/coupling/communication. Any such intervening components can include hardware and/or software components.

As used herein, the term "determine/determining" (and grammatical variants thereof) can include, not least: calculating, computing, processing, deriving, measuring, investigating, identifying, looking up (for example, looking up in a table, a database or another data structure), ascertaining and the like. Also, "determining" can include receiving (for example, receiving information), accessing (for example, accessing data in a memory), obtaining and the like. Also, " determine/determining" can include resolving, selecting, choosing, establishing, and the like.

In this description, reference has been made to various examples. The description of features or functions in relation to an example indicates that those features or functions are present in that example. The use of the term 'example' or 'for example' or 'can' or 'may' in the text denotes, whether explicitly stated or not, that such features or functions are present in at least the described example, whether described as an example or not, and that they can be, but are not necessarily, present in some of or all other examples. Thus 'example', 'for example', 'can' or 'may' refers to a particular instance in a class of examples. A property of the instance can be a property of only that instance or a property of the class or a property of a sub-class of the class that includes some but not all of the instances in the class. It is therefore implicitly disclosed that a feature described with reference to one example but not with reference to another example, can where possible be used in that other example as part of a working combination but does not necessarily have to be used in that other example.

Although examples have been described in the preceding paragraphs with reference to various examples, it should be appreciated that modifications to the examples given can be made without departing from the scope of the claims.

Features described in the preceding description may be used in combinations other than the combinations explicitly described above.

Although functions have been described with reference to certain features, those functions may be performable by other features whether described or not.

Although features have been described with reference to certain examples, those features may also be present in other examples whether described or not.

The term 'a', 'an' or 'the' is used in this document with an inclusive not an exclusive meaning. That is any reference to X comprising a/an/the Y indicates that X may comprise only one Y or may comprise more than one Y unless the context clearly indicates the contrary. If it is intended to use 'a', 'an' or 'the' with an exclusive meaning then it will be made clear in the context. In some circumstances the use of 'at least one' or 'one or more' may be used to emphasis an inclusive meaning but the absence of these terms should not be taken to infer any exclusive meaning.

The presence of a feature (or combination of features) in a claim is a reference to that feature or (combination of features) itself and also to features that achieve substantially the same technical effect (equivalent features). The equivalent features include, for example, features that are variants and achieve substantially the same result in substantially the same way. The equivalent features include, for example, features that perform substantially the same function, in substantially the same way to achieve substantially the same result.

In this description, reference has been made to various examples using adjectives or adjectival phrases to describe characteristics of the examples. Such a description of a characteristic in relation to an example indicates that the characteristic is present in some examples exactly as described and is present in other examples substantially as described.

The above description describes some examples of the present disclosure however those of ordinary skill in the art will be aware of possible alternative structures and method features which offer equivalent functionality to the specific examples of such structures and features described herein above and which for the sake of brevity and clarity have been omitted from the above description. Nonetheless, the above description should be read as implicitly including reference to such alternative structures and method features which provide equivalent functionality unless such alternative structures or method features are explicitly excluded in the above description of the examples of the present disclosure.

Whilst endeavoring in the foregoing specification to draw attention to those features believed to be of importance it should be understood that the Applicant may seek protection via the claims in respect of any patentable feature or combination of features hereinbefore referred to and/or shown in the drawings whether or not emphasis has been placed thereon.

I/we claim:

Claims

CLAIMS1. An apparatus comprising: means for determining a loss of a packet comprising first audio data and second audio data; means for generating, using a first generation process, first substitution audio data to be used as first audio data instead of first audio data of the packet determined as lost; means for generating, using a second generation process different to the first generation process, second substitution audio data to be used as second audio data instead of second audio data of the packet determined as lost; and means for causing contemporaneous rendering of the first substitution audio data and the second substitution audio data.
2. An apparatus as claimed in claim 1, wherein the packet is a packet communicated over a radio interface and the packet comprises a first data structure for transporting the first audio data, and a second data structure for transporting the second audio data.
3. An apparatus as claimed in claim 2, wherein the first data structure comprises metadata for positioning a source of the first audio data in three-dimensions.
4. An apparatus as claimed in any preceding claim, wherein the means for determining a loss of a packet comprising first audio data and second audio data comprise means for determining that an expected packet has not arrived, and/or means for determining that an expected packet has arrived late, and/or means for determining that an expected packet has arrived corrupted.
5. An apparatus as claimed in any preceding claim, wherein the means for generating, using the first generation process, first substitution audio data to be used as first audio data instead of first audio data comprises: means for generating, using the first generation process, first substitution audio data to be used as first audio data instead of first audio data of a delayed packet until the first audio data of the delayed packet is available for rendering.
6. An apparatus as claimed in claim 5, comprising means for rendering the first audio data of the delayed packet at a rate faster than a normal rendering rate at which rendering occurs without packet delay.
7. An apparatus as claimed in any preceding claim, wherein the first generation process produces audio data having audio characteristics similar to expected audio characteristics of the first audio data of a transferred packet, wherein the expected audio characteristics of the first audio data are those expected based on analysis of previous audio characteristics of the first audio data; and wherein the second generation process produces audio data having audio characteristics similar to expected audio characteristics of the for second audio data of the transferred packet wherein the expected audio characteristics of the second audio data are those expected based on analysis of previous audio characteristics of the second audio data.
8. An apparatus as claimed in any preceding claim, wherein the first generation process is dependent upon first audio data of one or more previously received packets, and/or wherein the first generation process is dependent upon a classification of first audio data of one or more previously received packets, and/or wherein the first generation process is dependent upon audio energy density across the frequency domain and/or the time domain of first audio data of one or more previously received packets.
9. An apparatus as claimed in any preceding claim, wherein the first generation is dependent upon classification of first audio data of previously received packets as speech.
10. An apparatus as claimed in any preceding claim, wherein the second generation process is dependent upon second audio data of one or more previously received packets, and/or wherein the second generation process is dependent upon a classification of second audio data of one or more previously received packets.
11. An apparatus as claimed in any preceding claim, wherein the second generation process is dependent upon classification of second audio data of previously received packets as ambient audio.
12. An apparatus as claimed in any preceding claim, wherein the first generation process is one of: muting and speeding up.
13. An apparatus as claimed in any preceding claim, wherein the second generation is one of: attenuation, replication or prediction.
14. An apparatus as claimed in any preceding claim, wherein the first audio data is discontinuous in the time domain and/or frequency domain and the second audio data is continuous in the time domain and/or frequency 15 domain; and/or wherein the first audio data is speech and the second audio data is not speech; and/or wherein the first audio data is directional audio and the second audio data is not directional audio; and/or wherein the first audio data is a point source and the second audio data is a diffuse ambient source; and/or wherein the first audio data comprises larger scale energy fluctuations in the time domain and/or the frequency domain compared to the second audio data and the second audio data comprises random smaller scale energy fluctuations in the time domain and/or the frequency domain compared to the first audio data; and/or wherein the first audio data is a foreground or proximal audio source and the secondaudio data is a background or distal audio source;and/or wherein the first audio data is a an audio source of higher energy than the second audio data.
15. An apparatus as claimed in any preceding claim, comprising means for receiving for immediate rendering a stream of packets comprising first audio data and second audio data.
16. A apparatus as claimed in any preceding claim, configured as a mobile apparatus, and comprising a radio transceiver for receiving packets comprising first audio data and second audio data.
17. A method comprising: determining a loss of a packet comprising first audio data and second audio data; generating, using a first generation process, first substitution audio data to be used as first audio data instead of first audio data of the packet determined as lost; generating, using a second generation process independent of the first generation process, second substitution audio data to be used as second audio data instead of second audio data of the packet determined as lost; and causing contemporaneous rendering of the first substitution audio data and the second substitution audio data.
18. A computer program comprising instructions that when run on one or more processors of an apparatus cause: determining a loss of a packet comprising first audio data and second audio data; generating, using a first generation process, first substitution audio data to be used as first audio data instead of first audio data of the packet determined as lost; generating, using a second generation process independent of the first generation process, second substitution audio data to be used as second audio data instead of second audio data of the packet determined as lost; and causing contemporaneous rendering of the first substitution audio data and the second substitution audio data.