RU2023120601A

RU2023120601A - OPTIMIZING SOUND DELIVERY FOR VIRTUAL REALITY APPS

Info

Publication number: RU2023120601A
Application number: RU2023120601A
Authority: RU
Inventors: Адриан МУРТАЗА; Харальд ФУКС; Бернд КЦЕЛЬХАН; Ян ПЛОГСТИС; Маттео АГНЕЛЛИ; Инго ХОФМАНН
Original assignee: Фраунхофер-Гезелльшафт Цур Фердерунг Дер Ангевандтен Форшунг Е.Ф.
Priority date: 2017-10-12
Filing date: 2023-08-07
Publication date: 2025-02-07

Claims

1. A system (102) for receiving audio streams to be reproduced, comprising:

at least one audio decoder (104) configured to decode audio signals (108) from at least one audio stream (106) and/or one adaptation set,

wherein the system (102) is configured to request (112) at least one audio stream (106), and/or one audio element in the audio stream, and/or one adaptation set, and/or one audio element in the adaptation set based on at least current user movement data, and/or user interaction metadata, and/or user position data (110),

wherein at least two audio scenes are defined, wherein at least one first and second audio element is associated with the first audio scene and at least one third audio element is associated with the second audio scene,

wherein the system is configured to receive metadata describing that at least one second audio element is additionally associated with the second audio scene,

wherein the system is configured to receive at least one first and second audio elements when the user's position is associated with the first audio scene,

wherein the system is configured to receive at least one second and third audio elements when the user's position is associated with the second audio scene, and

wherein the system is configured to receive at least the first, second and third audio elements in the event of a transition between the first audio scene and the second audio scene.

2. The system according to claim 1, configured to submit a request with current user movement data, and/or user interaction metadata, and/or user position data (110) for receiving at least one audio stream (106), and/or one audio element of the audio stream, and/or one adaptation set, and/or one audio element of the adaptation set.

3. A system according to any of the preceding claims, configured to:

receiving information about the user's current movement data and/or user interaction metadata,

and/or user position data, and/or any information characterizing changes caused by user actions, and

receiving information about the availability of adaptation sets and information describing the association of at least one adaptation set with at least one scene, and/or movement data, and/or interaction metadata, and/or position data.

4. The system according to any of the preceding claims, configured to decide whether to reproduce at least one audio element of the audio stream and/or one adaptation set for the current user movement data and/or user interaction metadata and/or user position, wherein the system is configured to receive this at least one audio element at the current virtual position of the user.

5. A system according to any of the preceding claims, configured to predict whether at least one audio element (152) of an audio stream and/or one adaptation set will become relevant and/or audible, based on at least current user movement data and/or user interaction metadata and/or position (110),

wherein the system is configured to receive this at least one audio element and/or audio stream and/or adaptation set at a specific virtual position of the user prior to the predicted movement and/or interaction of the user,

wherein the system is configured to reproduce at least one audio element and/or audio stream and/or one adaptation set and/or one audio element in the adaptation set, upon its reception, in the said specific virtual position of the user after the user's movement and/or interaction.

6. The system according to any of the preceding claims, configured to receive at least one audio element (152) with a lower bit rate and/or quality level at the virtual position of the user before the user's movement and/or interaction, wherein the system is configured to receive this at least one audio element with a higher bit rate and/or quality level at the virtual position of the user after said movement and/or interaction of the user.

7. The system of any of the preceding claims, wherein each audio element is associated with a position associated with a scene, and the system is configured to receive streams with a higher bit rate and/or quality level for audio elements closer to the user than for audio elements further from the user.

8. A system according to any of the preceding claims, wherein at least one audio element is associated with a position associated with a scene,

wherein the system is configured to request different streams with different bit rates and/or quality levels for audio elements based on their relevance and/or audibility level at each user position,

wherein the system is configured to request an audio stream and/or one adaptation set with a higher bit rate and/or quality level for audio elements that are more relevant and/or better heard in the current virtual position of the user, and/or an audio stream and/or one adaptation set with a lower bit rate and/or quality level for audio elements that are less relevant and/or worse heard in the current virtual position of the user.

9. The system of any of the preceding claims, wherein each audio element is associated with a position and/or region in the environment associated with the first or second audio scene, and the system is configured to periodically send a request with current user movement data, and/or user interaction metadata, and/or user position (110) in order to:

for the first position, a stream with a higher bitrate and/or quality level was accepted,

for the second position, a stream with a lower bitrate and/or quality level was accepted,

wherein the first position is closer to said at least one audio element (152) than the second position.

10. A system according to any of the preceding claims, in which first streams associated with a first, current audio scene are provided, and, in the event of the user moving to a second, distant audio scene, both streams associated with the first audio scene and second streams associated with the second audio scene are provided.

11. A system according to any of the preceding claims, in which

first streams associated with the first audio scene are provided for playing the first audio scene if the position or virtual position of the user is associated with the first audio scene,

second streams associated with the second audio scene are provided to reproduce the second audio scene if the position or virtual position of the user is associated with the second audio scene, and

both first streams associated with the first audio scene and second streams associated with the second audio scene are provided in the event of a transition of the position or virtual position of the user between the first audio scene and the second audio scene.

12. A system according to any of the preceding claims, in which a plurality of scenes (150 A, 150 V) are defined,

wherein the system is configured to receive first streams associated with the first audio scene for reproducing the first audio scene if the user's position is associated with the first audio scene,

wherein the system is configured to receive second streams associated with the second audio scene for reproducing the second audio scene if the user's virtual position is associated with the second audio scene, and

wherein the system is configured to receive both the first streams associated with the first audio scene and the second streams associated with the second audio scene if the user's virtual position is in a transition position (150AB) between the first audio scene and the second audio scene.

13. The system according to any one of paragraphs 10-12, wherein

the first streams associated with the first audio scene are received with a higher bitrate and/or quality level when the user position is associated with the first audio scene,

whereas the second streams associated with the second audio scene are received at a lower bit rate and/or quality level when the user is at the beginning of the transition from the first audio scene to the second audio scene, and

the first streams associated with the first audio scene are received with a lower bit rate and/or quality level, and the second streams associated with the second audio scene are received with a higher bit rate and/or quality level when the user moves from the first audio scene to the second audio scene,

wherein said lower bitrate and/or quality level is lower than said higher bitrate and/or quality level.

14. The system of any one of the preceding claims, wherein a plurality of N audio elements is defined, and if the user's distance to a position or region of these audio elements is greater than a predetermined threshold, then the N audio elements are processed to receive a smaller number of M audio elements (M<N) associated with a position or region close to the position or region of the N audio elements, so that

provide the system with at least one audio stream or one adaptation set associated with the N audio elements if the user's distance to the position or region of the N audio elements is less than a predetermined threshold, or

provide the system with at least one audio stream or one adaptation set associated with the M audio elements if the user's distance to the position or region of the N audio elements is greater than a predetermined threshold.

15. The system (102) according to any of the preceding claims, wherein at least one scene of the environment is associated with at least one set of N audio elements, N≥2, wherein each audio element is associated with a position and/or region in the environment,

wherein at least said at least one plurality of N audio elements is provided in at least one representation with a high bit rate and/or quality level,

wherein at least said at least one plurality of N audio elements is provided in at least one representation with a low bit rate and/or quality level, where this at least one representation is received by processing the N audio elements to receive a smaller number M of audio elements (M<N) associated with a position or region close to the position or region of the N audio elements,

wherein the system is configured to request a higher bit rate and/or quality level representation for the audio elements if the audio elements are more relevant and/or better audible at the user's current virtual position in the scene,

wherein the system is configured to request a representation with a lower bit rate and/or quality level for the audio elements if the audio elements are less relevant and/or less audible at the user's current virtual position in the scene.

16. The system of claim 14 or 15, wherein if the user distance and/or relevance and/or audibility level and/or angular orientation are below a predetermined threshold, then different streams are received for different audio elements.

17. The system of any one of the preceding claims, wherein the system is configured to receive streams based on user orientation, and/or user direction of movement, and/or user interactions in the scene.

18. A system according to any one of the preceding claims, configured to receive first audio streams or first adaptation sets and second audio streams, wherein the first audio elements in the first audio streams or first adaptation sets are more relevant and/or better audible than the second audio elements in the second audio streams or second adaptation sets, wherein the first audio streams or first adaptation sets are requested and/or received at a higher bit rate and/or quality level than the bit rate and/or quality level of the second audio streams or second adaptation sets.

19. The system of any one of the preceding claims, wherein at least one first audio element is provided in at least one audio stream and/or adaptation set, at least one second audio element is provided in at least one second audio stream and/or adaptation set, and at least one third audio element is provided in at least one third audio stream and/or adaptation set, wherein at least the first audio scene is described by metadata as a complete scene that requires at least the first and second audio streams and/or adaptation sets, wherein the second audio scene is described by metadata as an unfinished scene that requires at least the third audio stream and/or adaptation set and at least one second audio stream and/or adaptation sets associated with at least the first audio scene,

wherein the system comprises a metadata processor configured to operate with metadata to enable the second audio stream belonging to the first audio and the third audio stream associated with the second audio to be combined into a new single stream if the user's position is associated with the second audio scene.

20. The system according to any one of the preceding claims, wherein the system comprises a metadata processor configured to operate on metadata in at least one audio stream in front of at least one audio decoder based on current user movement data and/or user interaction metadata and/or user position data.

21. The system of claim 20, wherein the metadata processor is configured to enable and/or disable at least one audio element in at least one audio stream or adaptation set in front of at least one audio decoder based on current user movement data and/or user interaction metadata and/or user position data, wherein

the metadata processor is configured to disable at least one audio element in at least one audio stream or adaptation set before the at least one audio decoder if the system decides that this audio element no longer needs to be played as a result of current movement data and/or interaction metadata and/or position data, and wherein

the metadata processor is configured to include at least one audio element in at least one audio stream or adaptation set before at least one audio decoder if the system decides that this audio element needs to be reproduced as a result of current user movement data and/or user interaction metadata and/or user position data.

22. The system of any one of the preceding claims, configured to disable decoding of audio elements selected based on current user movement data and/or user interaction metadata and/or user position data.

23. The system of any one of the preceding claims, configured to disable decoding and/or playback of at least one stream based on metadata associated with the at least one stream and based on current user movement data and/or user interaction metadata and/or user position data.

24. The system of any of the preceding claims, further configured to operate on metadata associated with the group of selected audio streams based on at least current or estimated user movement data and/or user interaction metadata and/or user position data, in order to:

select and/or enable and/or activate the audio elements that make up the audio scene to be played; and/or

ensure that all selected audio streams are combined into a single audio stream.

25. A system according to any of the preceding claims, wherein information is received for each audio element or audio object, wherein the information includes descriptive information about the locations at which the sound scene or audio elements are active.

26. A system according to any of the preceding claims, configured to create or use at least adaptation sets to:

multiple adaptation sets were associated with a single audio scene; and/or

additional information was provided that related each adaptation set to one audio scene; and/or

additional information was provided, which may include:

information about the boundaries of one audio scene, and/or

information about the relationship between one adaptation set and one audio scene (e.g., an audio scene is encoded into three streams, which are contained in three adaptation sets), and/or

information about the relationship between audio scene boundaries and the set of adaptation sets.

27. The system according to any of the preceding paragraphs, wherein the system is further configured to:

receive at least one first adaptation set comprising at least one audio stream associated with at least one first audio scene;

receive at least one second adaptation set comprising at least one second audio stream associated with at least two audio scenes, including said at least one first audio scene; and

provide for joining at least one first audio stream and at least one second audio stream into a new audio stream for decoding based on metadata available regarding current user movement data and/or user interaction metadata and/or user position data and/or information describing an association of at least one first adaptation set with at least one first audio scene and/or an association of at least one second adaptation set with at least one first audio scene.

28. A system according to any of the preceding claims, capable of:

decide whether to reproduce at least one audio element from at least one audio scene embedded in at least one stream and at least one additional audio element from at least one additional audio scene embedded in at least one additional stream; and

to cause, if the decision is positive, an operation of joining, or composing, or multiplexing, or superimposing, or combining said at least one additional stream of an additional audio scene with said at least one stream of at least one audio scene.

29. The system according to any of the preceding claims, configured to operate on metadata associated with the selected audio streams based on at least current user movement data and/or user interaction metadata and/or user position data, in order to:

select and/or enable and/or activate the audio elements that make up the audio scene that you decide to play; and/or

ensure that all selected audio streams are combined into a single audio stream.

30. A method for receiving audio streams for playback, comprising the steps of:

decode audio signals from audio streams; and

requesting and/or receiving at least one audio stream based on current user movement data and/or user interaction metadata and/or user position data;

wherein the method includes receiving metadata describing that at least one second audio element is further associated with the second audio scene,

wherein the method includes requesting and/or receiving at least one first and second audio elements when the user's position is associated with the first audio scene,

wherein the method includes requesting and/or receiving at least one second and third audio elements when the user's position is in the second audio scene, and

wherein the method includes requesting and/or receiving at least the first, second and third audio elements in the event of a transition between the first audio scene and the second audio scene.

31. A computer program containing instructions that, when executed by a processor, cause the processor to perform the method of paragraph 30.