WO2025073355A1

WO2025073355A1 - Selective muting of sounds in extended reality spaces

Info

Publication number: WO2025073355A1
Application number: PCT/EP2023/077297
Authority: WO
Inventors: Pex TUFVESSON; Michael Björn; Niklas LINDSKOG; Daniel Lindström
Original assignee: Telefonaktiebolaget LM Ericsson AB
Current assignee: Telefonaktiebolaget LM Ericsson AB
Priority date: 2023-10-03
Filing date: 2023-10-03
Publication date: 2025-04-10
Anticipated expiration: 2026-04-03

Abstract

An XR system renders XR environments through XR devices to participants. Operations obtain audio stream(s) containing audio components from sensed sounds generated by objects in a first real environment, and obtain video stream(s) from camera(s) capturing objects in the first real environment. Operations render an XR space containing virtual representations of the objects for display through a display associated with a first XR device, and map a first audio component in the audio stream(s) to a first object in the video stream(s). Operations provide the audio stream(s) for playout through a speaker associated with the first XR device, and render a first mute-unmute indicia associated with the first object for display through the first XR device. Responsive to an indication that a participant selected the first mute-unmute indicia, operations selective mute or unmute the first audio component in the audio stream(s) provided for playout through the speaker.

Description

SELECTIVE MUTING OF SOUNDS IN EXTENDED REALITY SPACES

TECHNICAL FIELD

[0001] The present disclosure relates to rendering extended reality (XR) spaces and associated XR devices, and more particularly to streaming of video and audio between XR devices and XR servers.

BACKGROUND

[0002] Online meetings are typically performed using a two-dimensional (2D) video stream from a single camera and an audio stream from a single microphone. The meeting participant's perceived level of immersion is relatively low because of the limitation of watching and listening to the other person(s) through a computer.

[0003] Immersive extended reality (XR) spaces have been developed which provide a myriad of different types of user experiences for virtual meetings, gaming, social networking, co-creation of products, etc. Immersive XR spaces (also referred to as "XR spaces") can include virtual reality (VR) spaces where human users only see computer generated graphical renderings, augmented reality (AR) spaces where users see a combination of computergenerated graphical renderings overlaid on a view of the physical real-world through, e.g., see-through display screens, and blended VR and AR spaces.

[0004] Example XR rendering devices (also called "XR devices") include, without limitation, AR headsets, VR headsets, XR headsets, gaming consoles, smartphones, and tablet/laptop/desktop computers which communicate with an XR space server (XR server) generating an immersive XR space. Oculus Quest is an example VR device and Google Glass is an example AR device.

[0005] Immersive XR spaces, such as meeting spaces, can be configured to display computer generated avatars which represent poses of human users in virtual meeting rooms of the immersive XR spaces. Virtual objects can include post-it-notes, virtual whiteboards and static 3D object imports, etc. which can be generated in the virtual meeting rooms and interacted with by XR meeting participants. The level of immersion is higher but confined to occurring within an XR space that has been created by a third party and which is used by all participants. [0006] There has previously been no or limited capability for a participant to invite other online AR/VR meeting participants “into the participant's space”, since the virtual meeting is constrained to a selected meeting space within an immersive XR space. Participants have limited ability to increase others' level of immersion, since the participant's avatar's look, movements, and audio are constrained within the selected meeting space.

SUMMARY

[0007] Some embodiments disclosed herein are directed to a method by an XR system for rendering XR environments through XR devices to participants. The method includes obtaining at least one audio stream containing a plurality of audio components from sensed sounds generated by objects in a first real environment, and obtaining at least one video stream from at least one camera capturing objects in the first real environment. The method further includes rendering an XR space containing virtual representations of the objects for display through a display associated with a first XR device, and mapping a first audio component in the at least one audio stream to a first object in the at least one video stream. The method further includes providing the at least one audio stream for play out through a speaker associated with the first XR device, and rendering a first mute-unmute indicia associated with the first object for display through the first XR device. Responsive to an indication that a participant selected the first mute-unmute indicia, the method includes selective muting or unmuting the first audio component in the at least one audio stream provided for playout through the speaker associated with the first XR device, based on the association of the first mute-unmute indicia with the first object and based on the mapping of the first object to the first audio component.

[0008] Some other related embodiments are directed to an XR system for rendering XR environments through XR devices to participants. The XR system includes at least one processor circuit and at least one memory circuit storing instructions executable by the at least one processor circuit to perform operations. The operations include to obtain at least one audio stream containing a plurality of audio components from sensed sounds generated by objects in a first real environment, and obtain at least one video stream from at least one camera capturing objects in the first real environment. The operations render an XR space containing virtual representations of the objects for display through a display associated with a first XR device. The operations map a first audio component in the at least one audio stream to a first object in the at least one video stream. The operations provide the at least one audio stream for playout through a speaker associated with the first XR device. The operations render a first mute-unmute indicia associated with the first object for display through the first XR device. Responsive to an indication that a participant selected the first mute-unmute indicia, the operations selective mute or unmute the first audio component in the at least one audio stream provided for playout through the speaker associated with the first XR device, based on the association of the first mute-unmute indicia with the first object and based on the mapping of the first object to the first audio component.

[0009] Some other related embodiments are directed to another XR system for rendering XR environments through XR devices to participants. The XR system is operative to obtain at least one audio stream containing a plurality of audio components from sensed sounds generated by objects in a first real environment, and obtain at least one video stream from at least one camera capturing objects in the first real environment. The XR system is operative to render an XR space containing virtual representations of the objects for display through a display associated with a first XR device. The XR system is operative to map a first audio component in the at least one audio stream to a first object in the at least one video stream. The XR system is operative to provide the at least one audio stream for playout through a speaker associated with the first XR device. The XR system is operative to render a first mute-unmute indicia associated with the first object for display through the first XR device. Responsive to an indication that a participant selected the first mute-unmute indicia, the XR system is operative to selective mute or unmute the first audio component in the at least one audio stream provided for playout through the speaker associated with the first XR device, based on the association of the first mute-unmute indicia with the first object and based on the mapping of the first object to the first audio component.

[0010] Potential advantages that may be provided by these and further embodiments disclosed herein include that intuitive user interfaces are provided for one or more participants in XR spaces to visually observe mute-unmute indicias which are displayed with an association to various objects which are being rendered as virtual representations in the XR space. The parti cipant(s) can selectively mute and/or unmute sound associated with one of the objects by selecting the mute-unmute indicia associated with that object. Responsive to an indication that a participant selected the first mute-unmute indicia, operations are performed to selectively mute or unmute the sound associated with that object. In this manner, one of the participants who is located in the same environment as the object generating the sound can intuitively mute the sound to reduce or prevent its sending for playout to other participants and/or unmute the sound to increase or allow its sending for playout to the other participants. Alternatively or additionally, one or more of the other participants who is located in a different environment from the object generating the sound can intuitively mute the sound to reduce or prevent it from being received for play out to the participant and/or unmute the sound to increase or allow it to be received for playout to the participant.

[0011] Other methods and related XR systems according to embodiments will be or become apparent to one with skill in the art upon review of the following drawings and detailed description. It is intended that all such additional methods and related XR systems be included within this description and protected by the accompanying claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012] Aspects of the present disclosure are illustrated by way of example and are not limited by the accompanying drawings. In the drawings:

[0013] Figure 1 illustrates an XR system that includes an XR server which generates a shared XR space combining renderings and sounds of real objects sensed in the separate environments of participants, and which are displayed with played out through XR devices operated by participants, in accordance with some embodiments of the present disclosure; [0014] Figure 2 illustrates components of the XR device in Figure 1 which are configured in accordance with some embodiments of the present disclosure;

[0015] Figure 3 illustrates an example participant's XR view through one of the XR devices displaying a pair of selectable mute/unmute indicia adjacent to the corresponding sound creating objects, and which can be selected to mute/unmute the corresponding audio components in accordance with some embodiments;

[0016] Figure 4 illustrates flowcharts of operations that can be performed by the first XR device, the XR server, and the second XR device in accordance with some embodiment; and [0017] Figure 5 illustrates a flowchart of operations by an XR system for rendering XR environments through XR devices to participants.

DETAILED DESCRIPTION

[0018] Inventive concepts will now be described more fully hereinafter with reference to the accompanying drawings, in which examples of embodiments of inventive concepts are shown. Inventive concepts may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of various present inventive concepts to those skilled in the art. It should also be noted that these embodiments are not mutually exclusive. Components from one embodiment may be tacitly assumed to be present/used in another embodiment.

[0019] Potential problems that can arise with prior approaches are explained. As explained above, there has previously been no or limited capability for a participant to invite other online AR/VR meeting participants “into the participant's environment." If a participant were to share the participant's environment to increase the level of immersion by other participants, the imagery and audio presented to the other participants can rapidly overload the participants' cognitive processing comfort level. Moreover, if all participants were to broadcast their respective environments, the resulting presentation would likely overload the participants' comfort levels. The operation for sharing part of a participants' environment (e.g., video and sound) in a AR/VR meeting can correspond to publicly available operations for generating a Venn room.

[0020] Some embodiments of the present disclosure are directed to improving how audio content is shared in computer-generated Venn rooms.

[0021] A Venn room can be generated as an immersive XR space in which participants use local camera(s) and microphone(s) to stream video and audio to an XR server, which combines the video and audio to generate a merged space that includes components that have been sensed by cameras, microphones, etc. in the real environments of the participants.

[0022] For example, assume two people wearing XR devices (e.g., headsets) occupy the same XR space, but are located in different real environments (e.g., rooms in their respective homes). When they meet in the XR space, real objects (e.g., furniture, decorations, electronic devices, people and pets, etc.) in their rooms which are viewed by cameras (e.g., on XR headsets or separate therefrom) are virtualized and combined (blended) to generate a combined XR space in which both participants are immersed. The size, shape and layout rendered for the XR objects and spaces can depend upon metrics which are determined based on what is observed in the participants' real environments. If the two people want to sit together, for instance, they may move real furniture to align the corresponding XR furniture as desired in the XR space. This creates opportunities for people to share their real environments to collectively create shared XR spaces as a form of extended reality.

[0023] To increase the immersion level, participants can share multiple sound sources in their real environments, as may share acoustic properties of their real environments (e.g., room size, location of participant in the room, etc.). However, the participants' environments may have unwanted sounds that the sender or receiver participant does not want to hear. Some embodiments of the present disclosure are directed to operations to enable XR participants to selectively include/exclude individual sound sources, such as by tapping a mute icon attached to (e.g., displayed within a threshold distance of) noise sources around him/her. The XR space may, for example, enable one participant to select an indicia for “Mute all dogs”, “Unmute that harmonica”, etc. Participants in an XR meeting are provided intuitive operational ability to selectively mute/unmute sound sources that others are sharing. [0024] Each participant's real environment can include one or more microphones from which sound can be processed to determine directionality to the sound source and which may distinguish between different types of sound sources. The directionality and/or types of sound sources can be characterized as profile information that is included with the audio streams, such as in metadata sent to the XR server. During an XR meeting session, the XR server and/or XR device of a participant can be configured to enable a first participant to select, using the XR device of the first participant, which local sound sources they want to include or exclude in their personal representation of the shared space, e.g., Venn room, during the XR meeting session. A second participant in the XR meeting can select, using the XR device of the second participant, which remote sound sources, captured from the first participant's space, they want to include or exclude in their personal representation of the shared space, e.g., Venn room, during the XR meeting session. The XR server can operate to generate the shared space (e.g., Venn room) of the XR meeting session by sharing sounds in the participants’ shared spaces with inclusion of some sound sources which were selected for inclusion by respective participant(s) and exclusion of some other sound sources which were selected for exclusion by respective participant(s). The XR server also combines virtual representations of the real objects captured by the cameras for display on XR devices of respective participants.

[0025] Potential advantages can include enabling a higher level of immersion, where participants can be invited to another person’s broadcast “1-to-many” while sharing the sounds of the shared space. In a “many-to-many” shared space, e.g., Venn room, the higher level of immersion is enabled also from the shared acoustical environment, while enabling keeping the noise level down by participant selective including or excluding sound sources (auditory elements) in their own broadcasts.

[0026] Figure 1 illustrates an XR system that includes an XR space server 100 (XR server) which generates a shared XR space combining renderings and sounds of real objects observed in the separate environments of participants, and which are displayed and played- out through AR devices 150 operated by participants, in accordance with some embodiments of the present disclosure. [0027] Referring to Figure 1, the first XR device 150 is worn by a first participant in a first real environment (e.g., home office). One or more cameras in the first XR device 150 and/or located elsewhere provide one or more video streams of observed objects to the XR server 100. One or more microphones in the first XR device 150 and/or located elsewhere provide to the XR server 100 one or more audio streams having audio components which corresponds to sensed sounds from some of the observed objects in the first real environment. [0028] Similarly, a second XR device 150 is worn by a second participant in a second real environment (e.g., outdoor patio). One or more cameras in the second XR device 150 and/or located elsewhere provide one or more video streams of observed objects to the XR server 100. One or more microphones in the second XR device 150 and/or located elsewhere provide to the XR server 100 one or more audio streams having audio components which corresponds to sensed sounds from some of the observed objects in the second real environment.

[0029] When more than one camera or more than one microphone provides streams, the first/second XR device 150 may provide corresponding separate video streams and/or audio streams to the XR server 100, or may combine the more than one video stream into a combined video stream and/or combine the more than one audio stream into a combined audio stream that is provided to the XR server 100. Circuitry for processing video and/or audio may reside in the XR device 150 or in a connected device, such as a networked (e.g., WiFi networked, cellular networked, etc.) smartphone, tablet computer, etc.

[0030] The first and second XR devices 150 can communicate with the XR server 100 through one or more wired or wireless communication networks 160. The XR server 100 includes a network interface 140 operative to communicate with XR devices via the networks 160, and at least one processing circuity 110 ("processor") and at least one memory circuit 120 ("memory"). The memory 120 stores an XR application 122 and an object sound muting application 132 containing instructions executable by the processor to perform operations in accordance with embodiments disclosed herein. The XR server 100 may include a display 142, user interface, and other elements.

[0031 ] Although the example XR devices 150 are illustrated in Figure 1 as headsets they may be wearable in other ways or not configured to be wearable. The XR devices 150 may be a single integrated device or a collection of camera(s) 178 operative to provide video streams(s) (e.g., through network interface 175) capturing observed objects in an environment, microphone(s) 180 operative to provide audio stream(s) sensed in the environment, display(s) 174 operative to display video stream(s) of the XR space from the XR server 100, speaker(s) 176 operative to play out audio stream(s) of the XR space from the XR server 100, processor(s), and communication circuit(s) operative to communicate with the XR server 100. Moreover, the XR devices 150 may operate in an AR mode, a VR mode, or a combined VR and AR mode. The display 174 may be configured as a see-through display (e.g., see-through LCD screen, reciprocal mirror light transparent on one side and light reflective on the other side, etc.). The XR devices 150 include at least one processing circuity 170 ("processor") and at least one memory circuit 172 ("memory"). The memory stores an XR application and may store an object sound muting application containing instructions executable by the processor 170 to perform operations in accordance with embodiments disclosed herein.

[0032] The XR server 100 detects objects in the first real environment (e.g., home office) captured in the video stream(s) from the first XR device 150. Similarly, XR server 100 detects objects in the second real environment (e.g., outdoor patio) captured in the video stream(s) from the second XR device 150. The XR application 122 generates XR streams based on the detected objects that can be rendered on displays of the XR devices 150 and further generates the XR streams based on the audio streams. XR application 122 creates immersive XR spaces for the participants by sending first audio and video streams to the first XR device 150 for video display through displays and for audio play out through speakers, and by sending second audio and video streams to the second XR device 150 for display and audio play out. The XR application 122 may be an AR application, a VR application, a combined AR and VR application, etc.

[0033] In some embodiments, the first and second XR devices 150 may each map the audio components to the corresponding objects in their environment from which the sound originated. The determined mapping of audio components to the corresponding objects can be indicated through metadata sent with the audio and video XR streams.

[0034] The mapping may correspond to indicating directionality of the incident sound in metadata provided with the audio stream to the XR server 100. Incident sound directionality may be determined using a directional microphone and performing operations to determine the direction to the object creating the sound based on how sound gain changes with tracked orientation (pose) of the microphone. Alternatively or additionally, the incident sound directionality may be determined using an array of microphones and determining from phasing of sound signals and relative positions of the microphones the direction from the XR device to the object creating the sound. [0035] In accordance with some optional embodiments, the XR server 100 uses the metadata provided with the audio stream(s) from the first XR device 150 to further map the audio components to the corresponding objects of the first real environment (e.g., home office) observed in the video stream(s) from the first XR device 150. Similarly, the XR server 100 uses the metadata provided with the audio stream(s) from the second XR device 150 to further map the audio components to the corresponding objects of the second real environment (e.g., outdoor patio) observed in the video stream(s) from the second XR device 150.

[0036] For example, the XR server 100 may map an audio component to an object in a video stream based on determining the direction of the audio component in the audio stream(s) is correlated in time to a direction of the object captured in the video stream(s). The XR server 100 may be informed through the metadata or from another information source regarding orientation and spacing between the camera(s) to the microphone(s), e.g., which may be components of the XR device 150.

[0037] In accordance with some further embodiments, the XR server 100 controls (updates) the user interface of one or both the XR devices 150 to include selectable indicia associated with the mapped objects. One participant can select the indicia to cause muting or unmuting of the mapped audio component. Muting can correspond to reducing amplitude of the audio component or entirely eliminating the audio component from being streamed from the XR server 100 to the XR device 150 of that participant. Unmuting can correspond to increasing amplitude of the audio component or returning to an earlier amplitude level the audio component or resuming streaming of the audio component from the XR server 100 to the XR device 150 of that participant. These and other operations are described in further detail below with regard to the example embodiments of Figure 4.

[0038] In some further embodiments, the operations anchor the rendering of the first mute-unmute indicia to remain proximately located to a rendering of the first object when displayed through the first XR device and/or to remain proximately located to a defined area in a room. For example, as a participant wearing the first XR device rotates his/her head the first mute-unmute indicia can remain proximately located to the rendering of the first object, and rotate out of the field-of-view of the participant as the first object similarly rotates out of the field-of-view. In another example, the participant may define an area in a room where sound arriving therefrom can be selectively muted in the outgoing stream. As a participant wearing the first XR device rotates his/her head the first mute-unmute indicia can remain proximately located to the defined area of the room, and rotate out of the field-of-view of the participant as the defined area of the room similarly rotates out of the field-of-view.

[0039] Figure 2 illustrates components of the XR device 150 in Figure 1 configured in accordance with some embodiments of the present disclosure.

[0040] Referring to Figure 2, the XR device 150 includes camera(s) 210 operative to provide video stream(s) capturing observed objects in an environment, an array of microphones 220 (located on the left and right ear supports and/or along a front surface of the device 150) operative to provide audio streams from audio sensed in the environment, display(s) 200 operative to display video stream(s) of the XR space from the XR server 100, speaker(s) 230 operative to playout audio stream(s) from the XR server 100, and circuitry operative to process the audio and video streams, user interface input (e.g., via touch interface selections, camera sensed and recognized hand gestures for commands, and/or microphone sensed voice commands), and provide communication connectivity with the XR server 100. [0041 ] Figure 3 illustrates an example participant's VR view through one of the XR devices 150 displaying a pair of selectable mute/unmute indicia adjacent to the corresponding sound creating objects, and which can be selected to mute/unmute the corresponding audio components in accordance with some embodiments.

[0042] In one illustrative operational scenario for the system of Figure 1, the first participant has decided to share video and audio of the real environment (e.g., room). From a user interface display view, first participant can see part of the real environment which a camera is sending as a video stream to the XR server 100 (for sharing in the XR space of participant 2) and can hear sounds that a microphone is sending as an audio stream to the XR server 100. Sound sources that are active “currently making sound” are highlighted or otherwise associated with indicia 300 and 310 that can be selected by first participant to selectively mute or unmute that sound in the audio stream sent to the XR server 100 or in the audio stream which is sent from the XR server 100 to other XR devices (second participant). The sounds of the first participant's environment may be continuously and automatically operationally scanned and localized through the analysis of reverberation and echoes from sounds already in the environment, and associated with real objects in the environment.

[0043] The second participant is also a part of the online shared XR space (e.g., Venn room) and can see and hear the environment of the first participant through video and audio streams provided by the XR server 100. Second participant's displayed user interface may be similar to first participant, and may highlight sound sources that are active “currently making sound” with indicia 300 and 310 that can be selected by second participant to selectively mute or unmute that sound in the audio stream sent from the XR server 100 to second participant's XR device 150. For example, the second participant does not desire to hear a dog barking from the first participant's environment and can selectively mute the dog by selecting the mute/unmute indicia 300 displayed proximate to and above the dog. This action, in some embodiments, may not be revealed (indicated) to the first participant to enable confidential muting/unmuting actions to be performed by other participants with respect to audio from first participant's environment.

[0044] From the operational perspective of the XR sever 100, it has determined that the dog corresponds to a particular audio component (whining/barking) in the audio stream, and updates the user interface displayed by the XR device to include a mute/unmute indicia 300 proximate to and above the dog. The first/second participant can select the indicia 300 to cause the XR server 100 to mute/unmute the audio component (whining/barking) in the audio stream provided to that XR device and/or provided to one or more other XR devices, in accordance with some embodiments.

[0045] Similarly, the XR server 100 has determined that the television corresponds to another particular audio component (audio of televised programming) in the audio stream, and updates the user interface displayed by the XR device to include another mute/unmute indicia 310 adjacent to the television. The first/second participant can select the indicia 310 to cause the XR server 100 to mute/unmute the audio component (audio of the televised programming) in the audio stream provided to that XR device and/or provided to one or more other XR devices, in accordance with some embodiments.

[0046] Figure 4 illustrates flowcharts of operations that can be performed by the first XR device 150, the XR server 100, and the second XR device 150 in accordance with some embodiment.

[0047] Referring to Figure 4, the first and second XR devices 150 each establish 400 through the XR server 100 an XR session for their respective participants. The XR server 100 generates XR spaces which combine objects observed (sensed by camera(s) and microphone(s)) in the real environments of the first and second participants, e.g., so as to provide a Venn room to the participants. For example, the XR server 100 can render 402 first and second XR spaces based on a combination of a first set of objects observed in video from the first XR device 150 and a second set of objects observed in video from the second XR device 150.

[0048] The XR server 100 sends audio and video streams to the first XR device 150 to render a first XR space for the first participant and similarly sends different audio and video streams to the second XR device 150 to render a second XR space for the second participant. The first XR space includes renderings of real objects and sounds observed (sensed by camera(s) and microphone(s)) in the real environment of the second participant, and the second XR space includes renderings of real objects and sounds observed in the real environment of the first participant. Thus, for example, the second XR space of the second participant can include virtual representations of the real dog and television observed in the real environment of the first participant. The virtual representations can correspond to a realtime video feed of the real object (e.g., video of the dog without the background, video of the television without the background, etc.) and/or a computer-generated graphical representation of the real object.

[0049] The XR server 100 and/or the first XR device 150 operate to map 404 audio components in a first audio stream from a microphone(s) to a first object (e.g., dog) in the first set, and indicates the mapping metadata. The XR server 100 and/or the first XR device 150 also operates to map 406 audio components in the first audio stream and/or a second audio stream from the microphone(s) to a second object (e.g., television) in the first set, and indicates the mapping in the metadata. The XR server 100 and/or the first XR device 150 perform similar mappings of sounds to other objects and record indications of the mapping(s) in the metadata.

[0050] The first XR device 150 sends 408 the audio streams with the metadata to the XR server 100. Thus, for example, a first audio stream may be sent with metadata indicating it is mapped to a location of the first object (e.g., dog), which can correspond to an indication of directionality of the sound incident to the first XR device 150 from the first object generating the sound. Similarly, a second audio stream may be sent with metadata indicating it is mapped to a location of the second object (e.g., television), which can correspond to an indication of directionality of the sound incident to the first XR device 150 from the second object generating the sound.

[0051] When the XR server 100 performs the mapping of sounds to objects, the operation 408 is performed (without use of metadata for any such indication from XR device 150 to XR server 100) before the operations 404 and 406 which are then performed by the XR server 100 instead of the first XR device 150.

[0052] XR server 100 sends 410 the audio component in the audio streams and sends the video streams to the second XR device 150. The XR server 100 updates 412 the user interface of the first and/or second XR devices 150 to include participant selectable mute/unmute indicia associated with the first and second objects. For example, the XR server 100 may embed the indicia in the video streams sent to the first and/or second XR devices 150 along with commands that enable participant indication of selection of one or more of the displayed indicia. A participant may indicate selection by touch-selecting a region on the display of XR device 150 corresponding to the indicia, by steering a displayed cursor to the indicia, by voice command, by operating a button/joystick/touchpad to trigger the indicia to be selected or highlighted and then selected, etc.

[0053] The first XR device 150 locally displays 414 the updated user interface which includes the first and second participant selectable mute/unmute indicia (e.g., indicia 300 and 310 in Fig. 3). For example, with the example of Figure 3, the first participant may be in the same room as the dog and television. The first participant wearing the first XR device 150 may select indicia 300 or indicia 310 to selectively mute the audio component associated therewith, dog or television, from being streamed to the second participant wearing the second XR device 150. The second participant can be located remotely from the first participant and only able to hear the audio component associated with the dog or television when present in the audio stream.

[0054] The second XR device 150 locally displays 416 the updated user interface which includes the first and second participant selectable mute/unmute indicia (e.g., indicia 300 and 310 in Fig. 3), and detects the second participant's selection of the first mute indicia (e.g., indicia 300 in Fig. 3) associated with the first object (e.g., dog), and sends an indication of the selection to the XR server 100.

[0055] The XR server 100 responds to receipt of the indicated selection of the first mute indicia by stopping 418 forwarding of the first component in the audio stream corresponding to the first object (e.g., dog) to the second XR device 150 or attenuating volume of the first component in the audio stream sent to the second XR device 150, thereby muting that sound from the second participant. In some embodiments, the first participant is provided notification of operation to mute or unmute the sound. The XR server 100 may further update the user interface displayed on the second XR device 150 to visually indicate through the indicia associated with the first object that the associated sound is muted or likewise when it is unmuted.

[0056] The first XR device 150 can operate to detect 420 the first participant's selection of the second mute indicia (e.g., indicia 310 in Fig. 3) associated with the second object (e.g., television), and either locally mutes the sound associated with the second object or sends an indication of the selection to the XR server 100. The sound may be muted by the first participant from being sent to all other participants or only participants who are pre-defined or selected by the first participant. When the XR server 100 performs the muting, it responds to receipt of the indicated selection of the second mute indicia by stopping 422 forwarding of the second component in the audio stream to the second XR device 150 or attenuating volume of the second component in the audio stream sent to the second XR device 150 and possibly all other XR devices or to selected XR devices as explained above.

[0057] When the indication is sent to the XR server 100 for muting, the XR server 100 responds to receipt of the indicated selection of the second mute indicia by stopping forwarding of the second component in the audio stream corresponding to the second object (e.g., television) to the second XR device 150, thereby muting that sound from the second participant and possibly from all other participants or only participants who are pre-defined or selected by the first participant. The XR server 100 may further update the user interface displayed on the first XR device 150 to visually indicate through the indicia associated with the second object that the associated sound is muted or likewise when it is unmuted.

[0058] The first XR device 150, XR server 100, and second XR device 150 can repeat the operations 404-422 to continue 424 the XR session while dynamically tracking and reacting to appearance of new objects that generate sounds and/or disappearance of earlier objects, and dynamically tracking and reacting to participant selections of mute/unmute indicia.

[0059] Figure 5 illustrates a flowchart of operations by an XR system for rendering XR environments through XR devices to participants.

[0060] Referring to Figure 5, the operations include to obtain 500 at least one audio stream containing a plurality of audio components from sensed sounds generated by objects in a first real environment. The operations obtain 502 at least one video stream from at least one camera capturing objects in the first real environment. The operations render 504 an XR space containing virtual representations of the objects for display through a display associated with a first XR device. The operations map 506 a first audio component in the at least one audio stream to a first object in the at least one video stream. The operations provide 508 the at least one audio stream for play out through a speaker associated with the first XR device. The operations rendering 510 a first mute-unmute indicia associated with the first object for display through the first XR device. Responsive to an indication that a participant selected the first mute-unmute indicia, the operations selective mute or unmute 512 the first audio component in the at least one audio stream provided for play out through the speaker associated with the first XR device, based on the association of the first mute-unmute indicia with the first object and based on the mapping of the first object to the first audio component. [0061] Potential advantages that may be provided by these and further embodiments disclosed herein include that intuitive user interfaces are provided for one or more participants in an XR space to visually observe mute-unmute indicias which are displayed associated with various objects which are being rendered as virtual representations in the XR space. The participant(s) can selectively mute and/or unmute sounds associated with one of the objects by selecting the mute-unmute indicia associated with that object. In response to an indication that a participant selected the first mute-unmute indicia, operations are performed to selectively mute or unmute the sound associated with that object. In this manner, one of the participants who is located in the same environment as the object generating the sound can intuitively mute the sound to reduce or prevent its sending for playout to other participants and/or unmute the sound to increase or allow its sending for playout to the other participants. Alternatively or additionally, one or more of the other participants who is located in a different environment from the object generating the sound can intuitively mute the sound to reduce or prevent it from being received for playout to the participant and/or unmute the sound to increase or allow it to be received for playout to to the participant.

[0062] Responsive to the indication that the participant selected the first mute-unmute indicia, the operations may update the rendering of the first mute-unmute indicia to indicate triggering of the selective muting or unmuting of the first audio component in the at least one audio stream.

[0063] In some further embodiments, the operations may further include to map a second audio component in the at least one audio stream to a second object in the at least one video stream, and render a second mute-unmute indicia associated with the second object for display through the first XR device. Responsive to an indication that the participant selected the second mute-unmute indicia, the operations selective mute or unmute the second audio component in the at least one audio stream provided for playout through the speaker associated with the first XR device, based on the association of the second mute-unmute indicia with the second object and based on the mapping of the second object to the second audio component.

[0064] The at least one audio stream may be obtained from at least one microphone communicatively connected to the first XR device and located to provide the plurality of audio components from sensed sounds generated by the objects in the first real environment. The at least one video stream may be obtained from the at least one camera communicatively connected to the first XR device and located to capture the objects in the first real environment in the at least one video stream. [0065] The terms "first XR device" and "second XR device" are used interchangeably herein. Accordingly, the first XR device may be located in Figure 1 to observe the second participant's observed real environment 160 and the second XR device may be located to observe the first participant's observed real environment 160. Accordingly, the first XR device may be located in a second environment that is separate and isolated from the first environment, where with the first XR device in the second environment the at least one audio stream does not contain audio components from sounds generated by objects in the second real environment and the at least one video stream does not capture the objects in the second real environment.

[0066] In some embodiments, circuitry in the first XR device performs the operations to obtain 500 the at least one audio stream, to obtain 502 the at least one video stream, and to map 506 the first audio component in the at least one audio stream to the first object in the at least one video stream, and to further indicate in metadata the mapping of the first audio component in the at least one audio stream to the first object in the at least one video stream, and provide to an XR server the metadata, the at least one audio stream, and the at least one video stream. Circuitry in the XR server performs the operations to render 504 the XR space containing the virtual representations of the objects for display through the display associated with the first XR device, provide 508 the at least one audio stream for playout through the speaker associated with the first XR device, and render 510 the first mute-unmute indicia associated with the first object for display through the first XR device. The operation to selective mute or unmute 512 the first audio component in the at least one audio stream can be performed by the circuitry of the first XR device or the XR server.

[0067] The operation to map 506 the first audio component in the at least one audio stream to the first object in the at least one video stream, may include determining an indication of direction between the first object and the first XR device. The operation to indicate in the metadata the mapping of the first audio component in the at least one audio stream to the first object in the at least one video stream, can include to generate the metadata to include the indication of direction.

[0068] The operation to map 506 the first audio component in the at least one audio stream to the first object in the at least one video stream, may include to determine an indication of direction between the first object and the first XR device, and determine the mapping of the first audio component in the at least one audio stream to the first object in the at least one video stream based on the indication of direction. [0069] In some further embodiments, the circuitry in an XR server performs the operations to obtain 500 the at least one audio stream, obtain 502 the at least one video stream, render 504 the XR space, map 506 the first audio component in the at least one audio stream to the first object in the at least one video stream, render 504 the XR space containing the virtual representations of the objects for display through the display associated with the first XR device, provide 508 the at least one audio stream for play out through the speaker associated with the first XR device, and render 510 the first mute-unmute indicia associated with the first object for display through the first XR device. The operation to selective mute or unmute 512 the first audio component in the at least one audio stream can be performed by the circuitry of the first XR device or the XR server.

[0070] The operation to map 506 the first audio component in the at least one audio stream to the first object in the at least one video stream, may include to determine from metadata obtained with the at least one audio stream or the at least one video stream, an indication of direction between the first object and the first XR device, and to determine the mapping of the first audio component in the at least one audio stream to the first object in the at least one video stream based on the indication of direction.

[0071] The operation to determine the mapping 506 of the first audio component in the at least one audio stream to the first object in the at least one video stream based on the indication of direction, may include to identify the first object from among the objects captured in the at least one video stream based on correlating the indication of direction between the first object and the first XR device to location of the first object in the at least one video stream. Alternatively, the mapping 506 may be based on sound type identification which may include, for example, mapping an audio component containing characteristics of barking sounds to a dog identified as one of the objects, mapping another audio component containing characteristics of a television program to a television identified as another one of the objects, mapping another audio component containing characteristics of a child crying to a child identified as another one of the objects, etc.

[0072] In some further embodiments, the operations anchor the rendering of the first mute-unmute indicia to remain proximately located to a rendering of the first object when displayed through the first XR device and/or to remain proximately located to a defined area in a room. For example, as a participant wearing the first XR device rotates his/her head the first mute-unmute indicia can remain proximately located to the rendering of the first object, and rotate out of the field-of-view of the participant as the first object similarly rotates out of the field-of-view. [0073] In a further embodiment, the operation to render 510 the first mute-unmute indicia associated with the first object for display through the first XR device, includes to render the first mute-unmute indicia to be within a threshold distance of the first object when displayed through the first XR device.

[0074] Further Definitions and Embodiments:

[0075] In the above-description of various embodiments of present inventive concepts, it is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of present inventive concepts. Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which present inventive concepts belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of this specification and the relevant art and will not be interpreted in an idealized or overly formal sense expressly so defined herein.

[0076] When an element is referred to as being "connected", "coupled", "responsive", or variants thereof to another element, it can be directly connected, coupled, or responsive to the other element or intervening elements may be present. In contrast, when an element is referred to as being "directly connected", "directly coupled", "directly responsive", or variants thereof to another element, there are no intervening elements present. Like numbers refer to like elements throughout. Furthermore, "coupled", "connected", "responsive", or variants thereof as used herein may include wirelessly coupled, connected, or responsive. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Well-known functions or constructions may not be described in detail for brevity and/or clarity. The term "and/or" includes any and all combinations of one or more of the associated listed items.

[0077] It will be understood that although the terms first, second, third, etc. may be used herein to describe various elements/operations, these elements/operations should not be limited by these terms. These terms are only used to distinguish one element/operation from another element/operation. Thus, a first element/operation in some embodiments could be termed a second element/operation in other embodiments without departing from the teachings of present inventive concepts. The same reference numerals or the same reference designators denote the same or similar elements throughout the specification.

[0078] As used herein, the terms "comprise", "comprising", "comprises", "include", "including", "includes", "have", "has", "having", or variants thereof are open-ended, and include one or more stated features, integers, elements, steps, components or functions but does not preclude the presence or addition of one or more other features, integers, elements, steps, components, functions or groups thereof. Furthermore, as used herein, the common abbreviation "e.g.", which derives from the Latin phrase "exempli gratia," may be used to introduce or specify a general example or examples of a previously mentioned item, and is not intended to be limiting of such item. The common abbreviation "i.e.", which derives from the Latin phrase "id est," may be used to specify a particular item from a more general recitation.

[0079] Example embodiments are described herein with reference to block diagrams and/or flowchart illustrations of computer-implemented methods, apparatus (systems and/or devices) and/or computer program products. It is understood that a block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions that are performed by one or more computer circuits. These computer program instructions may be provided to a processor circuit of a general purpose computer circuit, special purpose computer circuit, and/or other programmable data processing circuit to produce a machine, such that the instructions, which execute via the processor of the computer and/or other programmable data processing apparatus, transform and control transistors, values stored in memory locations, and other hardware components within such circuitry to implement the functions/acts specified in the block diagrams and/or flowchart block or blocks, and thereby create means (functionality) and/or structure for implementing the functions/acts specified in the block diagrams and/or flowchart block(s).

[0080] These computer program instructions may also be stored in a tangible computer- readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instructions which implement the functions/acts specified in the block diagrams and/or flowchart block or blocks. Accordingly, embodiments of present inventive concepts may be embodied in hardware and/or in software (including firmware, resident software, micro-code, etc.) that runs on a processor such as a digital signal processor, which may collectively be referred to as "circuitry," "a module" or variants thereof.

[0081] It should also be noted that in some alternate implementations, the functions/acts noted in the blocks may occur out of the order noted in the flowcharts. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Moreover, the functionality of a given block of the flowcharts and/or block diagrams may be separated into multiple blocks and/or the functionality of two or more blocks of the flowcharts and/or block diagrams may be at least partially integrated. Finally, other blocks may be added/inserted between the blocks that are illustrated, and/or blocks/operations may be omitted without departing from the scope of inventive concepts. Moreover, although some of the diagrams include arrows on communication paths to show a primary direction of communication, it is to be understood that communication may occur in the opposite direction to the depicted arrows.

[0082] Many variations and modifications can be made to the embodiments without substantially departing from the principles of the present inventive concepts. All such variations and modifications are intended to be included herein within the scope of present inventive concepts. Accordingly, the above disclosed subject matter is to be considered illustrative, and not restrictive, and the appended examples of embodiments are intended to cover all such modifications, enhancements, and other embodiments, which fall within the spirit and scope of present inventive concepts. Thus, to the maximum extent allowed by law, the scope of present inventive concepts are to be determined by the broadest permissible interpretation of the present disclosure including the following examples of embodiments and their equivalents, and shall not be restricted or limited by the foregoing detailed description.

Claims

CLAIMS:

1. A method by an XR system for rendering XR environments through XR devices to participants, the method comprising: obtaining (500) at least one audio stream containing a plurality of audio components from sensed sounds generated by objects in a first real environment; obtaining (502) at least one video stream from at least one camera capturing objects in the first real environment; rendering (504) an XR space containing virtual representations of the objects for display through a display associated with a first XR device; mapping (506) a first audio component in the at least one audio stream to a first object in the at least one video stream; providing (508) the at least one audio stream for playout through a speaker associated with the first XR device; rendering (510) a first mute-unmute indicia associated with the first object for display through the first XR device; and responsive to an indication that a participant selected the first mute-unmute indicia, selective muting or unmuting (512) the first audio component in the at least one audio stream provided for playout through the speaker associated with the first XR device, based on the association of the first mute-unmute indicia with the first object and based on the mapping of the first object to the first audio component.

2. The method of Claim 1, further comprising: responsive to the indication that the participant selected the first mute-unmute indicia, updating the rendering of the first mute-unmute indicia to indicate triggering of the selective muting or unmuting of the first audio component in the at least one audio stream.

3. The method of any of Claims 1 to 2, further comprising: mapping a second audio component in the at least one audio stream to a second object in the at least one video stream; rendering a second mute-unmute indicia associated with the second object for display through the first XR device; and responsive to an indication that the participant selected the second mute-unmute indicia, selective muting or unmuting the second audio component in the at least one audio stream provided for play out through the speaker associated with the first XR device, based on the association of the second mute-unmute indicia with the second object and based on the mapping of the second object to the second audio component.

4. The method of any of Claims 1 to 3, wherein: the at least one audio stream is obtained from at least one microphone communicatively connected to the first XR device and located to provide the plurality of audio components from sensed sounds generated by the objects in the first real environment; and the at least one video stream is obtained from the at least one camera communicatively connected to the first XR device and located to capture the objects in the first real environment in the at least one video stream.

5. The method of any of Claims 1 to 4, wherein: the first XR device is located in a second environment that is separate and isolated from the first environment, where with the first XR device in the second environment the at least one audio stream does not contain audio components from sounds generated by objects in the second real environment and the at least one video stream does not capture the objects in the second real environment.

6. The method of any of Claims 1 to 5, further comprising: performing by circuitry in the first XR device, the obtaining (500) of the at least one audio stream, the obtaining (502) of the at least one video stream, and the mapping (506) of the first audio component in the at least one audio stream to the first object in the at least one video stream, indicating in metadata the mapping of the first audio component in the at least one audio stream to the first object in the at least one video stream, and proving to an XR server the metadata, the at least one audio stream, and the at least one video stream; and performing by circuitry in the XR server, the rendering (504) of the XR space containing the virtual representations of the objects for display through the display associated with the first XR device, the providing (508) of the at least one audio stream for play out through the speaker associated with the first XR device, and the rendering (510) of the first mute-unmute indicia associated with the first object for display through the first XR device, wherein the selective muting or unmuting (512) of the first audio component in the at least one audio stream is performed by the circuitry of the first XR device or the XR server.

7. The method of Claim 6, wherein the mapping (506) of the first audio component in the at least one audio stream to the first object in the at least one video stream, comprises determining an indication of direction between the first object and the first XR device, and the indicating in the metadata the mapping of the first audio component in the at least one audio stream to the first object in the at least one video stream, comprises generating the metadata to include the indication of direction.

8. The method of Claim 7, wherein repeating the mapping (506) and updating the metadata to respond to movement of the first object and/or the first XR device.

9. The method of any of Claims 6 to 8, wherein the mapping (506) of the first audio component in the at least one audio stream to the first object in the at least one video stream, comprises: determining an indication of direction between the first object and the first XR device; and determining the mapping of the first audio component in the at least one audio stream to the first object in the at least one video stream based on the indication of direction.

10. The method of any of Claims 1 to 5, further comprising: performing by circuitry in an XR server, the obtaining (500) of the at least one audio stream, the obtaining (502) of the at least one video stream, the rendering (504) of the XR space, the mapping (506) of the first audio component in the at least one audio stream to the first object in the at least one video stream, the rendering (504) of the XR space containing the virtual representations of the objects for display through the display associated with the first XR device, the providing (508) of the at least one audio stream for play out through the speaker associated with the first XR device, and the rendering (510) of the first mute-unmute indicia associated with the first object for display through the first XR device, and wherein the selective muting or unmuting (512) of the first audio component in the at least one audio stream is performed by the circuitry of the first XR device or the XR server.

11. The method of Claim 10, wherein the mapping (506) of the first audio component in the at least one audio stream to the first object in the at least one video stream, comprises: determining from metadata obtained with the at least one audio stream or the at least one video stream, an indication of direction between the first object and the first XR device; and determining the mapping of the first audio component in the at least one audio stream to the first object in the at least one video stream based on the indication of direction.

12. The method of Claim 11, wherein the determining of the mapping (506) of the first audio component in the at least one audio stream to the first object in the at least one video stream based on the indication of direction, comprises: identifying the first object from among the objects captured in the at least one video stream based on correlating the indication of direction between the between the first object and the first XR device to location of the first object in the at least one video stream.

13. The method of any of Claims 1 to 12, further comprising: anchoring the rendering of the first mute-unmute indicia to remain proximately located to a rendering of the first object when displayed through the first XR device and/or to remain proximately located to a defined area in a room.

14. The method of any of Claims 1 to 13, wherein the rendering (510) of the first mute- unmute indicia associated with the first object for display through the first XR device, comprises: rendering the first mute-unmute indicia to be within a threshold distance of the first object when displayed through the first XR device.

15. An XR system (450) for rendering XR environments through XR devices (150) to participants, the XR system (450) comprising: at least one processor circuit (110, 170); and at least one memory circuit (120, 172) storing instructions executable by the at least one processor circuit (110, 170) to perform operations to: obtain at least one audio stream containing a plurality of audio components from sensed sounds generated by objects in a first real environment; obtain at least one video stream from at least one camera (178) capturing objects in the first real environment; render an XR space containing virtual representations of the objects for display through a display (174) associated with a first XR device (150); map a first audio component in the at least one audio stream to a first object in the at least one video stream; provide the at least one audio stream for play out through a speaker (176) associated with the first XR device (150); render a first mute-unmute indicia associated with the first object for display through the first XR device; and responsive to an indication that a participant selected the first mute-unmute indicia, selective mute or unmute the first audio component in the at least one audio stream provided for play out through the speaker (176) associated with the first XR device, based on the association of the first mute-unmute indicia with the first object and based on the mapping of the first object to the first audio component.

16. The XR system of Claim 15, wherein the operations further comprise to perform the method of any of Claims 2 to 14.

17. An XR system (450) for rendering XR environments through XR devices (150) to participants, the XR system (450) operative to: obtain at least one audio stream containing a plurality of audio components from sensed sounds generated by objects in a first real environment; obtain at least one video stream from at least one camera (178) capturing objects in the first real environment; render an XR space containing virtual representations of the objects for display through a display (174) associated with a first XR device (150); map a first audio component in the at least one audio stream to a first object in the at least one video stream; provide the at least one audio stream for play out through a speaker (176) associated with the first XR device (150); render a first mute-unmute indicia associated with the first object for display through the first XR device; and responsive to an indication that a participant selected the first mute-unmute indicia, selective mute or unmute the first audio component in the at least one audio stream provided for play out through the speaker (176) associated with the first XR device, based on the association of the first mute-unmute indicia with the first object and based on the mapping of the first object to the first audio component.

18. The XR system (450) of Claim 17, wherein the operations further comprise to perform the method of any of Claims 2 to 14.