WO2026005772A1

WO2026005772A1 - Immersive sound field communication using a loudspeaker array

Info

Publication number: WO2026005772A1
Application number: PCT/US2024/035601
Authority: WO
Inventors: Dongeek Shin
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2024-06-26
Filing date: 2024-06-26
Publication date: 2026-01-02
Anticipated expiration: 2026-12-26

Abstract

Methods and systems for immersive sound field communication using a loudspeaker array are described herein. In one implementation, a first communication system at a first scene receives, from a second communication system at a second scene, an audio data stream representing a 3D sound field captured at the second scene by an array of microphones of the second communication system. Based on the audio data stream, the first communication system uses an array of loudspeakers to present the 3D sound field to a user located within a reference listening area at the first scene. During the presenting of the 3D sound field, the first communication system detects that the user leaves the reference listening area, and, in response to this detecting and while the user remains outside the reference listening area, ceases presenting the 3D sound field using the array of loudspeakers. Corresponding methods, systems, and media are also disclosed.

Description

IMMERSIVE SOUND FIELD COMMUNICATION USING A LOUDSPEAKER ARRAY

BACKGROUND

[0001] Remote communication technologies have long provided ways for people separated by distance to communicate with one another in real time. For example, the telephone provided a form of real-time audio communication between remotely separated users for decades before other more sophisticated technologies (e.g., cellular technologies, Voice-over-Internet-Protocol (VoIP) technologies, etc.) were developed to let people communicate with more flexibility, lower cost, higher quality of service, and so forth. More recently, even more immersive communication technologies have been deployed to mimic in- person interactions in various ways as people communicate over long distances.

SUMMARY

[0002] One way that remote communication has been made more immersive in recent years is by providing audio communication in which an entire three-dimensional (3D) sound field at a scene is captured, transmitted, and rendered (e.g., as opposed to a more basic mono or stereo capture typical of conventional communication systems). Communication systems described herein are associated with arrays of loudspeakers that can be used to present such 3D sound fields, though the immersive effects of 3D sound field presentations may only be experienced when the listening user is located in a particular physical location with respect to the loudspeaker array. If the user leaves this location, not only may certain 3D effects be lost, but certain artifacts of the 3D sound presentation may be perceivable that the user could find distracting and unrealistic, such that the sound experienced by the user outside of the designated area may be less immersive than a flat (non-3D) rendering of the sound would be. Accordingly, methods and systems described herein are configured to detect whether a user is located where the array of loudspeakers can properly present the 3D sound field, or, if the user is not so located, to direct the presentation of the audio in ways that preserve the immersiveness as much as possible given the circumstances. For instance, if the user is wearing a headphone device, the 3D sound field presentation may be transitioned from being presented on the loudspeaker array to being presented on the headphone device to maintain the immersive 3D sound experience. As another example, if the user is not wearing such a headphone device, the sound presentation may be flattened so that, while somewhat less immersive, there are at least no audible artifacts or other distractions while the user remains outside of the designated listening area.

[0003] To this end, one implementation described herein involves a method that may be performed by a first communication system at a first scene. This example method may include, for instance: 1) receiving, by the first communication system at the first scene from a second communication system at a second scene, an audio data stream representing a 3D sound field captured at the second scene by an array of microphones of the second communication system; 2) presenting, by the first communication system using an array of loudspeakers and based on the audio data stream, the 3D sound field to a user located within a reference listening area at the first scene; 3) detecting, during the presenting of the 3D sound field, that the user leaves the reference listening area; and 4) in response to the detecting and while the user remains outside the reference listening area, ceasing presenting the 3D sound field using the array of loudspeakers. For example, as will be described in more detail below, the ceasing presenting the 3D sound field using the array of loudspeakers may coincide with the array of loudspeakers instead presenting a flattened sound field (i.e., the sound captured at the second scene, but without imposing the 3D effects) or with the 3D sound field instead being presented by a headphone device worn by the user.

[0004] While this implementation is described from the perspective of the first communication system, it will be understood that the second communication system may perform similar functions at the same time. For example, the second communication system may also receive an audio data stream representing a 3D sound field captured at the first scene by an array of microphones of the first communication system, present this 3D sound field to a user at the second scene, and adjust the presentation in accordance with the dynamic location of the second user in a similar manner as described herein for the first user.

[0005] Another example implementation described herein involves a communication system (e.g., a video communication system) configured to perform the operations described above. For example, the communication system may include a display screen, an array of loudspeakers configured to present 3D sound configured for a reference listening area from where a user views the display screen, and a processor. The processor may be communicatively coupled to the array of loudspeakers and may be configured to perform a process that includes: 1) receiving, from an additional communication system, an audio data stream representing a 3D sound field captured by an array of microphones of the additional communication system; 2) presenting, using the array of loudspeakers and based on the audio data stream, the 3D sound field to the user as the user is located within the reference listening area; 3) detecting, during the presenting of the 3D sound field, that the user leaves the reference listening area; and 4) in response to the detecting and while the user remains outside the reference listening area, ceasing presenting the 3D sound field using the array of loudspeakers. Here again, as the communication system ceases presenting the 3D sound field using the array of loudspeakers, it may perform other actions to continue presenting sound captured by the additional communication system in as immersive a way as possible. For instance, as mentioned above, the array of loudspeakers could present a flattened sound field or the 3D sound field could continue being presented by a headphone device worn by the user as they are outside of the reference listening area.

[0006] Yet another example implementation described herein involves a non- transitory computer-readable medium storing instructions that, when executed by a processor of a first communication system at a first scene, cause the processor of the first communication system to perform a process. For example, the process may include: 1) receiving, from a second communication system at a second scene, an audio data stream representing a 3D sound field captured at the second scene by an array of microphones of the second communication system; 2) presenting, by the first communication system using an array of loudspeakers and based on the audio data stream, the 3D sound field to a user located within a reference listening area at the first scene; 3) detecting, during the presenting of the 3D sound field, that the user leaves the reference listening area; and 4) in response to the detecting and while the user remains outside the reference listening area, ceasing presenting the 3D sound field using the array of loudspeakers. Other actions may also be performed such that sound captured by the additional communication system can still be presented in an immersive way while the user is outside the reference listening area. For instance, as has been mentioned, the array of loudspeakers could present a flattened sound field or the 3D sound field could continue being presented by a headphone device worn by the user.

[0007] Various additional actions and operations may be added to these processes and methods as may serve a particular implementation, examples of which will be described in more detail below. Additionally, it will be understood that each of the processes and operations described as being performed by different types of implementations in the examples above (e.g., the non-transitory computer readable medium, the method, the communication system, etc.) may additionally or alternatively be performed by other types of implementations as well. For example, a process described above as being included in a computer readable medium could be performed as a method or could be performed by the processor of the communication system. Similarly, the method set forth above could be encoded in instructions stored by a computer readable medium or stored within the memory of the communication system, and so forth.

[0008] The details of these and other implementations are set forth in the accompanying drawings and the description below. Other features will also be made apparent from the following description, drawings, and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009] FIG. 1 shows certain aspects of an illustrative implementation of immersive sound field communication using a loudspeaker array configured to present a 3D sound field for a user located in a reference listening area in accordance with principles described herein.

[0010] FIG. 2 shows illustrative aspects of a communication session involving a user who is wearing a headphone device and leaves a reference listening area in accordance with principles described herein.

[0011] FIG. 3 shows illustrative aspects of an additional implementation of immersive sound field communication using a loudspeaker array in accordance with principles described herein.

[0012] FIG. 4 shows illustrative aspects of a communication session involving a user who is not wearing a headphone device and leaves a reference listening area in accordance with principles described herein.

[0013] FIG. 5 shows an illustrative method for immersive sound field communication using a loudspeaker array in accordance with principles described herein.

[0014] FIG. 6 shows illustrative aspects of a transition between a 3D sound presentation on a loudspeaker array and on a headphone device in accordance with principles described herein.

[0015] FIG. 7 shows illustrative aspects of a transition between a 3D sound presentation on a headphone device and on a loudspeaker array in accordance with principles described herein.

[0016] FIGS. 8A-8C show illustrative aspects of how a communication system may detect when a user is located within a reference listening area and when the user leaves the reference listening area in accordance with principles described herein.

[0017] FIG. 9 shows an illustrative computing system that may be used to implement various devices and/or systems described herein.

DETAILED DESCRIPTION

[0018] Methods and systems are described herein for immersive sound field communication using a loudspeaker array. In-person communication between people who are physically present in a shared space and free to communicate verbally and non-verbally (e.g., using hand gestures, intonation, etc.) in real time may be considered a gold standard of human communication. While certain communication technologies may be desirable as a direct result of convenient ways that they deviate from this standard (e.g., text messaging providing a non-real-time means of written communication; audio-only phone calls providing a verbal-only means of communication in which people do not need to worry about how they look or where they are located; etc.), other communication technologies may attempt to replicate this gold standard as nearly as possible so as to serve as a substitute for in-person, face-to-face communication (e.g., for when actual in-person communication it is not possible or convenient).

[0019] Real-time video communication provides one example of a communication technology that has the potential to be far more immersive and realistic than conventional audio-only communication. This is true for video communication carried out on various types of communication systems including, for example, smartphones, tablets, laptops, etc., and is particularly true for emerging three-dimensional (3D) communication technologies (sometimes referred to by other terms such as teleportation, telepresence, holoportation, spatial conferencing, etc.). 3D communication technologies provide even more immersive and realistic remote communication experiences by using 3D capture and projection techniques to create, in some examples, life-sized, 3D images of people and environments in real-time.

[0020] Audio is an important part of any communication experience, but the immersive illusion of in-person communication that may otherwise be provided by video communication (and especially by 3D video communication) tends to decline when sound produced during a remote conversation conspicuously fails to mimic the behavior that real- world sound would exhibit if the communication were actually in person. The realistic reproduction of sound is thus a technical problem that video communication systems must address with the objective generally being for reproduced sound to simulate real -world sound in quality (e.g., with high fidelity), in time (e.g., with minimal latency), and in space (e.g., simulating a 3D sound field that seems to originate from locations in space that correspond to where the sound source is visually perceived).

[0021] A common situation that occurs with communication systems tends to exacerbate and emphasize the technical problem of localized sound simulation described above. This situation involves one participant in a communication session moving outside of a reference listening area where an array of loudspeakers may be focused and configured to reproduce a 3D sound field. For video communications sessions carried out using 3D video communication systems, this reference listening area may often be associated with (e.g., fully or partially overlapping or coinciding with) a location in front of a display screen wherein the participant (hereafter referred to as a user of the communication system) may be located to view the display screen and interact with a 3D representation of another user presented thereon.

[0022] An array of loudspeakers could be integrated with the display screen or placed near the display screen or around the room in such a way that the communication system may use the array of loudspeakers to present a 3D sound field captured by a plurality of microphones at the remote scene. For example, the loudspeakers may apply various effects and principles such as interaural level differences (ILDs in which one ear hears a sound at a higher volume due to the sound source being closer to that ear and partially blocked by the head for the other ear), interaural time differences (ITDs in which one ear hears a sound at a slightly earlier time due to the sound source being closer to that ear so that sound takes longer to propagate to the other ear), head-related transfer functions (representing the head pose of the user and how it would affect 3D sound propagation), and so forth.

[0023] While all these and other types of effects may be applied to simulate a 3D sound field with a high degree of realism, certain assumptions may be made by the communication system in order to properly drive the array of loudspeakers to impose these effects and produce this immersive experience. In particular, the array of loudspeakers may be disposed in static locations with respect to the display screen and assumptions about where the user will be located during a given communication session may be integrated into the algorithms for reproducing the 3D sound field using the loudspeaker array. The physical location within the scene where proper spatialization of the 3D sound field can be produced by the loudspeaker array is referred to herein as a reference listening area associated with the loudspeaker array. Just as the loudspeakers themselves may be disposed at static locations within the scene, the reference listening area, too, may be static with respect to the scene, or at least may be static with respect to the communication system and the loudspeakers integrated therein (e.g., such that the reference listening area moves with the communication system and the speaker array).

[0024] As long as the user is positioned within the reference listening area for which a loudspeaker array is configured, the communication system controlling that loudspeaker array may be able to reproduce an immersive 3D sound field captured by an array of microphones at the other scene being communicated with. A technical problem may arise, however, if the user moves out of the reference listening area for some reason. For example, during the course of a conversation, the user could temporarily step away from the reference listening area to retrieve an object that they wish to show to the other user they are communicating with. In this case, as the user leaves the reference listening area, the user may perceive not only that the 3D sound field no longer includes the immersive details that convincingly simulated the sound originating from the direction of specific sound sources, the user may further perceive distracting artifacts related to the presentation of the 3D sound field in the reference listening area. These artifacts may be an aural analog, for example, to what happens visually when anamorphic art (e.g., a sidewalk painting of an object that is designed to look three dimensional from a specific vantage point or angle) is viewed from an unintended angle that not only causes the 3D effect to disappear, but also reveals the object to be highly distorted in order to achieve the 3D effect at the particular focus point.

[0025] Methods and systems described herein for immersive sound field communication using a loudspeaker array present technical solutions to this technical problem of immersive 3D sound simulation by a loudspeaker array being diminished, and even becoming an undesirable distraction, when the listening user is not located in accordance with constructive and destructive summing assumptions built into the system (e.g., such as when the user leaves the reference listening area). First, methods and systems described herein may be configured to recognize the issue when it occurs. For example, a first communication system at a first scene may receive an audio data stream from a second communication system at a second scene, the audio data stream representing a 3D sound field captured at the second scene by an array of microphones at the second scene. Using an array of loudspeakers, the first communication system may present the 3D sound field, based on the audio data stream, to a user located within a reference listening area at the first scene.

[0026] Should the user leave the reference listening area during this presentation, however, the first communication system may detect this event and respond accordingly to mitigate the technical problem described above and to otherwise direct the presentation of the audio in ways that help preserve the immersiveness as much as possible given the circumstances. Various such circumstances and ways of handling the situation will be described and illustrated herein. As one example, it could be determined that the user is wearing a headphone device (e.g., wireless earbuds, etc.) and, consequently, the communication system may cease presenting the 3D sound field using the array of loudspeakers and instead transition to presenting the 3D sound field using the headphone device. As another example, it could be determined that the user has no such headphone device and, consequently, the communication system may cease presenting the 3D sound field and instead use the array of loudspeakers to present a flattened, non-3D sound field (i.e., without the various 3D effects that lead to the undesirable artifacts for listeners outside the reference listening area).

[0027] In either of these examples or others described herein, the technical effect of implementing one of these technical solutions is that the sound field will continuously be presented to the user in a manner that, from the user’ s perspective, sounds immersive and true-to-life and is devoid of distracting artifacts even when the user is listening from an unintended location. In examples where the user is wearing a headphone device, the technical effect may be a continuously immersive sound presentation that transitions seamlessly from the loudspeakers to the headphone device while always simulating the 3D sound field according to the pose and/or dynamic movements of the user (e.g., based on the 3 degrees of freedom (3DOF) of the user’s head orientation, based on the 6 degrees of freedom (6DOF) of the user’s head position and orientation, based on a head-related transfer function (HRTF) computed for the user’s head, etc.). Even in examples where such a headphone device is not available or being worn by the user, the technical effect may be to simulate the sound with immersive 3D spatialization when the user’s position allows it, while still presenting version of the sound that, while less immersive, at least avoids distracting artifacts when the user’s position (outside of the reference listening area) does not allow for the 3D reproduction of the sound field.

[0028] In either situation, the user may enjoy a high-quality communication session even while moving about freely (without being anchored to the reference listening area). In or out of the reference listening area, the user may be presented with immersive sound that supports the sensation of natural, in-person communication without disruptive reminders that the communication is actually being mediated by technology.

[0029] Various implementations will now be described in more detail with reference to the figures. It will be understood that particular implementations described below are provided as non-limiting examples and may be applied in various situations. Additionally, it will be understood that other implementations not explicitly described herein may also fall within the scope of the claims set forth below. Systems and methods described herein for immersive sound field communication using a loudspeaker array may result in any or all of the technical effects mentioned above, as well as various additional effects and benefits that will be described and/or made apparent below. [0030] FIG. 1 shows certain aspects of an illustrative implementation of immersive sound field communication using a loudspeaker array configured to present a 3D sound field for a user located in a reference listening area in accordance with principles described herein. More particularly, the implementation of FIG. 1 is shown at two moments in time labeled as moment 100- A (on the left-hand side of FIG. 1) and as moment 100-B (on the right-hand side of FIG. 1).

[0031] Referring first to moment 100-A, the implementation of FIG. 1 is shown to comprise a communication system 102. In this example, communication system 102 is shown to be implemented as a video communication system that includes a display screen 104 and an array of loudspeakers 106. As will be described in more detail below, communication system 102 may be communicatively coupled to a second communication system (e.g., another video communication system, not explicitly shown in FIG. 1) at a second scene, and display screen 104 may be configured to present video captured at the second scene by one or more cameras of the second communication system (or, in some examples, to present 3D representations constructed based on such video).

[0032] The loudspeakers 106 of the array of loudspeakers will be understood to be integrated within display screen 104 at each corner of the screen in this example. Thus, each of the loudspeakers 106 of the array is illustrated in FIG. 1 as a dashed-line circle disposed in a particular corner of display screen 104 of communication system 102. The array of loudspeakers 106 may be configured to present 3D sound 108 (illustrated as sound waves that originate from loudspeakers 106 to propagate outward into the scene) configured for a reference listening area 110 from where a user 112-1 views display screen 104. While four coplanar speakers are used in this example to form a 2D array of speakers that is essentially coplanar with the display screen 104, it will be understood that other examples may include other types of arrangements for the speakers. For instance, two collinear speakers may form a ID array of speakers, three collinear speakers may form a ID array or three coplanar (but not collinear) speakers may form a 2D array, or four or more speakers may form ID, 2D, or 3D arrays as may serve a particular implementation.

[0033] In this example, user 112-1 and communication system 102 are shown to be located at a first scene 114-1, and user 112-1 is shown to be using communication system 102 to engage in a communication session with a second user 112-2 that is using a similar communication system at a second scene 114-2 (e.g., a scene that is remote from scene 114-1 where user 112-1 and the first communication system 102 are located). On both sides of the communication session, the communication systems may be relatively large so as to present life-sized representations (e.g., 3D representations in certain examples) of people and objects in scenes 114-1 and 114-2. As such, FIG. 1 shows that display screen 104 and the array of loudspeakers 106 of communication system 102 may be disposed at static locations (e.g., permanent or semi-permanent locations) within scene 114-1. Accordingly, reference listening area 110, too, may, in this example, be a static area within scene 114-1 from where user 112- 1 views display screen 104. The reference listening area 110 is an area located at a predetermined position relative to the display screen 104 (and/or relative to the array of loudspeakers 106). It will be understood that other types of communication systems (e.g., audio-only communication systems without display screens, more portable video communication systems with smaller display screens that are not configured to present lifesized representations, etc.) may be more readily moved and repositioned such that reference listening area 110 may be more dynamic in these examples.

[0034] During the communication session illustrated in FIG. 1, communication system 102 may be configured to serve as a virtual portal to allow users 112-1 and 112-2 to see and hear one another in a manner that simulates in-person interaction. As has been mentioned, one aspect contributing to the immersive sense of realism offered by this type of communication system is that a spatialized 3D sound field on each side of the communicative link may be captured, transmitted, and presented in a manner that reproduces various 3D effects in the audio heard by the users 112-1 and 112-2. For example, as will be described in more detail below, an array of microphones associated with the second communication system may capture a 3D sound field at scene 114-2 while user 112-2 is speaking from a location at the scene. Communication system 102 may then receive and present that 3D sound field as 3D sound 108 using the array of loudspeakers 106.

[0035] As has been mentioned, various 3D effects may be produced by an array of loudspeakers such as loudspeakers 106. For example, 3D sound 108 may be steered and otherwise simulated to originate from a desirable virtual location, such as a virtual location corresponding to where user 112-1 sees the sound source of user 112-2 in the example where user 112-2 is speaking. This 3D sound steering and reproduction may be implemented in any suitable way. For instance, loudspeakers 106 may be configured to present sounds using phase-delayed transmissions to cause constructive and/or destructive summing of the sound at relevant points near the left and right ears of user 112-1. In this way, realistic interaural level and time differences (ILDs, ITDs, etc.) may be reproduced, appropriate reverb and/or other such effects may be introduced that user 112-1 would experience if user 112-2 were physically present at the location presented on display screen 104, and so forth. [0036] Introducing these 3D effects, however, requires certain assumptions about where user 112-1 is located, which direction they are facing, and so forth. For example, 3D sound 108 may only properly sum (in constructive and/or destructive ways) to create the intended steering effect when user 112-1 is located at a particular focal point, or at least within a limited area near that focal point. This area is represented in FIG. 1 by reference listening area 110. Within reference listening area 110, the 3D sound 108 produced by the array of loudspeakers 106 may constructively and destructively sum in ways that properly steer the sound and recreate the 3D sound field that was captured at second scene 114-2. A consequence of this approach, however, is that away from the predesignated focal point (i.e., outside of reference listening area 110), 3D sound 108 may sound wrong (e.g., unrealistic, unnatural, artificial, etc.) due to undesirable artifacts of the 3D effects introduced based on the assumption that the listener is within reference listening area 110.

[0037] In FIG. 1, 3D sound 108 is illustrated (e.g., by dotted lines extending from loudspeakers 106 to reference listening area 110) as being produced by loudspeakers 106 and propagating to the reference listening area where the sound can be experienced by user 112-1 while user 112-1 is located there. It will be understood that sound emanating from loudspeakers 106 may also be audible elsewhere within scene 114-1, but the sound will only have the desired 3D effects at or near that focal point encompassed by reference listening area 110. Loudspeakers 106 may be positioned, oriented, and configured to operate so as to implement the reference listening area 110 at a desirable point in space. For example, for a video communication system such as communication system 102, the system design would typically be such that reference listening area 110 coincides with a natural place that user 112-1 may wish to be during a communication session in which the user is viewing display screen 104.

[0038] Regardless of how naturally located reference listening area 110 is, however, there may still be reasons for user 112-1 to at least occasionally leave reference listening area 110. For these occasions, implementations described herein allow communication system 102 to detect that user 112-1 has moved away from reference listening area 110 and to take action to avoid a scenario where user 112-1 hears the disruptive artifacts of the 3D sound 108 from locations external to reference listening area 110 where the assumptions baked into 3D sound 108 do not hold.

[0039] To illustrate, moment 100- A shows that loudspeakers 106 present 3D sound 108 to user 112-1 while user 112-1 is located within reference listening area 110. A path 116- A that user 112-1 may have traversed prior to moment 100-A shows that user 112-1 may have some limited mobility within reference listening area 110 while still experiencing the immersive realism of 3D sound 108 as presented by the array of loudspeakers 106. For example, the user may shift around slightly within this area, may sit or stand, etc., while still enjoying the 3D effects produced by the array of loudspeakers 106. However, since reference listening area 110 is a relatively small area directly in front of display screen 104, movements such as those illustrated by path 116- A clearly do not allow user 112-1 much freedom of motion while enjoying 3D sound 108 as presented by loudspeakers 106. Accordingly, the immersive experience of the communication session could be diminished if the user leaves the reference listening area 110 to retrieve something from another room (e.g., an object that user 112-1 wishes to show user 112-2 during the video communication session), to perform tasks outside of reference listening area 110 (e.g., tidying up around the house while talking to a friend, etc.), to relocate to a more comfortable place (e.g., to sit on a sofa that is still within scene 114-1 but further away from display screen 104), or for some other reason.

[0040] In accordance with one of these or other suitable reasons that user 112-1 may have to leave reference listening area 110, moment 100-B shows a path 116-B that user 112-1 may follow at some point after moment 100- A. Specifically, as shown, user 112-1 may leave reference listening area 110 to move through other parts of scene 114-1 external to reference listening area 110 (where 3D sound 108 is not configured to function properly and where undesirable artifacts of the sound could be heard). In some of the locations traversed by path 116-B, user 112-1 may still be able to see display screen 104. In other locations traversed by path 116-B (such as the final location behind communication system 102 where user 112-1 is depicted at moment 100-B), user 112-1 may be completely out of the field of view of communication system 102 and may be unable to see display screen 104. At all of these locations, however, it may be desirable for user 112-1 to still be able to hear user 112-2 as user 112-2 speaks and/or as other sound is communicated from scene 114-2. Indeed, to the extent possible, it would be desirable for user 112-1 to perceive a 3D sound field originating from the virtual portal into scene 114-2 that communication system 102 represents.

[0041] As has been mentioned, the array of loudspeakers 106 may not be configured to provide this immersive sound for 112-1 after leaving reference listening area 110, since the array is specifically associated with reference listening area 110 (e.g., a static area with respect to communication system 102). Communication system 102 may, however, be configured to detect when user 112-1 leaves reference listening area 110 (e.g., such as when user 112-1 follows path 116-B and steps away from reference listening area 110 to go to other locations within scene 114-1). When user 112-1 is detected to leave reference listening area 110, communication system 102 may cease presenting 3D sound 108 on loudspeakers 106. This is depicted at moment 100-B by the absence of the dotted lines and the 3D sound 108 emerging from loudspeakers 106.

[0042] Instead, for this example, user 112-1 may be detected to be wearing a headphone device 118 that communication system 102 may use to present the 3D sound field for as long as user 112-1 remains outside reference listening area 110. For example, at an earlier moment (e.g., moment 100- A), headphone device 118 may be configured to initially calibrate to the orientation of user 112-1 (e.g., facing display screen 104) and to the location of the current sound source being presented (e.g., speech originating from user 112-2).

[0043] As user 112-1 then moves around and traverses path 116-B at later times (e.g., including at moment 100-B), an inertial measurement unit (IMU) within headphone device 118 or another suitable mechanism for tracking the pose of user 112-1 may be employed to determine a 3DOF or 6DOF pose of headphone device 118 so that spatialized 3D sound may be transitioned from being presented on the array of loudspeakers 106 to being presented on headphone device 118 as 3D sound 120. For example, binauralization algorithms may generate 3D sound 120 in the left and right sides of headphone device 118 based on the detected pose of user 112-1 and based on the 3D sound field that has been being captured at scene 114-2, streamed from the second communication system at scene 114-2 to communication system 102, and provided (e.g., by way of Bluetooth, WiFi direct, or another suitable wireless or wired protocol) from communication system 102 to headphone device 118. For example, the detected pose may be a 3DOF or 6DOF pose of the head of user 112-1. If, at a later moment, user 112-1 then returns to reference listening area 110, signals from loudspeakers 106 may then serve as an anchor to avoid an IMU-based drift and to achieve a seamless screen-locked audio experience regardless of where user 112-1 is located within scene 114-1.

[0044] FIG. 2 shows other views of the communication session being carried out by the implementation of FIG. 1. More particularly, FIG. 2 shows a view 200- A that corresponds to a first moment of the communication session when user 112-1 is located within reference listening area 110 (e.g., similar to moment 100-A of FIG. 1) and a view 200-B that corresponds to a second moment of the communication session when user 112-1 has left reference listening area 110 (e.g., similar to moment 100-B of FIG. 1). While FIG. 1 illustrates a perspective view only of communication system 102 as display screen 104 allows user 112-1 to look from scene 114-1 (where the user is located) into scene 114-2 (where the second user 112-2 is located), views 200-A and 200-B each show portions of both scenes 114-1 and 114-2 from an overhead view that helps illustrate additional aspects of the communication session.

[0045] Specifically, view 200-A shows, from a top view, both a communication system 102-1 within scene 114-1 (i.e., the communication system 102 shown in FIG. 1) and a communication system 102-2 within scene 114-2 (which may be similar to communication system 102 but was not explicitly shown in the perspective view of FIG. 1). Similarly as described in relation to FIG. 1, communication system 102-1 is shown to include an array of loudspeakers 106 that produce 3D sound 108 to be perceived as immersive and realistic sound to user 112-1 as long as the user is located within reference listening area 110 (as is the case at the moment shown by view 200-A). As in the illustration of FIG. 1, view 200-A shows 3D sound 108 being directed from loudspeakers 106 to reference listening area 110. As described above, it will be understood that sound produced by these loudspeakers will be audible elsewhere in scene 114-1, but the sound is configured to properly simulate the 3D sound field only within reference listening area 110.

[0046] Scene 114-2 is shown to be separated from scene 114-1 by a barrier 202 that will be understood to represent any suitable distance between the scenes. For example, the two scenes could be as close together as two rooms in the same building or even two portions of the same room in certain examples, while the scenes may be far more remote (e.g., in different buildings, cities, countries, or even opposite parts of the world) in other examples. As shown, an audio data stream 204 is communicated between communication system 102-1 and communication system 102-2 during the communication session. For example, audio data stream 204 may represent bidirectional streams of data being concurrently transmitted by communication interfaces of each communication system throughout the session. While not explicitly shown in FIG. 2, it will be understood that audio data stream 204 may be transmitted using any suitable data transmission technologies to be carried over any suitable networks (e.g., local WiFi networks, other local area networks, wide area networks, private carrier networks, the public internet, etc.) or other communicative means.

[0047] Communication system 102-1 and communication system 102-2 will be understood to represent similar or identical communication systems having similar or identical features and capabilities. However, because the examples described herein involve user 112-2 speaking while user 112-1 listens, communication system 102-1 is illustrated to include the array of loudspeakers 106 while communication system 102-2 is shown to include an array of microphones 206 that capture the 3D sound field of scene 114-2, including speech sound 208 originating from user 112-2. As with the array of loudspeakers 106, it will be understood that microphones 206 of the array of microphones in communication system 102- 2 may be disposed at any suitable locations with respect to communication system 102-2 and its display screen. For instance, the microphones 206 could be placed around the room and be communicatively coupled (e.g., by wired or local wireless communication) to communication system 102-2. In other examples, the microphones could be integrated with the display screen in a similar manner as described for the integrated array of loudspeakers 106.

[0048] Wherever the microphones 206 are located, the array of microphones may be configured to capture a 3D sound field that includes the speech sound 208 from user 112-2 along with other sounds (e.g., reverb and echoes from speech sound 208, other people speaking in scene 114-2, other noises from objects at the scene that are not explicitly shown, etc.). By transmitting this 3D sound field (e.g., as part of audio data stream 204 and along with other communication data such as video data and metadata that are not explicitly shown) and then presenting it using loudspeakers 106 as has been described, communication systems 102-1 and 102-2 may collectively simulate a verbal interaction 210-A in which user 112-1 perceives that user 112-2 is standing nearby and speaking in a similar way as might be done during an actual in-person conversation. Combined with similar video effects, users 112-1 and 112-2 may converse back and forth through their communication systems as if a portal is open between them and they can see and hear each other in like manner as if they were together in a common space.

[0049] In view 200-B, many of the same elements are shown, but user 112-1 is illustrated as having left reference listening area 110 to move along path 216 to a new location external to reference listening area 110 (a larger portion of scene 114-1 is shown to accommodate the movement of user 112-1 away from communication system 102-1). As described above in relation to FIG. 1, communication system 102-1 may be configured to detect when user 112-1 leaves reference listening area 110 and to cease presenting 3D sound 108 on loudspeakers 106. Accordingly, while the array of loudspeakers 106 is still shown in view 200-B, no sound is shown to be produced by these speakers in view 200-B. Instead, as described above in relation to moment 100-B, communication system 102-1 may be configured to present the 3D sound field (i.e., the same 3D sound field being captured by the array of microphones 206 that was previously presented using the array of loudspeakers 106) using headphone device 118 worn by user 112-1. As with the previous presentation of the 3D sound field, 3D sound 120 may be presented by headphone device 118 based on audio data stream 204. At this moment, however, 3D sound 120 may replace 3D sound 108 and may persist while the presenting of the 3D sound field using the array of loudspeakers 106 is ceased (e.g., for as long as user 112-1 is located outside of reference listening area 110).

[0050] The relative positioning of users 112-1 and 112-2 in view 200-B is such that the users may not be able to see one another through their screens in the same way they could at the earlier moment shown by view 200- A. However, a verbal interaction 210-B shows that spatialized 3D audio may continue to be presented and enjoyed as a consequence of transitioning the 3D sound to the headphone device 118. As shown, verbal interaction 210-B may be perceived, by user 112-1, as if the sound is still emerging from communication system 102-1. In this way, the illusion of having an open portal between the scenes may persist regardless of where user 112-1 moves within scene 114-1. In another implementation, the verbal interaction could be perceived as if the sound is still emerging from where user 112-2 appears on the screen to be located (not shown in FIG. 2, but as would be illustrated by an arrow that extends straight from user 112-2 to user 112-1 without necessarily passing through communication system 102-1).

[0051] As described above, the presenting of the 3D sound field (i.e., 3D sound 120) using headphone device 118 may be performed based on head movement tracking of user 112-1 by an inertial measurement unit (IMU) within headphone device 118. The IMU could tracking a 3DOF pose of the head of user 112-1 with respect to the virtual position of user 112-2 (i.e., the sound source in this example) or with respect to the communication system 102-1 (where the sound would originate if the communication systems implemented an actual portal). In other examples, a 6DOF or other suitable pose may be tracked and accounted for in the binauralization of the 3D sound field for presentation of headphone device 118.

[0052] The implementation and examples shown and described in relation to FIGS. 1 and 2 may apply for a listening user (e.g., user 112-1) that is wearing a headphone device that the communication system 102 can use to present the 3D sound field when the user leaves the reference listening area 110. Even when the user does not happen to be wearing or have access to such a headphone device, however, communication system 102 may still be configured to detect when the user leaves the reference listening area and to take action to try to avoid negative consequences such as have been described (e.g., the user perceiving the sound as artificial or noticing distracting artifacts of 3D sound 108 that is not tailored for the present location of the user). One alternative action that may be taken when the user is not wearing a headphone device will now be described in relation to FIGS. 3 and 4.

[0053] FIG. 3 shows illustrative aspects of an additional implementation of immersive sound field communication using a loudspeaker array in accordance with principles described herein. Except as otherwise noted, FIG. 3 is very similar to FIG. 1 and illustrates the same principles described above. For example, FIG. 3 shows communication system 102 implemented within scene 114-1 as a video communication system with a display screen 104 configured to present video captured at scene 114-2 by a second communication system. Along with the display screen 104, communication system 102 also includes the array of loudspeakers 106 that are shown to be disposed at static locations within scene 114-1 such that reference listening area 110 is a static area from where user 112-1 can view display screen 104. For example, as shown at a moment 300-A on the left-hand side of FIG. 3, user 112-1 may be present within reference listening area 110 and may be presented with 3D sound 108 (e.g., a 3D sound field tailored to reference listening area 110 as illustrated by the directionality of 3D sound 108 being projected to reference listening area 110) by the array of loudspeakers 106 as the user views scene 114-2 and user 112-2 through display screen 104. As described above, user 112-1 may have some limited mobility, such as to follow path 116- A, while still experiencing the 3D sound as intended, but may perceive distracting artifacts outside of reference listening area 110 if action is not taken by communication system 102.

[0054] Moment 300-B on the right-hand side of FIG. 3 illustrates the situation where the user 112-1 without a headphone device has traversed path 116-B away from reference listening area 110. Without the headphone device 118 worn by the user in the example of FIG. 1, communication system 102 is not able to transition the presentation of the 3D sound field from the array of loudspeakers 106 to 3D sound 120 on the headphone device as described above. Instead, as shown, communication system 102 may present a flattened sound field on the array of loudspeakers 106. That is, sound 320, which no longer includes 3D effects tailored for reference listening area 110, is shown to be directed outward to scene 114-1. In this way, user 112-1 may still hear user 112-2 speaking as 112-1 leaves reference listening area 110. Even though the immersive realism of the 3D sound field may be lost in this example, the presentation of sound 320 at least will allow user 112-1 to continue engaging in the communication session and will not distract user 112-1 with any unwanted artifacts of presentation of 3D sound 108.

[0055] FIG. 4 illustrates this same scenario with a view 400-A that is parallel to view 200-A of FIG. 2 (corresponding to an earlier moment such as moment 300-A), and with a view 400-B that is parallel to view 200-B of FIG. 2 (corresponding to a later moment such as moment 300-B). As shown in view 400-A, a verbal interaction 410-A may be achieved by loudspeakers 106 presenting 3D sound 108 for user 112-1 while the user is located in reference listening area 110. When user 112-1 leaves reference listening area 110 on path 416, however, a verbal interaction 410-B involves communication system 102-1 presenting, based on audio data stream 204 and while the presenting of the 3D sound field using the array of loudspeakers 106 (i.e., 3D sound 108) is ceased, a flattened sound field labeled as sound 320. Just like the 3D sound field, the flattened sound field of sound 320 is shown to be presented using the array of loudspeakers 106 and will be understood to represent sound captured at scene 114-2 by the array of microphones 206. The flattened sound field of sound 320 may comprise a mono or a stereo sound. The flattened sound field of sound 320 may comprise a smaller number of audio channels than loudspeakers 106. For example, the flattened sound field of sound 320 is presented by emitting the same audio channel by some or all of the loudspeakers 106 and/or in phase and/or at the same amplitude. In this way, user 112-1 may hear the verbal interaction 410-B that includes the speech sound 208 from user 112-2 even without headphones. While verbal interaction 410-B may not have the same immersive, 3D nature as the 3D sound of verbal interaction 410-A, it still advantageously avoids undesirable artifacts even though the user has stepped away from reference listening area 110.

[0056] FIG. 5 shows an illustrative method 500 for immersive sound field communication using a loudspeaker array in accordance with principles described herein. More particularly, method 500 shows one sequence of operations that may be performed by a communication system such as communication system 102 of FIGS. 1 and 3 (communication system 102-1 of FIGS. 2 and 4). While the perspective of communication system 102-1 is chosen arbitrarily for these examples, it will be understood that a similar method from the perspective of the second communication system 102-2 may also be performed in a similar way. FIG. 5 shows illustrative operations 502-508 according to one implementation, though it will be understood that other implementations of method 500 could omit, add to, reorder, and/or modify any of operations 502-508 shown in FIG. 5. While operations shown in FIG. 5 are illustrated with arrows suggestive of a sequential order of operation, it will be understood that some or all of the operations of method 500 may be performed concurrently (e.g., in parallel) with one another. Each of operations 502-508 of method 500 will now be described in more detail as the operations may be performed by a first communication system (e.g., communication system 102-1) used by a first user (e.g., user 112-1) who is listening during a communication session with a second communication system (e.g., communication system 102-2) used by a second user (e.g., user 112-2) who is speaking.

[0057] At operation 502, the first communication system may receive, at a first scene (e.g., scene 114-1) from the second communication system at a second scene (e.g., scene 114- 2), an audio data stream (e.g., audio data stream 204) representing a 3D sound field. For example, as described above, the 3D sound field may be captured at the second scene by an array of microphones of the second communication system. The array of microphones may include omnidirectional and/or unidirectional microphones configured to capture the 3D sound field using ambisonics and/or other spatialized sound capture principles. The 3D sound field may then be encoded into the audio data stream and packaged with other communication information such as video information (e.g., visual 3D representations of people and objects at scene 114-2, etc.), metadata, and so forth, before being transmitted (e.g., over one or more networks, etc.) from the second communication system to the first communication system.

[0058] At operation 504, the first communication system may present the 3D sound field using an array of loudspeakers. For example, the first communication system may present the 3D sound field, based on the audio data stream received at operation 502, to the first user as the first user is located within a reference listening area at the first scene. As has been described and illustrated, the audio data stream and the array of loudspeakers may be configured to be presented to a reference listening area that is in a convenient location, such as in a location where the first user might naturally be expected to reside while viewing a display screen of the first communication system (e.g., in the event that the first communication system is a video communication system such as shown in FIGS. 1-4). However, as conveniently located as the reference listening area may be, it may nevertheless be the case that the first user does not wish to be limited to this area during any active communication session.

[0059] Accordingly, at operation 506, the first communication system may detect, during the presentation of the 3D sound field, that the first user leaves the reference listening area. In some examples, the user may step away temporarily (for any of a variety of reasons) with an intention to come right back. In other examples, the user may choose to engage in much or an entirety of the communication session from a location away from the reference listening area (e.g., from a sofa across the room that does not happen to be within the reference listening area, etc.). In either case, the first communication system may make a determination that the user is not present in the reference listening area that the 3D sound field is being presented for, such that action may be taken. The detection that the first user leaves (or otherwise is not present) the reference listening area may be performed in any suitable way, several examples of which will be described and illustrated in more detail below.

[0060] At operation 508, the first communication system may, in response to the detecting at operation 506 and while the first user remains outside the reference listening area, cease presenting the 3D sound field using the array of loudspeakers. This is not to say that the first communication system necessarily ceases presenting any audio to the first user during the communication session. Rather, in recognition that the user is not located in the reference listening area that the 3D sound field is referenced to or reproduced for, the first communication system may cease using the array of loudspeakers to present the 3D sound field and may instead present the audio to the first user in another suitable way that will help keep the communication experience as immersive as possible for the first user.

[0061] As a first example of how the sound presentation may change in connection with operation 508, the first communication system may transition to presenting the 3D sound field based on the audio data stream received at operation 502, but using a headphone device worn by the first user (rather than the array of loudspeakers). This type of example was described and illustrated above in relation to FIGS. 1 and 2. As another example of how the sound presentation may change in connection with operation 508, the first communication system may transition to presenting a flattened sound field (rather than a 3D sound field) using the array of loudspeakers. For example, this approach may be useful if the first user is not wearing a headphone device, as illustrated and described above in relation to FIGS. 3 and 4. This presentation may still be based on the audio data stream received at operation 502.

[0062] Method 500 may be performed by any communication system described herein, including implementations of communication system 102 described and illustrated above (e.g., communication system 102-1, communication system 102-2, etc.). In some examples, such communication systems may include, along with the array of loudspeakers, one or more processors communicatively coupled to the array of loudspeakers and configured to perform a process implementing method 500 (i.e., a process including operations 502-508 and/or additional or alternative operations in certain examples). Another way that method 500 may be implemented is embodied as instructions on a non-transitory computer-readable medium. For example, the non-transitory computer-readable medium may store instructions that, when executed, cause a processor of a communication system at a scene to perform a process implementing method 500.

[0063] For various examples described above, two discrete moments during a communication session have been described and illustrated: 1) a moment in which the 3D sound field is being presented on the array of loudspeakers and nothing is being presented on the headphone device, and 2) a moment in which the 3D sound field is being presented on the headphone device and nothing is being presented on the array of loudspeakers. Corresponding moments have also been described for scenarios where no headphone device is available and the sound field is flattened for the loudspeakers at the second moment. As will now be described and illustrated, it may be desirable, between these discrete moments, for the sound presentation to transition smoothly between the two different modes of sound presentation (e.g., between the loudspeakers and the headphones or between the 3D sound and the flattened sound).

[0064] Specifically, FIG. 6 shows illustrative aspects of a transition 600 between a 3D sound presentation on a loudspeaker array and the 3D sound presentation on a headphone device (i.e., the transition from the loudspeakers to the headphones) in accordance with principles described herein. FIG. 7 then shows illustrative aspects of a transition between a 3D sound presentation on a headphone device and on a loudspeaker array (i.e., the transition back from the headphones to the loudspeakers) in accordance with principles described herein. While these transitions 600 and 700 are described and illustrated for an implementation in which the headphone device is available and being worn by the user for an uninterrupted 3D audio experience, it will be understood that similar principles could be applied to an implementation (such as illustrated in FIGS. 3 and 4) in which no headphones are being used and the transition is from a 3D sound field to flattened sound field.

[0065] In FIG. 6, a scenario is illustrated in which the communication system (e.g., communication system 102-1) may determine, during the presenting of the 3D sound field on the array of loudspeakers and prior to user 112-1 leaving reference listening area 110, that user 112-1 is near a boundary 602 of reference listening area 110. Specifically, as shown, a path 616 that leads user 112-1 out of reference listening area 110 (analogous to paths described above such as paths 116-B, 216, 416, etc.) may take user 112-1 across an inner boundary 604-1 (‘I’ for “inner”) into a transition region 606 that is defined just inside the boundary 602 of reference listening area 110. Thereafter, path 616 may lead user 112-1 through a transition region 608 that is defined with an outer boundary 604-0 (‘O’ for “outer”) just outside the boundary 602 of reference listening area 110 and then out to portions of the scene that are not particularly proximate to reference listening area 110.

[0066] As shown on the graph below reference listening area 110 in FIG. 6, a transition 610 from the 3D sound field being presented on the loudspeaker array (e.g., the array of loudspeakers 106) to being presented on the headphone device (e.g., headphone device 118) may be performed as user 112-1 passes through transition regions 606 and 608. For example, in response to the determining that user 112-1 is near boundary 602 (e.g., based on the user having crossed inner boundary 604-1), the communication system may present the 3D sound field using both the array of loudspeakers and the headphone device worn by the user in a manner that transitions from one to the other. Specifically, as shown in the graph depicting what mechanism is used to present the 3D sound field (“3D Sound Presentation”) with respect to the user’s location (“User Position”), the loudspeaker array is shown to be used exclusively (or nearly exclusively) while user 112-1 is located within reference listening area 110 and not particularly close to boundary 602 (i.e., outside of transition region 606). Once user 112-1 crosses into transition region 606 and until the user leaves region 608, the graph then shows how the 3D sound presentation may transition from the loudspeaker array (which is shown to ramp down during transition 610) to the headphone device (which is shown to ramp up during transition 610) until the headphone device is used exclusively (or nearly exclusively) for presenting the 3D sound field.

[0067] While the ramps shown in transition 610 in FIG. 6 are drawn linearly, it will be understood that the transition may take any suitable shape and may occur over any suitable distance. For instance, in certain implementations, either or both of the ramps associated with transition 610 may be non-linear (e.g., parabolic shaped, logarithmically shaped, exponentially shaped, etc.). Additionally, in certain implementations, only one of the audio mechanisms may transition at the rate shown, while the other may change instantaneously (e.g., a step function) or at a different rate. For example, the headphone device could begin presenting the 3D sound field at full volume as soon as user 112-1 crosses inner boundary 604-1 or boundary 602, while the loudspeaker array may drop out more gradually according to the transition curves shown.

[0068] To make the communication session as immersive as possible, transition 610 may be performed smoothly and with certain limits or guardrails to ensure that no change is made so abruptly as to risk distracting the user from the immersive experience. To this end, a refresh rate for detecting the user’s location with respect to the various boundaries 602, 604-1 and 604-0 may be selected that is high enough to produce a smooth transition even if the user is moving relatively quickly. For example, the user location detection (which may be performed in various ways described in more detail below) may be performed or updated several times per second. In some cases, the detection rate may be equal to or related to the frame rate at which video and/or audio are being captured by the communication system.

[0069] In FIG. 7, a scenario is illustrated in which the communication system may determine, subsequent to ceasing the presenting of the 3D sound field using the array of loudspeakers while user 112-1 is external to listening area 110, that user 112-1 reenters the reference listening area 110. Specifically, as shown, a path 716 that leads user 112-1 back into reference listening area 110 (analogous to the reverse of paths described above such as paths 116-B, 216, 416, 616, etc.) may take user 112-1 across an outer boundary 704-0 (‘O’ for “outer”) into a transition region 708 that is defined just outside a boundary 702 of reference listening area 110. Thereafter, path 716 may lead user 112-1 through a transition region 706 that is defined with an inner boundary 704-1 (‘I’ for “inner”) just inside the boundary 702 of reference listening area 110 and then into the reference listening area 110.

[0070] Similarly as described above in relation to FIG. 6, a graph below reference listening area 110 in FIG. 7 shows a transition 710 from the 3D sound field being presented using the headphone device (e.g., headphone device 118) to being presented using the loudspeaker array (e.g., the array of loudspeakers 106). Transition 710 may be performed as user 112-1 passes through transition regions 708 and 706. For example, while the communication system may eventually cease presenting the 3D sound field using the headphone device (and again present the 3D sound field using the array of loudspeakers) once the user 112-1 arrives back in reference listening area 110, the system may perform a similar transition as described above when it is determined that user 112-1 is in the process of reentering reference listening area 110 (e.g., as user 112-1 crosses outer boundary 704-0 and then boundary 602). As shown during transition 710, for instance, the communication system may present the 3D sound field using both the headphone device worn by the user and the array of loudspeakers in a manner that transitions from one to the other.

[0071] More particularly, as shown in the graph depicting what mechanism is used to present the 3D sound field (“3D Sound Presentation”) with respect to the user’s location (“User Position”), the headphone device is shown to be used exclusively (or nearly exclusively) while user 112-1 is located outside of reference listening area 110 and not particularly close to boundary 702 (i.e., outside of transition region 708). Once user 112-1 crosses into transition region 708 and until the user leaves region 706, the graph then shows how the 3D sound presentation may transition from the headphone device (which is shown to ramp down during transition 710) to the array of loudspeakers (which is shown to ramp up during transition 710) until the loudspeakers are again used exclusively (or nearly exclusively) for presenting the 3D sound field. As described above in relation to FIG. 6, the shape and length of these transitional ramps may be implemented in any manner as may serve a particular implementation. Additionally, the tracking rate of the user as the user moves about in the scene may be any suitable rate such as described above (e.g., several times per second, matching the frame rate used by the communication systems, etc.)

[0072] As has been described, one operation that communication systems described herein may perform as part of providing immersive sound field communication using a loudspeaker array is to detect when a user leaves (and later reenters) a reference listening area. For example, this detection was described above in relation to operation 506 of method 500 and was mentioned in connection with the transitions illustrated in FIGS. 6 and 7. The detecting of a user leaving a reference listening area may be performed using any of a variety of techniques or a combination of such techniques. To illustrate a few examples, FIGS. 8A- 8C show illustrative aspects of how a communication system may detect when a user is located within a reference listening area and when the user leaves the reference listening area in accordance with principles described herein. More specifically, FIG. 8A shows a detection technique 800-A and FIG. 8B shows a detection technique 800-B that are each based on analyzing certain characteristics of an acoustic signal emitted by the communication system to determine whether certain predefined thresholds are satisfied. FIG. 8C then shows a detection technique 800-C that is based on visual tracking of the user with respect to the reference listening area.

[0073] In FIG. 8A, detection technique 800-A is illustrated by showing an acoustic signal 801 emitted from communication system 102-1 (e.g., by the array of loudspeakers thereof) to be received by headphone device 118 as worn by user 112-1. For example, this acoustic signal may be an acoustic-based excitation signal that can be detected by headphone device 118 so that various characteristics of the signal can be analyzed to determine if user 112-1 is likely to be located within the reference listening area 110.

[0074] In some examples, acoustic signal 801 may be specifically configured only for the purpose of location detection and, as such, may have certain properties tailored to that use case. For instance, the acoustic signal may be an inaudible signal that is emitted outside of a frequency range associated with human hearing capability (e.g., a subsonic signal including frequencies less than about 20 Hz, an ultrasonic signal including frequencies greater than about 20 kHz, etc.). Such a signal may be produced using one or more of the loudspeakers 106 or using other mechanisms or loudspeakers specifically configured for this purpose. Additionally, acoustic signal 801 may be produced in short pulses or chirps that are relatively straightforward for headphone device 118 to identify and analyze. For example, such pulses may be performed at intervals associated with the detection rate described above (e.g., several times per second, etc.).

[0075] In other examples, acoustic signal 801 may be audible to the user (i.e., within the frequency range associated with human hearing, such as between 20-20,000 Hz). In some implementations, acoustic signal 801 may be included as part of the sound (e.g., 3D sound 108, flattened sound 320, etc.) that communication system 102-1 is presenting using the loudspeakers 106.

[0076] Based on a particular characteristic of acoustic signal 801 as received by headphone device 118, communication system 102-1 may determine whether a predefined threshold associated with the characteristic is satisfied. More particularly, in this example, the characteristic may include a time-of-flight of acoustic signal 801 as received by headphone device 118 and the predefined threshold may include a time-of-flight threshold configured to be satisfied when headphone device 118 is outside reference listening area 110. For example, communication system 102-1 and headphone device 118 may be synchronized and pulses of acoustic signal 801 may be emitted according to a schedule known to both communication system 102-1 and headphone device 118 (e.g., a pulse every 25 ms, a pulse every 100 ms, etc.). Based on an assumption that a pulse of acoustic signal 801 was emitted at a prescheduled time, an estimate of the position of headphone device 118 with respect to the speaker emitting acoustic signal 801 may be computed based on when headphone device 118 receives the pulse. The time between the emission and receipt of acoustic signal 801 is the time-of-flight characteristic. In a similar implementation, the headphone device could send the pulses or immediately repeat the pulses (for transmission to and receipt by communication system 102-1) so that the time-of-flight characteristic could be determined in other ways.

[0077] FIG. 8A illustrates a distance graph beneath communication system 102-1 and user 112-1 that shows, along one dimension, how acoustic signal 801 may propagate from a starting position 802 corresponding to communication system 102-1 to an ending position 804 corresponding to headphone device 118. The distance graph is also shown to double as a timeline (“Distance/Time”), since the time of flight of acoustic signal 801 from starting position 802 to ending position 804 is directly related to the distance between them. As such, starting position 802 may also represent a first time when acoustic signal 801 is emitted while ending position 804 may also represent a second time when acoustic signal 801 is received by headphone device 118. On either side of the boundary demarcating reference listening area 110, additional positions 806-1 and 806-2 are also shown to represent threshold times that, together, may define a time-of-flight threshold 808. In other words, if the detected time of flight between starting position 802 and ending position 804 is within the time-of-flight threshold 808 range (as would be expected for the current location of user 112-1 and headphone device 118 shown in FIG. 8 A), the threshold may be considered to not be satisfied, such that communication system 102-1 may determine that it is likely that headphone device 118 is located within reference listening area 110. In contrast, if the detected time of flight between starting position 802 and ending position 804 is outside the time-of-flight threshold 808 range (e.g., as would occur if user 112-1 and headphone device 118 were located outside of reference listening area 110), the threshold may be considered to be satisfied, such that communication system 102-1 may determine that it is likely that headphone device 118 has left reference listening area 110.

[0078] Similar to FIG. 8A, detection technique 800-B shown in FIG. 8B is illustrated by again showing acoustic signal 801 emitted from communication system 102-1 to be received by headphone device 118 as worn by user 112-1. Rather than the time-of-flight characteristic described and illustrated in relation to detection technique 800-A, however, detection technique 800-B shows an example in which the characteristic being analyzed includes an amplitude of acoustic signal 801 as received by headphone device 118, and in which the predefined threshold includes an amplitude threshold configured to be satisfied when headphone device 118 is outside reference listening area 110. For example, communication system 102-1 may emit acoustic signal 801 (e.g., as a series of pulses/chirps or in another form) at a predetermined amplitude (e.g., volume, sound intensity level, etc.), such that the amplitude of the signal as received by the headphone device 118 may allow for an estimation of the position of headphone device 118 with respect to communication system 102-1 may be computed based on the detected amplitude and a known rate of decay of the signal over distance (e.g., stored in a lookup table, computed based on physical principles, etc.).

[0079] FIG. 8B illustrates a distance vs. amplitude graph beneath communication system 102-1 and user 112-1 that shows, along one dimension, how an amplitude 810 of acoustic signal 801 may decay from a starting position 812 corresponding to communication system 102-1 (where the signal is emitted) to an ending position 814 corresponding to headphone device 118 (where the signal is received). At starting position 812, the amplitude 810 of acoustic signal 801 is shown to be relatively high, but, as acoustic signal 801 propagates through the scene to eventually be detected at headphone device 118, the graph shows that amplitude 810 decreases at a known rate. Accordingly, if amplitude 810 is measured to be between amplitudes at threshold distances 816-1 and 816-2, or, in other words, if the measured amplitude is within an amplitude threshold 818 range, communication system 102-1 may determine that it is likely that headphone device 118 is located within reference listening area 110. In contrast, if the detected amplitude at ending position 804 is detected to be outside the amplitude threshold 818 range (e.g., as would occur if user 112-1 and headphone device 118 were located outside of reference listening area 110), the threshold may be considered to be satisfied, such that communication system 102-1 may determine that it is likely that headphone device 118 has left reference listening area 110.

[0080] Detection technique 800-A and detection technique 800-B may be efficient and advantageous techniques for detecting whether a user is located within a reference listening area since it may be a relatively straightforward and inexpensive operation (e.g., in terms of computing, in terms of necessary equipment, etc.) to emit the acoustic signal and detect and analyze the signal in the ways that have been described. Additionally, in the event that communication system 102-1 is implemented by an audio-only communication device or otherwise does not include cameras or video equipment, detection techniques 800-A and 800- B may be advantageous since they do not rely on video data or image analysis.

[0081] On the other hand, however, it could also be the case that a communication device is implemented with video support (such as described above for communication systems 102-1 and 102-2), such that a plurality of cameras, computation associated with image capture and object tracking of users at the scene, and other such resources are available already for other purposes. In this scenario, it may be convenient to reuse tracking operations that have already been performed (or at least to reuse video data that is being captured anyway) for tracking the user location and detecting when the user leaves reference listening area 110.

[0082] To illustrate, FIG. 8C shows a detection technique 800-C that uses a plurality of cameras 820 included within (e.g., integrated with) communication system 102-1 to capture imagery of the scene, including images of user 112-1 as the user moves about the scene. In this example, the detecting that user 112-1 leaves reference listening area 110 may include: 1) tracking a location of user 112-1 based on video captured by cameras 820 of communication system 102-1; and 2) determining, based on the tracking, that the location of user 112-1 is outside reference listening area 110. For example, if the user location is being tracked already (e.g., so that the user may be properly modeled, so that the 3D sound field may be properly interpreted and represented, etc.), that tracked location may be compared to a predetermined region associated with the reference listening area to determine whether the location is inside or outside the reference listening area.

[0083] As mentioned above, the detecting that user 112-1 leaves reference listening area 110 may be performed as a multi-modal detection based on any combination of detection techniques 800-A, 800-B, 800-C, and/or other suitable detection techniques such as determining the headphone location using a non-acoustic wireless signal (e.g., a WiFi signal, a Bluetooth signal, an ultra-wideband signal, etc.). For example, in one specific implementation the multi-modal detection could include both: 1) the tracking of the location of user 112-1 based on the video of detection technique 800-C; and 2) at least one other indication that user 112-1 is outside reference listening area 110 (e.g., indications derived from the time-of-flight and/or amplitude characteristics described in relation to detection techniques 800-A and 800-B). As another example, the multi-model detection may not involve video, but may instead be based on a combination of acoustic signal characteristics such as the time-of-flight and/or amplitude characteristics described in relation to detection techniques 800-A and 800-B. As yet another example, the multi-modal detection could be based on a weighted combination of all of these and/or other factors.

[0084] As has been mentioned, various methods and processes described herein may be implemented at least in part as instructions embodied in a non-transitory computer- readable medium and executable by one or more computing devices. In general, a processor (e.g., a microprocessor) receives instructions, from a non-transitory computer-readable medium (e.g., a memory, etc.), and executes those instructions, thereby performing one or more operations such as the operations described herein. Such instructions may be stored and/or transmitted using any of a variety of known computer-readable media.

[0085] A computer-readable medium (also referred to as a processor-readable medium) includes any non-transitory medium that participates in providing data (e.g., instructions) that may be read by a computer (e.g., by a processor of a computer). Such a medium may take many forms, including, but not limited to, non-volatile media, and/or volatile media. Non-volatile media may include, for example, optical or magnetic disks and other persistent memory. Volatile media may include, for example, dynamic random-access memory (DRAM), which typically constitutes a main memory. Common forms of computer- readable media include, for example, a disk, hard disk, magnetic tape, any other magnetic medium, a compact disc read-only memory (CD-ROM), a digital video disc (DVD), any other optical medium, random access memory (RAM), programmable read-only memory (PROM), electrically erasable programmable read-only memory (EPROM), FLASH- EEPROM, any other memory chip or cartridge, or any other tangible medium from which a computer can read.

[0086] FIG. 9 shows an illustrative computing system 900 that may be used to implement various devices and/or systems described herein. For example, computing system 900 may include or implement (or partially implement) communication systems such as communication system 102, any implementations thereof, any components thereof, and/or other devices used therewith.

[0087] As shown in FIG. 9, computing system 900 may include a communication interface 902, a processor 904, a storage device 906, and an input/output (I/O) module 908 communicatively connected via a communication infrastructure 910. While an illustrative computing system 900 is shown in FIG. 9, the components illustrated in FIG. 9 are not intended to be limiting. Additional or alternative components may be used in other embodiments. Components of computing system 900 shown in FIG. 9 will now be described in additional detail.

[0088] Communication interface 902 may be configured to communicate with one or more computing devices. Examples of communication interface 902 include, without limitation, a wired network interface (such as a network interface card), a wireless network interface (such as a wireless network interface card), a modem, an audio/video connection, and any other suitable interface.

[0089] Processor 904 generally represents any type or form of processing unit capable of processing data or interpreting, executing, and/or directing execution of one or more of the instructions, processes, and/or operations described herein. Processor 904 may direct execution of operations in accordance with one or more applications 912 or other computerexecutable instructions such as may be stored in storage device 906 or another computer- readable medium.

[0090] Storage device 906 may include one or more data storage media, devices, or configurations and may employ any type, form, and combination of data storage media and/or device. For example, storage device 906 may include, but is not limited to, a hard drive, network drive, flash drive, magnetic disc, optical disc, RAM, dynamic RAM, other non-volatile and/or volatile data storage units, or a combination or sub-combination thereof. Electronic data, including data described herein, may be temporarily and/or permanently stored in storage device 906. For example, data representative of one or more executable applications 912 configured to direct processor 904 to perform any of the operations described herein may be stored within storage device 906. In some examples, data may be arranged in one or more databases residing within storage device 906.

[0091] I/O module 908 may include one or more VO modules configured to receive user input and provide user output. One or more VO modules may be used to receive input for a single virtual experience. I/O module 908 may include any hardware, firmware, software, or combination thereof supportive of input and output capabilities. For example, I/O module 908 may include hardware and/or software for capturing user input, including, but not limited to, a keyboard or keypad, a touchscreen component (e.g., touchscreen display), a receiver (e.g., an RF or infrared receiver), motion sensors, and/or one or more input buttons.

[0092] I/O module 908 may include one or more devices for presenting output to a user, including, but not limited to, a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., display drivers), one or more audio speakers, and one or more audio drivers. In certain embodiments, I/O module 908 is configured to provide graphical data to a display for presentation to a user. The graphical data may be representative of one or more graphical user interfaces and/or any other graphical content as may serve a particular implementation.

[0093] The following examples describe implementations of immersive sound field communication using a loudspeaker array during a communication session in accordance with principles described herein.

[0094] Example 1: A method comprising: receiving, by a first communication system at a first scene from a second communication system at a second scene, an audio data stream representing a 3D sound field captured at the second scene by an array of microphones of the second communication system; presenting, by the first communication system using an array of loudspeakers and based on the audio data stream, the 3D sound field to a user located within a reference listening area at the first scene; detecting, during the presenting of the 3D sound field, that the user leaves the reference listening area; and in response to the detecting and while the user remains outside the reference listening area, ceasing presenting the 3D sound field using the array of loudspeakers.

[0095] Example 2: The method of any of the preceding examples, further comprising presenting, based on the audio data stream and while the presenting of the 3D sound field using the array of loudspeakers is ceased, the 3D sound field using a headphone device worn by the user.

[0096] Example 3: The method of any of the preceding examples, wherein the detecting that the user leaves the reference listening area includes: emitting an acoustic signal from the first communication system to the headphone device; and determining, based on a characteristic of the acoustic signal as received by the headphone device, that a predefined threshold associated with the characteristic is satisfied.

[0097] Example 4: The method of any of the preceding examples, wherein: the characteristic includes an amplitude of the acoustic signal as received by the headphone device; and the predefined threshold includes an amplitude threshold configured to be satisfied when the headphone device is outside the reference listening area. [0098] Example 5: The method of any of the preceding examples, wherein: the characteristic includes a time-of-flight of the acoustic signal as received by the headphone device; and the predefined threshold includes a time-of-flight threshold configured to be satisfied when the headphone device is outside the reference listening area.

[0099] Example 6: The method of any of the preceding examples, wherein the acoustic signal is an inaudible signal emitted outside of a frequency range associated with human hearing capability.

[0100] Example 7: The method of any of the preceding examples, wherein the presenting the 3D sound field using the headphone device is performed based on head movement tracking of the user by an inertial measurement unit within the headphone device.

[0101] Example 8: The method of any of the preceding examples, further comprising: determining, during the presenting of the 3D sound field and prior to the user leaving the reference listening area, that the user is near a boundary of the reference listening area; and in response to the determining that the user is near the boundary, presenting the 3D sound field using both the array of loudspeakers and the headphone device worn by the user.

[0102] Example 9: The method of any of the preceding examples, further comprising: determining, subsequent to the ceasing presenting the 3D sound field using the array of loudspeakers, that the user reenters the reference listening area; and in response to the determining that the user reenters the reference listening area, ceasing presenting the 3D sound field using the headphone device and again presenting the 3D sound field using the array of loudspeakers.

[0103] Example 10: The method of any of the preceding examples, further comprising presenting, based on the audio data stream and while the presenting of the 3D sound field using the array of loudspeakers is ceased, a flattened sound field captured at the second scene, the flattened sound field presented using the array of loudspeakers.

[0104] Example 11 : The method of any of the preceding examples, wherein the detecting that the user leaves the reference listening area includes: tracking a location of the user based on video captured by a camera of the first communication system; and determining, based on the tracking, that the location of the user is outside the reference listening area.

[0105] Example 12: The method of any of the preceding examples, wherein the detecting that the user leaves the reference listening area is performed as a multi-modal detection based on: the tracking of the location of the user based on the video; and at least one other indication that the user is outside the reference listening area. [0106] Example 13: The method of any of the preceding examples, wherein the first communication system and the second communication system are video communication systems, the first communication system including a display screen configured to present video captured at the second scene by a camera of the second communication system.

[0107] Example 14: The method of any of the preceding examples, wherein the display screen and the array of loudspeakers of the first communication system are disposed at static locations within the first scene such that the reference listening area is a static area from where the user views the display screen.

[0108] Example 15: A communication system comprising: a display screen; an array of loudspeakers configured to present 3D sound configured for a reference listening area from where a user views the display screen; and a processor communicatively coupled to the array of loudspeakers and configured to perform a process comprising: receiving, from an additional communication system, an audio data stream representing a 3D sound field captured by an array of microphones of the additional communication system; presenting, using the array of loudspeakers and based on the audio data stream, the 3D sound field to the user as the user is located within the reference listening area; detecting, during the presenting of the 3D sound field, that the user leaves the reference listening area; and in response to the detecting and while the user remains outside the reference listening area, ceasing presenting the 3D sound field using the array of loudspeakers.

[0109] Example 16: The communication system of any of the preceding examples, wherein the process further comprises presenting, based on the audio data stream and while the presenting of the 3D sound field using the array of loudspeakers is ceased, the 3D sound field using a headphone device worn by the user.

[0110] Example 17: The communication system of any of the preceding examples, wherein the process further comprises presenting, based on the audio data stream and while the presenting of the 3D sound field using the array of loudspeakers is ceased, a flattened sound field captured by the array of microphones of the additional communication system, the flattened sound field presented using the array of loudspeakers.

[0111] Example 18: The communication system of any of the preceding examples, wherein: the communication system is a first video communication system at a first scene and the additional communication system is a second video communication system at a second scene; the display screen is configured to present video captured at the second scene by a camera of the second video communication system; and the display screen and the array of loudspeakers are disposed at static locations within the first scene such that the reference listening area is a static area from where the user views the display screen.

[0112] Example 19: A non-transitory computer-readable medium storing instructions that, when executed, cause a processor of a first communication system at a first scene to perform a process comprising: receiving, from a second communication system at a second scene, an audio data stream representing a 3D sound field captured at the second scene by an array of microphones of the second communication system; presenting, by the first communication system using an array of loudspeakers and based on the audio data stream, the 3D sound field to a user located within a reference listening area at the first scene; detecting, during the presenting of the 3D sound field, that the user leaves the reference listening area; and in response to the detecting and while the user remains outside the reference listening area, ceasing presenting the 3D sound field using the array of loudspeakers.

[0113] Example 20: The non-transitory computer-readable medium of any of the preceding examples, wherein the process further comprises presenting, based on the audio data stream and while the presenting of the 3D sound field using the array of loudspeakers is ceased, the 3D sound field using a headphone device worn by the user.

[0114] Various implementations of the systems and techniques described herein can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

[0115] A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the description and claims. In addition, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. In addition, other steps may be provided, or steps may be eliminated, from the described flows, and other components may be added to, or removed from, the described systems. Accordingly, other implementations are within the scope of the following claims.

[0116] Specific structural and functional details disclosed herein are merely representative for purposes of describing example implementations. Example implementations, however, may be embodied in many alternate forms and should not be construed as limited to only the implementations set forth herein.

[0117] It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. A first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the implementations of the disclosure. As used herein, the term and/or includes any and all combinations of one or more of the associated listed items.

[0118] The terminology used herein is for the purpose of describing particular implementations only and is not intended to be limiting of the implementations. As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes,” and/or “including,” when used in this specification, specify the presence of the stated features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof.

[0119] It will be understood that when an element is referred to as being “coupled,” “connected,” or “responsive” to, or “on,” another element, it can be directly coupled, connected, or responsive to, or on, the other element, or intervening elements may also be present. In contrast, when an element is referred to as being “directly coupled,” “directly connected,” or “directly responsive” to, or “directly on,” another element, there are no intervening elements present. As used herein the term “and/or” includes any and all combinations of one or more of the associated listed items.

[0120] Spatially relative terms, such as “beneath,” “below,” “lower,” “above,” “upper,” and the like, may be used herein for ease of description to describe one element or feature in relationship to another element(s) or feature(s) as illustrated in the figures. It will be understood that the spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if the device in the figures is turned over, elements described as “below” or “beneath” other elements or features would then be oriented “above” the other elements or features. Thus, the term “below” can encompass both an orientation of above and below. The device may be otherwise oriented (rotated 130 degrees or at other orientations) and the spatially relative descriptors used herein may be interpreted accordingly. [0121] Unless otherwise defined, the terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which these concepts belong. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and/or the present specification and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

[0122] Further to the descriptions above, a user may be provided with controls allowing the user to make an election as to both if and when systems, programs, or features described herein may enable collection of user information (e.g., information about a user's social network, social actions, or activities, profession, a user's preferences, or a user's current location), and if the user is sent content or communications from a server. In addition, certain data may be treated in one or more ways before it is stored or used, so that personally identifiable information is removed. For example, a user's identity may be treated so that no personally identifiable information can be determined for the user, or a user's geographic location may be generalized, or location information may be obtained (such as to a city, zip code, or state level), so that a particular location of a user cannot be determined. Thus, the user may have control over what information is collected about the user, how that information is used, and what information is provided to the user.

[0123] While certain features of the described implementations have been illustrated as described herein, many modifications, substitutions, changes, and equivalents may occur to those skilled in the art. It is therefore to be understood that the appended claims are intended to cover such modifications and changes as fall within the scope of the implementations. It will be understood that they have been presented by way of example only, not limitation, and various changes in form and details may be made. Any portion of the apparatus and/or methods described herein may be combined in any combination, except mutually exclusive combinations. The implementations described herein can include various combinations and/or sub-combinations of the functions, components, and/or features of the different implementations described. As such, the scope of the present disclosure is not limited to the particular combinations hereafter claimed, but instead extends to encompass any combination of features or example implementations described herein irrespective of whether or not that particular combination has been specifically enumerated in the accompanying claims at this time.

Claims

WHAT IS CLAIMED IS:

1. A method compri sing : receiving, by a first communication system at a first scene from a second communication system at a second scene, an audio data stream representing a three dimensional (3D) sound field captured at the second scene by an array of microphones of the second communication system; presenting, by the first communication system using an array of loudspeakers and based on the audio data stream, the 3D sound field to a user located within a reference listening area at the first scene; detecting, during the presenting of the 3D sound field, that the user leaves the reference listening area; and in response to the detecting and while the user remains outside the reference listening area, ceasing presenting the 3D sound field using the array of loudspeakers.

2. The method of claim 1, further comprising presenting, based on the audio data stream and while the presenting of the 3D sound field using the array of loudspeakers is ceased, the 3D sound field using a headphone device worn by the user.

3. The method of claim 2, wherein the detecting that the user leaves the reference listening area includes: emitting an acoustic signal from the first communication system to the headphone device; and determining, based on a characteristic of the acoustic signal as received by the headphone device, that a predefined threshold associated with the characteristic is satisfied.

4. The method of claim 3, wherein: the characteristic includes an amplitude of the acoustic signal as received by the headphone device; and the predefined threshold includes an amplitude threshold configured to be satisfied when the headphone device is outside the reference listening area.

5. The method of claim 3 or 4, wherein: the characteristic includes a time-of-flight of the acoustic signal as received by the headphone device; and the predefined threshold includes a time-of-flight threshold configured to be satisfied when the headphone device is outside the reference listening area.

6. The method of any of claims 3 to 5, wherein the acoustic signal is an inaudible signal emitted outside of a frequency range associated with human hearing capability.

7. The method of any of claims 2 to 6, wherein the presenting the 3D sound field using the headphone device is performed based on head movement tracking of the user by an inertial measurement unit within the headphone device.

8. The method of any of claims 2 to 7, further comprising: determining, during the presenting of the 3D sound field and prior to the user leaving the reference listening area, that the user is near a boundary of the reference listening area; and in response to the determining that the user is near the boundary, presenting the 3D sound field using both the array of loudspeakers and the headphone device worn by the user.

9. The method of any of claims 2 to 8, further comprising: determining, subsequent to the ceasing presenting the 3D sound field using the array of loudspeakers, that the user reenters the reference listening area; and in response to the determining that the user reenters the reference listening area, ceasing presenting the 3D sound field using the headphone device and again presenting the 3D sound field using the array of loudspeakers.

10. The method of claim 1, further comprising presenting, based on the audio data stream and while the presenting of the 3D sound field using the array of loudspeakers is ceased, a flattened sound field captured at the second scene, the flattened sound field presented using the array of loudspeakers.

11. The method of any of claims 1 to 10, wherein the detecting that the user leaves the reference listening area includes: tracking a location of the user based on video captured by a camera of the first communication system; and determining, based on the tracking, that the location of the user is outside the reference listening area.

12. The method of claim 11, wherein the detecting that the user leaves the reference listening area is performed as a multi-modal detection based on: the tracking of the location of the user based on the video; and at least one other indication that the user is outside the reference listening area.

13. The method of any of claims 1 to 12, wherein the first communication system and the second communication system are video communication systems, the first communication system including a display screen configured to present video captured at the second scene by a camera of the second communication system.

14. The method of claim 13, wherein the display screen and the array of loudspeakers of the first communication system are disposed at static locations within the first scene such that the reference listening area is a static area from where the user views the display screen.

15. A communication system comprising: a display screen; an array of loudspeakers configured to present 3D sound configured for a reference listening area from where a user views the display screen; and a processor communicatively coupled to the array of loudspeakers and configured to perform a process comprising: receiving, from an additional communication system, an audio data stream representing a 3D sound field captured by an array of microphones of the additional communication system; presenting, using the array of loudspeakers and based on the audio data stream, the 3D sound field to the user as the user is located within the reference listening area; detecting, during the presenting of the 3D sound field, that the user leaves the reference listening area; and in response to the detecting and while the user remains outside the reference listening area, ceasing presenting the 3D sound field using the array of loudspeakers.

16. The communication system of claim 15, wherein the process further comprises presenting, based on the audio data stream and while the presenting of the 3D sound field using the array of loudspeakers is ceased, the 3D sound field using a headphone device worn by the user.

17. The communication system of claim 15, wherein the process further comprises presenting, based on the audio data stream and while the presenting of the 3D sound field using the array of loudspeakers is ceased, a flattened sound field captured by the array of microphones of the additional communication system, the flattened sound field presented using the array of loudspeakers.

18. The communication system of any of claims 15 to 17, wherein: the communication system is a first video communication system at a first scene and the additional communication system is a second video communication system at a second scene; the display screen is configured to present video captured at the second scene by a camera of the second video communication system; and the display screen and the array of loudspeakers are disposed at static locations within the first scene such that the reference listening area is a static area from where the user views the display screen.

19. A non-transitory computer-readable medium storing instructions that, when executed, cause a processor of a first communication system at a first scene to perform a process comprising: receiving, from a second communication system at a second scene, an audio data stream representing a 3D sound field captured at the second scene by an array of microphones of the second communication system; presenting, by the first communication system using an array of loudspeakers and based on the audio data stream, the 3D sound field to a user located within a reference listening area at the first scene; detecting, during the presenting of the 3D sound field, that the user leaves the reference listening area; and in response to the detecting and while the user remains outside the reference listening area, ceasing presenting the 3D sound field using the array of loudspeakers.

20. The non-transitory computer-readable medium of claim 19, wherein the process further comprises presenting, based on the audio data stream and while the presenting of the 3D sound field using the array of loudspeakers is ceased, the 3D sound field using a headphone device worn by the user.