[go: up one dir, main page]

WO2025259283A1 - Systems and method for audio processing in extended reality - Google Patents

Systems and method for audio processing in extended reality

Info

Publication number
WO2025259283A1
WO2025259283A1 PCT/US2024/033844 US2024033844W WO2025259283A1 WO 2025259283 A1 WO2025259283 A1 WO 2025259283A1 US 2024033844 W US2024033844 W US 2024033844W WO 2025259283 A1 WO2025259283 A1 WO 2025259283A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio data
output audio
energy level
masking
level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2024/033844
Other languages
French (fr)
Inventor
James Tracey
Mark Brandon HERTENSTEINER
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Magic Leap Inc
Original Assignee
Magic Leap Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Magic Leap Inc filed Critical Magic Leap Inc
Priority to PCT/US2024/033844 priority Critical patent/WO2025259283A1/en
Publication of WO2025259283A1 publication Critical patent/WO2025259283A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R5/00Stereophonic arrangements
    • H04R5/033Headphones for stereophonic communication
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10KSOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
    • G10K15/00Acoustics not otherwise provided for
    • G10K15/02Synthesis of acoustic waves
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/04Circuits for transducers, loudspeakers or microphones for correcting frequency response
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2430/00Signal processing covered by H04R, not provided for in its groups
    • H04R2430/01Aspects of volume control, not necessarily automatic, in sound systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2460/00Details of hearing devices, i.e. of ear- or headphones covered by H04R1/10 or H04R5/033 but not provided for in any of their subgroups, or of hearing aids covered by H04R25/00 but not provided for in any of its subgroups
    • H04R2460/09Non-occlusive ear tips, i.e. leaving the ear canal open, for both custom and non-custom tips
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2499/00Aspects covered by H04R or H04S not otherwise provided for in their subgroups
    • H04R2499/10General applications
    • H04R2499/15Transducers incorporated in visual displaying devices, e.g. televisions, computer displays, laptops
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R25/00Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception
    • H04R25/35Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception using translation techniques
    • H04R25/356Amplitude, e.g. amplitude shift or compression

Definitions

  • An XR head-worn audio visual system may include, e.g., a virtual reality (VR) head-worn audio visual system, an augmented reality (AR) head-worn audio visual system, or a mixed reality (MR) head-worn audio visual system.
  • a VR head-worn audio visual system typically involves the presentation of digital or virtual image information without transparency to other actual real-world visual input.
  • An AR head- worn audio visual system or MR head-worn audio visual system typically involves presentation of virtual objects to a user in relation to real objects of the physical world.
  • an AR head-worn audio visual system involves the presentation of digital or virtual image information as an augmentation to visualization of the actual world around the user.
  • An MR head-worn audio visual system involves the presentation of AR image content that appears to be blocked by or is otherwise perceived to interact with objects in the real world.
  • a method for presenting an extended reality experience to a user includes a display subsystem presenting images corresponding to image data to the user.
  • the method also includes a microphone capturing input sound from an ambient acoustic environment of the user and converting the captured input sound to input audio data.
  • the method further includes an audio controller analyzing the input audio data to determine a masking level.
  • the method includes the audio controller rendering output audio data corresponding to source audio data.
  • the method includes the audio controller analyzing the output audio data to determine an energy level of the output audio data.
  • the method also includes the audio controller comparing the energy level of the output audio data with the masking level to determine a comparison outcome.
  • the method further includes the audio controller modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data.
  • the method also includes an audio subsystem presenting output sound corresponding to the modified output audio data to the user.
  • modifying the output audio data includes increasing a gain to raise the energy level of the output audio data above the masking level to generate a raised energy level.
  • Modifying the output audio data may include applying a dynamic range compression based on the comparison outcome to maintain a maximum energy level of the output audio data above the raised energy level.
  • the dynamic range compression may include adjusting a compression associated parameter based on the comparison outcome.
  • the compression associated parameter may include a compression threshold, a compression ratio, or a response curve knee.
  • the method may include reducing a dynamic range by adjusting the compression associated parameter.
  • the dynamic range compression may include reducing a compression threshold based on the comparison outcome.
  • the masking level, the energy level of the output audio data, the dynamic range compression, and the gain are each specific to a band of a frequency domain.
  • the frequency domain may have 10 bands including the band.
  • Modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data may increase an intelligibility of the output sound while minimizing increases in sound pressure level.
  • the method may include calculating an auditory masking curve of the ambient acoustic environment.
  • the auditory masking curve may be critically band partitioned.
  • the auditory masking curve may describe masking in a frequency domain or a time domain.
  • the auditory masking curve may result in upward spread of masking in the output sound.
  • the method includes the audio controller, for each band of a frequency domain, analyzing respective input audio data to determine a respective masking level, rendering respective output audio data corresponding to respective audio data, analyzing the respective output audio data to determine an energy level of the respective output audio data, comparing the energy level of the respective output audio data with the respective masking level to determine a respective comparison outcome, and modifying the respective output audio data based on the respective comparison outcome to raise the energy level of the respective output audio data to generate respective modified output audio data.
  • the method also includes the audio subsystem, for each band of the frequency domain, presenting respective output sound corresponding to the respective modified output audio data to the user. Sound corresponding to source audio data below the masking level may be inaudible to the user.
  • the masking level may be determined by analyzing the input audio data using a psychoacoustic model.
  • an extended reality system in another embodiment, includes a display subsystem configured to present images corresponding to image data to a user.
  • the system also includes a microphone configured to capture input sound from an ambient acoustic environment of the user and convert the captured input sound to input audio data.
  • the system further includes an audio controller, including a processor, and a non- transitory computer readable storage medium storing thereupon a sequence of instructions which, when executed by the processor, causes the processor to perform a set of acts.
  • the set of acts includes analyzing the input audio data to determine a masking level, rendering output audio data corresponding to source audio data, analyzing the output audio data to determine an energy level of the output audio data, comparing the energy level of the output audio data with the masking level to determine a comparison outcome, and modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data.
  • the system includes an audio subsystem configured to present sound corresponding to the modified output audio data to the user.
  • modifying the output audio data comprises increasing a gain to raise the energy level of the output audio data above the masking level to generate a raised energy level.
  • Modifying the output audio data may include applying a dynamic range compression based on the comparison outcome to maintain a maximum energy level of the output audio data above the raised energy level.
  • the dynamic range compression may include adjusting a compression associated parameter based on the comparison outcome.
  • the compression associated parameter may include a compression threshold, a compression ratio, or a response curve knee.
  • the set of acts may further include reducing a dynamic range by adjusting the compression associated parameter.
  • the dynamic range compression may include reducing a compression threshold based on the comparison outcome.
  • the masking level, the energy level of the output audio data, the dynamic range compression, and the gain are each specific to a band of a frequency domain.
  • the frequency domain may have 10 bands including the band.
  • Modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data may increase an intelligibility of the output sound while minimizing increases in sound pressure level.
  • the set of acts may further include calculating an auditory masking curve of the ambient acoustic environment.
  • the auditory masking curve may be critically band partitioned.
  • the auditory masking curve may describe masking in a frequency domain or a time domain.
  • the auditory masking curve may result in upward spread of masking in the output sound.
  • the set of acts further includes for each band of a frequency domain, analyzing respective input audio data to determine a respective masking level, rendering respective output audio data corresponding to respective audio data, analyzing the respective output audio data to determine an energy level of the respective output audio data, comparing the energy level of the respective output audio data with the respective masking level to determine a respective comparison outcome, and modifying the respective output audio data based on the respective comparison outcome to raise the energy level of the respective output audio data to generate respective modified output audio data.
  • the method also includes the audio subsystem, for each band of the frequency domain, presenting respective output sound corresponding to the respective modified output audio data to the user. Sound corresponding to source audio data below the masking level may be inaudible to the user.
  • the masking level may be determined by analyzing the input audio data using a psychoacoustic model.
  • a computer program product includes a non- transitory computer readable storage medium having stored thereupon a sequence of instructions which, when executed by a processor, causes the processor to perform a set of acts.
  • the set of acts includes analyzing input audio data to determine a masking level.
  • the set of acts also includes rendering output audio data corresponding to source audio data.
  • the set of acts further includes analyzing the output audio data to determine an energy level of the output audio data.
  • the set of acts includes comparing the energy level of the output audio data with the masking level to determine a comparison outcome.
  • the set of acts includes modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data.
  • modifying the output audio data includes increasing a gain to raise the energy level of the output audio data above the masking level to generate a raised energy level.
  • Modifying the output audio data may include applying a dynamic range compression based on the comparison outcome to maintain a maximum energy level of the output audio data above the raised energy level.
  • the dynamic range compression may include adjusting a compression associated parameter based on the comparison outcome.
  • the compression associated parameter may include a compression threshold, a compression ratio, or a response curve knee.
  • the set of acts may include reducing a dynamic range by adjusting the compression associated parameter.
  • the dynamic range compression may include reducing a compression threshold based on the comparison outcome.
  • the masking level, the energy level of the output audio data, the dynamic range compression, and the gain are each specific to a band of a frequency domain.
  • the frequency domain may have 10 bands including the band.
  • Modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data may increase an intelligibility of the output sound while minimizing increases in sound pressure level.
  • the set of acts may include calculating an auditory masking curve of the ambient acoustic environment.
  • the auditory masking curve may be critically band partitioned.
  • the auditory masking curve may describe masking in a frequency domain or a time domain.
  • the auditory masking curve may result in upward spread of masking in the output sound.
  • the set of acts includes for each band of a frequency domain analyzing respective input audio data to determine a respective masking level, rendering respective output audio data corresponding to respective audio data, analyzing the respective output audio data to determine an energy level of the respective output audio data, comparing the energy level of the respective output audio data with the respective masking level to determine a respective comparison outcome, and modifying the respective output audio data based on the respective comparison outcome to raise the energy level of the respective output audio data to generate respective modified output audio data.
  • Sound corresponding to source audio data below the masking level may be inaudible to the user.
  • the masking level may be determined by analyzing the input audio data using a psychoacoustic model.
  • Figure 1 illustrates a user’s view of an AR/MR/XR scene using an example AR/MR/XR system according to some embodiments.
  • Figures 2 to 5 schematically depict users using XR systems according to some embodiments.
  • Figure 6 schematically depicts a processing pipeline for multi-band compression and equalization of output audio data according to some embodiments.
  • Figure 7 is a flowchart depicting a method for presenting an extended reality experience to a user according to some embodiments.
  • Figures 8 is a flowchart depicting a method for modifying output audio data according to some embodiments.
  • Figure 9 is a flowchart depicting a method for multi-band compression and equalization of output audio data according to some embodiments.
  • Figure 10 is a block diagram schematically depicting an illustrative computing system according to some embodiments.
  • the embodiments described herein include head-worn XR systems that reduce the compression threshold and increase the gain to raise the energy level of output audio data above a masking level to generate modified output audio data to increase an intelligibility of output sound corresponding to the modified output audio data, while minimizing increases in sound pressure level and related battery consumption and potential harm to the user’s ears.
  • the embodiments depict an adaptive audio approach using a psychoacoustic masking model that exploits the psychoacoustic phenomenon of auditory masking. Auditory masking occurs when perception of one sound is affected and compromised by the presence of another sound. Auditory masking in the frequency domain is known as simultaneous masking, frequency masking or spectral masking.
  • Auditory masking in the time domain is called temporal or non-simultaneous masking.
  • the psychoacoustic model also takes advantage of what is called upward spread of masking. Upward spread of masking refers to the higher growth rate of masking for maskers lower in frequency than the signal, compared to maskers at the signal frequency.
  • XR systems disclosed herein can include a head-worn display subsystem 104 which presents computer-generated (i.e., virtual) imagery (video/image data) to a viewer/wearer/user 50.
  • the head-worn display subsystems 104 are wearable, which may advantageously provide a more realistic and immersive XR experience.
  • Various components of head-worn XR systems 100 are depicted in Figures 2 to 5.
  • speakers may be placed at a distance from the ear canal, e.g., using a bone conduction technology.
  • the speaker 106 is a nonoccluding speaker that allows sound from the ambient acoustic environment of the user 50 to enter the ear canal of the user 50.
  • the display subsystem 104 is designed to present the eyes 52 of the end user 50 with light patterns that can be comfortably perceived as augmentations to physical reality, with high-levels of image quality and three-dimensional perception, as well as being capable of presenting two-dimensional content.
  • the display subsystem 104 presents a sequence of frames at high frequency that provides the perception of a single coherent scene.
  • the display subsystem 104 employs “optical see-through” display through which the user can directly view transmitted light from real objects via transparent or semi-transparent elements.
  • the transparent/semi-transparent element often referred to as a “combiner,” superimposes light from the display over the user’s view of the real world.
  • the display subsystem 104 may include a partially transparent display 110.
  • the partially transparent display 110 may be electronically controlled.
  • the partially transparent display 110 may include dimming to modify the opacity of the partially transparent display 110 or one or more portions thereof.
  • the partially transparent display 110 may include global dimming to control opacity of the entirety of the partially transparent display 110.
  • the partially transparent display 110 is positioned in the end user’s 50 field of view between the eyes 52 of the end user 50 and an ambient real-world external environment, such that direct light from the ambient real-world external environment is transmitted through the partially transparent display 110 to the eyes 52 of the end user 50.
  • an image projection assembly provides light to the partially transparent display 110, thereby combining with the direct light from the ambient real-world external environment, and being transmitted from the partially transparent display 110 to the eyes 52 of the user 50.
  • the projection subsystem may include various light sources and spatial light modulators, and the partially transparent display 110 may be a waveguide-based display into which the light from the projection subsystem is injected to produce, e.g., images at a single optical viewing distance closer than infinity (e.g., arm’s length), images at multiple, discrete optical viewing distances or focal planes, and/or image layers stacked at multiple viewing distances or focal planes to represent volumetric 3D objects.
  • the display subsystem 104 may be monocular or binocular.
  • the head-worn XR system 100 may also include a microphone assembly 107 that captures input sound from the ambient acoustic environment of the user 50, and converts the captured input sound to electrical signals that are then delivered to an audio processor of the head-worn XR system 100.
  • a pair of microphone assemblies 107 is mounted to the pair of arms of the frame structure 102.
  • the head-worn XR system 100 may also include one or more sensors (not shown) mounted to the frame structure 102 for detecting the position and movement of the head 54 of the end user 50 and/or the eye position and inter-ocular distance of the end user 50.
  • sensors may include image capture devices (such as cameras), microphones, inertial measurement units, accelerometers, compasses, GPS units, radio devices, and/or gyros). Many of these sensors operate on the assumption that the frame 102 on which they are affixed is in turn substantially fixed to the user’s head, eyes, and ears.
  • the head-worn XR system 100 may also include outwardly-facing cameras 660 to capture images of a field-of-view of a user.
  • the head-worn XR system 100 may also include one or more dimmers 692 (e.g., passive photochromic dimmers or active LCD dimmers), which together with the respective LOEs 690, form display subsystems 104.
  • the head-worn XR system 100 may also include a user orientation detection module.
  • the user orientation module detects the instantaneous position of the head 54 of the end user 50 (e.g., via sensors coupled to the frame 102) and may predict the position of the head 54 of the end user 50 based on position data received from the sensors.
  • Detecting the instantaneous position of the head 54 of the end user 50 facilitates determination of the specific actual object that the end user 50 is looking at, thereby providing an indication of the specific virtual object to be generated in relation to that actual object and further providing an indication of the position in which the virtual object is to be displayed.
  • the user orientation module may also track the eyes of the end user 50 based on the tracking data received from the sensors.
  • the head-worn XR system 100 may also include a control subsystem that may take any of a large variety of forms.
  • the control subsystem includes a number of controllers, for instance one or more microcontrollers, microprocessors or central processing units (CPUs), digital signal processors, graphics processing units (GPUs), other integrated circuit controllers, such as application specific integrated circuits (ASICs), display bridge chips, display controllers, programmable gate arrays (PGAs), for instance field PGAs (FPGAs), and/or programmable logic controllers (PLUs).
  • controllers for instance one or more microcontrollers, microprocessors or central processing units (CPUs), digital signal processors, graphics processing units (GPUs), other integrated circuit controllers, such as application specific integrated circuits (ASICs), display bridge chips, display controllers, programmable gate arrays (PGAs), for instance field PGAs (FPGAs), and/or programmable logic controllers (PLUs).
  • the local processing and data module 170 may be mounted in a variety of configurations, such as fixedly attached to the frame structure 102 (Figure 2), fixedly attached to a helmet or hat 56 ( Figure 3), removably attached to the torso 58 of the end user 50 ( Figure 4), or removably attached to the hip 60 of the end user 50 in a belt-coupling style configuration (Figure 5).
  • the head-worn XR system 100 may also include a remote processing module 174 and remote data repository 176 operatively coupled, such as by a wired lead or wireless connectivity 178, 180, to the local processing and data module 170 and the local display bridge 142, such that these remote modules 174, 176 are operatively coupled to each other and available as resources to the local processing and data module 170 and the local display bridge 142.
  • a remote processing module 174 and remote data repository 176 operatively coupled, such as by a wired lead or wireless connectivity 178, 180, to the local processing and data module 170 and the local display bridge 142, such that these remote modules 174, 176 are operatively coupled to each other and available as resources to the local processing and data module 170 and the local display bridge 142.
  • the local processing and data module 170 and the local display bridge 142 may each include a power-efficient processor or controller, as well as digital memory, such as flash memory, both of which may be utilized to assist in the processing, caching, and storage of data captured from the sensors and/or acquired and/or processed using the remote processing module 174 and/or remote data repository 176, possibly for passage to the display subsystem 104 after such processing or retrieval.
  • the remote processing module 174 may include one or more relatively powerful processors or controllers configured to analyze and process data and/or image information.
  • the remote data repository 176 may include a relatively large-scale digital data storage facility, which may be available through the internet or other networking configuration in a “cloud” resource configuration. In some embodiments, all data is stored and all computation is performed in the local processing and data module 170 and the local display bridge 142, allowing fully autonomous use from any remote modules.
  • the couplings 172, 178, 180 between the various components described above may include one or more wired interfaces or ports for providing wires or optical communications, or one or more wireless interfaces or ports, such as via RF, microwave, and IR for providing wireless communications.
  • all communications may be wired, while in other implementations all communications may be wireless.
  • the choice of wired and wireless communications may be different from that illustrated in Figures 2 to 5. Thus, the particular choice of wired or wireless communications should not be considered limiting.
  • the user orientation module is contained in the local processing and data module 170 and/or the local display bridge 142, while CPU and GPU are contained in the remote processing module.
  • the CPU, GPU, or portions thereof may be contained in the local processing and data module 170 and/or the local display bridge 142.
  • the 3D database can be associated with the remote data repository 176 or disposed locally.
  • Figure 6 schematically depicts a processing pipeline 600 for multi-band compression and equalization of output audio data according to some embodiments.
  • the left and right microphone assemblies capture respective left and right input sound from an ambient acoustic environment of the user of a head-worn XR system and convert the captured left and right input sound to respective left and right input audio data.
  • left and right output audio data is rendered from source audio data.
  • the source audio data may be provided by the head-worn XR system for a particular XR scenario.
  • the left and right input audio data is sent to an energy calculation module of an audio controller.
  • the rendered left and right output audio data is sent to the energy calculation module of the audio controller.
  • the rendered left and right output audio data is also sent to a multiband compressor 650.
  • the masking calculation module of the audio controller analyzes the left and right input audio data and the rendered left and right output audio data to determine respective masking levels for the rendered left and right output audio data.
  • the respective masking levels are determined based on the psychoacoustic model. Analyzing the left and right input audio data and the rendered left and right output audio data may include determining respective left and right masking levels below which left and right output sound corresponding to the rendered left and right output audio data would be inaudible/unintelligible (over the ambient input sound corresponding to the left and right input audio data).
  • a masking level for the microphones may be selected such that a difference between the input audio energy level and the masking level would be at least a threshold amount below the output audio energy level for each audio channel (e.g., left and right) of each band of a frequency domain.
  • the multiband compressor 650 may be a 10-band compressor with 10 bands in its frequency domain.
  • the energy calculation module of the audio controller compares the output audio energy level for each audio channel (e.g., left and right) of each band of the frequency domain (e.g., 10 bands) with the masking level (e.g., by subtraction) to determine an amount by which the band compression threshold is reduced for each channel and each band of the frequency domain. Reducing the band compression threshold reduces the dynamic range. Simultaneously, the band gained his increase to a sufficient level to raise the output energy level for each audio channel of each band of the frequency domain above the calculated microphone masking level.
  • the modified output audio data is generated by the processing pipeline 600 and sent to an audio subsystem to provide sound corresponding to the modified output audio data for each audio channel.
  • Figure 7 depicts a method 710 for presenting an XR experience to a user according to some embodiments.
  • a display subsystem of an XR system presents images corresponding to image data to the user.
  • the images may be configured to generate a three-dimensional XR experience for the user.
  • the display subsystem may be similar to the display subsystem 104 depicted in Figures 2 to 5.
  • a microphone of the XR system captures input sound from an ambient acoustic environment of the user and converts the captured input sound to input audio data.
  • the microphone may be similar to the microphone 107 depicted in Figures 2 to 5.
  • an audio controller generates modified output audio data based on the input data corresponding to the input sound captured by the microphone.
  • the audio controller may be similar to the audio controllers described herein and may have a processing pipeline similar to the processing pipeline 600 depicted in Figure 6. Details regarding the generation of the modified output audio data are depicted in Figure 8 and described herein.
  • an audio subsystem of the XR system presents output sound corresponding to the modified output audio data to the user.
  • the audio subsystem may include speakers 106 such as those depicted in Figures 2 to 5.
  • Figure 8 depicts a method 810 for multi-band compression and equalization of output audio data by an audio processor to generate modified output audio data according to some embodiments.
  • the audio processor analyzes input audio data generated by the microphone at step 712 depicted in Figure 7 to determine a masking level for the microphone based on the psychoacoustic model. This analysis occurs at 630 in the processing pipeline 600 depicted in Figure 6. Analyzing the input audio data may include determining a masking level below which output sound corresponding to rendered output audio data would be inaud ible/un intelligible (over the ambient input sound corresponding to the input audio data). For example, a masking level for a microphone may be selected such that a difference between the input audio energy level and the masking level would be at least a threshold amount below the output audio energy level.
  • the audio processor renders output audio data from and corresponds to source audio data. This rendering occurs at 660L and 660R in the processing pipeline 600 depicted in Figure 6.
  • the source audio data may be provided by the head-worn XR system for a particular XR scenario.
  • the audio processor analyzes the output audio data to determine an energy level of the output audio data.
  • the audio processor compares the energy level of the output audio data with the masking level to determine a comparison outcome. This comparison occurs at 640 in the processing pipeline 600 depicted in Figure 6.
  • the comparison outcome may be an amount by which a band compression threshold is reduced.
  • the audio processor modifies the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data. Modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data may increase an intelligibility of the output sound while minimizing increases in sound pressure level.
  • Figure 9 depicts a method 910 for modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data according to some embodiments.
  • the audio processor increases a gain to raise the energy level of the output audio data above the masking level to generate a raised energy level.
  • the audio processor reduces the compression threshold (e.g., by applying a dynamic range compression based on the comparison outcome).
  • the dynamic range compression comprises adjusting a compression associated parameter based on the comparison outcome.
  • the compression associated parameter may include a compression threshold (as shown in Figure 9), a compression ratio, or a response curve knee. Reducing the compression threshold maintains a maximum energy level of the output audio data above the raised energy level.
  • output audio data modification method 910 depicted in Figure 9 includes reducing the compression threshold and raising an energy level
  • output audio data modification methods include reducing the compression threshold or raising an energy level independently without the other function.
  • output audio data modification methods may include applying a dynamic range compression based on the comparison outcome to maintain a maximum energy level of the output audio data above a raised energy level.
  • an audio subsystem of the XR system presents output sound corresponding to the modified output audio data to the user at 716 depicted in Figure 7.
  • the audio subsystem may include speakers 106 such as those depicted in Figures 2 to 5.
  • the processing pipeline 600 and methods 710, 810, 910 depicted in Figures 6 to 9 depict embodiments of adaptive audio processing including auditory masking based on a psychoacoustic masking model.
  • the auditory masking depicted in Figures 6 to 9 may be in the frequency domain (e.g., simultaneous masking, frequency masking or spectral masking) and/or the time domain (e.g., temporal or non-simultaneous masking).
  • the auditory masking depicted in Figures 6 to 9 may also result in upward spread of masking, with higher growth rate of masking lower in frequency compared to the signal frequency of the input audio data.
  • the audio controller may modify output audio data for each band of a frequency domain. Modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data may increase an intelligibility of the output sound while minimizing increases in sound pressure level. Sound corresponding to source audio data below the masking level may be inaudible to the user
  • FIG. 10 is a block diagram of an illustrative computing system 1000 suitable for implementing an embodiment of the present disclosure.
  • Computer system 1000 includes a bus 1006 or other communication mechanism for communicating information, which interconnects subsystems and devices, such as processor 1007, system memory 1008 (e.g., RAM), static storage device 1009 (e.g., ROM), disk drive 1010 (e.g., magnetic or optical), communication interface 1014 (e.g., modem or Ethernet card), display 1011 (e.g., CRT or LCD), input device 1012 (e.g., keyboard), and cursor control.
  • processor 1007 system memory 1008 (e.g., RAM), static storage device 1009 (e.g., ROM), disk drive 1010 (e.g., magnetic or optical), communication interface 1014 (e.g., modem or Ethernet card), display 1011 (e.g., CRT or LCD), input device 1012 (e.g., keyboard), and cursor control.
  • system memory 1008 e.g.
  • computer system 1000 performs specific operations by processor 1007 executing one or more sequences of one or more instructions contained in system memory 1008. Such instructions may be read into system memory 1008 from another computer readable/usable medium, such as static storage device 1009 or disk drive 1010.
  • hard-wired circuitry may be used in place of or in combination with software instructions to implement the disclosure.
  • embodiments of the disclosure are not limited to any specific combination of hardware circuitry and/or software.
  • the term “logic” shall mean any combination of software or hardware that is used to implement all or part of the disclosure.
  • Non-volatile media includes, for example, optical or magnetic disks, such as disk drive 1010.
  • Volatile media includes dynamic memory, such as system memory 1008.
  • Common forms of computer readable media includes, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM (e.g., NAND flash, NOR flash), any other memory chip or cartridge, or any other medium from which a computer can read.
  • execution of the sequences of instructions to practice the disclosure is performed by a single computer system 1000.
  • two or more computer systems 1000 coupled by communication link 1015 may perform the sequence of instructions required to practice the disclosure in coordination with one another.
  • Computer system 1000 may transmit and receive messages, data, and instructions, including program, i.e., application code, through communication link 1015 and communication interface 1014.
  • Received program code may be executed by processor 1007 as it is received, and/or stored in disk drive 1010, or other non-volatile storage for later execution.
  • Database 1032 in storage medium 1031 may be used to store data accessible by system 1000 via data interface 1033.
  • audio processing systems and methods are describe herein as implemented in various head-worn audio visual systems, the audio processing systems and methods described herein may be implemented in various other audio visual systems.
  • the devices and methods described herein can advantageously be at least partially implemented using, for example, computer software, hardware, firmware, or any combination of software, hardware, and firmware.
  • Software modules can include computer executable code, stored in a computer’s memory, for performing the functions described herein.
  • computer-executable code is executed by one or more general purpose computers.
  • any module that can be implemented using software to be executed on a general purpose computer can also be implemented using a different combination of hardware, software, or firmware.
  • such a module can be implemented completely in hardware using a combination of integrated circuits.
  • such a module can be implemented completely or partially using specialized computers designed to perform the particular functions described herein rather than by general purpose computers.
  • a module can be implemented completely or partially using specialized computers designed to perform the particular functions described herein rather than by general purpose computers.
  • methods are described that are, or could be, at least in part carried out by computer software, it should be understood that such methods can be provided on non-transitory computer-readable media that, when read by a computer or other processing device, cause it to carry out the method.
  • the various processors and other electronic components described herein are suitable for use with any optical system for projecting light.
  • the various processors and other electronic components described herein are also suitable for use with any audio system for receiving voice commands.
  • the disclosure includes methods that may be performed using the subject devices.
  • the methods may include the act of providing such a suitable device. Such provision may be performed by the end user.
  • the “providing” act merely requires the end user obtain, access, approach, position, set-up, activate, power-up or otherwise act to provide the requisite device in the subject method.
  • Methods recited herein may be carried out in any order of the recited events which is logically possible, as well as in the recited order of events.
  • any optional feature of the variations described may be set forth and claimed independently, or in combination with any one or more of the features described herein.
  • Reference to a singular item includes the possibility that there are plural of the same items present. More specifically, as used herein and in claims associated hereto, the singular forms “a,” “an,” “said,” and “the” include plural referents unless the specifically stated otherwise.
  • use of the articles allow for “at least one” of the subject item in the description above as well as claims associated with this disclosure. It is further noted that such claims may be drafted to exclude any optional element. As such, this statement is intended to serve as antecedent basis for use of such exclusive terminology as “solely,” “only” and the like in connection with the recitation of claim elements, or use of a “negative” limitation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

A method for presenting an extended reality experience to a user includes a display subsystem presenting images corresponding to image data to the user. The method also includes a microphone capturing input sound from an ambient acoustic environment and converting the captured input sound to input audio data. The method further includes an audio controller analyzing the input audio data to determine a masking level, rendering output audio data corresponding to source audio data, analyzing the output audio data to determine an energy level of the output audio data, comparing the energy level with the masking level to determine a comparison outcome, and modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data. The method also includes an audio subsystem presenting output sound corresponding to the modified output audio data to the user.

Description

SYSTEMS AND METHOD FOR AUDIO PROCESSING IN EXTENDED REALITY
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] The present application is related to U.S. Patent Application Serial Number 15/423,415 filed on February 2, 2017 (under attorney docket number ML.20043.00) and issued as U.S. Patent Number 10,536,783 on January 14, 2020; U.S. Patent Application Serial Number 15/666,210 filed on August 1 , 2017 (under attorney docket number ML.20041.00) and issued as U.S. Patent Number 10,390,165 on August 20, 2019; U.S. Patent Application Serial Number 15/703,946 filed on September 13, 2017 (under attorney docket number ML.20068.00) and issued as U.S. Patent Number 10,448,189 on October 15, 2019; and PCT Application Number PCT/US2022/032229 filed on June 3, 2022 (under attorney docket number ML-1162WO). The contents of the patent applications and patents mentioned herein are hereby expressly and fully incorporated by reference in their entirety, as though set forth in full. Described in the aforementioned incorporated patent applications and patents are various embodiments of extended reality systems and methods including audio systems and methods. Described herein are further embodiments of extended reality systems and methods including audio processing.
COPYRIGHT NOTICE
[0002] A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever. FIELD OF THE INVENTION
[0003] The present disclosure generally relates to head-worn audio visual systems and devices with audio processing, such as those that can be used in extended reality environments.
BACKGROUND
[0004] Modern computing and display technologies have facilitated the development of head-worn audio visual systems for so called extended reality (XR) experiences that create an environment for a user in which some or all of the environment is generated by presenting digitally reproduced images (e.g., virtual objects) to a user in a manner where they seem to be, or may be perceived as, real. XR head-worn audio visual systems may be useful for many applications, spanning the fields of scientific visualization, medical training, engineering design and prototyping, tele-manipulation and tele-presence, and personal entertainment. An XR head-worn audio visual system may include, e.g., a virtual reality (VR) head-worn audio visual system, an augmented reality (AR) head-worn audio visual system, or a mixed reality (MR) head-worn audio visual system. A VR head-worn audio visual system typically involves the presentation of digital or virtual image information without transparency to other actual real-world visual input. An AR head- worn audio visual system or MR head-worn audio visual system typically involves presentation of virtual objects to a user in relation to real objects of the physical world. In particular, an AR head-worn audio visual system involves the presentation of digital or virtual image information as an augmentation to visualization of the actual world around the user. An MR head-worn audio visual system involves the presentation of AR image content that appears to be blocked by or is otherwise perceived to interact with objects in the real world.
[0005] Figure 1 depicts an example AR/MR/XR scene 1 where a user sees a real- world park setting 6 featuring people, trees, buildings in the background, and a concrete platform 20. In addition to these items, computer-generated imagery is also presented to the user. The computer-generated imagery can include, for example, a robot statue 10 standing upon the real-world platform 20, and a cartoon-like avatar character 12 flying by which seems to be a personification of a bumble bee, even though these elements 12, 10 are not actually present in the real-world environment.
[0006] Various optical systems generate images at various depths for displaying VR, AR, MR, orXR scenarios. Some such optical systems are described in U.S. Utility Patent Application Serial No. 14/555,585 filed on November 27, 2014 (under attorney docket number ML.20011 .00), the contents of which are hereby expressly and fully incorporated by reference in their entirety, as though set forth in full. Other such optical systems for displaying AR/MR/XR experiences are described in U.S. Utility Patent Application Serial No. 14/738,877 filed on June 13, 2015 (under attorney docket number ML.20019.00), the contents of which are hereby expressly and fully incorporated by reference in their entirety, as though set forth in full.
[0007] In order to enhance the VR/AR/MR/XR experience for the user, sound generated by real sound sources and/or sound generated by virtual sound sources may be conveyed to the user via speakers incorporated into or otherwise connected to the head-worn audio visual system. Whether the sound is generated from a real sound source or a virtual sound source, it is desirable for the user to preferentially receive/hear the generated sound compared to extraneous sound from the ambient acoustic environment of the user, so that the user can hear sounds from an object or objects in the VR/AR/MR/XR experience. Extraneous sound from the ambient acoustic environment of the user especially interfere with sound generated as part of a VR/AR/MR/XR experience in systems with non-occluding speakers.
[0008] While occluding speakers physically reduce extraneous sound, they do so non- selectively and to a degree that is non-variable. Another current solution that preferentially presents generated sound over extraneous sound increases the playback volume of the generated sound. However increasing the playback volume may expose the user to potentially harmful playback levels and is inefficient in that increased playback volume will affect battery life.
[0009] Thus, there is a need to preferentially convey to an end user sound generated by a head-worn audio visual system as a part of VR/AR/MR/XR experience compared to sound from extraneous sound from the ambient acoustic environment of the user while addressing the system requirements described herein.
SUMMARY
[0010] In one embodiment, a method for presenting an extended reality experience to a user includes a display subsystem presenting images corresponding to image data to the user. The method also includes a microphone capturing input sound from an ambient acoustic environment of the user and converting the captured input sound to input audio data. The method further includes an audio controller analyzing the input audio data to determine a masking level. Moreover, the method includes the audio controller rendering output audio data corresponding to source audio data. In addition, the method includes the audio controller analyzing the output audio data to determine an energy level of the output audio data. The method also includes the audio controller comparing the energy level of the output audio data with the masking level to determine a comparison outcome. The method further includes the audio controller modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data. The method also includes an audio subsystem presenting output sound corresponding to the modified output audio data to the user.
[0011] In one or more embodiments, modifying the output audio data includes increasing a gain to raise the energy level of the output audio data above the masking level to generate a raised energy level. Modifying the output audio data may include applying a dynamic range compression based on the comparison outcome to maintain a maximum energy level of the output audio data above the raised energy level. The dynamic range compression may include adjusting a compression associated parameter based on the comparison outcome. The compression associated parameter may include a compression threshold, a compression ratio, or a response curve knee. The method may include reducing a dynamic range by adjusting the compression associated parameter. The dynamic range compression may include reducing a compression threshold based on the comparison outcome.
[0012] In one or more embodiments, the masking level, the energy level of the output audio data, the dynamic range compression, and the gain are each specific to a band of a frequency domain. The frequency domain may have 10 bands including the band. Modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data may increase an intelligibility of the output sound while minimizing increases in sound pressure level. The method may include calculating an auditory masking curve of the ambient acoustic environment. The auditory masking curve may be critically band partitioned. The auditory masking curve may describe masking in a frequency domain or a time domain. The auditory masking curve may result in upward spread of masking in the output sound.
[0013] In one or more embodiments, the method includes the audio controller, for each band of a frequency domain, analyzing respective input audio data to determine a respective masking level, rendering respective output audio data corresponding to respective audio data, analyzing the respective output audio data to determine an energy level of the respective output audio data, comparing the energy level of the respective output audio data with the respective masking level to determine a respective comparison outcome, and modifying the respective output audio data based on the respective comparison outcome to raise the energy level of the respective output audio data to generate respective modified output audio data. The method also includes the audio subsystem, for each band of the frequency domain, presenting respective output sound corresponding to the respective modified output audio data to the user. Sound corresponding to source audio data below the masking level may be inaudible to the user. The masking level may be determined by analyzing the input audio data using a psychoacoustic model.
[0014] In another embodiment, an extended reality system includes a display subsystem configured to present images corresponding to image data to a user. The system also includes a microphone configured to capture input sound from an ambient acoustic environment of the user and convert the captured input sound to input audio data. The system further includes an audio controller, including a processor, and a non- transitory computer readable storage medium storing thereupon a sequence of instructions which, when executed by the processor, causes the processor to perform a set of acts. The set of acts includes analyzing the input audio data to determine a masking level, rendering output audio data corresponding to source audio data, analyzing the output audio data to determine an energy level of the output audio data, comparing the energy level of the output audio data with the masking level to determine a comparison outcome, and modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data. Moreover, the system includes an audio subsystem configured to present sound corresponding to the modified output audio data to the user.
[0015] In one or more embodiments, modifying the output audio data comprises increasing a gain to raise the energy level of the output audio data above the masking level to generate a raised energy level. Modifying the output audio data may include applying a dynamic range compression based on the comparison outcome to maintain a maximum energy level of the output audio data above the raised energy level. The dynamic range compression may include adjusting a compression associated parameter based on the comparison outcome. The compression associated parameter may include a compression threshold, a compression ratio, or a response curve knee. The set of acts may further include reducing a dynamic range by adjusting the compression associated parameter. The dynamic range compression may include reducing a compression threshold based on the comparison outcome. [0016] In one or more embodiments, the masking level, the energy level of the output audio data, the dynamic range compression, and the gain are each specific to a band of a frequency domain. The frequency domain may have 10 bands including the band. Modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data may increase an intelligibility of the output sound while minimizing increases in sound pressure level. The set of acts may further include calculating an auditory masking curve of the ambient acoustic environment. The auditory masking curve may be critically band partitioned. The auditory masking curve may describe masking in a frequency domain or a time domain. The auditory masking curve may result in upward spread of masking in the output sound.
[0017] In one or more embodiments, the set of acts further includes for each band of a frequency domain, analyzing respective input audio data to determine a respective masking level, rendering respective output audio data corresponding to respective audio data, analyzing the respective output audio data to determine an energy level of the respective output audio data, comparing the energy level of the respective output audio data with the respective masking level to determine a respective comparison outcome, and modifying the respective output audio data based on the respective comparison outcome to raise the energy level of the respective output audio data to generate respective modified output audio data. The method also includes the audio subsystem, for each band of the frequency domain, presenting respective output sound corresponding to the respective modified output audio data to the user. Sound corresponding to source audio data below the masking level may be inaudible to the user. The masking level may be determined by analyzing the input audio data using a psychoacoustic model.
[0018] In yet another embodiment, a computer program product includes a non- transitory computer readable storage medium having stored thereupon a sequence of instructions which, when executed by a processor, causes the processor to perform a set of acts. The set of acts includes analyzing input audio data to determine a masking level. The set of acts also includes rendering output audio data corresponding to source audio data. The set of acts further includes analyzing the output audio data to determine an energy level of the output audio data. Moreover, the set of acts includes comparing the energy level of the output audio data with the masking level to determine a comparison outcome. In addition, the set of acts includes modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data.
[0019] In one or more embodiments, modifying the output audio data includes increasing a gain to raise the energy level of the output audio data above the masking level to generate a raised energy level. Modifying the output audio data may include applying a dynamic range compression based on the comparison outcome to maintain a maximum energy level of the output audio data above the raised energy level. The dynamic range compression may include adjusting a compression associated parameter based on the comparison outcome. The compression associated parameter may include a compression threshold, a compression ratio, or a response curve knee. The set of acts may include reducing a dynamic range by adjusting the compression associated parameter. The dynamic range compression may include reducing a compression threshold based on the comparison outcome.
[0020] In one or more embodiments, the masking level, the energy level of the output audio data, the dynamic range compression, and the gain are each specific to a band of a frequency domain. The frequency domain may have 10 bands including the band. Modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data may increase an intelligibility of the output sound while minimizing increases in sound pressure level. The set of acts may include calculating an auditory masking curve of the ambient acoustic environment. The auditory masking curve may be critically band partitioned. The auditory masking curve may describe masking in a frequency domain or a time domain. The auditory masking curve may result in upward spread of masking in the output sound.
[0021] In one or more embodiments, the set of acts includes for each band of a frequency domain analyzing respective input audio data to determine a respective masking level, rendering respective output audio data corresponding to respective audio data, analyzing the respective output audio data to determine an energy level of the respective output audio data, comparing the energy level of the respective output audio data with the respective masking level to determine a respective comparison outcome, and modifying the respective output audio data based on the respective comparison outcome to raise the energy level of the respective output audio data to generate respective modified output audio data. Sound corresponding to source audio data below the masking level may be inaudible to the user. The masking level may be determined by analyzing the input audio data using a psychoacoustic model. [0022] Additional and other objects, features, and advantages of the disclosure are described in the detail description, figures and claims.
BRIEF DESCRIPTION OF THE DRAWINGS
[0023] The drawings described below are for illustration purposes only. The drawings are not intended to limit the scope of the present disclosure. The drawings illustrate the design and utility of preferred embodiments of the present disclosure, in which similar elements are referred to by common reference numerals. In order to better appreciate how the above-recited and other advantages and objects of the present disclosures are obtained, a more particular description of the present disclosures briefly described above will be rendered by reference to specific embodiments thereof, which are illustrated in the accompanying drawings. Understanding that these drawings depict only typical embodiments of the disclosure and are not therefore to be considered limiting of its scope, the disclosure will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:
[0024] Figure 1 illustrates a user’s view of an AR/MR/XR scene using an example AR/MR/XR system according to some embodiments.
[0025] Figures 2 to 5 schematically depict users using XR systems according to some embodiments.
[0026] Figure 6 schematically depicts a processing pipeline for multi-band compression and equalization of output audio data according to some embodiments.
[0027] Figure 7 is a flowchart depicting a method for presenting an extended reality experience to a user according to some embodiments. [0028] Figures 8 is a flowchart depicting a method for modifying output audio data according to some embodiments.
[0029] Figure 9 is a flowchart depicting a method for multi-band compression and equalization of output audio data according to some embodiments.
[0030] Figure 10 is a block diagram schematically depicting an illustrative computing system according to some embodiments.
DETAILED DESCRIPTION
Summary of Problems and Solutions
[0031] As described above, there is a need to preferentially convey to an end user sound generated by a head-worn audio visual system as part of a VR/AR/MR/XR experience compared to sound from extraneous sound from the ambient acoustic environment of the user while addressing the system requirements described herein.
[0032] Current solutions to the problem of preferentially conveying generated sound include occlusive speakers, which have the drawback of isolating the user from the ambient audio environment, and increasing playback volume, which may expose the user to potentially harmful playback levels and is inefficient in that increased playback volume will affect battery life.
[0033] The embodiments described herein include head-worn XR systems that reduce the compression threshold and increase the gain to raise the energy level of output audio data above a masking level to generate modified output audio data to increase an intelligibility of output sound corresponding to the modified output audio data, while minimizing increases in sound pressure level and related battery consumption and potential harm to the user’s ears. The embodiments depict an adaptive audio approach using a psychoacoustic masking model that exploits the psychoacoustic phenomenon of auditory masking. Auditory masking occurs when perception of one sound is affected and compromised by the presence of another sound. Auditory masking in the frequency domain is known as simultaneous masking, frequency masking or spectral masking. Auditory masking in the time domain is called temporal or non-simultaneous masking. The psychoacoustic model also takes advantage of what is called upward spread of masking. Upward spread of masking refers to the higher growth rate of masking for maskers lower in frequency than the signal, compared to maskers at the signal frequency.
Illustrative Head-Worn XR System with Audio Processing
[0034] The description that follows relates to head-worn audio video systems and methods to be used in XR systems. However, it is to be understood that the while the disclosure lends itself well to applications in XR systems, the disclosure, in its broadest aspects, is not so limited. For example, the disclosed systems and methods can be applied to general audio systems, as well as head-worn hearing aid devices that do not utilize displays for presenting an XR experience to the user. Thus, while often described herein in terms of XR systems, the teachings should not be limited to such systems or such uses.
[0035] XR systems disclosed herein can include a head-worn display subsystem 104 which presents computer-generated (i.e., virtual) imagery (video/image data) to a viewer/wearer/user 50. The head-worn display subsystems 104 are wearable, which may advantageously provide a more realistic and immersive XR experience. Various components of head-worn XR systems 100 are depicted in Figures 2 to 5. The head- worn XR system 100 includes a frame structure 102 worn by an end user 50, a display subsystem 104 carried by the frame structure 102, such that the display subsystem 104 is positioned in front of the eyes 52 of the end user 50, and a speaker 106 carried by the frame structure 102, such that the speaker 106 is positioned adjacent the ear canal of the end user 50 (optionally, another speaker (not shown) is positioned adjacent the other ear canal of the end user 50 to provide for stereo/shapeable sound control). Although the speaker 106 is described as being positioned adjacent the ear canal, other types of speakers that are not located adjacent the ear canal can be used to convey sound to the end user 50. For example, speakers may be placed at a distance from the ear canal, e.g., using a bone conduction technology. In any case, the speaker 106 is a nonoccluding speaker that allows sound from the ambient acoustic environment of the user 50 to enter the ear canal of the user 50.
[0036] The display subsystem 104 is designed to present the eyes 52 of the end user 50 with light patterns that can be comfortably perceived as augmentations to physical reality, with high-levels of image quality and three-dimensional perception, as well as being capable of presenting two-dimensional content. The display subsystem 104 presents a sequence of frames at high frequency that provides the perception of a single coherent scene.
[0037] In the illustrated embodiments, the display subsystem 104 employs “optical see-through” display through which the user can directly view transmitted light from real objects via transparent or semi-transparent elements. The transparent/semi-transparent element, often referred to as a “combiner,” superimposes light from the display over the user’s view of the real world. To this end, the display subsystem 104 may include a partially transparent display 110. In some embodiments, the partially transparent display 110 may be electronically controlled. In some embodiments, the partially transparent display 110 may include dimming to modify the opacity of the partially transparent display 110 or one or more portions thereof. In some embodiments, the partially transparent display 110 may include global dimming to control opacity of the entirety of the partially transparent display 110. The partially transparent display 110 is positioned in the end user’s 50 field of view between the eyes 52 of the end user 50 and an ambient real-world external environment, such that direct light from the ambient real-world external environment is transmitted through the partially transparent display 110 to the eyes 52 of the end user 50.
[0038] In the illustrated embodiments, an image projection assembly provides light to the partially transparent display 110, thereby combining with the direct light from the ambient real-world external environment, and being transmitted from the partially transparent display 110 to the eyes 52 of the user 50. The projection subsystem may include various light sources and spatial light modulators, and the partially transparent display 110 may be a waveguide-based display into which the light from the projection subsystem is injected to produce, e.g., images at a single optical viewing distance closer than infinity (e.g., arm’s length), images at multiple, discrete optical viewing distances or focal planes, and/or image layers stacked at multiple viewing distances or focal planes to represent volumetric 3D objects. These layers in the light field may be stacked closely enough together to appear continuous to the human visual system (i e. , one layer is within the cone of confusion of an adjacent layer). Additionally or alternatively, picture elements (i.e., sub-images) may be blended across two or more layers to increase perceived continuity of transition between layers in the light field, even if those layers are more sparsely stacked (i.e. , one layer is outside the cone of confusion of an adjacent layer). The display subsystem 104 may be monocular or binocular.
[0039] The head-worn XR system 100 may also include a microphone assembly 107 that captures input sound from the ambient acoustic environment of the user 50, and converts the captured input sound to electrical signals that are then delivered to an audio processor of the head-worn XR system 100. In the illustrated embodiment, a pair of microphone assemblies 107 is mounted to the pair of arms of the frame structure 102.
[0040] The head-worn XR system 100 may also include one or more sensors (not shown) mounted to the frame structure 102 for detecting the position and movement of the head 54 of the end user 50 and/or the eye position and inter-ocular distance of the end user 50. Such sensors may include image capture devices (such as cameras), microphones, inertial measurement units, accelerometers, compasses, GPS units, radio devices, and/or gyros). Many of these sensors operate on the assumption that the frame 102 on which they are affixed is in turn substantially fixed to the user’s head, eyes, and ears.
[0041] The head-worn XR system 100 may also include outwardly-facing cameras 660 to capture images of a field-of-view of a user. The head-worn XR system 100 may also include one or more dimmers 692 (e.g., passive photochromic dimmers or active LCD dimmers), which together with the respective LOEs 690, form display subsystems 104. [0042] The head-worn XR system 100 may also include a user orientation detection module. The user orientation module detects the instantaneous position of the head 54 of the end user 50 (e.g., via sensors coupled to the frame 102) and may predict the position of the head 54 of the end user 50 based on position data received from the sensors. Detecting the instantaneous position of the head 54 of the end user 50 facilitates determination of the specific actual object that the end user 50 is looking at, thereby providing an indication of the specific virtual object to be generated in relation to that actual object and further providing an indication of the position in which the virtual object is to be displayed. The user orientation module may also track the eyes of the end user 50 based on the tracking data received from the sensors.
[0043] The head-worn XR system 100 may also include a control subsystem that may take any of a large variety of forms. The control subsystem includes a number of controllers, for instance one or more microcontrollers, microprocessors or central processing units (CPUs), digital signal processors, graphics processing units (GPUs), other integrated circuit controllers, such as application specific integrated circuits (ASICs), display bridge chips, display controllers, programmable gate arrays (PGAs), for instance field PGAs (FPGAs), and/or programmable logic controllers (PLUs).
[0044] The control subsystem of head-worn XR system 100 may include a central processing unit (CPU), a graphics processing unit (GPU), one or more frame buffers, and a three-dimensional database for storing three-dimensional scene data. The CPU may control overall operation, while the GPU may render frames (i.e., translating a three- dimensional scene into a two-dimensional image) from the three-dimensional data stored in the three-dimensional database and store these frames in the frame buffers. One or more additional integrated circuits may control the reading into and/or reading out of frames from the frame buffers and operation of the image projection assembly of the display subsystem 104. The control subsystem of the head-worn XR system 100 may include an audio controller configured to render and process audio data to provide output sound to enhance the XR experience. In some embodiments, the audio controller may be implemented in the CPU.
[0045] The various processing components of the head-worn XR system 100 may be physically contained in a distributed subsystem. For example, as illustrated in Figures 2 to 5, the head-worn XR system 100 may include a local processing and data module 170 operatively coupled, such as by a wired lead or wireless connectivity 172, to a local display bridge (not shown), the display subsystem 104, and sensors. The local processing and data module 170 may be mounted in a variety of configurations, such as fixedly attached to the frame structure 102 (Figure 2), fixedly attached to a helmet or hat 56 (Figure 3), removably attached to the torso 58 of the end user 50 (Figure 4), or removably attached to the hip 60 of the end user 50 in a belt-coupling style configuration (Figure 5). The head-worn XR system 100 may also include a remote processing module 174 and remote data repository 176 operatively coupled, such as by a wired lead or wireless connectivity 178, 180, to the local processing and data module 170 and the local display bridge 142, such that these remote modules 174, 176 are operatively coupled to each other and available as resources to the local processing and data module 170 and the local display bridge 142.
[0046] The local processing and data module 170 and the local display bridge 142 may each include a power-efficient processor or controller, as well as digital memory, such as flash memory, both of which may be utilized to assist in the processing, caching, and storage of data captured from the sensors and/or acquired and/or processed using the remote processing module 174 and/or remote data repository 176, possibly for passage to the display subsystem 104 after such processing or retrieval. The remote processing module 174 may include one or more relatively powerful processors or controllers configured to analyze and process data and/or image information. The remote data repository 176 may include a relatively large-scale digital data storage facility, which may be available through the internet or other networking configuration in a “cloud” resource configuration. In some embodiments, all data is stored and all computation is performed in the local processing and data module 170 and the local display bridge 142, allowing fully autonomous use from any remote modules.
[0047] The couplings 172, 178, 180 between the various components described above may include one or more wired interfaces or ports for providing wires or optical communications, or one or more wireless interfaces or ports, such as via RF, microwave, and IR for providing wireless communications. In some implementations, all communications may be wired, while in other implementations all communications may be wireless. In still further implementations, the choice of wired and wireless communications may be different from that illustrated in Figures 2 to 5. Thus, the particular choice of wired or wireless communications should not be considered limiting. [0048] In some embodiments, the user orientation module is contained in the local processing and data module 170 and/or the local display bridge 142, while CPU and GPU are contained in the remote processing module. In alternative embodiments, the CPU, GPU, or portions thereof may be contained in the local processing and data module 170 and/or the local display bridge 142. The 3D database can be associated with the remote data repository 176 or disposed locally.
[0049] Figure 6 schematically depicts a processing pipeline 600 for multi-band compression and equalization of output audio data according to some embodiments. At 610L and 61 OR, the left and right microphone assemblies capture respective left and right input sound from an ambient acoustic environment of the user of a head-worn XR system and convert the captured left and right input sound to respective left and right input audio data. At 660L and 660R, left and right output audio data is rendered from source audio data. The source audio data may be provided by the head-worn XR system for a particular XR scenario.
[0050] At 620, the left and right input audio data is sent to an energy calculation module of an audio controller. At 670, the rendered left and right output audio data is sent to the energy calculation module of the audio controller. The rendered left and right output audio data is also sent to a multiband compressor 650.
[0051] At 630, the masking calculation module of the audio controller analyzes the left and right input audio data and the rendered left and right output audio data to determine respective masking levels for the rendered left and right output audio data. The respective masking levels are determined based on the psychoacoustic model. Analyzing the left and right input audio data and the rendered left and right output audio data may include determining respective left and right masking levels below which left and right output sound corresponding to the rendered left and right output audio data would be inaudible/unintelligible (over the ambient input sound corresponding to the left and right input audio data). For example, a masking level for the microphones may be selected such that a difference between the input audio energy level and the masking level would be at least a threshold amount below the output audio energy level for each audio channel (e.g., left and right) of each band of a frequency domain. For instance, the multiband compressor 650 may be a 10-band compressor with 10 bands in its frequency domain. [0052] At 640, the energy calculation module of the audio controller compares the output audio energy level for each audio channel (e.g., left and right) of each band of the frequency domain (e.g., 10 bands) with the masking level (e.g., by subtraction) to determine an amount by which the band compression threshold is reduced for each channel and each band of the frequency domain. Reducing the band compression threshold reduces the dynamic range. Simultaneously, the band gained his increase to a sufficient level to raise the output energy level for each audio channel of each band of the frequency domain above the calculated microphone masking level.
[0053] The modified output audio data is generated by the processing pipeline 600 and sent to an audio subsystem to provide sound corresponding to the modified output audio data for each audio channel.
[0054] Figure 7 depicts a method 710 for presenting an XR experience to a user according to some embodiments. At 712, a display subsystem of an XR system presents images corresponding to image data to the user. The images may be configured to generate a three-dimensional XR experience for the user. The display subsystem may be similar to the display subsystem 104 depicted in Figures 2 to 5.
[0055] At 714, a microphone of the XR system captures input sound from an ambient acoustic environment of the user and converts the captured input sound to input audio data. The microphone may be similar to the microphone 107 depicted in Figures 2 to 5. [0056] At 810, an audio controller generates modified output audio data based on the input data corresponding to the input sound captured by the microphone. The audio controller may be similar to the audio controllers described herein and may have a processing pipeline similar to the processing pipeline 600 depicted in Figure 6. Details regarding the generation of the modified output audio data are depicted in Figure 8 and described herein.
[0057] At 716, an audio subsystem of the XR system presents output sound corresponding to the modified output audio data to the user. The audio subsystem may include speakers 106 such as those depicted in Figures 2 to 5.
[0058] Figure 8 depicts a method 810 for multi-band compression and equalization of output audio data by an audio processor to generate modified output audio data according to some embodiments. At 812, the audio processor analyzes input audio data generated by the microphone at step 712 depicted in Figure 7 to determine a masking level for the microphone based on the psychoacoustic model. This analysis occurs at 630 in the processing pipeline 600 depicted in Figure 6. Analyzing the input audio data may include determining a masking level below which output sound corresponding to rendered output audio data would be inaud ible/un intelligible (over the ambient input sound corresponding to the input audio data). For example, a masking level for a microphone may be selected such that a difference between the input audio energy level and the masking level would be at least a threshold amount below the output audio energy level.
[0059] At 814, the audio processor renders output audio data from and corresponds to source audio data. This rendering occurs at 660L and 660R in the processing pipeline 600 depicted in Figure 6. The source audio data may be provided by the head-worn XR system for a particular XR scenario.
[0060] At 816, the audio processor analyzes the output audio data to determine an energy level of the output audio data. [0061] At 818, the audio processor compares the energy level of the output audio data with the masking level to determine a comparison outcome. This comparison occurs at 640 in the processing pipeline 600 depicted in Figure 6. The comparison outcome may be an amount by which a band compression threshold is reduced.
[0062] At 910, the audio processor modifies the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data. Modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data may increase an intelligibility of the output sound while minimizing increases in sound pressure level.
[0063] Figure 9 depicts a method 910 for modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data according to some embodiments. At 912, the audio processor increases a gain to raise the energy level of the output audio data above the masking level to generate a raised energy level.
[0064] At 914, which may be simultaneous with 912, the audio processor reduces the compression threshold (e.g., by applying a dynamic range compression based on the comparison outcome). The dynamic range compression comprises adjusting a compression associated parameter based on the comparison outcome. The compression associated parameter may include a compression threshold (as shown in Figure 9), a compression ratio, or a response curve knee. Reducing the compression threshold maintains a maximum energy level of the output audio data above the raised energy level. [0065] While the output audio data modification method 910 depicted in Figure 9 includes reducing the compression threshold and raising an energy level, in other embodiments output audio data modification methods include reducing the compression threshold or raising an energy level independently without the other function. In other embodiments, output audio data modification methods may include applying a dynamic range compression based on the comparison outcome to maintain a maximum energy level of the output audio data above a raised energy level.
[0066] As described above, an audio subsystem of the XR system presents output sound corresponding to the modified output audio data to the user at 716 depicted in Figure 7. The audio subsystem may include speakers 106 such as those depicted in Figures 2 to 5.
[0067] The processing pipeline 600 and methods 710, 810, 910 depicted in Figures 6 to 9 depict embodiments of adaptive audio processing including auditory masking based on a psychoacoustic masking model. The auditory masking depicted in Figures 6 to 9 may be in the frequency domain (e.g., simultaneous masking, frequency masking or spectral masking) and/or the time domain (e.g., temporal or non-simultaneous masking). The auditory masking depicted in Figures 6 to 9 may also result in upward spread of masking, with higher growth rate of masking lower in frequency compared to the signal frequency of the input audio data.
[0068] While the methods depicted in Figures 7 to 9 generally depict modifying output audio data, in other embodiments the audio controller may modify output audio data for each band of a frequency domain. Modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data may increase an intelligibility of the output sound while minimizing increases in sound pressure level. Sound corresponding to source audio data below the masking level may be inaudible to the user
System Architecture Overview
[0069] Figure 10 is a block diagram of an illustrative computing system 1000 suitable for implementing an embodiment of the present disclosure. Computer system 1000 includes a bus 1006 or other communication mechanism for communicating information, which interconnects subsystems and devices, such as processor 1007, system memory 1008 (e.g., RAM), static storage device 1009 (e.g., ROM), disk drive 1010 (e.g., magnetic or optical), communication interface 1014 (e.g., modem or Ethernet card), display 1011 (e.g., CRT or LCD), input device 1012 (e.g., keyboard), and cursor control.
[0070] According to one embodiment of the disclosure, computer system 1000 performs specific operations by processor 1007 executing one or more sequences of one or more instructions contained in system memory 1008. Such instructions may be read into system memory 1008 from another computer readable/usable medium, such as static storage device 1009 or disk drive 1010. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions to implement the disclosure. Thus, embodiments of the disclosure are not limited to any specific combination of hardware circuitry and/or software. In one embodiment, the term “logic” shall mean any combination of software or hardware that is used to implement all or part of the disclosure.
[0071] The term “computer readable medium” or “computer usable medium” as used herein refers to any medium that participates in providing instructions to processor 1007 for execution. Such a medium may take many forms, including but not limited to, nonvolatile media and volatile media. Non-volatile media includes, for example, optical or magnetic disks, such as disk drive 1010. Volatile media includes dynamic memory, such as system memory 1008.
[0072] Common forms of computer readable media includes, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM (e.g., NAND flash, NOR flash), any other memory chip or cartridge, or any other medium from which a computer can read.
[0073] In an embodiment of the disclosure, execution of the sequences of instructions to practice the disclosure is performed by a single computer system 1000. According to other embodiments of the disclosure, two or more computer systems 1000 coupled by communication link 1015 (e.g., LAN, PTSN, or wireless network) may perform the sequence of instructions required to practice the disclosure in coordination with one another.
[0074] Computer system 1000 may transmit and receive messages, data, and instructions, including program, i.e., application code, through communication link 1015 and communication interface 1014. Received program code may be executed by processor 1007 as it is received, and/or stored in disk drive 1010, or other non-volatile storage for later execution. Database 1032 in storage medium 1031 may be used to store data accessible by system 1000 via data interface 1033.
[0075] While the audio processing systems and methods are describe herein as implemented in various head-worn audio visual systems, the audio processing systems and methods described herein may be implemented in various other audio visual systems.
[0076] Certain aspects, advantages and features of the disclosure have been described herein. It is to be understood that not necessarily all such advantages may be achieved in accordance with any particular embodiment of the disclosure. Thus, the disclosure may be embodied or carried out in a manner that achieves or optimizes one advantage or group of advantages as taught herein without necessarily achieving other advantages as may be taught or suggested herein.
[0077] Embodiments have been described in connection with the accompanying drawings. However, it should be understood that the figures are not drawn to scale. Distances, angles, etc. are merely illustrative and do not necessarily bear an exact relationship to actual dimensions and layout of the devices illustrated. In addition, the foregoing embodiments have been described at a level of detail to allow one of ordinary skill in the art to make and use the devices, systems, methods, and the like described herein. A wide variety of variation is possible. Components, elements, and/or steps may be altered, added, removed, or rearranged.
[0078] The devices and methods described herein can advantageously be at least partially implemented using, for example, computer software, hardware, firmware, or any combination of software, hardware, and firmware. Software modules can include computer executable code, stored in a computer’s memory, for performing the functions described herein. In some embodiments, computer-executable code is executed by one or more general purpose computers. However, a skilled artisan will appreciate, in light of this disclosure, that any module that can be implemented using software to be executed on a general purpose computer can also be implemented using a different combination of hardware, software, or firmware. For example, such a module can be implemented completely in hardware using a combination of integrated circuits. Alternatively or additionally, such a module can be implemented completely or partially using specialized computers designed to perform the particular functions described herein rather than by general purpose computers. In addition, where methods are described that are, or could be, at least in part carried out by computer software, it should be understood that such methods can be provided on non-transitory computer-readable media that, when read by a computer or other processing device, cause it to carry out the method.
[0079] While certain embodiments have been explicitly described, other embodiments will become apparent to those of ordinary skill in the art based on this disclosure.
[0080] The various processors and other electronic components described herein are suitable for use with any optical system for projecting light. The various processors and other electronic components described herein are also suitable for use with any audio system for receiving voice commands.
[0081] Various exemplary embodiments of the disclosure are described herein. Reference is made to these examples in a non-limiting sense. They are provided to illustrate more broadly applicable aspects of the disclosure. Various changes may be made to the disclosure described and equivalents may be substituted without departing from the true spirit and scope of the disclosure. In addition, many modifications may be made to adapt a particular situation, material, composition of matter, process, process act(s) or step(s) to the objective(s), spirit or scope of the present disclosure. Further, as will be appreciated by those with skill in the art, each of the individual variations described and illustrated herein has discrete components and features which may be readily separated from or combined with the features of any of the other several embodiments without departing from the scope or spirit of the present disclosure. All such modifications are intended to be within the scope of claims associated with this disclosure.
[0082] The disclosure includes methods that may be performed using the subject devices. The methods may include the act of providing such a suitable device. Such provision may be performed by the end user. In other words, the “providing” act merely requires the end user obtain, access, approach, position, set-up, activate, power-up or otherwise act to provide the requisite device in the subject method. Methods recited herein may be carried out in any order of the recited events which is logically possible, as well as in the recited order of events.
[0083] Exemplary aspects of the disclosure, together with details regarding material selection and manufacture have been set forth above. As for other details of the present disclosure, these may be appreciated in connection with the above-referenced patents and publications as well as generally known or appreciated by those with skill in the art. The same may hold true with respect to method-based aspects of the disclosure in terms of additional acts as commonly or logically employed.
[0084] In addition, though the disclosure has been described in reference to several examples optionally incorporating various features, the disclosure is not to be limited to that which is described or indicated as contemplated with respect to each variation of the disclosure. Various changes may be made to the disclosure described and equivalents (whether recited herein or not included for the sake of some brevity) may be substituted without departing from the true spirit and scope of the disclosure. In addition, where a range of values is provided, it is understood that every intervening value, between the upper and lower limit of that range and any other stated or intervening value in that stated range, is encompassed within the disclosure.
[0085] Also, it is contemplated that any optional feature of the variations described may be set forth and claimed independently, or in combination with any one or more of the features described herein. Reference to a singular item, includes the possibility that there are plural of the same items present. More specifically, as used herein and in claims associated hereto, the singular forms “a,” “an,” “said,” and “the” include plural referents unless the specifically stated otherwise. In other words, use of the articles allow for “at least one” of the subject item in the description above as well as claims associated with this disclosure. It is further noted that such claims may be drafted to exclude any optional element. As such, this statement is intended to serve as antecedent basis for use of such exclusive terminology as “solely,” “only” and the like in connection with the recitation of claim elements, or use of a “negative” limitation.
[0086] Without the use of such exclusive terminology, the term “comprising” in claims associated with this disclosure shall allow for the inclusion of any additional element- irrespective of whether a given number of elements are enumerated in such claims, or the addition of a feature could be regarded as transforming the nature of an element set forth in such claims. Except as specifically defined herein, all technical and scientific terms used herein are to be given as broad a commonly understood meaning as possible while maintaining claim validity. [0087] The breadth of the present disclosure is not to be limited to the examples provided and/or the subject specification, but rather only by the scope of claim language associated with this disclosure.
[0088] In the foregoing specification, the disclosure has been described with reference to specific embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the disclosure. For example, the above-described process flows are described with reference to a particular ordering of process actions. However, the ordering of many of the described process actions may be changed without affecting the scope or operation of the disclosure. The specification and drawings are, accordingly, to be regarded in an illustrative rather than restrictive sense.

Claims

What is claimed is:
1. A method for presenting an extended reality experience to a user, comprising: a display subsystem presenting images corresponding to image data to the user; a microphone capturing input sound from an ambient acoustic environment of the user and converting the captured input sound to input audio data; an audio controller: analyzing the input audio data to determine a masking level, rendering output audio data corresponding to source audio data, analyzing the output audio data to determine an energy level of the output audio data, comparing the energy level of the output audio data with the masking level to determine a comparison outcome, and modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data; and an audio subsystem presenting output sound corresponding to the modified output audio data to the user.
2. The method of claim 1 , wherein modifying the output audio data comprises increasing a gain to raise the energy level of the output audio data above the masking level to generate a raised energy level.
3. The method of claim 2, wherein modifying the output audio data comprises applying a dynamic range compression based on the comparison outcome to maintain a maximum energy level of the output audio data above the raised energy level.
4. The method of claim 3, wherein the dynamic range compression comprises adjusting a compression associated parameter based on the comparison outcome, and wherein the compression associated parameter comprises a compression threshold, a compression ratio, or a response curve knee.
5. The method of claim 4, further comprising reducing a dynamic range by adjusting the compression associated parameter.
6. The method of claim 4, wherein the dynamic range compression comprises reducing a compression threshold based on the comparison outcome.
7. The method of claim 3, wherein the masking level, the energy level of the output audio data, the dynamic range compression, and the gain are each specific to a band of a frequency domain.
8. The method of claim 7, wherein the frequency domain has 10 bands including the band.
9. The method of claim 1 , wherein modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data increases an intelligibility of the output sound while minimizing increases in sound pressure level.
10. The method of claim 1 , wherein modifying the output audio data comprises applying a dynamic range compression based on the comparison outcome to maintain a maximum energy level of the output audio data above a raised energy level.
11 . The method of claim 1 , further comprising calculating an auditory masking curve of the ambient acoustic environment.
12. The method of claim 11 , wherein the auditory masking curve is critically band partitioned.
13. The method of claim 11 , wherein the auditory masking curve describes masking in a frequency domain.
14. The method of claim 11 , wherein the auditory masking curve describes masking in a time domain.
15. The method of claim 11 , wherein the auditory masking curve results in upward spread of masking in the output sound.
16. The method of claim 1 , further comprising: the audio controller, for each band of a frequency domain analyzing respective input audio data to determine a respective masking level, rendering respective output audio data corresponding to respective audio data, analyzing the respective output audio data to determine an energy level of the respective output audio data, comparing the energy level of the respective output audio data with the respective masking level to determine a respective comparison outcome, and modifying the respective output audio data based on the respective comparison outcome to raise the energy level of the respective output audio data to generate respective modified output audio data; the audio subsystem, for each band of the frequency domain, presenting respective output sound corresponding to the respective modified output audio data to the user.
17. The method of claim 1 , wherein sound corresponding to source audio data below the masking level is inaudible to the user.
18. The method of claim 1 , wherein the masking level is determined by analyzing the input audio data using a psychoacoustic model.
19. An extended reality system, comprising: a display subsystem configured to present images corresponding to image data to a user; a microphone configured to capture input sound from an ambient acoustic environment of the user and convert the captured input sound to input audio data; an audio controller, comprising a processor, and a non-transitory computer readable storage medium storing thereupon a sequence of instructions which, when executed by the processor, causes the processor to perform a set of acts, the set of acts comprising analyzing the input audio data to determine a masking level, rendering output audio data corresponding to source audio data, analyzing the output audio data to determine an energy level of the output audio data, comparing the energy level of the output audio data with the masking level to determine a comparison outcome, and modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data; and an audio subsystem configured to present sound corresponding to the modified output audio data to the user.
20. The system of claim 19, wherein modifying the output audio data comprises increasing a gain to raise the energy level of the output audio data above the masking level to generate a raised energy level.
21 . The system of claim 20, wherein modifying the output audio data comprises applying a dynamic range compression based on the comparison outcome to maintain a maximum energy level of the output audio data above the raised energy level.
22. The system of claim 21 , wherein the dynamic range compression comprises adjusting a compression associated parameter based on the comparison outcome, and wherein the compression associated parameter comprises a compression threshold, a compression ratio, or a response curve knee.
23. The system of claim 22, wherein the set of acts further comprises reducing a dynamic range by adjusting the compression associated parameter.
24. The system of claim 22, wherein the dynamic range compression comprises reducing a compression threshold based on the comparison outcome.
25. The system of claim 21 , wherein the masking level, the energy level of the output audio data, the dynamic range compression, and the gain are each specific to a band of a frequency domain.
26. The system of claim 25, wherein the frequency domain has 10 bands including the band.
27. The system of claim 19, wherein modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data increases an intelligibility of the output sound while minimizing increases in sound pressure level.
28. The system of claim 19, wherein modifying the output audio data comprises applying a dynamic range compression based on the comparison outcome to maintain a maximum energy level of the output audio data above a raised energy level.
29. The system of claim 19, wherein the set of acts further comprises calculating an auditory masking curve of the ambient acoustic environment.
30. The system of claim 29, wherein the auditory masking curve is critically band partitioned.
31. The system of claim 29, wherein the auditory masking curve describes masking in a frequency domain.
32. The system of claim 29, wherein the auditory masking curve describes masking in a time domain.
33. The system of claim 29, wherein the auditory masking curve results in upward spread of masking in the output sound.
34. The system of claim 19, wherein the set of acts further comprises for each band of a frequency domain: analyzing respective input audio data to determine a respective masking level; rendering respective output audio data corresponding to respective audio data; analyzing the respective output audio data to determine an energy level of the respective output audio data; comparing the energy level of the respective output audio data with the respective masking level to determine a respective comparison outcome; and modifying the respective output audio data based on the respective comparison outcome to raise the energy level of the respective output audio data to generate respective modified output audio data, and wherein the audio subsystem is configured to for each band of the frequency domain, present respective output sound corresponding to the respective modified output audio data to the user.
35. The system of claim 19, wherein sound corresponding to source audio data below the masking level is inaudible to the user.
36. The system of claim 19, wherein the masking level is determined by analyzing the input audio data using a psychoacoustic model.
37. A computer program product comprising a non-transitory computer readable storage medium having stored thereupon a sequence of instructions which, when executed by a processor, causes the processor to perform a set of acts, the set of acts comprising: analyzing input audio data to determine a masking level; rendering output audio data corresponding to source audio data; analyzing the output audio data to determine an energy level of the output audio data; comparing the energy level of the output audio data with the masking level to determine a comparison outcome; and modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data.
38. The computer program product of claim 37, wherein modifying the output audio data comprises increasing a gain to raise the energy level of the output audio data above the masking level to generate a raised energy level.
39. The computer program product of claim 38, wherein modifying the output audio data comprises applying a dynamic range compression based on the comparison outcome to maintain a maximum energy level of the output audio data above the raised energy level.
40. The computer program product of claim 39, wherein the dynamic range compression comprises adjusting a compression associated parameter based on the comparison outcome, and wherein the compression associated parameter comprises a compression threshold, a compression ratio, or a response curve knee.
41. The computer program product of claim 40, wherein the set of acts comprises reducing a dynamic range by adjusting the compression associated parameter.
42. The computer program product of claim 40, wherein the dynamic range compression comprises reducing a compression threshold based on the comparison outcome.
43. The computer program product of claim 39, wherein the masking level, the energy level of the output audio data, the dynamic range compression, and the gain are each specific to a band of a frequency domain.
44. The computer program product of claim 43, wherein the frequency domain has 10 bands including the band.
45. The computer program product of claim 37, wherein modifying the output audio data based on the comparison outcome to raise the energy level of the output audio data to generate modified output audio data increases an intelligibility of the output sound while minimizing increases in sound pressure level.
46. The computer program product of claim 37, wherein modifying the output audio data comprises applying a dynamic range compression based on the comparison outcome to maintain a maximum energy level of the output audio data above the raised energy level.
47. The computer program product of claim 37, wherein the set of acts comprises calculating an auditory masking curve of the ambient acoustic environment.
48. The computer program product of claim 47, wherein the auditory masking curve is critically band partitioned.
49. The computer program product of claim 47, wherein the auditory masking curve describes masking in a frequency domain.
50. The computer program product of claim 47, wherein the auditory masking curve describes masking in a time domain.
51. The computer program product of claim 47, wherein the auditory masking curve results in upward spread of masking in the output sound.
52. The computer program product of claim 37, wherein the set of acts comprises for each band of a frequency domain: analyzing respective input audio data to determine a respective masking level; rendering respective output audio data corresponding to respective audio data; analyzing the respective output audio data to determine an energy level of the respective output audio data; comparing the energy level of the respective output audio data with the respective masking level to determine a respective comparison outcome; and modifying the respective output audio data based on the respective comparison outcome to raise the energy level of the respective output audio data to generate respective modified output audio data.
53. The computer program product of claim 37, wherein sound corresponding to source audio data below the masking level is inaudible to the user.
54. The computer program product of claim 37, wherein the masking level is determined by analyzing the input audio data using a psychoacoustic model.
PCT/US2024/033844 2024-06-13 2024-06-13 Systems and method for audio processing in extended reality Pending WO2025259283A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2024/033844 WO2025259283A1 (en) 2024-06-13 2024-06-13 Systems and method for audio processing in extended reality

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2024/033844 WO2025259283A1 (en) 2024-06-13 2024-06-13 Systems and method for audio processing in extended reality

Publications (1)

Publication Number Publication Date
WO2025259283A1 true WO2025259283A1 (en) 2025-12-18

Family

ID=98051293

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2024/033844 Pending WO2025259283A1 (en) 2024-06-13 2024-06-13 Systems and method for audio processing in extended reality

Country Status (1)

Country Link
WO (1) WO2025259283A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10638248B1 (en) * 2019-01-29 2020-04-28 Facebook Technologies, Llc Generating a modified audio experience for an audio system
US20220415299A1 (en) * 2021-06-25 2022-12-29 Nureva, Inc. System for dynamically adjusting a soundmask signal based on realtime ambient noise parameters while maintaining echo canceller calibration performance
WO2023064870A1 (en) * 2021-10-15 2023-04-20 Magic Leap, Inc. Voice processing for mixed reality

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10638248B1 (en) * 2019-01-29 2020-04-28 Facebook Technologies, Llc Generating a modified audio experience for an audio system
US20220415299A1 (en) * 2021-06-25 2022-12-29 Nureva, Inc. System for dynamically adjusting a soundmask signal based on realtime ambient noise parameters while maintaining echo canceller calibration performance
WO2023064870A1 (en) * 2021-10-15 2023-04-20 Magic Leap, Inc. Voice processing for mixed reality

Similar Documents

Publication Publication Date Title
US12126988B2 (en) Audio spatialization
JP7054406B2 (en) How to operate an Augmented Reality (AR) system
US12293450B2 (en) 3D conversations in an artificial reality environment
US12236560B2 (en) Per-pixel filter
WO2022252924A1 (en) Image transmission and display method and related device and system
US11102601B2 (en) Spatial audio upmixing
US11709370B2 (en) Presentation of an enriched view of a physical setting
JP2023099443A (en) AR processing method and apparatus
US20250220385A1 (en) Visual content presentation with viewer position-based audio
US20250106366A1 (en) Authentic Eye Region Capture through Artificial Reality Headset
US20190066384A1 (en) System and method for rendering virtual reality interactions
US12184869B2 (en) Adaptive quantization matrix for extended reality video encoding
WO2020234939A1 (en) Information processing device, information processing method, and program
US20220027604A1 (en) Assisted Expressions
WO2025259283A1 (en) Systems and method for audio processing in extended reality
CN114207670A (en) Information processing apparatus, information processing method, and program
US12118646B1 (en) Transitional effects in real-time rendering applications
US20250365401A1 (en) Electronic device, method, and non-transitory computer readable storage medium for generating three-dimensional image or three-dimensional video using alpha channel in which depth value is included
US20250111596A1 (en) Dynamic Transparency of User Representations
JP7770990B2 (en) Information processing device, information processing method, and program
CN111936910A (en) Virtual reality system and method
CN120917513A (en) Reducing far field noise by spatial filtering using microphone arrays

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24943621

Country of ref document: EP

Kind code of ref document: A1