HK1190481B

HK1190481B - Realistic occlusion for a head mounted augmented reality display

Info

Publication number: HK1190481B
Application number: HK14103605.5A
Authority: HK
Inventors: K．A．盖斯那; B．J．芒特; S．G．拉塔; D．J．麦卡洛克; K．D．李; B．J．萨格登; J．N．马戈利斯; K．S．佩雷; S．M．斯莫尔; M．J．菲诺齐奥; R．L．小克罗科
Original assignee: 微软技术许可有限责任公司
Priority date: 2012-04-10
Filing date: 2014-04-15
Publication date: 2018-03-02

Description

Realistic occlusion for head mounted, augmented reality displays

Technical Field

The invention relates to a technique for enabling a head-mounted, augmented reality display device system to display realistic occlusion.

Background

Augmented reality, also known as mixed reality, is a technology that allows virtual images to be mixed with a user's view of the real world. In addition to making physical properties (e.g., shape, color, size, texture) of virtual objects realistic in a display, it is also desirable to realistically display the position and movement of these virtual objects relative to real objects. For example, it is desirable that in a user field of view provided by a head mounted display device, a virtual object is blocked from view just like a real object and another object (real or virtual) is blocked from view.

Disclosure of Invention

The technology described herein provides realistic occlusion of head mounted, augmented reality display device systems. In particular for virtual and real objects, the spatial occlusion relationship identifies at least one or more portions of the object in the user field of view that are (partially or fully) occluded from view. The real object or the virtual object may be an occluding object or an occluded object. The occluding object at least partially occludes the occluded object from view. In the case of partial occlusion, there is at least one partial occlusion interface between the occluding object and the occluded object. In spatial relationship, a partial occlusion interface is where the boundary of an occluding portion of an occluding object is adjacent to an unoccluded (e.g., unblocked) portion of an occluded objectThe intersection of (a). For example, dashed lines 708,710, and 712 in FIG. 7B are each virtual dolphin 702₂And reality tree 716₂Example of a partial occlusion interface.

In addition to partial occlusion interfaces, models according to a level of detail can also be generated for adaptive (conforming) occlusion interfaces in which at least a portion of the boundary data of a virtual object conforms to at least a portion of the boundary data of a real object. The adapted occlusion may be a full occlusion or a partial occlusion.

To balance between realistic display occlusion and overall action in updating the field of view of the display device, the level of detail of a model (e.g., a geometric model) representing the occlusion interface is determined based on one or more criteria such as depth distance from the display device, display size, and proximity to the point of regard. Some embodiments also include realistic, three-dimensional audio occlusion of occluded objects (real or virtual) based on the physical properties of the occluding object.

The technology provides one or more embodiments of a method for a head mounted augmented reality display device system to display realistic occlusion between a real object and a virtual object. It is determined that a spatial occlusion relationship exists between an occluding object and an occluded object, including real objects and virtual objects, based on overlapping three-dimensional (3D) spatial positions of the objects in a 3D map of at least a user field of view of the display device system. In response to identifying the spatial occlusion relationship as a partial spatial occlusion between the real object and the virtual object, object boundary data for an occluded portion of an occluding object in the partial occlusion is retrieved. A level of detail of a model representing a partial occlusion interface is determined based on level of detail criteria. The display device system, alone or with the assistance of another computer, generates a model of the partial occlusion interface from the determined level of detail based on the retrieved object boundary data. Modified versions of boundary data of the virtual object adjacent to the non-occluded portion of the real object are generated based on the model, and the generated adjacent boundary data has a shape based on the model. The display device displays the unobstructed portion of the virtual object according to the modified version of the boundary data for the virtual object.

The technology provides one or more embodiments of a see-through, augmented reality display device system for providing realistic occlusion. A see-through, augmented reality display device system includes a see-through, augmented reality display having a user field of view and supported by a support structure of the see-through, augmented reality display device. At least one camera for capturing image data and depth data of real objects in the user field of view is also supported by the support structure. One or more software controlled processors are communicatively coupled to the at least one camera for receiving image and depth data including a user field of view. One or more software controlled processors determine a spatial occlusion relationship between an occluding object and an occluded object based on the image and depth data. The occluding object and the occluded object include a virtual object and a real object. One or more software controlled processors are communicatively coupled to the see-through, augmented reality display, and the one or more processors cause the see-through display to represent a spatial occlusion relationship in the display by modifying the display of the virtual object. In some embodiments, the one or more processors cause the see-through display to represent the spatial occlusion relationship in the display by determining a level of detail for generating a model of an occlusion interface between a real object and a virtual object based on level of detail criteria. A modified version of the object boundary data may be generated for the virtual object based on the generated model, and the see-through, augmented reality display may display the virtual object based on the modified version of the object boundary data for the virtual object.

The technology provides one or more embodiments having one or more processor-readable storage devices encoded thereon with instructions that cause one or more processors to perform a method for a head mounted, augmented reality display device system to provide realistic audiovisual occlusion between a real object and a virtual object. The method includes determining, in an environment of a head mounted augmented reality display device, a spatial occlusion relationship between a virtual object and a real object based on a three-dimensional mapping of the environment. It is determined whether an audio occlusion relationship exists between the virtual object and the real object, and if so, the audio data of the occluded object is modified based on one or more physical properties associated with the occluding object. Causing one or more earphones of the display device to output the modified audio data.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

Drawings

FIG. 1A is a block diagram depicting example components of one embodiment of a see-through, augmented reality display device system.

Fig. 1B is a block diagram depicting example components of another embodiment of a see-through, augmented reality display device system.

Fig. 2A is a side view of a temple of a frame in an embodiment of a transparent, augmented reality display device embodied as eyewear that provides support for hardware and software components.

FIG. 2B is a top view of an embodiment of a display optical system of a see-through, near-eye, augmented display device.

FIG. 2C is a block diagram of one embodiment of a computing system that may be used to implement a network accessible computing system.

FIG. 3A is a block diagram of a system from a software perspective for a head mounted, augmented reality display device system to provide realistic occlusion between a real object and a virtual object.

Fig. 3B shows an example of a reference object data set.

FIG. 3C illustrates some examples of data fields in an object physical property dataset.

FIG. 4A illustrates an example of spatial occlusion resulting in audio occlusion of a virtual object by a real object.

FIG. 4B illustrates an example of spatial occlusion resulting in audio occlusion of a real object by a virtual object.

FIG. 5A is a flow diagram of an embodiment of a method for a head mounted, augmented reality display device system to display realistic partial occlusion between a real object and a virtual object.

FIG. 5B is a flow diagram of an implementation example for determining a spatial occlusion relationship between virtual and real objects in a user field of view of a head mounted, augmented reality display device based on 3D spatial locations of the objects.

FIG. 5C is a flow diagram of an embodiment of a method for a head mounted, augmented reality display device system to display a realistic conforming occlusion interface between real objects occluded by a conforming virtual object.

FIG. 6A is a flow diagram of an implementation example for determining a level of detail for representing an occlusion interface based on level of detail criteria including a depth position of the occlusion interface.

FIG. 6B is a flow diagram of an implementation example for determining a level of detail for representing an occlusion interface based on level of detail criteria including a display size of the occlusion interface.

FIG. 6C is a flow diagram of an implementation example for determining a level of detail for representing an occlusion interface based on level of detail criteria and a gaze priority value.

FIG. 6D is a flow diagram of an implementation example for determining a level of detail using a speed of an interface as a basis.

FIG. 7A illustrates an example of a level of detail of at least a portion of a boundary using a predefined bounding geometry.

FIG. 7B illustrates an example of a level of detail using geometry fitting with a first accuracy criterion.

FIG. 7C illustrates an example of a level of detail using geometry fitting with a second accuracy criterion indicating a higher modeled level of detail.

FIG. 7D illustrates an example of a level of detail using a bounding volume as boundary data for at least a real object.

FIG. 8A shows an example of a partial occlusion interface modeled as a triangular leg of the virtual object in FIG. 7A.

FIG. 8B illustrates an example of a partial occlusion interface modeled by geometric fitting with a first accuracy criterion for the virtual object in FIG. 7B.

Fig. 8C is a reference image of the unmodified virtual object (dolphin) in fig. 7A, 7B, 7C, and 8A and 8B.

Fig. 9A shows an example of a real person registered with an adapted virtual object.

FIG. 9B illustrates an example of an adaptive occlusion interface modeled with a first level of detail with a first accuracy criterion of a virtual object.

FIG. 9C illustrates an example of an adaptive occlusion interface modeled at a second level of detail with a second accuracy criterion for a virtual object.

FIG. 10 illustrates an example of displaying a shadow effect between an occluding real object and a virtual object.

FIG. 11 is a flow chart describing an embodiment of a process for displaying one or more virtual objects in a user field of view of a head mounted augmented reality display device.

FIG. 12 is a flow chart describing one embodiment of a process for accounting for shadows.

FIG. 13A is a flow diagram of an embodiment of a method for a head mounted, augmented reality display device system to provide realistic audiovisual occlusion between a real object and a virtual object.

FIG. 13B is a flow diagram of an example implementation process for determining whether an audio occlusion relationship between a virtual object and a real object exists based on one or more sound occlusion models associated with one or more physical properties of an occluding object.

Detailed Description

Various embodiments are described for providing realistic occlusion between a real object and a virtual object by a see-through, augmented reality display device system. The one or more cameras capture image data in a field of view of a display of the display device system, which field of view is hereinafter referred to as a user field of view because the field of view approximates the user's field of view when viewed through the display device. A spatial occlusion relationship between a real object and a virtual object in the user field of view is identified based on the captured image data. A 3D model including at least the spatial location of the 3D object of the user field of view may be mapped based on stereoscopic processing of image data or based on depth data from one or more depth sensors and image data. The 3D space is the volume of space occupied by the object.

Depending on the accuracy captured, the 3D space may match the 3D shape of the object or be a less accurate volume of space similar to the bounding shape around the object. Some examples of bounding shapes are bounding boxes, bounding spheres, bounding cylinders, bounding ellipses, or complex polygons that are typically slightly larger than the object. As in these examples, the bounding volume may have the shape of a predefined geometric volume. In other examples, the bounding volume shape is not a predefined shape. For example, a volume of space may follow each detected edge of an object. In some embodiments, discussed further below, the bounding volume may be used as a shutter. The 3D space position represents the position coordinates of the boundary of the volume or 3D space. In other words, the 3D spatial position identifies how much space an object occupies and where in the user field of view the occupied space is.

In a spatial occlusion relationship, one object partially or fully blocks another object in the field of view. In the illustrative example of fig. 7A, 7B and 7C, a real pine partially occludes a virtual dolphin. In the case where the virtual object is completely blocked or occluded by the real object, not presenting the occlusion of the virtual object may represent it on the display. Similarly, real objects may be wholly or partially occluded by virtual objects, depending on an executing application. A virtual object may be displayed as a display element, e.g. a pixel of a display, of all or part of the virtual object in front of all or part of the real object. In other examples, the virtual object may be sized to completely cover the real object.

However, in some instances, the virtual object will be displayed such that its shape fits over at least a portion of the real object. Since the shape of an occluding virtual object depends on the shape of at least a portion of the real object that it occludes (meaning blocks from view), there is an adaptive occlusion interface. As described below, the adaptive occlusion interface is also modeled to form a basis for generating a modified version of virtual object boundary data upon which the display of the virtual object is based. In the case of partial occlusion, there is a partial occlusion interface, which is the intersection where the object boundary of an occluding portion of an occluding object meets or is adjacent to an unoccluded portion of an occluded object. For partial or full occlusion between a real object and a virtual object, either type of object may be an occluding object or an occluded object.

For see-through displays, whether a virtual object is an occluded object or an occluding object in the occlusion, the image data of the unoccluded portion of the virtual object is modified to represent the occlusion, since the real object is actually seen through the display. The displayed image data may be moving image data such as video and still image data. For a video viewing display, both real world image data and virtual images are displayed to the user such that the user is not actually looking at the real world. The same embodiments of the methods and processes discussed below may also be applied to video viewing displays, if desired. Furthermore, Z-buffering may be performed on image data of real objects as well as virtual image data based on Z-depth testing. In the case of a video viewing display, image data of an occluded part of an object (whether it is real or virtual) is not displayed, while image data of an occluding object (whether it is real or virtual) is displayed.

How realistic a virtual object appears is related to how many display primitives (e.g., triangles, lines, polygons, etc.) are used to represent it. The more primitives are displayed and the more complex these primitives are, the more computing time the graphics pipeline takes to render them. Based on real-time factors of the occlusion interface such as depth position, display size, and proximity to the object the user is looking at, a suitable level of detail for representing or modeling the occlusion interface can be determined to improve computational efficiency while providing a realistic display of the occlusion interface. Some embodiments of audio occlusion based on spatial occlusion detected in a user's environment, including virtual or real objects, are also disclosed.

FIG. 1A is a block diagram depicting example components of an embodiment of a see-through, augmented or mixed reality display device system. System 8 includes a see-through display device that is a near-eye, head-mounted display device 2 that communicates with processing unit 4, either through line 6 in this example, or wirelessly in other examples. In this embodiment, the head mounted display device 2 is in the shape of framed 115 glasses, the frame 115 having a display optical system 14 for each eye, wherein image data is projected into the user's eye to generate a display of the image data, while the user also views through the display optical system 14 to obtain an actual direct view of the real world. Each display optical system 14 may also be referred to as a see-through display, and the two display optical systems 14 together may also be referred to as a see-through display.

The term "actual direct view" is used to refer to the ability to see the real-world object directly with the human eye, rather than seeing the created image representation of the object. For example, looking through glasses at a room will allow a user to get an actual direct view of the room, whereas viewing a video of the room on a television set is not an actual direct view of the room. The frame 115 provides a support structure for holding the various components of the system in place and a conduit for electrical connections. In this embodiment, frame 115 provides a convenient eyeglass frame as a support for the various elements of the system discussed further below. Some other examples of near-eye support structures are eyewear frames or eyewear supports. The frame 115 includes a nose bridge portion 104, the nose bridge portion 104 having a microphone 110 for recording sound and transmitting audio data to a control circuit 136. The side arms or temples 102 of the frame rest on each ear of the user and in this example the temples 102 are shown as including control circuitry 136 for the display device 2.

As shown in fig. 2A and 2B, an image generating unit 120 is further included on each temple 102 in this embodiment. Also, not shown in this view but shown in fig. 2A and 2B are outward facing cameras 113, said cameras 113 being used to record digital images and video and to communicate visual recordings to control circuitry 136, control circuitry 136 may in turn send captured image data to processing unit 4, processing unit 4 may also send this data to one or more computer systems 12 via network 50.

The processing unit 4 may take various embodiments. In some embodiments, the processing unit 4 is a separate unit that may be worn on the body (e.g., waist) of the user, or may be a separate device such as a mobile device (e.g., smartphone). Processing unit 4 may communicate with one or more computing systems 12, whether located nearby or at a remote location, by wire or wirelessly (e.g., WiFi, bluetooth, infrared, RFID transmission, Wireless Universal Serial Bus (WUSB), cellular, 3G, 4G, or other wireless communication device) over a communication network 50. In other embodiments, the functionality of processing unit 4 may be integrated in the software and hardware components of display device 2 of FIG. 1B. An example of the hardware components of the processing unit 4 is shown in fig. 2C.

One or more remote, network-accessible computer systems 12 may be leveraged for processing capability and remote data access. An example of the hardware components of computing system 12 is shown in FIG. 2C. The application may execute on a computer system 12 that interacts with or performs processing for the application executing on one or more processors in the see-through, augmented reality display system 8. For example, a 3D mapping application may be executed on the one or more computer systems 12 and the user's display device system 8. In some embodiments, the application instance may execute in a host and client role, where the client copy executes on the display device system 8 and the client copy performs a 3D mapping of its user field of view; receiving an update to the 3D mapping from computer system 12 in a view-independent format; receiving an update of an object in its view from a host 3D mapping application; and sending the image data, and depth and object identification data (if available) back to the host copy. Additionally, in some embodiments, 3D mapping applications executing on different display device systems 8 in the same environment share data updates (e.g., object identifications of real objects and occlusion data such as occlusion volumes) either in real-time in a peer-to-peer configuration among the devices or in real-time with 3D mapping applications executing in one or more network-accessible computing systems.

The data shared in some examples may be referenced relative to a common coordinate system of the environment. In other examples, one head mounted reality (HMD) device may receive data from another HMD device, including image data or data derived from image data, transmit position data of the HMD (e.g., GPS or IR data giving relative position), and orientation data. An example of data shared between HMDs is depth map data, which includes image data and depth data captured by its front facing camera 113, and occlusion volumes for real objects in the depth map. The real objects may still be unidentified or have been recognized by software executing on the HMD device or a supporting computer system (e.g., 12 or another display device system 8). Without using a common coordinate system, the second HMD may map the locations of objects received into the depth map at a user angle of the second HMD based on the location and orientation data of the sending HMD. Any common objects identified in both the depth map data of the field of view of the recipient HMD device and the depth map data of the field of view of the sending HMD device may also be used for mapping.

One example of an environment is a 360 degree visible portion of a real-world venue where a user is located. A user may only be looking at his environment as a subset of his field of view. For example, a room is an environment. An individual may be at home and looking at the top shelf of a refrigerator in the kitchen. The top shelf of the refrigerator is in his field of view, the kitchen is his environment, but his upstairs room is not part of his current environment, as the walls and ceiling block his view of the upstairs room. Of course, as he moves, his environment changes. Some other examples of an environment may be a court, a street venue, a portion of a store, a customer portion of a coffee shop, and so forth. The venue may include multiple environments, for example, a home may be a venue. The user and his friends may be playing games with their display device system, which occurs anywhere in the home. As each player moves around in the home, its environment changes. Similarly, a perimeter around several blocks may be a place, and different intersections provide different environments to view as different intersections come into view.

In the illustrative embodiment of FIGS. 1A and 1B, computer system 12 and display device system 8 also have network access to 3D image capture device 20. The capture device 20 may be, for example, a camera that visually monitors one or more users and the surrounding space so that gestures and/or movements performed by the one or more users and structures of the surrounding space including surfaces and objects may be captured, analyzed, and tracked. Such information may be used, for example, to update a display portion of the virtual object, display location-based information to a user, and to identify gestures to indicate one or more controls or actions to an executing application (e.g., a gaming application).

The capture device 20 may be a depth camera. According to an example embodiment, each capture device 20 may be configured with RGB and IR components to capture video including depth information including a depth image, which may include depth values, by any suitable technique including, for example, time-of-flight, structured light, stereo image, or the like. According to one embodiment, the capture device 20 may organize the depth information into "Z layers" (i.e., layers that may be perpendicular to a Z axis extending from the depth camera along its line of sight). The depth image may include a two-dimensional (2-D) pixel area of the captured field of view, where each pixel in the 2-D pixel area may represent a length of an object in the captured field of view from the camera (e.g., in centimeters, millimeters, etc.).

FIG. 1B is a block diagram depicting example components of another embodiment of a see-through, augmented or mixed reality display device system 8 that can communicate with other devices over a communication network 50. In this embodiment, the control circuitry 136 of the display device 2 incorporates the functionality provided by the processing unit in FIG. 1A and communicates wirelessly with one or more computer systems 12 over the communications network 50 via a wireless transceiver (see 137 in FIG. 2A).

Fig. 2A is a side view of temple 102 of frame 115 in an embodiment of a see-through, augmented reality display device 2 embodied as eyewear that provides support for hardware and software components. A camera 113 facing the physical environment is located in front of the frame 115, which camera is capable of capturing video and still images (typically in color) of the real world to map real objects in the field of view of the see-through display and hence the user. In some examples, the camera 113 may also be a depth sensitive camera that transmits and detects infrared light from which depth data may be determined. In other examples, a separate depth sensor (not shown) in front of the frame 115 may also provide depth data to objects and other surfaces in the field of view. The depth data and image data form a depth map in the captured field of view of the camera 113, which is calibrated to include the user field of view. A three-dimensional (3D) mapping of the user field of view may be generated based on the depth map. Some examples of depth sensing technologies that may be included on head mounted display device 2 are, but are not limited to, SONAR, LIDAR, structured light, and/or time-of-flight.

In some embodiments, stereo vision is used instead of or in place of a depth sensor to determine depth information. The outward facing camera 113 provides overlaid image data from which depth information for objects in the image data can be determined based on stereo vision. In the captured image data, parallax and contrast features (such as color contrast) may be used to resolve the relative position of one real object from another object for those objects that exceed the depth resolution of the depth sensor, for example.

The camera 113 is also referred to as an outward facing camera, meaning facing outward from the user's head. The illustrated camera 113 is a forward facing camera that is calibrated with respect to a reference point of its respective display optical system 14. One example of such a reference point is the optical axis (see 142 in fig. 2B) of its respective display optical system 14. This calibration allows the field of view of the real optical system 14 (also referred to as the user field of view as described above) to be determined from the data captured by the camera 113.

Control circuitry 136 provides various electronics that support other components of head mounted display device 2. In this example, the right temple 102r includes control circuitry 136 for the display device 2 that includes a processing unit 210, a memory 244 accessible to the processing unit 210 for storing processor readable instructions and data, a wireless interface 137 communicatively coupled to the processing unit 210, and a power supply 239 that provides power to the components of the control circuitry 136 and other components of the display 2 (e.g., the camera 113, microphone 110, and sensor units discussed below). Processing unit 210 may include one or more processors, including a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU), particularly in embodiments without a separate processing unit 4.

An earpiece 130 of a pair of earpieces 130, an inertial sensor 132, one or more site or proximity sensors 144 (some examples of which are GPS transceivers, Infrared (IR) transceivers, or radio frequency transceivers for processing RFID data) are located inside the temple 102 or mounted to the temple 102. In one embodiment, inertial sensors 132 include a three axis magnetometer, a three axis gyroscope, and a three axis accelerometer. Inertial sensors are used to sense the position, orientation, and sudden acceleration of head mounted display device 2. By these movements, the head position and thus the orientation of the display device may also be determined. In this embodiment, each device that processes analog signals in its operation includes control circuitry that is digitally connected to digital processing unit 210 and memory 244 and that generates or converts analog signals or both analog signals for its respective device. Some examples of devices that process analog signals are sensor devices 144, 132 and headphones 130, as described above, as well as microphone 110, camera 113, IR illuminator 134A, and IR detector or camera 134B.

An image source or image generation unit 120 that produces visible light representing an image is mounted on the temple 102 or within the temple 102. The image generation unit 120 can display the virtual object as appearing at a specified depth location in the field of view to provide a realistic, in-focus three-dimensional display of the virtual object interacting with one or more real objects. Some examples of embodiments of the image generation unit 120 that can display virtual objects at various depths are described in the following applications, which are incorporated by reference herein: filed 11/8/2010 with U.S. patent application No. 12/941,825 and the inventor AviBar-Zeev and John Lewis "Automatic Variable visual Focus for Augmented reality displays"; and "Automatic focus improvement for Augmented Reality Displays" filed on 11/18/2010 with U.S. application number 12/949,650 and inventors Avi Bar-Zeev and John Lewis. In these examples, the focal length of the image generated by the microdisplay is changed by: adjusting a displacement between an image source such as a microdisplay and at least one optical element such as a lens; or to adjust the optical power of the optical elements receiving the light representing the image. The change in focal length results in a change in the area in the field of view of the display device in which the image of the virtual object appears to be displayed. In one example, multiple images, each including a virtual object, may be displayed to a user at a rate fast enough so that human temporal image fusion makes the images appear to the human eye to be simultaneous. In another embodiment, a composite image of the in-focus portions of the virtual image generated at different focal regions is displayed.

In one embodiment, the image generation unit 120 includes a microdisplay for projecting an image of one or more virtual objects and couples optics, such as a lens system, for directing the image from the microdisplay to the reflective surface or element 124. Microdisplays can be implemented in a variety of technologies including projection technology, micro-Organic Light Emitting Diode (OLED) technology, or reflective technologies such as Digital Light Processing (DLP), Liquid Crystal On Silicon (LCOS), and from high-pass, IncDisplay technology. The reflective surface 124 directs light from the microdisplay 120 into the light guide optical element 112, and the light guide optical element 112 directs light representing an image into the eye of a user.

Fig. 2B is a top view of an embodiment of a side of a see-through, near-eye, augmented reality display device including display optical system 14. A portion of the frame 115 of the near-eye display device 2 will surround the display optics 14 for providing support and making electrical connections. To illustrate the various components of display optical system 14 (in this case, right eye system 14 r) in head mounted display device 2, a portion of frame 115 surrounding the display optical system is not depicted.

In the illustrated embodiment, the display optical system 14 is an integrated eye tracking and display system. The system embodiment comprises: an opacity filter 114 for enhancing the contrast of the virtual image, in this example behind and aligned with an optional see-through lens 116; a light guide optical element 112 for projecting image data from the image generation unit 120, which is behind and aligned with the opacity filter 114; and an optional see-through lens 118 behind and aligned with the light guide optical element 112.

The light guide optical element 112 transmits the light from the image generation unit 120 to the eye 140 of the user wearing the head-mounted display device 2. The light guide optical element 112 also allows light to be transmitted from the front of the head mounted display device 2 to the eye 140 through the light guide optical element 112 as depicted by arrow 142 representing the optical axis of the display optical system 14r, thereby allowing the user to have an actual direct view of the space in front of the head mounted display device 2 in addition to receiving the virtual image from the image generation unit 120. Thus, the walls of the light guide optical element 112 are see-through. The light guide optical element 112 is a planar waveguide in this embodiment and includes a first reflective surface 124 (e.g., a mirror or other surface) that reflects incident light from the image generation unit 120 such that the light is trapped within the waveguide. Representative reflective elements 126 represent one or more optical elements such as mirrors, gratings, and other optical elements that direct visible light representing an image from a planar waveguide to a user's eye 140.

The infrared illumination and reflections also traverse the planar waveguide 112 of the eye tracking system 134 for tracking the position of the user's eyes, which may be used for applications such as gaze detection, blink command detection, and collecting biological information indicative of the personal physical state of the user. The eye tracking system 134 includes an eye tracking IR illumination source 134A (infrared Light Emitting Diode (LED)) or a laser (e.g., VCSEL) and an eye tracking IR sensor 134B (e.g., an IR camera, an arrangement of IR photodetectors, or an IR Position Sensitive Detector (PSD) for tracking the location of glints). In this embodiment, the representative reflective element 126 also implements a bi-directional Infrared (IR) filter that directs IR illumination toward the eye 140, preferably centered on the optical axis 142, and receives IR reflections from the user's eye 140. In some examples, the reflective element 126 may include a hot mirror or a grating for implementing bi-directional IR filtering. Wavelength selective filter 123 passes visible spectrum light from reflective surface 124 and directs infrared wavelength illumination from eye tracking illumination source 134A into planar waveguide 112. Wavelength-selective filter 125 passes visible and infrared illumination in the direction of the optical path toward nose-bridge 104. The wavelength-selective filter 125 directs infrared radiation from the waveguide including infrared reflections of the user's eye 140, preferably including reflections captured around an optical axis 142, from the waveguide 112 to the IR sensor 134B.

In other embodiments, the eye tracking unit optics are not integrated with the display optics. For more examples of Eye Tracking systems for HMD devices, see U.S. patent 7,401,920 entitled "head mounted Eye Tracking and Display System" issued to Kranz et al on 22/7/2008; see U.S. patent application No. 13/245,739 to Lewis et al entitled "Gaze Detection in a See-Through, Near-Eye, Mixed Reality Display" filed on 30/8.2011; and see Bohn, U.S. patent application No. 13/245,700 entitled "Integrated Eye Tracking and Display System," filed on 26/9/2011, all of which are incorporated herein by reference.

Opacity filter 114, aligned with light guide optical element 112, selectively blocks natural light from passing through light guide optical element 112 for enhancing the contrast of the virtual image. When the system renders a scene for an augmented reality display, the system notes which real-world objects are in front of which virtual objects, and vice versa. If a virtual object is in front of a real world object, then opacity is turned on for the coverage area of the virtual object. If the virtual object is (virtually) behind a real-world object, the opacity and any color of the display region are turned off so that the user will only see the real-world object for the corresponding region of real light. The opacity filter helps make the image of the virtual object appear more realistic and represent the full range of colors and intensities. In this embodiment, the electrical control circuitry (not shown) of the opacity filter receives instructions from the control circuitry 136 via electrical connections routed through the frame. Further details of an Opacity Filter are provided in U.S. patent application No. 12/887426, "Opacity Filter For se-Through mounted display" filed on 21/9/2010, the entire contents of which are incorporated herein by reference.

Also, fig. 2A and 2B show only half of the head mounted display device 2. A complete head-mounted display device may include another set of selectable see-through lenses 116 and 118, another opacity filter 114, another light guide optical element 112, another image generation unit 120, a physical environment facing camera 113 (also referred to as outward facing or forward facing camera 113), an eye tracking component 134, and headphones 130. Additional details of the head mounted display device system are shown in U.S. patent application No. 12/905952 entitled "fusing virtual Content Into Real Content," filed on 10/15/2010, which is hereby incorporated by reference in its entirety.

FIG. 2C is a block diagram of one embodiment of a computing system that may be used to implement one or more network-accessible computing systems 12 or processing units 4, which processing units 4 may host at least some of the software components in the computing environment 54 or other elements depicted in FIG. 3A. With reference to fig. 2C, the exemplary system includes a computing device, such as computing device 200. In most basic configurations, computing device 200 typically includes one or more processing units 202, including one or more Central Processing Units (CPUs) and one or more Graphics Processing Units (GPUs). Computing device 200 also includes memory 204. Depending on the exact configuration and type of computing device, memory 204 may include volatile memory 205 (such as RAM), non-volatile memory 207 (such as ROM, flash memory, etc.) or some combination of the two. This most basic configuration is illustrated in fig. 2C by dashed line 206. Additionally, device 200 may also have additional features/functionality. For example, device 200 may also include additional storage (removable and/or non-removable) including, but not limited to, magnetic or optical disks or tape. Such additional storage is illustrated in FIG. 2C by removable storage 208 and non-removable storage 210.

Device 200 may also contain communication connections 212, such as one or more network interfaces and transceivers, that allow the device to communicate with other devices. Device 200 may also have input device(s) 214 such as keyboard, mouse, pen, voice input device, touch input device, etc. Output device(s) 216 such as a display, speakers, printer, etc. may also be included. All of these devices are well known in the art and need not be discussed at length here.

FIG. 3A is a block diagram of a system from a software perspective for a head mounted, augmented reality display device system to provide realistic occlusion between a real object and a virtual object. FIG. 3A illustrates a computing environment embodiment 54 from a software perspective that may be implemented by a head-mounted display device system, such as system 8, one or more remote computing systems 12 in communication with one or more display device systems, or a combination thereof. In addition, the display device system may communicate with other display device systems to share data and processing resources. Network connectivity allows for full utilization of available computing resources. As shown in the embodiment of FIG. 3A, the software components of the computing environment 54 include an image and audio processing engine 191 in communication with an operating system 190. The image and audio processing engine 191 processes image data (e.g., moving data such as video or still data) and audio data in order to support applications for execution by an HMD device system, such as the see-through, augmented reality display device system 8. The image and audio processing engine 191 includes an object recognition engine 192, a gesture recognition engine 193, a virtual data engine 195, eye tracking software 196 (if eye tracking is used), an occlusion engine 302, a 3D positional audio engine 304 with a sound recognition engine 194, and a scene mapping engine 306, all in communication with each other.

The computing environment 54 also stores data in an image and audio data buffer 199. The buffer provides: a memory for receiving image data captured from the outward facing capture device 113, image data captured by other capture devices (if available), image data from an eye tracking camera (if used) of the eye tracking component 134; a buffer for holding image data of a virtual object to be displayed by the image generating unit 120; and buffers for both input and output audio data, such as sound captured from the user through the microphone 110, and sound effects from the 3D audio engine 304 for the application to be output to the user through the headphones 130.

A 3D mapping of a user field of view of a see-through display may be determined by the scene mapping engine 306 based on captured image data and depth data for the user field of view. The depth map may represent the captured image data and depth data. A view-independent coordinate system may be used for mapping of the user's field of view, since whether an object occludes another object depends on the user's viewpoint. An example of a view-independent coordinate system is an x, y, z coordinate system, where the z-axis or depth axis extends perpendicularly or as a normal from the front of the see-through display. In some examples, the image or depth data of the depth map representing the user's field of view is received from a camera 113 in front of the display device 2.

Occlusion processing may be performed even before a real object is recognized or identified. Before completing object identification, the object recognition engine 192 may detect the boundary of the real object in the depth map and may assign a bounding volume as a 3D space around the real object. Bounding volumes are identified to the 3D scene mapping engine 306 and the occlusion engine 302. For example, the object recognition engine 192 may identify the bounding volume in a message to the operating system 190, which the operating system 190 broadcasts to other engines such as the scene mapping engine and occlusion engine and applications that have registered for such data. Even before the object recognition is performed, the bounding volume can be used as an occlusion volume for performing occlusion processing. For example, a fast moving object may cause occlusions that are processed based on the occlusion volume and depth map data, even if the object moves out of view before it is identified. The boundaries of the occlusion body can be used at least in part as a basis for generating an occlusion interface. The scene mapping engine 306 may assign a 3D spatial position for one or more real objects detected in the user's field of view based on the depth map. As described below, when objects are identified by the object recognition engine 192, the 3D space or volume of these objects in the map may be refined to better match the actual shape of the real-world object. The 3D spatial location of the virtual object may be determined to be within the 3D map of the user's field of view by the virtual data engine 195 or the executing application. The occlusion engine 302 may also assign occlusion volumes to virtual objects based on level of detail criteria.

Sensor data can be used to help map things around a user in the user's environment. Data from orientation sensors 132 (e.g., three axis accelerometer 132C and three axis magnetometer 132A) determine changes in position of the user's head, and correlation of these changes in head position with changes in image and depth data from forward facing camera 113 may identify the position of the respective images relative to each other. As described above, depth map data of another HMD device that is currently or previously in the environment, along with position and head orientation data for the other HMD device, may also be used to map what is in the user environment. The shared real objects in their depth maps may be used for image alignment and other techniques for image mapping. With this position and orientation data, it can also be predicted what object is entering the view, so that occlusion and other processing can start even before the object is in the view.

The scene mapping engine 306 may also use a view-independent coordinate system for 3D mapping. The map may also be stored in a view-independent coordinate system at a storage location (e.g., 324) accessible by other display device systems 8, other computer systems 12, or both, retrieved from memory, and updated over time as one or more users enter or re-enter the environment. In some examples, registering the image and the object in a common coordinate system may be performed using an extrinsic calibration process. Registration and alignment of images (or objects within images) on a common coordinate system allows the scene mapping engine to compare and integrate real world objects, landmarks, or other features extracted from different images into a unified 3D map associated with the real world environment.

When a user enters the environment, the scene mapping engine 306 may first search for a pre-generated 3D map identifying the 3D spatial location and object identification data, which 3D map is stored locally or may be accessed from another display device system 8 or a network accessible computer system 12. The graph may include stationary objects. The map may also include objects that move in real time and current lighting and shading conditions if the map is currently being updated by another system. Additionally, the pre-generated graph may include identification data of objects that tend to enter the environment at a particular time to speed up the recognition process. The pre-generated map may also store occlusion data, as described below. The pre-generated maps may be stored in a network accessible database, such as image and map database 324.

The environment may be identified by venue data. The venue data can be used as an index for searching in the venue-indexed images and pre-generated 3D map database 324 or in the internet-accessible images 326 for map or image related data that can be used to generate a map. GPS data from GPS transceiver 144, for example, of the location and proximity sensors on display device 2, may identify the user's location. Further, the IP address of a WiFi hotspot or cell site with which display device system 8 has a connection may identify a venue. Cameras at known locations within a venue may identify a user or other people through facial recognition. Furthermore, maps and map updates or at least object identification data can be exchanged between the display device systems 8 in a certain location as the range of the signal allows, by infrared, bluetooth or WUSB.

An example of image-related data that may be used to generate a map is metadata associated with any matching image data by which an object and its location within the coordinate system of the venue may be identified. For example, the relative position of one or more objects in the image data from the outward facing camera of the user's display device system 8 relative to one or more GPS-tracked objects at the venue may be determined, whereby other relative positions of real and virtual objects may be identified.

As described in the discussion of fig. 1A and 1B, the image data used to map the environment may come from cameras other than those cameras 113 on the user's display device 2. The image and depth data from the multiple perspectives may be received in real time from other 3D image capture devices 20 under control of one or more network accessible computer systems 12 or from at least one other display device system 8 in the environment. Depth maps from multiple perspectives are combined based on a view-independent coordinate system for describing an environment (e.g., an x, y, z representation of a room, store space, or a geo-occluded area) for creating a spatial volume or 3D map. For example, if the scene mapping engine 306 receives depth maps from multiple cameras, the engine 306 correlates the images and derives a common coordinate system by delineating the images and using the depth data to create a volumetric description of the environment.

In some examples, the 3D map (whether it is a depth map of the user's field of view, a location in a 3D map or view-independent coordinate system of the environment, or some other location in between) may be modeled as a 3D mesh of the environment. A mesh may include detailed geometric representations of various features and surfaces within a particular environment or region of an environment. A 3D point cloud representing a surface of an object including objects such as walls and floors in a space may be generated based on captured image data and depth data of a user environment. A 3D mesh of the surfaces in the environment may then be generated from the point cloud. More information regarding the generation of 3D graphs can be found in U.S. patent application 13/017,690 entitled "Three-Dimensional environmental Reconstruction," which is incorporated by reference herein in its entirety.

In addition to sharing data for scene mapping, in some embodiments, scene mapping may be a collaborative effort using other display device systems 8, or other network accessible image capture devices (e.g., 20) in the venue providing the image data and depth data, or a combination thereof, and one or more network accessible computer systems 12 to facilitate computing and sharing map updates. (for more information on collaborative scene mapping between HMDs such as system 8 and hub computer system 12 having access to image data, see "Low-Latency Fusing of Virtual and Real Content" filed on 27/10/2010 with U.S. patent application No. 12/912,937 and inventor Avi Bar-Zeev, which is incorporated herein by reference.) in some instances, a scene mapping engine 306 on the network-accessible computer system 12 receives image data for multiple user views from multiple perspective, augmented reality display device systems 8 in an environment and correlates these data based on the time of capture of their image data in order to track objects and changes in lighting and shading in the environment in Real-time. The 3D map updates may then be sent to multiple display device systems 8 in the environment. The 3D mapping data may be saved according to pre-generated criteria for faster retrieval in the future. Some examples of such pre-generated criteria include stationary objects, time of day, and environmental conditions that affect lighting and shadows. In other examples, display device system 8 may broadcast its image data or 3D map updates to other display device systems 8 in the environment and likewise receive such updates from other device systems. Each local scene mapping engine 306 then updates its 3D mapping according to these broadcasts.

As described above, the scene mapping engine 306 (and in particular the scene mapping engine executing on the display device system 8) may map the user field of view based on the image data and depth data captured by the camera 112 on the device. The user field of view 3D map may also be determined remotely or using a combination of remote and local processing. Scene mapping engine 306 (typically executing on one or more network accessible computer systems 12) may also generate, for each of the plurality of display device systems 8, a 3D map of a unique user field of view for a respective subset of environments based on combining depth and image data from respective depth images received at the respective display device systems 8 with the 3D map of the environment being updated in the independent coordinate system.

The object recognition engine 192 of the image and audio processing engine 191 detects, tracks, and identifies objects that are in the user's field of view and the user's 3D environment based on the captured image data and depth data (if available) or depth locations determined from stereo vision. The object recognition engine 192 distinguishes real objects from each other by marking object boundaries and comparing the object boundaries to the structure data. One example of marking object boundaries is detecting edges within the detected or derived depth data and image data, connecting the edges and comparing with stored structure data in order to find matches within probability criteria. As described above, polygonal meshes may also be used to represent the boundaries of objects. One or more databases of the structured data 200 accessible through one or more communication networks 50 may include structural information about the objects. As in other image processing applications, a person may be a type of object, so an example of structural data is a stored skeletal model of the person, which may be referenced to help identify body parts. The structure data 200 may include structural information about one or more inanimate objects to help identify the one or more inanimate objects, examples of which are furniture, sports equipment, automobiles, and so forth.

The structure data 200 may store the structure information as image data or use the image data as a reference for pattern recognition and face recognition. The object recognition engine 192 may also perform facial and pattern recognition on the image data of the object based on stored image data from other sources such as: user profile data 197 for the user; other user profile data 322 accessible to the hub; images indexed by location and 3D maps 324 and internet accessible images 326. Motion capture data from the image and depth data may also identify motion characteristics of the object.

The object recognition engine 192 may also examine the attributes of the detected object against reference attributes of the object, such as its size, shape, and motion characteristics. An example of such a reference property set of an object is a reference object data set stored in the reference object data set 318.

FIG. 3B illustrates a reference object dataset 318 with some examples of data fields_NExamples of (2). The reference data set 318 available to the object recognition engine 192 may have been predetermined manually by the application developer in an offline situation or by pattern recognition software and stored. In addition, if the user obtains a list of objects by viewing the objects with display device system 8 and entering data into the data fields, a reference object data set is generated. Also, reference object data sets may be created and stored for sharing with other users indicated in the sharing permissions. The data field includes an object type 341, which may be a data record that also includes a subfield. For object type 341, the other data fields provide the following data records: the data record identifies the type of physical attribute available for the type of object. For example, these other data records identify physical interaction features 342, size ranges 343, available shape selections 344, typical material types 345, available colors 347, available patterns 348, available surfaces 351, typical surface textures 346, geometric orientations 350 of each available surface 351.

FIG. 3C illustrates an object physical specific data set 320 being stored for a specific real object or a specific virtual object_NSome examples of data fields in (e) include data values detected or otherwise determined based on the captured data of the real object, or data predefined or generated by the application for the particular virtual object. Example data fields include an object type 381, and physical interaction characteristics 382 based on a size 383, which is three-dimensional in this example, a shape 384, which is also 3D in this example, a structure 399, which is also three-dimensional in this example (e.g., a skeleton or structure of an inanimate object), a boundaryData 400 and material type 385, among other physical attributes. For example, as a real object is closer to the user in the field of view, more detected boundary data, such as data points and boundaries, representing more detail that may be detected due to the closer proximity of the object may be stored and may also form the basis of the motion data 395 for the object. Some other exemplary data fields include pattern 386 and color 387 and surface 388_N. 3D location data 394 for the object may also be stored. In this example, location data 394 includes motion data 395 that tracks directions of movement past locations in a venue.

Surface 388_NAn exemplary data set representing each surface identified. The data set includes one or more surface textures 390, geometric orientations 393 of the surface N, surface shapes 389 (e.g., flat, circular, curved, uneven, etc.), and other factors, such as surrounding free space (3D) data 392, and illumination 396, shadows 397, and reflectivity 398 of the respective surface determined from the image data. Ambient free space (3D) 392 may be determined from position data 391 of surface N relative to one or more surfaces of one or more other objects (real or virtual) in the real environment. These other objects are typically nearest neighbor objects. Furthermore, in general, the position of the surfaces of the same object relative to each other may be the basis for determining the overall thickness and 3D shape. Ambient free space and position data can be used to determine when audio occlusion exists.

These different attributes are weighted and assigned a probability of whether an object in the image data is of a certain object type. The real object physical attribute data set 335 may be stored in one or more network accessible data stores 320.

Upon detection of one or more objects by the object recognition engine 192, other ones of the image and audio processing engines 191, such as the scene mapping engine 306 and the occlusion engine 302, receive the identity and corresponding position and/or orientation of each detected object. This object data is also reported to the operating system 190, and the operating system 190 passes the object data to other executing applications, such as other upper level applications 166.

As described above, the presence or absence of occlusions between objects depends on the viewpoint of the viewer. What the viewer sees from his point of view is his field of view. The viewpoint is also referred to as the viewing angle. In some embodiments, the perspective of the user wearing the display device (referred to herein as the user perspective) and the user's field of view from that perspective may be approximated by a view-independent coordinate system having mutually perpendicular X, Y and a Z-axis, where the Z-axis represents the depth position from a determined point or points in front of or relative to the display device system 8 (e.g., an approximate position of the fovea of the user's retina). In some examples, to achieve fast processing, the depth map coordinate system of the depth camera 113 may be used to approximate a view-independent coordinate system of the user's field of view. The occlusion engine 302 identifies occlusions between objects, particularly real and virtual objects, based on the volumetric position data of the identified objects within the view-independent coordinate system of the 3D mapping of the user field of view as updated by the object recognition engine 192 and the scene mapping engine 306.

The 3D spatial position of an object is volume position data because it represents the volume of space occupied by the object and the position of the object volume in the coordinate system. The occlusion engine 302 compares the 3D spatial locations of objects in the user field of view from the user perspective for each incoming display update. The occlusion engine 302 may process objects currently in view that are noted by the scene mapping engine 306 and those objects predicted to enter the view. Occlusions may be identified by overlapping portions in the coordinates of the respective 3D spatial locations. For example, the virtual object and the real object share a region that covers the same area in view-independent X and Y coordinates, but have different depths, e.g., one object in front of the other. In one implementation example, the 3D object boundary data represented in the 3D spatial location is projected as a mask of the object boundary data into the 2D viewing plane of the image buffer 199 for determining the overlap boundary. The depth data associated with the boundary data is then used to identify which boundary belongs to the occluding object and which boundary data belongs to the occluded object.

As described above, in the event that the virtual object is completely occluded by the real object, the occlusion engine may notify the virtual data engine 195 (see below) not to display the virtual object. In case the real object is completely occluded by the virtual object, the size of the virtual object or parts thereof may be determined to completely cover the real object and parts thereof. However, in the case of partial occlusion, the display is updated to show the portion of the virtual object that is relevant to the real object. In the case of a see-through display, the display is updated to show portions of virtual objects, while portions of real objects are still visible through the display device 2. The occlusion engine 302 identifies and stores object boundary data for occluded portions (also referred to as blocked or overlapping portions) of occluded objects in an occlusion data set as a basis for generating a partial occlusion interface. There may be more than one partial occlusion interface between the same pair of virtual and real objects in a spatial occlusion. The processing may be performed independently for each partial occlusion interface. Additionally, the virtual object may adapt its shape to at least a portion of the real object. Portions of the object boundary data of both real and virtual objects that are in the adapted portion are also stored in the occlusion dataset for use in representing or modeling the adapted interface.

Again, using a see-through display device, the user is actually looking at real objects present in the field of view. Regardless of which object is occluding, a modified version of the virtual object is generated to represent the occlusion. For either type of interface, a modified version of the boundary data of the virtual object is generated by the occlusion engine 302. The virtual data engine 195 displays the unoccluded portion of the virtual object according to its modified boundary data. Using the partial occlusion interface as an illustrative example, the boundary data (e.g., a polygonal mesh region or a sequence of edges) of the virtual object is modified so that its occluded portion now has a boundary adjacent to the unoccluded portion of the real object, and the shape of this new boundary data is similar to the shape of the model generated for the partial occlusion interface. As described above, a video viewing device may utilize various embodiments of the same methods and processes.

The occlusion engine 302 determines a level of detail of the generated model of the partial occlusion interface in order to display unoccluded portions of the virtual object in proximity to the partial occlusion interface. The more the model of the interface matches the details of the boundary data of the overlapping portion, the more realistic the interface will look on the display. The engine 302 can also determine a level of detail to accommodate an occlusion interface. This level of detail defines the parameters and which techniques are available to affect the geometry of the resulting model for either type of interface. The rule set 311 for the different occlusion detail levels controls which geometric modeling techniques can be used, and the accuracy criteria, such as how much of the object boundary data determined based on detecting the object or stored in a detailed version of the object, will be incorporated in the unmodified model and the smoothing tolerance. For example, for the same set of boundary data as an edge sequence, one level of detail may cause the generated model of the edge sequence to be a curve, which introduces more unmodified object boundary data than another level of detail that causes the model of the same edge sequence to be a straight line. Another example of a level of detail uses bounding volumes or occlusion volumes as object boundary data and uses depth map data to track the occlusion volumes to enable faster occlusion processing, rather than waiting for object recognition.

The level of detail criterion is a factor that affects: how much detail the user will perceive due to human perceptual limited approximation or display resolution. Examples of level of detail criteria that may be represented as occlusion level of detail criteria 310 in data stored in memory include depth position, display size, speed of the interface in the user field of view, and distance from the point of regard, and these criteria and determinations made based thereon will be discussed in detail with reference to fig. 6A-6D.

An occlusion data set 308 generated by the occlusion engine 302 or received from another system (8 or 12) is also stored in memory. In some embodiments, occlusion data is associated with a virtual object and a real object, and includes one or more models generated at one or more levels of detail for at least one occlusion interface between the virtual object and the real object. As described above, unmodified boundary data for the occlusion interface in question is also stored in the occlusion data set. Occlusion level of detail criteria 310 and occlusion level of detail rules 311 are also stored for use by the occlusion engine in determining how to model a partial occlusion interface or adapt an occlusion interface. Occlusion data may be shared with pre-generated maps, as well as object identification data and location data, or as data useful for generating 3D maps.

Occlusion data may be first generated for a mobile display device system. When the latter display device encounters the same occlusion, they may download the generated occlusion interfaces of different levels of detail rather than re-generating them. For example, a previously generated model of a partial occlusion interface may be reused for levels of detail based on being within a depth distance range of an object and within a user perspective angle range. Such saved occlusion data is particularly useful for stationary real objects in the environment, such as buildings. However, saved occlusion data may also save time for mobile real-world objects (e.g., dispatch-based buses in a street scene) that have a perceptible speed range and path through a venue. Whether the object is stationary or movable, the rate of movement may be determined based on the object type 381 of the object.

In addition to detecting spatial occlusions in the user's field of view, other occlusions in the user's environment or place but not in the user's field of view may also be identified by the occlusion engine 302 based on the 3D spatial location of objects relative to the user. The occlusion engine 302 executing in the display device system 8 or the hub 12 may identify occlusions. Although not visible, such occlusion relative to the user may cause audio data associated with the occluded object to be modified based on the physical properties of the occluding object.

The 3D audio engine 304 is a positional 3D audio engine that receives input audio data and outputs audio data for the headphones 130. The received input audio data may be audio data of a virtual object or may be audio data generated by a real object. Audio data of the virtual object generated by the application may be output to headphones to sound as if coming from the direction in which the virtual object is projected in the user's field of view. An example of a location 3D Audio engine that may be used with an Augmented Reality system is disclosed in U.S. patent application No. 12/903,610 by Flaks et al entitled "system and Method for High-Precision3-Dimensional Audio for Augmented Reality," filed on 13/10/2010, the contents of which are incorporated herein by reference. The output audio data may come from a sound bank 312.

The sound recognition software 194 of the 3D audio engine identifies audio data from the real world received through the microphone 110 for application control through voice commands and environmental and object recognition. In addition to identifying the content of the audio data (e.g., a voice command or a piece of music), the 3D audio engine 304 also attempts to identify which object issued the audio data. Based on the sound library 312, the engine 304 may identify sounds, such as horn sounds associated with a certain make or model of car, with physical objects. In addition, the voice data files stored in user profile data 197 or user profile 322 may also identify speakers associated with the human objects mapped in the environment.

In addition to uploading their image data, the display device system 8 and the 3D image capture device 20 in a location also upload their captured audio data into the hub computing system 12. Sometimes this may be the user's voice, but may also include sounds made in the user's environment. Based on the sound quality and the identification of objects in the vicinity of the user and based on the object type of the sound library used by the sound recognition software component, it can be determined which object in the environment or location has made a sound. Furthermore, a pre-generated 3D map of a location may provide an audio index of the sound of objects fixed at the location or what regularly enters and leaves the location, such as train and bus sounds. Sharing of data about objects (real and virtual) including the sounds they make between the multiple display device systems 8 and the hub 12 facilitates identification of the objects making the sounds. Thus, sound object candidates identified based on matches in the sound library 312 or voice data file may be compared to identified objects in the environment and even the venue for a match.

Once a real or virtual object associated with the input audio data is identified by the occlusion engine 302 as being in a spatial occlusion, and the spatial occlusion causes the object to be audibly occluded, the 3D audio engine 304 can access a sound occlusion model 316 for the audibly occluded object, the model providing rules for modifying the sound data as output by the headphones 130 to represent the occlusion.

The following method figures provide some examples of how to determine whether a spatial occlusion has caused an auditory occlusion. For example, one criterion is whether the sound-emanating part of an occluded object is blocked in a spatial occlusion. Fig. 4A and 4B provide examples of audio occlusion due to spatial occlusion.

FIG. 4A illustrates an example of spatial occlusion resulting in audio occlusion of a virtual object by a real object. Fig. 4A also shows occlusion of the sound emission area. A user's hand 404, seen in the user's field of view indicated by lines of sight 401l and 401r, is identified as being positioned above monster 402 in that field of view and at substantially the same depth distance as monster 402, thus muting the audio of monster 402 according to the sound damping characteristics of the human hand. In another example, a distance between an occluding object and an occluded object for a sound effect such as muting may indicate that there is no apparent audio occlusion or a factor on the weighting of things associated with the audio data such as volume, pitch, and pitch. Monster 403 is partially occluded by the user's arm 405 in this field of view, but monster 403 is at arm depth and several feet behind monster 402. The sound absorption characteristics of a single human body part have a very small range, so there is no auditory occlusion effect on occluded objects such as monsters 403 that are several feet away.

FIG. 4B illustrates an example of spatial occlusion resulting in audio occlusion of a real object by a virtual object. In this example, virtual brick wall 410 appears in the respective head mounted display device 2 when users Bob406 and George408 execute a quest-like game that they both play, and virtual brick wall 410 appears as a trigger for George's action. In this example, to provide a realistic experience, neither George408 nor Bob406 can hear each other due to the sound absorbing characteristics of the thick brick wall (e.g., 18 inches) between them if the brick wall is real. In FIG. 4B, audio data generated by George (e.g., his call for help) is blocked or removed from audio received via Bob's microphone and sent to Bob's headphones. Likewise, George's 3D audio engine modifies the audio data received at George's headphones to remove the audio data generated by Bob.

To hear the audio of the virtual object generated by the executing application and sent to the 3D audio engine 304, the user typically uses headphones to hear more clearly. In the case of real objects, the sound of the real object received at the microphone may be buffered before being output to the user's headphones, so that the user experiences an audio occlusion effect that is applied to the real object audio when the user uses the headphones.

The object properties, including the material type of the object, are used to determine one or more effects thereof on the audio data. The sound occlusion model 316 may include rules for representing one or more effects that the 3D audio engine 304 may implement. For example, one material type may be primarily a sound absorber, where the amplitude of the sound waves is damped and the sound energy is converted into heat energy. The absorber is advantageous for sound insulation. The sound occlusion model may for example indicate a damping coefficient of the amplitude of the audio data to represent the absorption effect. Another material type may be used to reflect the acoustic wave such that the angle of incidence is a predefined percentage of the angle of reflection of the acoustic wave striking the material. The echo and doppler effects may be output as a result by the 3D audio engine. The third type of material acts as a sound diffuser reflecting incident sound waves in all directions. The sound occlusion model associated with objects having this material type has rules for generating a reflection signal of audio data in random directions away from the size and shape of the occluding object implemented by the 3D audio engine. Within these general categories of sound characteristics, there may be more specific situations like a resonant absorber that damps the amplitude of the sound wave when it is reflected. A 3D audio engine such as may be used for interactive gaming with all artificial display environments has techniques for modifying sound waves to create echo, doppler, and absorption, transmission, and scattering effects.

In an embodiment of the display device system 8, the outward facing camera 113 in conjunction with the object recognition engine 192 and the gesture recognition engine 193 implement a Natural User Interface (NUI). Blink commands or gaze duration data identified by the eye tracking software 196 are also examples of physical action user input. The voice commands may also supplement other physical actions recognized, such as gestures and eye gaze.

The gesture recognition engine 193 can identify actions performed by the user that are indicative of controls or commands to the executing application. This action may be performed by a body part of the user (e.g. a hand or a finger), but an eye blink sequence of the eyes may also be a gesture. In one embodiment, the gesture recognition engine 193 includes a set of gesture filters, each comprising information about a gesture that can be performed by at least a portion of the skeletal model. The gesture recognition engine 193 compares the skeletal model to gesture filters in its associated mobile bucket gesture library derived from the captured image data to identify when the user (represented by the skeletal model) performed one or more gestures. In some examples, image data is matched to an image model of a user's hand or finger during a gesture training session, rather than skeletal tracking to recognize gestures.

More information on the detection and Tracking of objects can be found in U.S. patent application 12/641,788 entitled "motion detection Using Depth Images" filed 12, 18, 2009, and U.S. patent application 12/475,308 entitled "Device for Identifying and Tracking Multiple human devices over Time", both of which are incorporated herein by reference in their entirety. More information about the Gesture Recognition engine 193 may be found in U.S. patent application 12/422,661 entitled "Gesture recognizer System Architecture," filed on 13.4.2009, which is incorporated herein by reference in its entirety. For more information on recognition of Gestures, see U.S. patent application 12/391,150 "Standard getcures" filed on 23.2.2009 and U.S. patent application 12/474,655 "gettrue Tool" filed on 29.5.2009, both of which are incorporated herein by reference in their entirety.

The virtual data engine 195 processes the virtual objects and registers the 3D spatial positions and orientations of the virtual objects with respect to one or more coordinate systems (e.g., in user-field-of-view dependent coordinates or in view-independent 3D map coordinates). The virtual data engine 195 determines the position of image data of a virtual object or image (e.g., shadow) in terms of display coordinates for each display optical system 14. In addition, the virtual data engine 195 performs translation, rotation, and scaling operations to display the virtual objects at the correct size and perspective. The virtual object position may depend on the position of the corresponding object (which may be real or virtual). The virtual data engine 195 may update the scene mapping engine with respect to the spatial location of the processed virtual object.

The device data 198 may include: a unique identifier for the computer system 8, a network address (e.g., IP address), a model number, configuration parameters (such as installed devices), an identification of the operating system, and what applications are available in the display device system 8 and are executing in the display system 8, and so on. Particularly for the see-through, augmented reality display device system 8, the device data may also include data from or determined from the sensors, such as the orientation sensor 132, the temperature sensor 138, the microphone 110, and the one or more venue and proximity transceivers 144.

For illustrative purposes, the following method embodiments are described in the context of the system embodiments described above. However, the method embodiments are not limited to operating in the system embodiments described above, but may be implemented in other system embodiments. Furthermore, the method embodiments are performed continuously, and there may be multiple occlusions between real and virtual objects being processed for the current user field of view. For example, when a user wearing a head mounted, augmented reality display device system moves at least her head, and the real and virtual objects also move, the user's field of view continuously changes as well as the observable occlusions. The display typically has a display or frame rate that updates faster than the human eye can sense, for example 30 frames per second.

Fig. 5A-5C illustrate some embodiments that may be used to cause a see-through display or other head mounted display to represent a spatial occlusion relationship in the display by modifying the display of a virtual object.

FIG. 5A is a flow diagram of an embodiment of a method for a head mounted, augmented reality display device system to display realistic partial occlusion between a real object and a virtual object. The occlusion engine identifies a partial spatial occlusion between the real object and the virtual object based on their 3D spatial positions from the user perspective at step 502 and retrieves object boundary data for an occluded portion of the occluding object in the partial occlusion at step 506. At step 508, the occlusion engine 302 determines a level of detail for a model (e.g., a geometric model) representing the partial occlusion interface based on level of detail criteria, and at step 510, generates a model of the partial occlusion interface based on the retrieved object boundary data according to the determined level of detail. At step 512, the occlusion engine 302 generates a modified version of the boundary data of the virtual object based on the model to include boundary data adjacent to the unoccluded portion of the real object, the boundary data having a shape based on the model of the partial occlusion interface. For example, the shape of the adjacent boundary data is the same as the shape of the model. The virtual engine data causes the image generation unit to display the unobstructed portion of the virtual object according to the modified version of the boundary data of the virtual object at step 514. The video viewing HMD device may modify the embodiment of fig. 5A so that steps 512 and 514 may be performed with respect to occluding objects (which are real or virtual) because the video viewing display is not a see-through display, but rather displays real world image data that can be manipulated as well as image data of virtual objects. In other embodiments, the see-through display may employ a hybrid approach and may modify at least a portion of the boundary of the real object and display its image data according to the modified boundary portion.

FIG. 5B is a flow diagram of an implementation example for determining a spatial occlusion relationship between virtual and real objects in a user field of view of a head mounted, augmented reality display device based on 3D spatial location data of the objects. At step 522, the occlusion engine 302 identifies, from the user perspective, an overlap of the 3D spatial location of the real object and the 3D spatial location of the virtual object in the 3D mapping of the user field of view. The occlusion engine 302 identifies at step 524 which object is an occluded object and which object is an occluding object for the overlap based on the depth data of the respective portions of the virtual object and the real object in the overlap. In step 526, the occlusion engine 302 determines whether the occlusion is full or partial based on the position coordinates of the 3D spatial positions of the real and virtual objects in terms of the non-depth axis of the 3D mapping.

In the case of a full occlusion, which type of object is occluded affects the occlusion process. For example, the occlusion engine 302 can notify the virtual data engine 195 not to display a virtual object that is completely occluded by a real object. In the case where the virtual object completely occludes the real object and the shape of the virtual object does not depend on the shape of the real object, the occlusion engine 302 does not modify the boundary of the virtual object for this occlusion.

In some occlusions (whether partial or full), the virtual object occludes at least a portion of the real object and conforms its shape to that of the real object. For example, when the scene mapping engine 306 or higher-level application 166 identifies that the user is in the field of view of other display device systems 8, the user may have indicated to these other display device systems 8 in their user profile 322 settings for their avatars to be displayed as appropriate to him. The other viewers see the avatar instead of him from their respective perspectives, and the avatar mimics his movements.

FIG. 5C is a flow diagram of an embodiment of a method for a head mounted, augmented reality display device system to display a realistic conforming occlusion interface between real objects occluded by a conforming virtual object. In step 532, in response to the overlap being an occlusion where at least a portion of the virtual object fits at least a portion of the boundary data of the real object, the occlusion engine 302 retrieves the object boundary data for occluding the at least a portion of the virtual object and the at least a portion of the occluded real object. At step 534, a level of detail is determined for the occluded version of the boundary data of the virtual object based on level of detail criteria and the retrieved object boundary data of the real and virtual objects. In step 536, the occlusion engine 302 generates an occlusion interface model for at least a portion of the boundary data of the virtual object based on the determined level of detail and, in step 537, generates a modified version of the boundary data of the virtual object based on the occlusion interface model. At step 538, the virtual data engine 195 displays the virtual object according to the modified version of the boundary data of the virtual object.

6A, 6B, 6C, and 6D describe examples of method steps for selecting a level of detail for displaying an occlusion interface based on different types of level of detail criteria including depth, display size, speed of the interface in the user field of view, and positional relationship to the point of regard.

FIG. 6A is a flow diagram of an implementation example for determining a level of detail for representing a partial occlusion interface or an adaptive occlusion interface based on level of detail criteria including a depth position of the occlusion interface. The occlusion engine 302 tracks the depth position of the occlusion interface in the user field of view at step 542 and selects a level of detail based on the depth position in the field of view at step 544. Tracking the depth position includes monitoring changes in the depth position of each object or portions of each object that are in occlusion in order to inform the interface where it will be and predict where it will be at a future reference time. Where a depth camera is available, the scene mapping engine updates the position values based on readings from the depth sensor or the depth camera. Additionally, as an alternative or in addition to depth data, the scene mapping engine may identify depth changes according to disparity determined based on the position of individual image elements (e.g., pixels) in the image data of the same object captured separately from the front facing camera 113.

Parallax shows a significant difference in the position of an object when the object is viewed from at least two different lines of sight to the object, and is measured in terms of the angle between the two lines. Closer objects have greater parallax than objects that are further apart. For example, when driving along a road with a tree, as his car approaches the tree, the parallax detected by the user's eyes for the tree increases. However, no parallax to the moon in the air is detected, as the moon is far apart even if the user is moving relative to the moon. An increase or decrease in disparity may indicate a change in depth position of the object. Further, the change in parallax may indicate a change in the viewing angle.

The level of detail may be incremented as in a continuous level of detail, or there may be a respective range of distances associated with each discrete level of detail in the set. The intersection distance between two discrete levels of detail can be identified as a region for the virtual data engine to apply level of detail transformation techniques to avoid a "pop-up" effect as modeling of the occlusion interface becomes more detailed. Some examples of these techniques are alpha blending or geometric morphing.

As described above, the selected level of detail identifies how accurately the occlusion interface will be modeled as if the virtual object in the spatial occlusion relationship were a real object, natural or realistic. The level of detail can include a level of detail of a geometric model of the occlusion interface. One example of a level of detail that may be selected for a geometric model is a rule that uses at least a portion of the boundary of a predefined bounding geometric shape (such as a circle, square, rectangle, or triangle) as a model or representation of an occlusion interface. In examples of higher levels of detail, geometry fits (such as straight line or curve fits) may be used to fit the object boundary data points in the dataset representing the occlusion interface, and examples of precision criteria include a smoothing criterion and a percentage of the object boundary data stored for occlusions to be included in the resulting curve, straight line or other fitted geometry, or geometry generated by the fitting.

Another example of a level of detail is one that affects the detail of at least the boundary data points of the real object in the occlusion. The boundary data of at least the real object in the occlusion is a bounding volume or occlusion volume. The application may be displaying virtual objects and these virtual objects are moving quickly, or the user wearing the HMD is moving quickly, so occlusions are appearing quickly. Less detailed bounding shapes facilitate faster processing by taking advantage of human perceptual constraints in noticing details of fast moving objects. For example, the boundary data of the tree may be represented as a cylinder. The ellipse may surround the user in the field of view. The adaptive occlusion interface may be modeled as at least a portion of a bounding volume of the real object. For partial occlusions, using the bounding volume as boundary data will simplify the interface. In step 506, if the tree is an occluding object, the object boundary data of the retrieved occluding portion is part of a cylinder. In step 534 of FIG. 5C for the adaptive interface process, cylinder boundary data is retrieved for the tree, rather than a more detailed and realistic version of the boundary data. In some embodiments, the virtual object may also be represented by a bounding volume that may further simplify the interface.

At such a level of detail, occlusions may be processed based on depth map data, such as may be captured from a front facing camera 113, as bounding volumes may be assigned prior to refining the boundaries and real object identification.

Another example of a display aspect that rules for a level of detail can govern is a respective gap tolerance between real and virtual objects that meet at an occlusion interface. The less the geometric representation fits to the object boundary data, the more likely one or more gaps are created. For example, when a user's real fingers occlude portions of the virtual ball, portions of the virtual ball between the fingers may be presented to prevent small gaps from being created at short distances from the object boundary data representing the user's fingers. The real world or another virtual object behind the gap will be visible. The small gap at the partial occlusion interface is less distracting to the human eye than if the virtual ball portions overlapped the real fingers in the display. In fig. 7A and 8A, the triangular model results in a gap because the dolphin is shown with the left and right sides adjacent to the triangular sides 704 and 706, respectively.

In some embodiments, the level of detail may be included in a set in which the virtual object is allowed to be presented without correcting for occlusion. Criteria that may allow this include the display size of the partial occlusion interface being smaller than the resolution of the display element (e.g., picture element or pixel), i.e., the display. Another factor that also affects the level of detail is the number of edges or data points determined from the raw image data. In other embodiments, a very detailed level of detail may indicate that the detected edge is used as a model of a partially occluded interface to represent the interface that results in a very detailed display.

The reality of the displayed occlusion is balanced against the efficiency in updating the display to represent the motion of the virtual object and updating the 3D mapping of the user environment. Other level of detail criteria may include an efficiency factor that indicates the time at which display of the occlusion will be completed. Compliance with this factor may be determined based on status messages of available processing time of various processing units (including graphics processing units) between display device system 8 and the cooperative processors of one or more network-accessible computer systems 12 and other display device systems 8 that make their additional processing capabilities available. If processing resources are not available, a level of detail that is smaller, less realistic, than the depth position can warrant may be selected.

However, the hub computer system or another display device system 8 may have generated and stored a model representing a partial occlusion interface or an adapted occlusion interface, as well as image data for an occlusion interface that presents the same real and virtual objects at a level of detail. In particular for occlusions with static real objects, the occlusion dataset may store a generated model of a partial occlusion interface or an adapted occlusion interface at a particular level of detail, and the hub computer system 12 may retrieve the stored model and send it over a network to the display device system 8 that has the same occlusion in its field of view at a depth position appropriate for that level of detail. The display device system 8 may translate, rotate, and scale the occlusion data for its perspective. Hub computing system 12 may also retrieve image data for the occlusion interface from another display device system and perform scaling, rotation, or translation as needed for the perspective of display device system 8, and send modified image data to display device system 8, the modified image data being in a format ready to be processed by image generation unit 120. Sharing of occlusion and image data may also enable a more detailed level of detail to comply with processing efficiency criteria.

Illumination and shading affect detail visibility. For example, at a particular depth location, more detail of a real object may be visible during the bright day than during the night or in a shadow cast by another real or virtual object. On cloudy, rainy days, it may be computationally inefficient to render an occlusion interface of a virtual object with a real object at some level of detail for bright daylight. Returning to FIG. 6A, in step 546, the occlusion engine 302 determines an illumination value for the 3D spatial location of the occlusion interface, optionally based on values assigned by the scene mapping software for illumination level, degree of shadowing, and reflectivity, and in step 548, modifies the selected level of detail, optionally based on the illumination value and taking into account the depth location.

FIG. 6B is a flow diagram of an implementation example for determining a level of detail for representing an occlusion interface based on level of detail criteria including a display size of the occlusion interface. In step 552, the occlusion engine 302 tracks the depth position of the occlusion interface and in step 554, for example, based on the respective associated object physical property data sets 320 for the portions of the virtual and real objects at the occlusion interface_NPhysical attributes of these parts, including object size and shape, are identified.

The display size of the portion of the virtual object at the occlusion interface may be determined at step 556 by the virtual data engine 195 in response to a request by the occlusion engine 302 by: a display size is calculated based on the depth position, the identified physical properties of the portions of the objects (including object size and shape), and the coordinate transformation to identify how many display elements (e.g., pixels or sub-pixels on the display) on the display are to represent an image of the occlusion interface. For example, if the display size is significantly smaller than the pixel resolution of the display, a level of detail indicating that no occlusion processing is taking place may be selected, as the occlusion will not be visible, or the computational cost may hardly be adjusted at all. In step 558, the occlusion engine 302 selects a level of detail corresponding to the determined display size.

FIG. 6C is a flow diagram of an implementation example for determining a level of detail for representing an occlusion interface based on level of detail criteria and based on a gaze priority value. At step 562, the eye tracking software 196 identifies the point of regard in the user's field of view. For example, the point of regard may be determined by: the method includes detecting pupil positions of respective eyes of the user, extending respective lines of sight from each of the user's approximate retinal locations based on an eyeball model, and identifying intersections of the lines of sight in a 3D mapped user field of view. The intersection point is the point of gaze, which may be an object in the field of view. The point of regard in the coordinate system may be stored in a memory location accessible to other software for processing by the other software. The occlusion engine 302 assigns a priority value to each occlusion interface based on its respective position from the point of regard at step 564 and selects a level of detail for generating a model of a partial occlusion interface or an adaptive occlusion interface based on level of detail criteria and the priority value at step 566. In some examples, the priority value may be based on a distance criterion from the point of regard. In other examples, occlusion interfaces located in the blending region of Panum (i.e., the single vision zone of human binocular vision) may receive a greater priority value than those regions located before the blending region of Panum.

FIG. 6D is a flow diagram of an implementation example for determining a level of detail using a speed of an interface as a basis. The occlusion engine 302 determines the speed of the occlusion interface based on the speed of the objects of the occlusion at step 572. Occlusion may be a predicted or future occlusion based on its speed. In step 574, the occlusion engine 302 uses the speed as a basis for selecting a level of detail. Like gaze and depth distances and display size, speed may also be one of several factors considered in determining a level of detail for handling occlusion. The higher the speed, the less detailed the occlusion process, and no occlusion can be selected to be a hierarchy if the thing moves too fast.

Some geometry fitting techniques, such as the examples above, are applied to fit at least a portion of a boundary of a straight line, curve, or predefined geometry to the boundary with an accuracy criterion that adjusts the closeness of the fit. An example of using at least a portion of the boundaries of a predefined geometric shape by using the sides of a triangle as a model of partial occlusion interfaces 704 and 706 is shown in FIGS. 7A and 8A. Fig. 7B and 8B show examples of line fits as a form of geometry fits with a first accuracy criterion, while fig. 7C shows line fits with a second accuracy criterion with a greater accuracy. Fig. 8C is an unmodified reference image of the virtual object (i.e., dolphin) in fig. 7A, 7B, 7C, and 8A and 8B.

FIG. 7A illustrates an example of a level of detail of at least a portion of a boundary using a predefined bounding geometry. FIG. 8A shows a triangle modeled as the virtual object in FIG. 7AExamples of partially occluding interfaces for legs. For example, pine 716 in FIG. 7A₁Not triangles but with boundaries that include triangle-like properties. Referring to fig. 8C, in fig. 7A, the central portion including the fin is hidden by the pine tree. In this example, due to the depth of the virtual object and the pine tree, there are two partial occlusion interfaces modeled as sides of a triangle. Due to the distance from the tree, for this level of detail in this example, a larger gap tolerance is permitted between the end of the real branches and the beginning of the virtual dolphin side.

As the user moves closer to the tree, the user will see more detail through natural vision. In addition, the image and depth sensors may also determine depth more accurately. Now, consider more shapes of real objects (i.e., pine trees). FIG. 7B illustrates an example of a level of detail using geometry fitting with a first accuracy criterion. In fig. 7B, a straight line fitting algorithm using a smoothing criterion may be used. For example, the smoothing criterion may indicate a maximum of how far the fitted geometry may be from the initially detected boundary data (e.g., points and edges), or a level of complexity of the polygon (e.g., triangles versus tetrahedrons) may be used to represent a polygon mesh version of the portion of the object retrieved from the storage location upon identification of the object. The third, fourth, and fifth level branches down will be too far from the fitted straight line to represent their shape in the geometry of partial occlusion interfaces 708,710, and 712. FIG. 8B shows the resulting partial occlusion interfaces 708,710, and 712, which include serrations for spaces between layers of branches.

FIG. 7C illustrates an example of a level of detail using geometry fitting with a second precision criterion indicating a higher modeled level of detail. At distances for FIG. 7C, a geometry fitting algorithm, such as a curve or line fit, may be used to model the boundary data of the tree that detects more detail, now including branches with pine needles that may be perspective so that more detail is represented in the partial occlusion interface. In this example, when a dolphin swims around a pine tree,the user field of view is identified as indicating that the user is traversing a certain portion of the tree 716₃The middle branch is fixed on the dolphin fin. In this example, the geometry fitting algorithm may have more boundary data from the captured image and depth data to process, and the accuracy criterion indicates a lower tolerance from the boundary data. When the user moves towards the tree and virtual dolphin 702₃When swimming continuously, part of the shielding interface is changed continuously. At the moment of this current field of view, the branches are in front of the dolphin. Multiple representative partial occlusion interfaces are noted. Partial occlusion interface 724_NRepresenting the interface between the trunk of the tree and the dolphin between the branches. Interface 721_NRepresenting the shielding interface of the branch parts among the pine needles. Interface 720_NAn occlusion interface representing the pine needle part on the branch in front of the dolphin in the user's perspective.

FIG. 7D illustrates an example of a level of detail using a bounding volume as boundary data for at least one real object. In this example, a person (i.e., Bob 406) is being viewed through a see-through display device 2, such as might be worn by George408 in this example. George is looking at the virtual monster 732, as indicated by gaze lines 731l and 731r in this display frame. Monsters 732 and 733 are jumping around quickly in the room, so the occlusion engine 302 tracks Bob based on depth map data as monsters are jumping around continuously using a bounding volume of a predefined shape (an ellipse in this example) through different display frames. Bob406 is considered a real-world object, although he may not have been identified as a person by the object recognition engine 192. The occlusion engine 302 uses an ellipse to model the occlusion interface with respect to monsters. Monster 732 is clipped to be displayed at the oval boundary, rather than Bob's right arm being clipped. Monster 733 is similarly clipped off or does not show the portion occluded by the ellipse. Because of the speed of occlusion because of monsters jumping around in the room, occlusion interfaces of less detail can be presented according to level of detail criteria.

Fig. 9A shows an example of a real person registered with an adapted virtual object. A person in the user's field of view (here Sam) is wearing a T-shirtShirt 804. Sam's body protrudes outward in its middle portion and is covered on its T-shirt by 806₁And 806₂These projections are shown as being clearly close together. Sam is at an event where someone can be seen to be wearing a virtual sweater indicating the university he joins. Sam's virtual sweater 902 conforms to Sam's body as would normally be the case for a garment. FIG. 9B illustrates an example of an adaptive occlusion interface modeled at a first level of detail with a first accuracy criterion of a virtual object. Another user wearing her see-through, augmented reality display device system 8 causes Sam to face her directly in her field of view less than 7 feet away. The dashed lines in fig. 9B and 9C indicate the accommodation and partial occlusion interface of occluding virtual sweater 902 and the real object portion of Sam, such as its T-shirt 804, arms, shoulders and pants, etc. Occlusion interface 910 is an adaptive interface, so the position of the volume or 3D space occupied by the virtual shoulders of the sweater is adapted based on the shape and size of the real shoulders of Sam. Sweater 908₁And 908₂With a partial occlusion interface with Sam's T-shirt 804 and pants. Part 906 of a placket₁And 906₂Sam-based including projections 806₁And 806₂To obtain their shape. Thus, the middle placket portion 906₁And 906₂Not flat but follows the contour of the bulge. FIG. 9C illustrates an example of an adaptive occlusion interface modeled at a second level of detail with a second accuracy criterion for a virtual object. In this current field of view, the wearer sees Sam again in the center of their field of view, but at least twice as far away. Based on the distance, virtual placket portion 906₁And 906₂Is not shown, but is used for the smooth placket 908 on Sam's sweater 902₁And 908₂Instead of a smooth and less detailed curve.

Occlusion can result in shadows, and shadow effects can also have an impact on how realistic an occlusion looks. FIG. 10 illustrates an example of displaying a shadow effect between occluding real and virtual objects. The shadow of the virtual object may be displayed on the real object, and the virtual object may be displayed with the shadow of the real object appearing thereon. As discussed in U.S. patent application serial No. 12/905952 entitled "Fusing Virtual Content Into Real Content," discussed above, the shaded areas may be identified by display coordinates, and opacity filter 114 in front of display optical system 14 may adjust the incident light for these display coordinates to appear darker in some embodiments to give a shading effect. The shaded image data may also be displayed as appearing on a virtual or real object using conventional real-time shading generation techniques. The location of the shadow of the real object may be determined by conventional shadow detection techniques used in image processing. Based on the illumination detection technique and the shadow detection technique, the scene mapping engine 306 may determine the location of the shadow cast by the virtual object and whether the virtual object is to be displayed as being in a shadow. In fig. 10, the spheres 932 and 940 are real objects and the box 936 is a virtual object. Scene mapping engine 306 detects a shadow 934 of ball 932 and a shadow 942 of ball 940 from image and depth data captured by front facing camera 113 or other cameras in the environment for the user's field of view. The scene mapping engine 306 updates the 3D mapping of the user field of view identifying these shadows, and other applications such as the occlusion engine 302 and the virtual data engine 195 receive notification of the real shadow locations as they retrieve their next map update. The 3D position of the virtual box 936 in the user field of view is determined, and the occlusion engine 302 determines that the virtual box 936 is partially occluded by the ball 932 and slightly occludes the ball 941. The occlusion engine 302 determines whether there is also a shadow occlusion, meaning that the occluding object casts a shadow on the occluded object based on the shadow position of the 3D mapping.

Based on the lighting and shadow effects indicated by the scene mapping engine 306 in the map of 3D spatial locations where the two balls and boxes are located, the occlusion engine 302 determines whether the occlusion generated a shadow and whether the shadow was applied to an object in an occlusion relationship. In addition to the partial occlusion interface 933, the engine 302 determines that the shadow occluding the real ball 932 extends to the surface of the occluded virtual box 936. The occlusion engine may identify one or more shadow occlusion boundaries 935 for the virtual box 936 that indicate a portion of the virtual box to be in shadow. The shadow may have a transparency level that is transparent. As described above, a partially occluded interface identified as being in a shadow can receive a level of detail for its modeling that is less detailed due to the shadow effect.

The occlusion engine 302 also identifies a partial occlusion interface 937 where the virtual box 936 occludes the real ball 940 and a shadow occlusion boundary 939 on the real ball 940. The virtual data engine 195 is notified of the modified boundary data due to the partial occlusion interface and shadow occlusion boundary for updating the display accordingly. Boundaries such as polygonal meshes and edges are not typically displayed. Which are the basis for the shape and size information used by the virtual data engine 195 to identify the image data.

FIG. 11 is a flow chart describing an embodiment of a process for displaying one or more virtual objects in a user field of view of a see-through, augmented reality display device (e.g., as in the see-through, augmented reality display devices of FIGS. 1A-2B). Steps are described that may be performed by or for an opacity filter. The methods of FIGS. 11 and 12 may be performed in a display device system without opacity filter 114, but without performing those steps associated with the opacity filter. At step 950, the virtual data engine 195 accesses a 3D mapping of the user's field of view from the user's perspective. For virtual images, such as may include virtual objects, the system has a target 3D spatial location for inserting the virtual image.

In step 954, the system presents the previously created three-dimensional model of the environment from the user viewpoint (i.e., user perspective) of the see-through augmented reality display device 2 in the z-buffer without presenting any color information in the corresponding color buffer. This effectively leaves the rendered image of the environment as completely black, but stores z (depth) data for objects in the environment. In step 956, the virtual content (e.g., a virtual image corresponding to the virtual object) is rendered in the same z-buffer. Steps 954 and 956 result in a depth value being stored for each pixel (or for a subset of pixels).

In step 958, the virtual data engine 195 determines color information for the virtual content to be displayed in the corresponding color buffer. This determination may be performed in a variety of ways. In some embodiments, a Z or depth test is performed for each pixel. If the pixel is part of a virtual object that is closer to the display device than any other object (real or virtual), the color information of the virtual object is selected. In other words, the pixel corresponds to an unobstructed portion of the virtual object. In the case of a video viewing display, the color information may be used for real objects that are not occluded and for virtual objects that are not occluded. Returning to the case of a see-through display, if the pixel corresponds to an occluded portion of a virtual object, no color information is selected for the pixel.

In some embodiments, the modified boundary data of the virtual object determined and modeled based on the occlusion interface may be used as a basis for selecting which color information of the virtual content is written to which pixels. In other examples, the virtual content buffered for display is a version of the virtual content that already includes any modifications to the image data based on the boundary data modified due to the occlusion processing of the occlusion interface with reference to the level of detail, such that the color information can be simply written to a color buffer for the virtual content. Any of these methods effectively allow the virtual image to be drawn on microdisplay 120 taking into account real-world objects or other virtual objects occluding all or part of the virtual object. In other words, any of these methods may result in a see-through display representing a spatial occlusion relationship in the display by modifying the display of the virtual object.

As part of the optional opacity processing, in optional step 960, the system identifies the pixels of microdisplay 120 that display the virtual image. In optional step 962, an alpha value is determined for each pixel of microdisplay 120. In a conventional chromakeying system, alpha values indicate pixel by pixel: how much alpha value is used to identify how opaque the image is. In some applications, the alpha value may be binary (e.g., on and off). In other applications, the alpha value may be a number with a range. In one example, each pixel identified in step 960 will have a first alpha value and all other pixels will have a second alpha value.

At optional step 964, the pixels of the opacity filter are determined based on the alpha values. In one example, the opacity filter has the same resolution as microdisplay 120, and thus the opacity filter can be controlled using alpha values. In another embodiment, the opacity filter has a different resolution than microdisplay 120, and thus data for darkening or not darkening the opacity filter will be derived from the alpha values using any of a variety of mathematical algorithms for converting between resolutions. Other means for deriving control data for the opacity filter based on alpha values (or other data) may also be used.

In step 966, the images in the z-buffer and color buffer, as well as the alpha values and control data for the opacity filter (if used), are adjusted to account for light sources (virtual or real) and shadows (virtual or real). More details of step 966 are provided below with reference to FIG. 12. In step 968, the composite image based on the z-buffer and color buffer is sent to microdisplay 120. That is, the virtual image to be displayed at the appropriate pixels is sent to microdisplay 120, taking into account perspective and occlusion. At optional step 968, control data for the opacity filter is transmitted from one or more processors or processing units of control circuitry 136 to control opacity filter 114. Note that the process of fig. 11 may be performed multiple times per second (e.g., refresh rate).

FIG. 12 is a flow chart describing one embodiment of a process for accounting for light sources and shadows, which is an example implementation of step 966 of FIG. 11. In step 970, the scene mapping engine 306 identifies one or more light sources that need to be considered. For example, when rendering a virtual image, real light sources may need to be considered. If the system adds a virtual light source to the user's view, the effect of the virtual light source may be taken into account in the head mounted display device 2. For more details of other implementation examples for changing lighting on real and virtual objects and additional ways of generating shadow effects, See "Display of shadow views se-Through Display (displaying Shadows via a See-Through Display)" of inventor Matthew Lamb, filed 12/2011, which is hereby incorporated by reference in its entirety.

At step 972, portions of the 3D map of the user field of view (including the virtual image) illuminated by the light source are identified. At step 974, the image depicting the illumination is added to the color buffer described above.

At step 976, the scene mapping engine 306 and the occlusion engine 302 identify one or more shadow regions for each shadow produced by the occlusion that the virtual data engine 195 needs to add, optionally with the aid of an opacity filter. For example, if a virtual image is added to an area in the shadow, the shadow needs to be taken into account by adjusting the color buffer at step 978 when drawing the virtual image. At step 980, if a virtual shadow is to be added to where no virtual image exists, the occlusion engine 302 indicates a transparency of the real object, a shadow occlusion interface on the real object, and the shadow, based on which the virtual data engine 195 generates and renders the shadow as virtual content registered with the real object (if the real object is in a virtual shadow). Optionally, alternatively or additionally, at step 982, those pixels of opacity filter 114 that correspond to the location of the virtual shadow are darkened.

As with the various aspects of the methods described above, the various steps for displaying the partial occlusion interface may be performed by the see-through, augmented reality display device system 8 alone, or in cooperation with one or more hub computing systems 12, alone or in conjunction with other display device systems 8.

FIG. 13A is a flow diagram of an embodiment of a method for a head mounted, augmented reality display device system to provide realistic audiovisual occlusion between a real object and a virtual object. In step 1002, the occlusion engine 302 determines a spatial occlusion relationship between a virtual object and a real object in the environment of the head mounted, augmented reality display device based on three-dimensional data representing an object volume or spatial location. In step 1004, the occlusion engine 302 determines whether the spatial occlusion relationship satisfies a field of view criterion of the display device. Some examples of the field of view criteria are whether the occlusion is in the field of view and the expected time at which the occlusion is expected to enter the field of view based on motion tracking data of these objects. If the occlusion satisfies the field of view criteria, then a determination is made at step 1006 as to whether the spatial occlusion is a partial occlusion. In response to the occlusion being a partial occlusion, at step 1008, processing for displaying a realistic partial occlusion is performed. Otherwise, at step 1010, processing is performed to show realistic total occlusion of one object by another.

If the spatial occlusion does not meet the field of view criteria or a process for displaying the spatial occlusion in the field of view is being performed or has been performed, then a determination is made at step 1012 as to whether an audio occlusion relationship exists between the virtual object and the real object. If an audio occlusion relationship does not exist, audio data is output at step 1016. If an audio occlusion relationship exists, the audio data of the occluded object in the relationship is modified based on one or more physical properties associated with the occluded object in the relationship at step 1014, and the modified audio data is output at step 1018.

FIG. 13B is a flow diagram of an example implementation process for determining whether an audio occlusion relationship between a virtual object and a real object exists based on one or more sound occlusion models associated with one or more physical properties of an occluding object. At step 1022, the 3D audio engine 304 identifies at least one sound occlusion model associated with one or more physical properties of the occluding object, and which model(s) represent at least one sound effect and at least one distance range for the at least one effect. At step 1024, the 3D audio engine retrieves the depth distance between objects in the spatial occlusion relationship and determines at step 1026 whether the occluded object is within the at least one distance range. If not, the unmodified audio data is output as in step 1016.

In response to the occluded object being in the at least one distance, the 3D audio engine 304 determines whether the sound production portion of the occluded object associated with the audio data is occluded at step 1028. Based on the object type of the object and the sound identified as being emitted by the occluded object, the portion of the object that emits the sound can be identified. From the 3D spatial position data of the occluded object and the occluding object, it can be determined whether the sound emission portion is blocked. For example, if the partially occluded object is a person, but the person's face is not blocked at all, then there is no audio occlusion of the voice data from the person.

In response to the utterance portion being occluded (i.e., blocked) by the occluding object, at step 1030, the 3D audio engine 304 modifies the audio data according to at least one sound effect represented by the identified sound occlusion model, and the 3D audio engine 304 performs step 1018 of outputting the modified audio data.

The example computer systems illustrated in the figures include examples of computer-readable storage devices. The computer readable storage device is also a processor readable storage device. Such devices include volatile and nonvolatile, removable and non-removable memory devices implemented in any method or technology for storage of information such as processor readable instructions, data structures, program modules or other data. Some of the processor or computer readable storage devices are RAM, ROM, EEPROM, cache, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, memory sticks or cards, magnetic cassettes, magnetic tape, media drives, hard disks, magnetic disk storage or other magnetic storage devices, or any other device which can be used to store the desired information and which can be accessed by a computer.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A method for a head mounted, augmented reality display device system to display realistic occlusion between a real object and a virtual object, comprising:

determining (522, 524) that a spatial occlusion relationship exists between occlusion objects and occluded objects based on three-dimensional (3D) positions of the occlusion objects and the occluded objects including the real object and the virtual object overlapping in a three-dimensional (3D) mapping of at least one user field of view of the display device system;

determining (502,526,532) an occlusion interface for the spatial occlusion relationship;

determining (508,534) a level of detail for the occlusion interface model based on level of detail criteria;

generating (510,536) the occlusion interface model based on the determined level of detail;

generating (512,537) a modified version of boundary data for the virtual object based on the occlusion interface model; and

displaying (514,538) the virtual object according to the modified version of the virtual object's boundary data.

2. The method of claim 1, wherein the occlusion interface is a partial occlusion interface (704,706,708,710,712,720)_N,721_N,724_N) The partial occlusion interface is an intersection of object boundary data of an occluded portion of the occluding object and an unoccluded portion of the occluded object;

wherein generating a modified version of the boundary data for the virtual object based on the occlusion interface model further comprises generating (512), based on the model, a modified version of the boundary data for the virtual object adjacent to the unoccluded portion of the real object, the generated adjacent boundary data having a shape based on the model; and

wherein displaying the virtual object according to the modified version of the boundary data of the virtual object further comprises displaying (514) an unobstructed portion of the virtual object according to the modified version of the boundary data of the virtual object.

3. The method of claim 2, wherein the level of detail further includes respective gap tolerances between the real object and the virtual object adjacent to the partial occlusion interface.

4. The method of claim 1, wherein the occlusion interface is an adaptive occlusion interface, wherein the virtual object is an occluding object and the real object is an occluded object, and at least a portion of the virtual object is adapted to at least a portion of boundary data of the real object; and

wherein determining a level of detail of the occlusion interface model based on the level of detail criterion further comprises determining (534) a level of detail of the occlusion interface model for object boundary data of the virtual object based on the level of detail criterion and the object boundary data of at least the portion of the occlusion virtual object and the occluded real object.

5. The method of claim 1, wherein determining a level of detail for the occlusion interface model based on level of detail criteria further comprises:

a level of detail of a geometric model to be represented in a resulting fitted geometry is selected from a set of levels of detail comprising a plurality of geometric structure fitted models having different accuracy criteria.

6. The method of claim 1, wherein the level of detail criteria comprises at least one of:

the depth position of the occlusion interface is set,

a display size of at least a portion of the virtual object in the spatial occlusion relationship,

illumination values for 3D spatial positions of the occlusion interfaces,

distance of the occlusion interface from the gaze point, an

A speed of the occlusion interface.

7. An augmented reality display device system (8) for providing realistic occlusion, comprising:

an augmented reality display (14) having a user field of view and supported by a support structure (115) of the augmented reality display device system;

at least one camera (113) supported by the support structure for capturing image data and depth data of real objects in a user field of view of the augmented reality display;

one or more processors (202, 210) communicatively coupled to the at least one camera, the one or more processors to receive image and depth data comprising the user field of view;

the one or more software controlled processors are to determine a spatial occlusion relationship between an occluding object and an occluded object based on their three dimensional (3D) positions overlapping in a three dimensional mapping of the user field of view of the augmented reality display device system, the occluding object and the occluded object comprising a real object and a virtual object;

the one or more software controlled processors are communicatively coupled to the augmented reality display, and the one or more processors cause the augmented reality display to represent the spatial occlusion relationship in the display by modifying display of the virtual object (402,403,702,732,733,902,936);

the one or more software controlled processors are communicatively coupled to one or more computer systems (8, 12), and the one or more processors and the one or more computer systems cooperatively determine, in real time, a three-dimensional map of an environment of a user wearing the augmented reality display device system in a common coordinate system based on captured images and depth data of the environment; and

the one or more software controlled processors and the one or more computer systems share in real time at least one occlusion data set (308) comprising a model of an occlusion interface.

8. The system of claim 7, wherein the augmented reality display (14) is a see-through display.

9. The system of claim 7, wherein the one or more processors causing the augmented reality display to represent the spatial occlusion relationship in the display by modifying display of the virtual object further comprises:

the one or more software-controlled processors determine a level of detail for generating a model of an occlusion interface between the real object and the virtual object based on level of detail criteria and generate a modified version of object boundary data for the virtual object based on the generated model; and

the augmented reality display displays the virtual object based on a modified version of object boundary data of the virtual object.

10. The system of claim 7, further comprising:

an earpiece (130) attached to the support structure;

a microphone (110) attached to the support structure;

the one or more software controlled processors are communicatively coupled to the microphone and receive audio data from the microphone;

the one or more software controlled processors are communicatively coupled to the headphones for controlling output of audio data;

the one or more software-controlled processors identifying (1012) an audio occlusion relationship existing between the real object and the virtual object based on the spatial occlusion relationship;

the one or more software-controlled processors executing a three-dimensional audio engine (304) that modifies (1014) audio data of an occluded object in the spatial occlusion relationship based on one or more physical properties associated with the occluded object;

the one or more software controlled processors identify audio data from the microphone as coming from the real object that is the occluded object and modify the audio data of the real object based on one or more physical properties associated with the virtual object that is the occluded object; and

the one or more software controlled processors cause headphones of the display device system to output the modified audio data.