HK1173690B

HK1173690B - Methods and systems for determining and tracking extremities of a target

Info

Publication number: HK1173690B
Application number: HK13100850.4A
Authority: HK
Inventors: T．莱瓦德; S．P．斯塔奇亚克; C．佩普尔; 刘韶; J.李
Original assignee: 微软技术许可有限责任公司
Priority date: 2009-11-11
Filing date: 2010-11-02
Publication date: 2015-07-10

Description

Method and system for determining and tracking extremities of a target

Technical Field

The present invention relates to a system and a method for determining and tracking extremities of a user in a scene.

Background

Many computing applications, such as computer games, multimedia applications, and the like, use controls to allow a user to manipulate game characters or other aspects of the application. Such controls are typically entered using, for example, a controller, remote control, keyboard, mouse, etc. Unfortunately, these controls can be difficult to learn, thereby creating a barrier between the user and these games and applications. Further, these control commands may be different from the actual game actions or other application actions for which these control commands are intended. For example, a game control that causes a game character to swing a baseball bat may not correspond to the actual motion of swinging a baseball bat.

Disclosure of Invention

Disclosed herein are systems and methods for tracking extremities of a user in a scene. For example, an image such as a depth image of a scene may be received or observed. A grid (grid) of voxels (voxels) may be generated based on the depth image in order to downsample the depth image. For example, a depth image may include a plurality of pixels, which may be divided into portions or blocks. Voxels may then be generated for each portion or block such that the received depth image is downsampled into a grid of voxels.

According to an example embodiment, the background included in the grid of voxels may be removed to isolate one or more voxels associated with a foreground object (such as a human target). For example, each voxel in the grid may be analyzed to determine whether the voxel may be associated with a foreground object (such as a human target) or a background object. Voxels associated with background objects may then be removed or discarded to isolate foreground objects such as human targets.

A location or position of one or more extremities of the isolated human target may then be determined. For example, in one embodiment, the location of an extremity (such as a centroid or center, head, shoulder, hip, arm, hand, elbow, leg, foot, knee, etc.) of the isolated human target may be determined. According to example embodiments, a scoring technique for candidates of one or more extremities may be used, one or more anchor points and averages for one or more extremities may be used, a block associated with one or more extremities may be used, etc. to determine a location or position of one or more extremities. The location or position of one or more extremities may also be refined based on pixels associated with the one or more extremities in the non-downsampled depth image.

One or more extremities may be further treated. For example, in one embodiment, a model, such as a skeletal model, may be generated and/or adjusted based on the location or position of one or more extremities.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.

Drawings

FIGS. 1A and 1B illustrate an example embodiment of a target recognition, analysis, and tracking system in which a user is playing a game.

FIG. 2 illustrates an example embodiment of a capture device that may be used in a target recognition, analysis, and tracking system.

FIG. 3 illustrates an example embodiment of a computing environment that may be used to interpret one or more gestures in a target recognition, analysis, and tracking system and/or animate an avatar or on-screen character displayed by the target recognition, analysis, and tracking system.

FIG. 4 illustrates another example embodiment of a computing environment that may be used to interpret one or more gestures in a target recognition, analysis, and tracking system and/or animate an avatar or on-screen character displayed by the target recognition, analysis, and tracking system.

FIG. 5 depicts a flow diagram of an example method for determining extremities of a user in a scene.

FIG. 6 illustrates an example embodiment of a depth image that may be used to track a user's extremity.

7A-7B illustrate example embodiments of a portion of a depth image being downsampled.

FIG. 8 illustrates an example embodiment of a centroid or center estimated for a human target.

FIG. 9 illustrates an example embodiment of a bounding box that may be defined to determine a nuclear volume.

Fig. 10 illustrates an example embodiment of candidate cylinders, such as a head cylinder and a torso cylinder that may be created to score extremity candidates, such as head candidates.

FIG. 11 illustrates an example embodiment of a head-to-center vector determined based on the head and the centroid or center of the human target.

Fig. 12 illustrates an example embodiment of a limb segment, such as a shoulder segment and hip segment, determined based on a head-to-center vector.

Fig. 13 shows an example embodiment of extremities, such as shoulders and hips, which may be calculated based on shoulder and hip volumes.

FIG. 14 illustrates an example embodiment of a cylinder that may represent a nuclear volume.

Figures 15A-15C illustrate example embodiments of extremities, such as a hand determined based on anchor points.

FIG. 16 illustrates an example embodiment of extremities, such as hands and feet, that may be calculated based on an average position of the extremities, such as arms and legs, and/or an anchor point.

FIG. 17 illustrates an example embodiment of a model that may be generated.

Detailed description of illustrative embodiments

Fig. 1A and 1B illustrate an example embodiment of a configuration of the target recognition, analysis, and tracking system 10 in which a user 18 is playing a boxing game. In an example embodiment, the target recognition, analysis, and tracking system 10 may be used to recognize, analyze, and/or track a human target such as the user 18.

As shown in FIG. 1A, the target recognition, analysis, and tracking system 10 may include a computing environment 12. The computing environment 12 may be a computer, a gaming system or console, or the like. According to an example embodiment, the computing environment 12 may include hardware components and/or software components such that the computing environment 12 may be used to execute applications such as gaming applications, non-gaming applications, and the like. In one embodiment, the computing environment 12 may include a processor, such as a standardized processor, a specialized processor, a microprocessor, or the like that may execute instructions, including, for example, instructions for receiving a depth image; instructions for generating a voxel grid based on the depth image; instructions for removing a background included in the voxel grid to isolate one or more voxels associated with the human target; instructions for determining a location or position of one or more extremities of the isolated human target; instructions for adjusting the model based on the location or position of the one or more extremities; or any other suitable instructions, which will be described in more detail below.

As shown in FIG. 1A, the target recognition, analysis, and tracking system 10 may also include a capture device 20. The capture device 20 may be, for example, a camera that may be used to visually monitor one or more users, such as the user 18, so that gestures and/or movements performed by the one or more users may be captured, analyzed, and tracked to perform one or more controls or actions in an application and/or animate an avatar or on-screen character, as will be described in more detail below.

According to one embodiment, the target recognition, analysis, and tracking system 10 may be connected to an audiovisual device 16, such as a television, a monitor, a high-definition television (HDTV), or the like, that may provide game or application visuals and/or audio to a user, such as the user 18. For example, the computing environment 12 may include a video adapter such as a graphics card and/or an audio adapter such as a sound card that may provide audiovisual signals associated with the game application, non-game application, or the like. The audiovisual device 16 may receive the audiovisual signals from the computing environment 12 and may then output the game or application visuals and/or audio associated with the audiovisual signals to the user 18. According to one embodiment, the audiovisual device 16 may be connected to the computing environment 12 via, for example, an S-video cable, a coaxial cable, an HDMI cable, a DVI cable, a VGA cable, or the like.

As shown in fig. 1A and 1B, the target recognition, analysis, and tracking system 10 may be used to recognize, analyze, and/or track a human target such as a user 18. For example, the user 18 may be tracked using the capture device 20 such that gestures and/or movements of the user 18 may be captured to animate an avatar or on-screen character and/or the gestures and/or movements of the user 18 may be interpreted as control commands that may be used to affect an application executed by the computing environment 12. Thus, according to one embodiment, the user 18 may move his or her body to control an application and/or animate an avatar or on-screen character.

As shown in FIGS. 1A and 1B, in an exemplary embodiment, the application executing on the computing environment 12 may be a boxing game that the user 18 may be playing. For example, the computing environment 12 may use the audiovisual device 16 to provide a visual representation of a boxing opponent 38 to the user 18. The computing environment 12 may also use the audiovisual device 16 to provide a visual representation of a player avatar 40 that the user 18 may control with his or her movements. For example, as shown in FIG. 1B, the user 18 may throw a punch in physical space to cause the player avatar 40 to throw a punch in game space. Thus, according to an example embodiment, the computing environment 12 and the capture device 20 of the target recognition, analysis, and tracking system 10 may be used to recognize and analyze the punch of the user 18 in physical space such that the punch may be interpreted as a game control of the player avatar 40 in game space and/or the motion of the punch may be used to animate the player avatar 40 in game space.

Other movements of the user 18 may also be interpreted as other controls or actions, and/or used to animate the player avatar, such as controls to bob, weave, shuffle, block, jab, or throw a variety of different power punches. Further, certain movements may be interpreted as controls that may correspond to actions other than controlling the player avatar 40. For example, in one embodiment, a player may use movements to end, pause or save a game, select a level, view high scores, communicate with friends, and so forth. According to another embodiment, the player may use the movements to select a game or other application from the main user interface. Thus, in an example embodiment, the full range of motion of user 18 may be obtained, used, and analyzed in any suitable manner to interact with the application.

In example embodiments, a human target, such as the user 18, may hold an object. In these embodiments, a user of the electronic game may hold the object so that the motions of the player and the object may be used to adjust and/or control parameters of the game. For example, the motion of a player holding a racket may be tracked and utilized to control an on-screen racket in an electronic sports game. In another example embodiment, the motion of a player holding an object may be tracked and utilized to control an on-screen weapon in an electronic combat game.

According to other example embodiments, the target recognition, analysis, and tracking system 10 may also be used to interpret target movements as operating system and/or application controls outside the realm of gaming. For example, virtually any controllable aspect of an operating system and/or application may be controlled by movement of an object, such as user 18.

FIG. 2 illustrates an example embodiment of a capture device 20 that may be used in the target recognition, analysis, and tracking system 10. According to an example embodiment, the capture device 20 may be configured to capture video with depth information, which may include depth values, including a depth image via any suitable technique, including, for example, time-of-flight, structured light, stereo image, or the like. According to one embodiment, the capture device 20 may organize the depth information into "Z layers," or layers that may be perpendicular to a Z axis extending from the depth camera along its line of sight.

As shown in FIG. 2, the capture device 20 may include an image camera component 22. According to one exemplary embodiment, the image camera component 22 may be a depth camera that may capture a depth image of a scene. The depth image may include a two-dimensional (2-D) pixel area of the captured scene where each pixel in the 2-D pixel area may represent a depth value, such as, for example, a length or distance in centimeters, millimeters, or the like of an object in the captured scene from the camera.

As shown in FIG. 2, according to an exemplary embodiment, the image camera component 22 may include an IR light component 24, a three-dimensional (3-D) camera 26, and an RGB camera 28 that may be used to capture a depth image of a scene. For example, in time-of-flight analysis, the IR light component 24 of the capture device 20 may emit an infrared light onto the scene and may then use sensors (not shown) to detect the backscattered light from the surface of one or more targets and objects in the scene with, for example, the 3-D camera 26 and/or the RGB camera 28. In some embodiments, pulsed infrared light may be used such that the time between an outgoing light pulse and a corresponding incoming light pulse may be measured and used to determine a physical distance from the capture device 20 to a particular location on a target or object in the scene. Additionally, in other exemplary embodiments, the phase of the outgoing light wave may be compared to the phase of the incoming light wave to determine the phase shift. The phase shift may then be used to determine a physical distance from the capture device to a particular location on the targets or objects.

According to another exemplary embodiment, time-of-flight analysis may be used to indirectly determine a physical distance from the capture device 20 to a particular location on the target or object by analyzing the intensity of the reflected beam of light over time via various techniques including, for example, shuttered light pulse imaging.

In another exemplary embodiment, the capture device 20 may use a structured light to capture depth information. In such an analysis, patterned light (i.e., light displayed as a known pattern such as a grid pattern or a stripe pattern) may be projected onto the scene via, for example, the IR light component 24. Upon falling onto the surface of one or more targets or objects in the scene, the pattern may become distorted in response. Such a deformation of the pattern may be captured by, for example, the 3-D camera 26 and/or the RGB camera 28, and may then be analyzed to determine a physical distance from the capture device to a particular location on the targets or objects.

According to another embodiment, the capture device 20 may include two or more physically separated cameras that may view a scene from different angles to obtain visual stereo data that may be resolved to generate depth information.

The capture device 20 may also include a microphone 30. Microphone 30 may include a transducer or sensor that may receive sound and convert it into an electrical signal. According to one embodiment, the microphone 30 may be used to reduce feedback between the capture device 20 and the computing environment 12 in the target recognition, analysis, and tracking system 10. Additionally, the microphone 30 may be used to receive audio signals that may also be provided by the user to control applications such as gaming applications, non-gaming applications, etc., that may be executed by the computing environment 12.

In an exemplary embodiment, the capture device 20 may also include a processor 32 that may be in operable communication with the image camera component 22. Processor 32 may include a standardized processor, a specialized processor, a microprocessor, or the like that may execute instructions including, for example, instructions for receiving a depth image; instructions for generating a voxel grid based on the depth image; instructions for removing a background included in the voxel grid to isolate one or more voxels associated with the human target; instructions for determining a location or position of one or more extremities of the isolated human target; or any other suitable instructions, which will be described in more detail below.

The capture device 20 may also include a memory component 34, where the memory component 34 may store instructions executable by the processor 32, images or frames of images captured by the 3-D camera or RGB camera, or any other suitable information, images, or the like. According to an example embodiment, the memory component 34 may include Random Access Memory (RAM), Read Only Memory (ROM), cache, flash memory, a hard disk, or any other suitable storage component. As shown in FIG. 2, in one embodiment, the memory component 34 may be a separate component in communication with the image capture component 22 and the processor 32. According to another embodiment, the memory component 34 may be integrated into the processor 32 and/or the image capture component 22.

As shown in FIG. 2, the capture device 20 may communicate with the computing environment 12 via a communication link 36. The communication link 36 may be a wired connection including, for example, a USB connection, a firewire connection, an ethernet cable connection, etc., and/or a wireless connection such as a wireless 802.11b, g, a, or n connection, etc. According to one embodiment, the computing environment 12 may provide a clock to the capture device 20 via the communication link 36 that may be used to determine when to capture, for example, a scene.

Additionally, the capture device 20 may provide the depth information and images captured by, for example, the 3-D camera 26 and/or the RGB camera 28, and/or a skeletal model that may be generated by the capture device 20 to the computing environment 12 via the communication link 36. The computing environment 12 may then use the model, depth information, and captured images to, for example, control an application such as a game or word processor and/or animate an avatar or on-screen character. For example, as shown in FIG. 2, the computing environment 12 may include a gestures library 190. The gestures library 190 may include a collection of gesture filters, each comprising information about a gesture that the model may perform (as the user moves). The data captured by the cameras 26, 28 and capture device 20 in the form of models and the movements associated therewith may be compared to gesture filters in the gesture library 190 to identify when the user (as represented by the models) has performed one or more gestures. Those gestures may be associated with various controls of an application. Thus, the computing environment 12 may use the gestures library 190 to interpret movements of the model and control the application based on the movements.

FIG. 3 illustrates an example embodiment of a computing environment that may be used to interpret one or more gestures in a target recognition, analysis, and tracking system and/or animate an avatar or on-screen character displayed by the target recognition, analysis, and tracking system. The computing environment such as the computing environment 12 described above with reference to FIGS. 1A-2 may be a multimedia console 100 such as a gaming console. As shown in FIG. 3, the multimedia console 100 has a Central Processing Unit (CPU) 101 having a primary cache 102, a secondary cache 104, and a flash ROM (read Only memory) 106. The level one cache 102 and the level two cache 104 temporarily store data and thus reduce the number of memory access cycles, thereby improving processing speed and throughput. The CPU101 may be provided with more than one core and thus with additional level one caches 102 and level two caches 104. The flash ROM 106 may store executable code that is loaded during an initial phase of a boot process when the multimedia console 100 is powered ON.

A Graphics Processing Unit (GPU) 108 and a video encoder/video codec (coder/decoder) 114 form a video processing pipeline for high speed and high resolution graphics processing. Data is carried from the graphics processing unit 108 to the video encoder/video codec 114 via a bus. The video processing pipeline outputs data to an a/V (audio/video) port 140 for transmission to a television or other display. A memory controller 110 is connected to the GPU 108 to facilitate processor access to various types of memory 112, such as, but not limited to, RAM (random access memory).

The multimedia console 100 includes an I/O controller 120, a system management controller 122, an audio processing unit 123, a network interface controller 124, a first USB host controller 126, a second USB controller 128, and a front panel I/O subassembly 130 that are preferably implemented on a module 118. The USB controllers 126 and 128 serve as hosts for peripheral controllers 142(1) -142(2), a wireless adapter 148, and an external memory device 146 (e.g., flash memory, external CD/DVD ROM drive, removable media, etc.). The network interface controller 124 and/or wireless adapter 148 provide access to a network (e.g., the Internet, home network, etc.) and may be any of a wide variety of various wired or wireless adapter components including an Ethernet card, a modem, a Bluetooth module, a cable modem, and the like.

System memory 143 is provided to store application data that is loaded during the boot process. A media drive 144 is provided and may comprise a DVD/CD drive, hard drive, or other removable media drive, among others. The media drive 144 may be internal or external to the multimedia controller 100. Application data may be accessed via the media drive 144 for execution, playback, etc. by the multimedia console 100. The media drive 144 is connected to the I/O controller 120 via a bus, such as a Serial ATA bus or other high speed connection (e.g., IEEE 1394).

The system management controller 122 provides various service functions related to ensuring availability of the multimedia console 100. The audio processing unit 123 and the audio codec 132 form a corresponding audio processing pipeline with high fidelity and stereo processing. Audio data is transmitted between the audio processing unit 123 and the audio codec 132 via a communication link. The audio processing pipeline outputs data to the A/V port 140 for reproduction by an external audio player or device having audio capabilities.

The front panel I/O subassembly 130 supports the functionality of the power button 150 and the eject button 152, as well as any LEDs (light emitting diodes) or other indicators exposed on the outer surface of the multimedia console 100. The system power supply module 136 provides power to the components of the multimedia console 100. A fan 138 cools the circuitry within the multimedia console 100.

The CPU101, GPU 108, memory controller 110, and various other components within the multimedia console 100 are interconnected via one or more buses, including serial and parallel buses, a memory bus, a peripheral bus, and a processor or local bus using any of a variety of bus architectures. By way of example, these architectures may include a Peripheral Component Interconnect (PCI) bus, a PCI-Express bus, and the like.

When the multimedia console 100 is powered ON, application data may be loaded from the system memory 143 into memory 112 and/or caches 102, 104, and executed on the CPU 101. The application may present a graphical user interface that provides a consistent user experience when navigating to different media types available on the multimedia console 100. In operation, applications and/or other media contained within the media drive 144 may be launched or played from the media drive 144 to provide additional functionalities to the multimedia console 100.

The multimedia console 100 may be operated as a standalone system by simply connecting the system to a television or other display. In the standalone mode, the multimedia console 100 allows one or more users to interact with the system, watch movies, or listen to music. However, with the integration of broadband connectivity made available through the network interface controller 124 or the wireless adapter 148, the multimedia console 100 may further be operated as a participant in a larger network community.

When the multimedia console 100 is powered on, a set amount of hardware resources may be reserved for system use by the multimedia console operating system. These resources may include a reserve of memory (such as 16 MB), a reserve of CPU and GPU cycles (such as 5%), a reserve of network bandwidth (such as 8 kbs), and so on. Because these resources are reserved at system boot time, the reserved resources are not present from the application's perspective.

In particular, the memory reservation is preferably large enough to contain the launch kernel, concurrent system applications and drivers. The CPU reservation is preferably constant such that if the reserved CPU usage is not used by the system applications, the idle thread will consume any unused cycles.

For the GPU reservation, lightweight messages generated by system applications (e.g., popups) are displayed by using a GPU interrupt to schedule code to render popup into an overlay. The amount of memory required for the overlay depends on the overlay area size, and the overlay preferably scales with the screen resolution. Where the concurrent system application uses a full user interface, it is preferable to use a resolution that is independent of the application resolution. A scaler may be used to set this resolution so that there is no need to change the frequency and cause a TV resynch.

After the multimedia console 100 boots and system resources are reserved, concurrent system applications execute to provide system functionality. The system functions are encapsulated in a set of system applications that execute within the aforementioned reserved system resources. The operating system kernel identifies the thread as a system application thread rather than a game application thread. The system applications are preferably scheduled to run on the CPU101 at predetermined times and intervals in order to provide a consistent view of system resources to the application. The scheduling is to minimize cache disruption for the gaming application running on the console.

When the concurrent system application requires audio, audio processing is asynchronously scheduled to the gaming application due to time sensitivity. A multimedia console application manager (described below) controls the audio level (e.g., mute, attenuate) of the gaming application while system applications are active.

The input devices (e.g., controllers 142(1) and 142 (2)) are shared by the gaming application and the system application. Rather than reserving resources, the input devices are switched between the system application and the gaming application so that each has a focus of the device. The application manager preferably controls the switching of input stream without knowledge of the gaming application's knowledge, and the driver maintains state information regarding focus switches. The cameras 26, 28 and capture device 20 may define additional input devices for the multimedia console 100.

FIG. 4 illustrates another example embodiment of a computing environment 220, which may be the computing environment 12 shown in FIGS. 1A-2 for interpreting one or more gestures in the target recognition, analysis, and tracking system and/or animating an avatar or on-screen character displayed by the target recognition, analysis, and tracking system. The computing environment 220 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the disclosed subject matter. Neither should the computing environment 220 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary computing environment 220. In some embodiments, the various depicted computing elements may include circuitry configured to instantiate specific aspects of the present disclosure. For example, the term "circuitry" as used in this disclosure may include dedicated hardware components configured to perform functions through firmware or switches. In other example embodiments, the term circuitry may include a general purpose processing unit, memory, etc., configured by software instructions that implement logic that may be used to perform functions. In example embodiments where circuitry includes a combination of hardware and software, an implementer may write source code embodying logic and the source code can be compiled into machine readable code that can be processed by the general purpose processing unit. Because those skilled in the art will appreciate that the prior art has evolved to the point where there is little difference between hardware, software, or a combination of hardware/software, the selection of hardware or software to implement a particular function is a design choice left to the implementer. More specifically, those skilled in the art will appreciate that a software process can be transformed into an equivalent hardware structure, and a hardware structure can itself be transformed into an equivalent software process. Thus, the choice of hardware or software implementation is one of design choice and left to the implementer.

In FIG. 4, the computing environment 220 includes a computer 241, the computer 241 typically including a variety of computer-readable media. Computer readable media can be any available media that can be accessed by computer 241 and includes both volatile and nonvolatile media, removable and non-removable media. The system memory 222 includes computer storage media in the form of volatile and/or nonvolatile memory such as Read Only Memory (ROM) 223 and Random Access Memory (RAM) 260. A basic input/output system 224 (BIOS), containing the basic routines that help to transfer information between elements within computer 241, such as during start-up, is typically stored in ROM 223. RAM 260 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 259. By way of example, and not limitation, FIG. 4 illustrates operating system 225, application programs 226, other program modules 227, and program data 228.

The computer 241 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only, FIG. 4 illustrates a hard disk drive 238 that reads from or writes to non-removable, nonvolatile magnetic media, a magnetic disk drive 239 that reads from or writes to a removable, nonvolatile magnetic disk 254, and an optical disk drive 240 that reads from or writes to a removable, nonvolatile optical disk 253 such as a CD ROM or other optical media. Other removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like. The hard disk drive 238 is typically connected to the system bus 221 through a non-removable memory interface such as interface 234, and magnetic disk drive 239 and optical disk drive 240 are typically connected to the system bus 221 by a removable memory interface, such as interface 235.

The drives and their associated computer storage media discussed above and illustrated in FIG. 4, provide storage of computer readable instructions, data structures, program modules and other data for the computer 241. In FIG. 4, for example, hard disk drive 238 is illustrated as storing operating system 258, application programs 257, other program modules 256, and program data 255. Note that these components can either be the same as or different from operating system 225, application programs 226, other program modules 227, and program data 228. Operating system 258, application programs 257, other program modules 256, and program data 255 are given different numbers here to illustrate that, at a minimum, they are different copies. A user may enter commands and information into the computer 241 through input devices such as a keyboard 251 and pointing device 252, commonly referred to as a mouse, trackball or touch pad. Other input devices (not shown) may include a microphone, joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 259 through a user input interface 236 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a Universal Serial Bus (USB). The cameras 26, 28 and capture device 20 may define additional input devices for the multimedia console 100. A monitor 242 or other type of display device is also connected to the system bus 221 via an interface, such as a video interface 232. In addition to the monitor, computers may also include other peripheral output devices such as speakers 244 and printer 243, which may be connected through an output peripheral interface 233.

The computer 241 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 246. The remote computer 246 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 241, although only a memory storage device 247 has been illustrated in fig. 4. The logical connections depicted in FIG. 2 include a Local Area Network (LAN) 245 and a Wide Area Network (WAN) 249, but may also include other networks. Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.

When used in a LAN networking environment, the computer 241 is connected to the LAN 245 through a network interface or adapter 237. When used in a WAN networking environment, the computer 241 typically includes a modem 250 or other means for establishing communications over the WAN 249, such as the Internet. The modem 250, which may be internal or external, may be connected to the system bus 221 via the user input interface 236, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 241, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation, fig. 4 illustrates remote application programs 248 as residing on memory storage device 247. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.

FIG. 5 depicts a flow diagram of an example method 300 for determining extremities of a user in a scene. The example method 300 may be implemented using, for example, the capture device 20 and/or the computing environment 12 of the target recognition, analysis, and tracking system 10 described with reference to FIGS. 1A-4. In an example embodiment, the example method 300 may take the form of program code (i.e., instructions) that may be executed by, for example, the capture device 20 and/or the computing environment 12 of the target recognition, analysis, and tracking system 10 described with reference to FIGS. 1A-4, a processor, a server, a computer, a mobile device such as a mobile phone, or any other suitable electronic device hardware component.

According to one embodiment, at 305, a depth image may be received. For example, the target recognition, analysis, and tracking system may include a capture device such as the capture device 20 described with reference to FIGS. 1A-2. A capture device may capture or observe a scene that may include one or more targets. In an example embodiment, the capture device may be a depth camera configured to obtain a depth image of the scene using any suitable technique, such as time-of-flight analysis, structured light analysis, stereo vision analysis, and so forth.

The depth image may be a plurality of observed pixels, where each observed pixel has an observed depth value. For example, the depth image may include a two-dimensional (2D) pixel area of the captured scene where each pixel in the 2D pixel area may represent a depth value, such as a length or distance, e.g., in centimeters, millimeters, or the like, of an object in the captured scene from the capture device.

FIG. 6 illustrates an example embodiment of a depth image 400 that may be received at 305. According to an example embodiment, the depth image 400 may be an image or frame of a scene captured by, for example, the 3D camera 26 and/or the RGB camera 28 of the capture device 20 described above with reference to FIG. 2. As shown in fig. 6, depth image 400 may include a human target 402a corresponding to a user, e.g., such as user 18 described with reference to fig. 1A and 1B, and one or more non-human targets 404, such as walls, tables, monitors, etc., in a captured scene. As described above, depth image 400 may include a plurality of observed pixels, where each observed pixel has an observed depth value associated therewith. For example, the depth image 400 may include a two-dimensional (2-D) pixel area of the captured scene where each pixel in the 2-D pixel area may have a depth value, such as a length or distance, e.g., in centimeters, millimeters, or the like, of an object or target in the captured scene from the capture device.

In one embodiment, depth image 400 may be colored (colorized) such that different colors of the pixels of the depth image correspond to and/or visually depict different distances of human target 402a and non-human target 404 from the capture device. For example, according to one embodiment, pixels in the depth image associated with a target closest to the capture device may be shaded with red and/or orange shades, while pixels in the depth image associated with a target further away may be shaded with green and/or blue shades.

Referring back to FIG. 5, in one embodiment, upon receiving the image, at 305, processing may be performed on the depth image such that depth information associated with the depth image may be used to generate a model, track a user, and the like. For example, highly variable and/or noisy depth values may be removed, depth values may be smoothed, missing depth information may be filled and/or reconstructed, or any other suitable processing may be performed for the depth image.

According to an example embodiment, at 310, a grid of one or more voxels may be generated based on the received depth image. For example, the target recognition, analysis, and tracking system may downsample the received depth image by generating one or more voxels using information included in the received depth image so that a downsampled depth image may be generated. In one embodiment, the one or more voxels may be volume elements that may represent data or values of information included in the received depth image on a sub-sampled (sub-sampled) grid.

For example, as described above, the depth image may include a 2D pixel area of the captured scene where each pixel has an X value, a Y value, and a depth value (or Z value) associated therewith. In an embodiment, the depth image may be downsampled by reducing pixels in the 2-D pixel region to a grid of one or more voxels. For example, the depth image may be divided into various pixel portions or pixel blocks, such as a 4x4 pixel block, a 5x5 pixel block, an 8x8 pixel block, a 10x10 pixel block, and so forth. Each portion or block may be processed to generate a voxel of the depth image that may represent the orientation of the portion or block associated with the pixel of the 2-D depth image in real world space. According to an example embodiment, the orientation of each voxel may be based on, for example, an average depth value of valid or non-zero depth values of pixels in the block or portion that the voxel may represent, a minimum, maximum, and/or intermediate depth value of pixels in the block or portion that the voxel may represent, an average of X-and Y-values of pixels in the portion or block that the voxel may represent having valid depth values, or any other suitable information provided by the depth image. Thus, according to an example embodiment, each voxel may represent a subvolume (sub-volume) portion or patch of the depth image having values such as: the voxel may represent an average depth value of valid or non-zero depth values of pixels in the block or portion that the voxel may represent, a minimum, maximum, and/or intermediate depth value of pixels in the portion or block that the voxel may represent, an average of the X-value and the Y-value of pixels in the portion or block that have valid depth values, or any other suitable information provided by the depth image based on the X-value, the Y-value, and the depth value of the corresponding portion or block of pixels of the depth image received at 305.

In one embodiment, a grid of one or more voxels in the downsampled depth image may be layered. For example, the target recognition, analysis, and tracking system may generate voxels as described above. The target recognition, analysis, and tracking system may then stack the generated voxel on one or more other generated voxels in the grid.

According to an example embodiment, the target recognition, analysis, and tracking system may stack voxels in the grid, for example around edges of objects in the scene captured in the depth image. For example, the depth image received at 305 may include a human target and a non-human target (such as a wall). For example, at the edge of the human target, the human target may overlap with a non-human target (such as a wall). In one embodiment, the overlapping edges may include depth values, such as X values, Y values, and the like, associated with the human target and non-human target capturable in the depth image. The target recognition, analysis, and tracking system may generate a voxel associated with the human target and a voxel associated with the human target at the overlapping edge so that the voxels may be stacked while information (such as depth values, X values, Y values, etc.) of the overlapping edge may be maintained in the mesh.

According to another embodiment, a mesh of one or more voxels may be generated at 310 by, for example, projecting information (such as depth values, X-values, Y-values, etc.) into a three-dimensional (3-D) space. For example, a transformation using a camera, an image, or a perspective transformation, etc., may map depth values to 3-D points in the 3-D space so that the information may be transformed into a trapezoid or cone in the 3-D space. In one embodiment, a 3D space having a trapezoid or cone shape may be divided into blocks, such as cubes, which may create a grid of voxels, such that each of these blocks or cubes may represent a voxel in the grid. For example, the target recognition, analysis, and tracking system may superimpose a 3-D mesh on 3-D points corresponding to objects in the depth image. The target recognition, analysis, and tracking system may then divide or cut the mesh into blocks representing voxels to downsample the depth image to a lower resolution. According to an example embodiment, each voxel in the mesh may include an average depth value of valid or non-zero depth values of pixels associated with a 3D space in the mesh that the voxel may represent, a minimum and/or maximum depth value of pixels associated with a 3D space in the mesh that the voxel may represent, an average of X and Y values of pixels associated with a 3D space in the mesh that the voxel may represent having valid depth values, or any other suitable information provided by the depth image.

7A-7B illustrate example embodiments of a portion of a depth image being downsampled. For example, as shown in FIG. 7A, a portion 410 of the depth image 400 described above with reference to FIG. 6 may include a plurality of pixels 420, where each pixel 420 may have associated therewith an X value, a Y value, and a depth value (or Z value). According to one embodiment, a depth image (such as depth image 400) may be downsampled by reducing pixels in a 2-D pixel region to a grid of one or more voxels, as described above. For example, as shown in fig. 7A, a portion 410 of depth image 400 may be divided into portions or blocks 430 of pixels 420, such as 8x8 blocks of pixels 420. The target recognition, analysis, and tracking system may process the portion or block 430 to generate a voxel 440, and the voxel 440 may represent the real-world orientation of the portion or block 430 associated with the pixel 420, as shown in fig. 7A-7B.

Referring back to fig. 5, at 315, the background may be removed from the downsampled depth image. For example, a background (such as a non-human target or object) in the downsampled image may be removed to isolate a foreground object (such as a human target associated with the user). As described above, the target recognition, analysis, and tracking system may downsample the captured or observed depth image by generating a grid of one or more voxels for the captured or observed depth image. The target recognition, analysis, and tracking system may analyze each of the downsampled depth images to determine whether a voxel may be associated with a background object (such as one or more non-human targets of the depth image). If a voxel may be associated with a background object, the voxel may be removed or discarded from the downsampled depth image so that a foreground object (such as a human target) and one or more voxels of the grid associated with the foreground object may be isolated.

At 320, one or more extremities (such as one or more body parts) may be determined for an isolated foreground object (such as a human target). For example, in one embodiment, the target recognition, analysis, and tracking system may apply one or more heuristics (heuristic) or rules to the isolated human target to determine, for example, a centroid or center, a head, a shoulder, a torso, an arm, a leg, etc., associated with the isolated human target. According to one embodiment, based on the determination of the extremity, the target recognition, analysis, and tracking system may generate and/or adjust a model of the isolated human target. For example, if the depth image received at 305 may be included in an initial frame observed or captured by a capture device (such as capture device 20 described above with reference to fig. 1A-2), a model may be generated by, for example, assigning joints of the model to determined locations of extremities based on the location of the extremities (such as the center of mass, head, shoulder, arm, hand, leg, etc.) determined at 320, as will be described in more detail below. Alternatively, if the depth image may be included in a subsequent or non-initial frame viewed or captured by the capture device, the model that may have been previously generated may be adjusted based on the location of the extremity (such as the center of mass, head, shoulder, arm, hand, leg, etc.) determined at 320.

According to an example embodiment, upon isolating a foreground object (such as a human target) at 315, the target recognition, analysis, and tracking system may calculate an average of voxels in the human target at 320 to, for example, estimate a centroid or center of the human target. For example, the target recognition, analysis, and tracking system may calculate an average orientation of voxels included in the human target, which may provide an estimate of the center of mass or center of the human target. In one embodiment, the target recognition, analysis, and tracking system may calculate an average orientation of voxels associated with the human target based on the X-values, Y-values, and depth values associated with the voxels. For example, as described above, the target recognition, analysis, and tracking system may calculate an X-value for a voxel by averaging the X-values of pixels associated with the voxel, calculate a Y-value for a voxel by averaging the Y-values of pixels associated with the voxel, and calculate a depth value for the voxel by averaging the depth values of pixels associated with the voxel. At 320, the target recognition, analysis, and tracking system may average the X-values, Y-values, and depth values of the voxels included in the human target to calculate an average orientation, which may provide an estimate of a center of mass or center of the human target.

FIG. 8 illustrates an example embodiment of a centroid or center estimated for human target 402 b. According to an example embodiment, the location or position 802 of the centroid or center may be based on an average position or location of voxels associated with the isolated human target 402b as described above.

Referring back to fig. 5, the target recognition, analysis, and tracking system may then define a bounding box for the human target at 320 to determine a nuclear volume of the human target that may include, for example, a head and/or a torso of the human target. For example, in determining an estimate of the center of mass or center of the human target, the target recognition, analysis, and tracking system may search horizontally in the X-direction to determine the width of the human target that may be used to define a bounding box associated with the nuclear volume. According to one embodiment, to search horizontally in the X-direction to determine the width of the human target, the target recognition, analysis, and tracking system may search in the left and right directions along the X-axis from the centroid or center until the target recognition, analysis, and tracking system may reach an invalid voxel, such as a voxel that may not include a depth value associated therewith, or a voxel that may be associated with another object identified in the scene. For example, as described above, voxels associated with the background may be removed at 315 to isolate the human target from voxels associated therewith. As described above, according to an example embodiment, to remove voxels at 315, the target recognition, analysis, and tracking system may replace the X-values, Y-values, and/or depth values associated with the voxels of the background object with a value of 0, or another suitable indicator, or a flag that may indicate that the voxels are invalid. At 320, the target recognition, analysis, and tracking system may search from the centroid of the human target in a leftward direction until a first invalid voxel is reached on the left side of the human target, and may search from the centroid of the human target in a rightward direction until a second invalid voxel is reached on the right side of the human target. The target recognition, analysis, and tracking system may then calculate or measure the width based on, for example, the difference between the X-values of a first valid voxel adjacent to a first invalid voxel reached in the left direction and the X-values of a second valid voxel adjacent to a second invalid voxel in the right direction.

The target recognition, analysis, and tracking system may then search vertically along the Y-direction to determine the height of the human target, such as a height from head to hip that may be used to define a bounding box associated with the nuclear volume. According to one embodiment, to search vertically along the Y-direction to determine the width of the human target, the target recognition, analysis, and tracking system may search in an upward direction and a downward direction along the Y-axis from the centroid or center until the target recognition, analysis, and tracking system reaches an invalid voxel, such as a voxel that may not include a depth value associated therewith, a voxel that may be labeled or may have an invalid indicator associated therewith, a voxel that may be associated with another object identified in the scene, and so forth. For example, at 320, the target recognition, analysis, and tracking system may search in a leftward direction from the centroid of the human target until reaching a third invalid voxel on the upper portion of the human target, and may search in a downward direction from the centroid of the human target until reaching a fourth invalid voxel on the lower portion of the human target. The target recognition, analysis, and tracking system may then calculate or measure the height based on, for example, the Y values of a third valid voxel adjacent to a third invalid voxel reached in the upward direction and a fourth valid voxel adjacent to a fourth invalid voxel in the downward direction.

According to example embodiments, the target recognition, analysis, and tracking system may also search orthogonally in the X and Y directions at various angles (such as 30 degrees, 45 degrees, 60 degrees, etc.) on the X and Y axes to determine other distances and values that may be used to define a bounding box associated with the core volume.

Additionally, the target recognition, analysis, and tracking system may define a bounding box associated with the core volume based on a distance or ratio of values. For example, in one embodiment, the target recognition, analysis, and tracking system may define the width of the bounding box based on the height determined as described above multiplied by a constant (such as 0.2, 0.25, 0.3, or any other suitable value).

The target recognition, analysis, and tracking system may then define a bounding box that may represent the nuclear volume based on the first and second valid voxels determined by a horizontal search along the X-axis, the third and fourth valid voxels determined by a vertical search along the Y-axis, or other distances and values, ratios of distances or values, etc., determined by, for example, an orthogonal search. For example, in one embodiment, the target recognition, analysis, and tracking system may generate a first vertical line of the bounding box along the Y-axis at the X-value of a first valid voxel and a second vertical line of the bounding box along the Y-axis at the X-value of a second valid voxel. Additionally, the target recognition, analysis, and tracking system may generate a first horizontal line of the bounding box along the X-axis at the Y-value of a third valid voxel and a second horizontal line of the bounding box along the X-axis at the Y-value of a fourth valid voxel. According to an example embodiment, the first and second horizontal lines may intersect the first and second vertical lines to form a rectangle or square, which may represent a bounding box associated with a core area of the human target.

FIG. 9 illustrates an example embodiment of a bounding box 804 that may be defined to determine a nuclear volume. As shown in fig. 9, the bounding box 804 may form a rectangle based on the intersections of the first and second vertical lines VL1, VL2 and the first and second horizontal lines HL1, HL2 determined as described above.

Referring back to FIG. 5, at 320, the target recognition, analysis, and tracking system may then determine an extremity, such as the head of the human target. For example, in one embodiment, after determining a nuclear volume and defining a bounding box associated therewith, the target recognition, analysis, and tracking system may determine a position or orientation of a head of the human target.

To determine the location or position of an extremity such as the head, the target recognition, analysis, and tracking system may determine individual candidates at locations or positions appropriate for the extremity, may score the individual candidates, and may then select the location of the extremity from the individual candidates based on the scores. According to one embodiment, the target recognition, analysis, and tracking system may search for the absolute highest voxel and/or voxels adjacent or near to the absolute highest voxel of the human target, one or more delta voxels based on the position of the head determined for the previous frame, the highest voxel on an upward vector that may extend vertically from, for example, the centroid or center and/or voxels adjacent or near to the highest voxel determined for the previous frame, the highest voxel on a previous upward vector between the centroid or center and the highest voxel determined for the previous frame, and any other suitable voxels to determine candidates for the end (such as the head).

The target recognition, analysis, and tracking system may then score the candidates. According to one embodiment, the candidates may be scored based on 3-D pattern matching. For example, the target recognition, analysis, and tracking system may create or generate one or more candidate cylinders, such as a head cylinder and a shoulder cylinder. The target recognition, analysis, and tracking system may then calculate a score for the candidate based on the number of voxels associated with the candidate that may be included in one or more candidate cylinders, such as the head cylinder, the shoulder cylinder, and the like, as will be described in more detail below.

Fig. 10 illustrates an example embodiment of a head post 806 and a shoulder post 808, the head post 806 and shoulder post 808 may be created to score candidates associated with an extremity such as the head. According to an example embodiment, the target recognition, analysis, and tracking system may calculate a score for a candidate based on the number of voxels associated with the candidate included in the head cylinder 806 and shoulder cylinder 808. For example, the target recognition, analysis, and tracking system may determine a first total number of candidates inside the head cylinder 806 and/or shoulder cylinder 808 based on the locations of voxels associated with the candidates, and determine a second total number of candidates outside the head cylinder 806 (e.g., within the region 807) and/or shoulder cylinder 808 based on the locations of voxels associated with the candidates. The target recognition, analysis, and tracking system may also calculate a measure of symmetry based on a function of an absolute value of a difference between the first number of candidates in the left half LH of the shoulder post 808 and the second number of head candidates in the right half RH of the shoulder post 808. In an example embodiment, the target recognition, analysis, and tracking system may then calculate the score for a candidate by subtracting the second total number of candidates outside of the stud 806 and/or shoulder 808 from the first total number of candidates inside of the stud 806 and/or shoulder 808, and further subtracting the symmetry metric from the difference between the first and second total numbers of candidates inside and outside of the stud 806 and/or shoulder 808. According to one embodiment, the target recognition, analysis, and tracking system may multiply the first and second totals of candidates inside and outside of the head cylinder 806 and/or the shoulder cylinder 808 by a constant determined by the target recognition, analysis, and tracking system before subtracting the second totals from the first totals.

Referring back to fig. 5, according to one embodiment, if the score associated with one of the candidate terms exceeds the extremity threshold score, the target recognition, analysis, and tracking system may determine a location or position of the extremity, such as the head, based on the voxels associated with that candidate term at 320. For example, in one embodiment, the target recognition, analysis, and tracking system may select the orientation or position of the head based on the highest point, the highest voxel on the upward vector that may extend vertically (from, for example, the centroid or center and/or a voxel adjacent or near to the highest voxel on the upward vector determined for the previous frame), the previous upward vector or the highest voxel of the upward vector of the previous frame, the average orientation of all voxels within a region (such as a box, cube, etc.) around the orientation or position of the head in the previous frame, or any other suitable orientation or position associated with the candidate having a suitable score. According to other example embodiments, the target recognition, analysis, and tracking system may calculate an average of values (such as X-values, Y-values, and depth values of voxels associated with candidates that may exceed the extremity threshold score), may determine a maximum value and/or a minimum value of voxels associated with candidates that may exceed the extremity threshold score, or may select any other suitable value based on voxels associated with candidates that may exceed the extremity threshold score. The target recognition, analysis, and tracking system may then assign one or more of such values to the location or position of the extremities of the head. Additionally, the target recognition, analysis, and tracking system may select a position or location of the head based on a fitted line or best-fit line of voxels associated with one or more candidate items that may exceed the extremity threshold score.

Additionally, in one embodiment, if more than one candidate exceeds the head threshold score, the target recognition, analysis, and tracking system may select the candidate that may have the highest score and may then determine the location or position of the extremity (such as the head) based on the voxels associated with the candidate that may have the highest score. As described above, the target recognition, analysis, and tracking system may select the position or location of the head based on, for example, an average of the values of the voxels associated with the candidate that may have the highest score (such as the X-value, Y-value, and depth value), or any other suitable technique (such as the highest point described above, the highest voxel on the previous upward vector, etc.).

According to one embodiment, if none of the scores associated with the candidates exceeds the head threshold score, the target recognition, analysis, and tracking system may use a previous orientation or position of the head determined by voxels included in the human target associated with depth images of previous frames in which the head score may have exceeded the head threshold score, or if the depth image received at 305 may be an initial frame captured or observed by the capture device, the target recognition, analysis, and tracking system may use a default orientation or position of the head in a default pose (such as a T-type pose, a natural stance pose, etc.) of the human target.

According to another embodiment, the target recognition, analysis, and tracking system may include one or more two-dimensional (2D) patterns associated with, for example, an extremity shape such as the shape of a head. The target recognition, analysis, and tracking system may then score the candidates associated with the extremity, such as the head, based on the likelihood that the voxels associated with the candidates are likely to be one or more 2D-patterned shapes. For example, the target recognition, analysis, and tracking system may determine and sample depth values for voxels near or near that may indicate an extremity shape that defines a shape such as the head. If a sampled depth value that may indicate one of the voxels defining an extremity shape such as the shape of the head may deviate from one or more expected or predefined depth values of the voxels of the extremity shape associated with the 2-D pattern, the target recognition, analysis, and tracking system may decrease the default or initial score to indicate that the voxel may not be an extremity such as the head. In one embodiment, the target recognition, analysis, and tracking system may determine a score associated with the voxel with the highest score and may assign a location or position of an extremity, such as the head, based on the location or position of the voxel associated with the candidate with the highest score.

In one embodiment, the default or initial score may be the score of a candidate associated with an extremity such as the head calculated using the head and/or shoulder cylinders as described above. The target recognition, analysis, and tracking system may reduce this score if the candidate may not be in the shape of the head associated with the one or more 2-D styles. As described above, the target recognition, analysis, and tracking system may then select a score for the candidate that exceeds the extremity threshold score, and may assign a location or position of the extremity, such as the head, based on the location or position of the candidate.

The target recognition, analysis, and tracking system may further determine other extremities, such as shoulders and hips, of the human target at 320. For example, in one embodiment, after determining the location or position of an extremity such as the head of a human target, the target recognition, analysis, and tracking system may determine the location or position of the shoulders and hips of the human target. The target recognition, analysis, and tracking system may also determine the orientation of the shoulder and hip, such as the rotation or angle of the shoulder and hip.

According to an example embodiment, to determine the location or position of extremities such as shoulders and hips, the target recognition, analysis, and tracking system may define a head-to-center vector based on the location or position of the head and the centroid or center of the human target. For example, a head-to-center vector may be a vector or line defined between an X-value, a Y-value, and a depth value (or Z-value) for the position or orientation of the head and an X-value, a Y-value, and a depth value (or Z-value) for the position or orientation of the particle or center.

FIG. 11 illustrates an example embodiment of a head-to-center vector based on the head and the centroid or center of the human target. As described above, a position or orientation, such as a position or orientation 810 of the head, may be determined. As shown in fig. 11, the target recognition, analysis, and tracking system may then define a head-to-center vector 812 between the position or orientation 810 of the head and the position or orientation 802 of the center or centroid.

Referring back to fig. 5, at 320, the target recognition, analysis, and tracking system may then create or define one or more extremity volumes, such as a shoulder volume box and a hip volume box, based on the head-to-center vector. According to one embodiment, the target recognition, analysis, and tracking system may define or determine the approximate location or position of extremities, such as shoulders and hips, based on displacement along a head-to-center vector. For example, the displacement may be a length from a body landmark, such as an orientation or position associated with the head or centroid or center. The target recognition, analysis, and tracking system may then define extremity volumes, such as shoulder and hip volumes, around displacement values from body landmarks, such as the orientation or position associated with the head or the center of mass or center.

Fig. 12 illustrates an example embodiment of a limb tip volume, such as a shoulder volume SVB and a hip volume HVB, determined based on a head-to-center vector 812. According to an example embodiment, the target recognition, analysis, and tracking system may define or determine an approximate location or position of extremities, such as shoulders and hips, based on a displacement, such as a length from a body landmark, such as location or position associated with head 810, or location or position associated with a centroid or center along a head-to-center vector 802. The target recognition, analysis, and tracking system may then define extremity volumes such as shoulder volume SVB and hip volume HVB around the distance body landmark displacement values.

Referring back to fig. 5, the target recognition, analysis, and tracking system may also calculate centers of extremities such as shoulders and hips based on displacement values such as length from body landmarks (such as the head along a head-to-center vector) at 320. For example, the target recognition, analysis, and tracking system may move the displacement values down or up along the head-to-center vector to compute the center of extremities such as shoulders and hips.

According to one embodiment, the target recognition, analysis, and tracking system may also determine the orientation of angles such as extremities (such as shoulders and hips). In one embodiment, the target recognition, analysis, and tracking system may calculate a fit line of depth values within limb end volumes, such as shoulder and hip volumes, for example, in order to determine the orientation, such as the angle of the respective limb ends (such as shoulders and hips). For example, the target recognition, analysis, and tracking system may calculate a best-fit line based on the X-values, Y-values, and depth values of voxels associated with extremity volumes (such as shoulder volumes and hip volumes) in order to calculate extremity slopes of the extremity vector, which may define the bones of the respective extremities. Thus, in an example embodiment, the target recognition, analysis, and tracking system may calculate a best-fit line based on the X-values, Y-values, and depth values of the voxels associated with the shoulder and hip volumes in order to calculate the shoulder slope of the shoulder vector that may define the shoulder bone by the center of the shoulder and the hip slope of the hip vector that may define the bone of the hip by the center of the hip. Extremity angularity, such as shoulder angularity and hip angularity, may define a corresponding orientation, such as the angle of the extremities (such as the shoulder and hip).

In an example embodiment, the target recognition, analysis, and tracking system may determine the location or position of joints associated with extremities (such as shoulders and hips) based on the skeleton defined by the extremity vector and its slope. For example, in one embodiment, the target recognition, analysis, and tracking system may search in each direction along the shoulder and hip vectors until reaching the corresponding edges of the shoulder and hip defined by, for example, invalid voxels in the shoulder and hip volume. The target recognition, analysis, and tracking system may then assign the shoulder and hip joints with locations or orientations that include X-values, Y-values, and depth values based on one or more locations or orientations that include X-values, Y-values, or depth values for valid voxels along shoulder and hip vectors that may be adjacent or near invalid voxels. According to other example embodiments, the target recognition, analysis, and tracking system may determine a first length of a shoulder vector between the shoulder edges and a second length of a hip vector between the hip edges. The target recognition, analysis, and tracking system may determine a position or orientation of a shoulder joint based on the first length and a position or orientation of a hip joint based on the second length. For example, in one embodiment, a shoulder joint may be assigned a position or orientation that includes an X-value, a Y-value, and a depth value of an end of a shoulder vector at a first length, and a hip joint may be assigned a position or orientation that includes an X-value, a Y-value, and a depth value of an end of a hip vector at a second length. According to another embodiment, the target recognition, analysis, and tracking system may adjust the first length and the second length before assigning the position or orientation to the shoulder and the hip joint. For example, the target recognition, analysis, and tracking system may equally subtract shoulder displacement values from each end of the shoulder vector to adjust the first length, which may include values associated with a particular displacement between the edge of the person's shoulder or shoulder blade and shoulder joint. Similarly, the target recognition, analysis, and tracking system may adjust the second length by equally subtracting hip displacement values, which may include values associated with the edge of a person's hip or a characteristic displacement between the pelvis and hip, from each end of the hip vector. In adjusting the first and second lengths of the shoulder and hip vectors that may define the bones of the respective shoulder and hip, the target recognition, analysis, and tracking system may assign a position or orientation to the shoulder joint that includes the X-value, Y-value, and depth value of the end of the shoulder vector at the adjusted first length and assign a position or orientation to the hip joint that includes the X-value, Y-value, and depth value of the end of the hip vector at the adjusted second length.

Fig. 13 illustrates an example embodiment of a shoulder and hip that may be calculated based on a shoulder volume SVB and a hip volume HVB. As shown in fig. 13, shoulder positions or orientations 816a-b and hip positions or orientations 818a-b may be determined based on the respective shoulder mass SVB and hip mass HVB as described above.

Referring back to fig. 5, at 320, the target recognition, analysis, and tracking system may then determine an extremity, such as the head of the human target. In one embodiment, after determining the shoulders and hips, the target recognition, analysis, and tracking system may generate or create a torso volume that may include voxels associated with or surrounding the head, shoulders, center, and hips. The torso body may be cylindrical, pill-shaped (such as a cylinder with rounded ends), etc., based on the location or orientation of the center, head, shoulders, and/or hips.

According to one embodiment, the target recognition, analysis, and tracking system may create or generate a cylinder that may represent a torso body having dimensions based on shoulders, head, hips, center, and the like. For example, the target recognition, analysis, and tracking system may create a cylinder that may have a width or diameter based on the width of the shoulder and a height based on the distance between the head and the hip. The target recognition, analysis, and tracking system may then orient or tilt (angle) along a head-to-center vector may represent a cylinder of the torso volume, such that the torso volume may reflect an orientation (such as an angle) of the torso of the human target.

Fig. 14 illustrates an example embodiment of a cylinder 820 that may represent a torso body. As shown in fig. 14, post 820 may have a width or diameter based on the width of the shoulder and a height based on the distance between the head and the hip. The cylinder 820 may also be oriented or tilted along the head-to-center vector 812.

Referring back to fig. 5, at 520, the target recognition, analysis, and tracking system may then determine additional extremities, such as limbs including arms, hands, legs, feet, etc., of the human target. According to one embodiment, the target recognition, analysis, and tracking system may roughly label voxels outside of the torso volume as limbs after the torso volume is generated or created. For example, the target recognition, analysis, and tracking system may identify each of the voxels that are outside of the torso volume such that the target recognition, analysis, and tracking system may label the voxel as part of a limb.

The target recognition, analysis, and tracking system may then determine extremities, such as actual limbs including right and left arms, right and left hands, right and left legs, right and left feet, and the like, associated with voxels outside of the torso volume. In one embodiment, to determine the actual limb, the target recognition, analysis, and tracking system may compare the previous position or location of the identified limb (such as the previous position or location of the right arm, left leg, right leg, etc.) to the position or location of the voxel outside of the torso volume. According to example embodiments, the previous position or orientation of the previously identified limb may be a position or orientation of the limb in the depth image received in the previous frame, a position or orientation of a projected body part based on previous movements, or any other suitable previous position or orientation of the identification of a human target such as a fully connected skeleton or body model of a human target. Based on the comparison, the target recognition, analysis, and tracking system may then associate voxels outside of the torso volume with the closest previously identified limb. For example, the target recognition, analysis, and tracking system may compare the position or location of the X-value, Y-value, and depth value of each of the voxels included outside of the torso volume to previous positions or locations including the X-value, Y-value, and depth value of previously identified limbs (such as previously identified left arm, right arm, left leg, right leg, and the like). The target recognition, analysis, and tracking system may then associate each of the voxels outside of the torso volume with a previously identified limb that may have a closest position or orientation based on the comparison.

In one embodiment, to determine the actual limb, the target recognition, analysis, and tracking system may compare a default position or location of the identified limb (such as the right arm, left leg, right leg, etc.) in the identified default pose of the human target to the position or location of voxels outside of the torso volume. For example, the depth image received at 305 may be included in an initial frame captured or observed by the capture device. If the depth image received at 305 may be included in the initial frame, the target recognition, analysis, and tracking system may compare a default position or location of a limb (such as a default position or location of a right arm, left leg, right leg, etc.) with a position or location of a voxel outside of the torso volume. According to an example embodiment, the default position or orientation of the identified limb may be a position or orientation of the limb in a representation of the human target (such as a fully connected skeleton or body model of the human target in a default pose) in a default pose (such as a T-pose, a da vinci pose, a natural pose, etc.). Based on the comparison, the target recognition, analysis, and tracking system may then associate voxels outside the torso volume with the closest limb associated with the default pose. For example, the target recognition, analysis, and tracking system may compare the location or position of the X-value, Y-value, and depth value of each of the voxels included outside of the torso volume to a default location or position of the X-value, Y-value, and depth value including a default limb (such as a default left arm, right arm, left leg, right leg, etc.). The target recognition, analysis, and tracking system may then associate each of the voxels outside of the torso volume with a default limb that may have a closest position or orientation based on the comparison.

The target recognition, analysis, and tracking system may also re-label voxels within the torso volume based on the estimated limbs. For example, in one embodiment, at least a portion of an arm (such as the left forearm) may be positioned in front of the torso of the human target. Based on the identified default position or location of the arm, the target recognition, analysis, and tracking system may determine or estimate the portion as being associated with the arm, as described above. For example, the previous position or location of the previously identified limb may indicate that one or more voxels of a limb (such as an arm) of the human target may be within the torso body. The target recognition, analysis, and tracking system may then compare the previous locations or positions including the X, Y, and depth values of the previously identified limbs (such as the previously identified left arm, right arm, left leg, right leg, etc.) to the locations or positions of the voxels included in the torso volume. The target recognition, analysis, and tracking system may then associate and re-label each of the voxels within the torso volume with a previously identified limb that may have the closest location or position based on the comparison.

According to one embodiment, after marking the voxels associated with the limb, the target recognition, analysis, and tracking system may determine, for example, the location or orientation of the portions of the marked limb at 320. For example, after marking voxels associated with the left arm, the right arm, the left leg, and/or the right leg, the target recognition, analysis, and tracking system may determine the location or orientation of the hands and/or elbows, knees and/or feet, elbows, etc. associated with the right arm and the left arm.

The target recognition, analysis, and tracking system may determine the location or orientation of the portion (such as the hand, elbow, foot, knee, etc.) based on the limb position of each of the limbs. For example, the target recognition, analysis, and tracking system may calculate the left arm mean position by adding an X value for each of the voxels associated with the left arm, adding a Y value for each of the voxels associated with the left arm, and adding a depth value for each of the voxels associated with the left arm and dividing the sum of each of the added X, Y, and depth values by the total number of voxels associated with the left arm. According to one embodiment, the target recognition, analysis, and tracking system may then define a vector or line between the left shoulder and the left arm average position such that the vector or line between the left shoulder and the left arm average position may define the first search direction for the left hand. The target recognition, analysis, and tracking system may then search for the last valid voxel, or the last voxel with a valid X-value, Y-value, and/or depth value, from shoulder to along the first search direction defined by the vector or line and may associate the location or orientation of the last valid voxel with the left hand.

According to another embodiment, the target recognition, analysis, and tracking system may calculate an anchor point. The location or position of the anchor point may be based on one or more offsets from other determined extremities, such as the head, hip, shoulder, etc. For example, the target recognition, analysis, and tracking system may calculate the X-value and depth value of the anchor point by extending the location or position of the shoulder in the respective X-direction and Z-direction by half the X-value and depth value associated with the location or position of the shoulder. The target recognition, analysis, and tracking system may then mirror the location or orientation of the X-value and depth value of the anchor point around the extended location or orientation.

The target recognition, analysis, and tracking system may calculate a Y-value for the anchor point based on the displacement of the left arm average position from the head and/or hip. For example, the target recognition, analysis, and tracking system may calculate a displacement or difference between the Y value of the head and the Y value averaged for the left arm. The target recognition, analysis, and tracking system may then add the displacement or difference to a Y value, for example, of the center of the hip to calculate a Y value for the anchor point.

Figures 15A-15C illustrate example embodiments of extremities, such as a hand determined based on anchor points 828 a-828C. As shown in FIGS. 15A-15C, according to another embodiment, the target recognition, analysis, and tracking system may calculate anchor points 828 a-828C. The target recognition, analysis, and tracking system may then define a vector or line between the anchor points 828a-828c and the left arm average positions 826a-826c such that the vector or line between the anchor points and the left arm average positions may define a second search direction for the left hand. The target recognition, analysis, and tracking system may then search for the last valid voxel, or the last voxel with a valid X-value, Y-value, and/or depth value, from the anchor points 828a-828c along a second search direction defined by the vector or line, and may associate the location or orientation of the last valid voxel with the left hand.

As described above, in an example embodiment, the target recognition, analysis, and tracking system may calculate the location or position of anchor points 828a-828c based on one or more offsets from other determined extremities (such as the head, hip, shoulder, etc., as described above). For example, the target recognition, analysis, and tracking system may calculate the X-values and depth values of the anchor points 828a-828c by extending the location or position of the shoulder in the respective X-direction and Z-direction by half the X-values and depth values associated with the location or position of the shoulder. The target recognition, analysis, and tracking system may then mirror the locations or orientations of the X-values and depth values of the anchor points 828a-828c around the extended location or orientation.

The target recognition, analysis, and tracking system may calculate the Y values for the anchor points 828a-828c based on the displacement of the left arm average position from the head and/or hip. For example, the target recognition, analysis, and tracking system may calculate a displacement or difference between the Y value of the head and the Y value of the left arm averages 826a-826 c. The target recognition, analysis, and tracking system may then add the displacement or difference to the Y value of, for example, the center of the hip to calculate the Y value of the anchor points 828a-828 c.

Referring back to FIG. 5, according to an example embodiment, the target recognition, analysis, and tracking system may calculate a right arm average position at 320, which may be used to define a search direction, such as the first and second search directions, described above, which may be used to determine the position or orientation of the right hand. The target recognition, analysis, and tracking system may also calculate the average position of the left leg and the average position of the right leg that may be used to define a search direction as described above, which may be used to determine the left foot and the right foot.

Fig. 16 illustrates an example embodiment of extremities, such as hands and feet, that may be calculated based on an average orientation of the extremities (such as arms and legs) and/or an anchor point. As shown in FIG. 16, hand positions or orientations 822a-b and foot positions or orientations 824a-b may be determined based on the first and second search directions determined by the respective arm and leg average orientations and/or anchor points as described above.

Referring back to FIG. 5, at 320, the target recognition, analysis, and tracking system may also determine the location or position of the elbow and knee based on the average position of the right and left arms, the average position of the right and left legs, the shoulder, hip, head, etc. In one embodiment, the target recognition, analysis, and tracking system may determine the location position of the left elbow by refining the X-value, Y-value, and depth value of the left arm average location. For example, the target recognition, analysis, and tracking system may determine the outermost voxels, which may define the edge associated with the left arm. The target recognition, analysis, and tracking system may then adjust the X-value, Y-value, and depth value of the left arm average position to be in the middle or equidistant from the edges.

The target recognition, analysis, and tracking system may also determine additional points of interest of the isolated human target at 320. For example, the target recognition, analysis, and tracking system may determine the voxel furthest from the center of the human body, the voxel closest to the camera, and the voxel furthest forward of the human target based on an orientation such as an angle of the shoulder.

According to an example embodiment, at 320, one or more of the extremities, such as the head, hand, arm, leg, foot, center, shoulder, hip, etc., may be refined based on depth averaging. For example, the target recognition, analysis, and tracking system determines an initial position or orientation of the extremity by analyzing voxels associated with the isolated human target using, for example, the anchor point, head-to-center vector, extremity volume, scoring techniques, patterns, etc., described above. The target recognition, analysis, and tracking system may then refine the initial position or orientation of the extremity based on values, such as depth values of pixels in a 2D pixel region of a non-downsampled depth image that may be associated with the voxel.

For example, in one embodiment, the target recognition, analysis, and tracking system may determine a moving average of the extremity, which may include an average such as an X-value, a Y-value, or a depth value of the location or position of the extremity determined from a previously received frame and depth image (such as a series of three previously received frames and depth images). The target recognition, analysis, and tracking system may then determine an average volume based on the moving average. According to one embodiment, the average volume may be a region or portion of a non-down sampled depth image comprising pixels included therein that may be scanned to refine the extremity based on a moving average. For example, the target recognition, analysis, and tracking system may analyze or compare the initial position or location of the extremity relative to the moving average. If a value such as the X-value, Y-value, or depth value of the initial position or orientation may approach or equal the average of the moving averages, the target recognition, analysis, and tracking system may determine an average volume having the average of the moving averages as its center. If a value such as an X-value, Y-value, or depth value of the initial position or orientation may not be near or equal to the average of the moving averages, the target recognition, analysis, and tracking system may determine an average having the value of the initial position or orientation as its center. Thus, in one embodiment, when the moving average is different from the initial location of the extremity, the target recognition, analysis, and tracking system may use the initial position or location as the center of the average volume being determined.

After determining the mean volume, the target recognition, analysis, and tracking system may scan pixels in the non-downsampled depth image associated with the mean volume to determine a location or position of the extremity that may be used to refine the initial location or position. For example, the target recognition, analysis, and tracking system may scan each pixel in the non-downsampled depth image that may be included in or associated with the average volume. Based on the scan, the target recognition, analysis, and tracking system may calculate a refined location or position that includes the X-value, Y-value, and depth value of the extremity in the non-downsampled depth image by averaging pixel values in the non-downsampled depth image that may be associated with the extremity but not with the background. The target recognition, analysis, and tracking system may then adjust or refine the initial position or location of the extremity based on the refined position or location. For example, in one embodiment, the target recognition, analysis, and tracking system may assign a refined position or location to the location or position of the extremity. According to another embodiment, the target recognition, analysis, and tracking system may adjust or move the initial position or location of the extremity based on the refined position or location. For example, the target recognition, analysis, and tracking system may move or adjust the extremity from the initial position or location in one or more directions (such as from the centroid towards the tip of the extremity) based on the refined position or location. The target recognition, analysis, and tracking system may then assign a location or position for the extremity based on the movement or adjustment to the initial location or position using the refined position or location.

The target recognition, analysis, and tracking system may also determine whether one or more of the positions or locations determined for the extremities (such as the head, shoulders, hips, hands, feet, etc.) may not be the exact position or location of the actual extremities of the human target at 320. For example, in one embodiment, the position or orientation of the right hand may be inaccurate, such that the position or orientation of the right hand may stagnate at or adjacent to the position or orientation of the shoulder or hip.

According to an example embodiment, the target recognition, analysis, and tracking system may include or store a list of body markers (markers) for each extremity that may indicate an inaccurate position or location of that extremity. For example, the list may include body markers around the shoulders or hips that may be associated with the hand. The target recognition, analysis, and tracking system may determine whether the position or orientation of the hand may be accurate based on the body markers associated with the hand in the list. For example, if the position or orientation of the hand may be within one of the volume markers associated with the hand in the list, the target recognition, analysis, and tracking system may determine that the position or orientation of the hand may be inaccurate. According to one embodiment, the target recognition, analysis, and tracking system may then adjust the position or orientation of the hand to the current position or orientation of the hand at the previous exact position of the hand in the previous frame.

At 325, the target recognition, analysis, and tracking system may process the extremity determined at 320. For example, in one embodiment, the target recognition, analysis, and tracking system may process the extremity to generate a model, such as may have a skeletal model between which one or more joints and bones are defined.

FIG. 17 illustrates an example embodiment of a model 900 that may be generated. According to an example embodiment, model 900 may include one or more data structures that may represent, for example, a three-dimensional model of a human. Each body part may be characterized as a mathematical vector having X, Y and Z values that may define the joints and bones of model 900.

As shown in FIG. 17, model 900 may include one or more joints j1-j 16. According to an example embodiment, each of the joints j1-j16 may enable one or more body parts defined between the joints to move relative to one or more other body parts. For example, a model representing a human target may include a plurality of rigid and/or deformable body parts defined by one or more structural members such as "bones" with joints j1-j16 located at the intersection of adjacent bones. Joints j1-j16 may enable various body parts associated with bones and joints j1-j16 to move independently of each other. For example, as shown in fig. 17, the bone defined between joints j10 and j12 corresponds to a forearm that may move independently of the bone (which corresponds to a calf), e.g., between joints j14 and j 16.

Referring back to fig. 5, at 325, the target recognition, analysis, and tracking system may also process the extremity determined at 320 by adjusting a model (such as model 900 described above with reference to fig. 9) based on the location or position determined for the extremity at 320. For example, the target recognition, analysis, and tracking system may adjust the joint j1 associated with the head to correspond to a position or location, such as the location or location 810 described with reference to fig. 11, determined at 320. Thus, in an example embodiment, joint j1 may be assigned an X value, a Y value, and a depth value associated with position or orientation 810 determined for the head as described above. If one or more extremities may be inaccurate based on, for example, a list of body markers as described above, the target recognition, analysis, and tracking system may maintain the inaccurate joint in its previous position or orientation based on the previous frame.

In one embodiment, the target recognition, analysis, and tracking system may process the adjusted model by, for example, mapping one or more motions or movements applied to the adjusted model to an avatar or game character, such that the avatar or game character may be animated to simulate a user such as user 18 described above with reference to FIGS. 1A and 1B. For example, the visual appearance of the on-screen character may then be changed in response to changes to the adjusted model.

In another embodiment, the target recognition, analysis, and tracking system may process the adapted model by providing the adapted model to a gesture library in a computing environment (such as the computing environment 12 described above with reference to FIGS. 1A-4). The gesture library may be used to determine controls performed within the application based on the orientation of various body parts in the model.

It will be appreciated that the configurations and/or approaches described herein are exemplary in nature, and that these specific embodiments or examples are not to be considered in a limiting sense. The specific routines or methods described herein may represent one or more of any number of processing strategies. As such, various acts illustrated may be performed in the sequence illustrated, in other sequences, in parallel, and so forth. Also, the order of the above-described processes may be changed.

The subject matter of the present disclosure includes all novel and nonobvious combinations and subcombinations of the various processes, systems and configurations, and other features, functions, acts, and/or properties disclosed herein, as well as any and all equivalents thereof.

Claims

1. A method for determining extremities of a user, the method comprising:

receiving a depth image;

generating a grid of voxels based on the depth image;

removing a background included in the grid of voxels to isolate one or more voxels associated with a human target;

defining a head-to-center vector based on the position or orientation of the head and the center of the isolated human target;

creating a acrosome volume based on displacement along the head-to-center vector; and

determining a location or position of one or more extremities of the isolated human target based on a fit line of depth values within the extremity body block.

2. The method of claim 1, wherein determining the location or position of the one or more extremities of the isolated human target further comprises estimating a center of the isolated human target, wherein estimating the center of the human target comprises calculating an average position of voxels in a grid associated with the isolated human target.

3. The method of claim 1, wherein determining the location or position of the one or more extremities of the isolated human target further comprises:

determining candidates for the one or more extremities;

generating a candidate cylinder based on the grid of voxels;

computing a score for the candidate term based on the candidate cylinder;

determining whether a score of the candidate exceeds an extremity threshold score;

assigning a value of a voxel in a grid associated with the candidate item to a location or position of the one or more extremities when the score exceeds the extremity threshold score; and

assigning a previous position of an extremity to a position or location of the one or more extremities when the score does not exceed the extremity threshold score.

4. The method of claim 1, wherein determining the location or position of the one or more extremities of the isolated human target further comprises:

sampling depth values indicative of voxels in a mesh defining the extremity shape;

determining whether the sampled depth values of the voxels deviate from one or more expected depth values of a two-dimensional pattern associated with the extremity shape; and

decreasing a score associated with a voxel when the sampled depth value deviates from the expected depth value;

determining whether a score associated with the voxel has a highest value; and

assigning a value of the voxel with the highest value to a location or position of the one or more extremities.

5. The method of claim 1, wherein determining the orientation of the one or more extremities comprises calculating extremity slopes of extremity vectors associated with the one or more extremities based on a fitted line of the depth values.

6. The method of claim 1, wherein determining the location or position of the one or more extremities of the isolated human target further comprises:

creating a torso body;

identifying voxels outside the torso volume; and

identifying voxels outside of the torso volume as being associated with the one or more extremities.

7. The method of claim 1, wherein determining the location or position of the one or more extremities of the isolated human target further comprises:

determining anchor points and average limb positions;

generating a vector between the anchor point and the limb mean position, wherein the vector defines a search direction;

determining a last valid voxel along said vector by searching in said search direction from said anchor point; and

associating the location or position of the one or more extremities with the last valid voxel.

8. A method for determining extremities of a user in a scene, the method comprising:

receiving a depth image comprising pixels;

down-sampling the pixels in the received depth image to generate one or more voxels;

isolating one or more voxels associated with the human target;

determining a position or orientation of a head of the isolated human target based on a fit line of depth values within the limb body block.

9. The method of claim 8, wherein the step of determining the position or orientation of the head of the isolated human target further comprises:

determining a candidate for the head;

generating a candidate cylinder based on the grid of voxels;

computing a score for the candidate term based on the candidate cylinder;

determining whether the score of the candidate exceeds a top threshold score;

assigning a value of a voxel in a grid associated with the candidate item to a position or orientation of the head when the score exceeds the extremity threshold score; and

assigning a previous position of the head to a position or location of the head when the score does not exceed the extremity threshold score.

10. The method of claim 8, wherein the step of determining the position or orientation of the head of the isolated human target further comprises:

sampling depth values of voxels in a mesh indicating a shape defining a head;

determining whether the sampled depth values of the voxels deviate from one or more expected depth values of a two-dimensional pattern associated with a shape of the head; and

determining whether a score associated with the voxel has a highest value; and

assigning the value of the voxel with the highest value to a position or orientation of the head.

11. The method of claim 8, further comprising:

determining an average volume associated with the head based on a comparison of a moving average and a position or orientation of the head;

scanning pixels in the depth image associated with the average volume;

calculating a refined position or orientation of the head by averaging one or more values of one or more voxels in the mean volume; and

refining the position or orientation of the head based on the refined position or location.

12. A system for determining extremities of a user, the system comprising:

a capture device, wherein the capture device includes a camera component that receives a depth image of a scene; and

a computing device in operable communication with the capture device, wherein the computing device includes a processor that generates a downsampled depth image based on one or more pixels in the depth image received from the capture device; removing a background of the downsampled depth image to isolate a human target; and defining a head-to-center vector based on a position or orientation of a head and a center of the isolated human target, creating a limb tip volume based on a displacement along the head-to-center vector, and determining a position or orientation of one or more limbs of the isolated human target based on a fitted line of depth values within the limb tip volume, wherein the one or more limbs include at least one of a head, a center of mass, a shoulder, a hip, a leg, an arm, a hand, or a foot.

13. The system of claim 12, wherein the processor determines the location or position of the one or more extremities by:

determining candidates for the one or more extremities;

generating a candidate cylinder based on the grid of voxels;

computing a score for the candidate term based on the candidate cylinder;

14. The system of claim 12, wherein determining the orientation of the extremity comprises calculating an extremity slope of an extremity vector associated with the one or more extremities based on a fit line of the depth values.

15. The system of claim 12, wherein the processor determines the location or position of the one or more extremities by:

determining anchor points and average limb positions;