HK1171854B

HK1171854B - Validation analysis of human target tracking

Info

Publication number: HK1171854B
Application number: HK12112661.9A
Authority: HK
Inventors: J．D．普尔西弗; P．穆哈吉尔; N．A．艾德格米; S.刘; P．O．科克; J.C.福斯特; R．O．F．小福布斯; S．P．斯塔赫尼克; T．拉万德; J．贝尔托拉米; M．T．詹尼; K．T．胡耶恩; C．C．马雷; S．D．佩罗; R．J．菲茨杰拉德; W．R．比森; C．C．佩普尔
Original assignee: 微软技术许可有限责任公司
Priority date: 2010-12-17
Filing date: 2012-12-07
Publication date: 2016-07-15

Description

Confirmation analysis of human target tracking

Technical Field

The invention relates to a validation analysis for tracking of a human target.

Background

Target recognition, analysis, and tracking systems have been created that use capture devices to determine the location and movement of objects and humans in a scene. The capture devices may include a depth camera, an RGB camera, and an audio detector that provide information to a capture processing pipeline that includes hardware and software elements. The processing pipeline provides motion recognition, analysis, and motion tracking data to applications that can use the data. Exemplary applications include games and computer interfaces.

Accuracy in the trace pipeline is desirable. Accuracy depends on the ability to determine the movement of various types of user motions within the field of view for various types of users (male, female, tall, short, etc.). Achieving accuracy in the tracking pipeline is particularly difficult when providing commercially viable devices, where the possible variations in the motion and type of user to be tracked are significantly greater than in a testing or academic environment.

Disclosure of Invention

In one embodiment, a technique for testing a target recognition, analysis, and tracking system is provided. A method for verifying the accuracy of a target recognition, analysis, and tracking system includes creating test data and providing a searchable collection of test data. The test data may be a recorded and/or synthesized depth clip with an associated ground truth (groudtuth). The ground truth includes an association of joint parts of the human with skeletal tracking information that has been verified to be accurate. At least a subset of the searchable set of test data is provided to the pipeline in response to a request to test the pipeline. The trace data is output from the pipeline and analysis of the trace data relative to the ground truth provides an indication of the accuracy of the pipeline code.

A system for verifying the accuracy of a target recognition, analysis, and tracking system includes a searchable repository of recorded and synthesized depth clips and associated ground truth, which are available to multiple processing pipelines in testing. One or more processing devices, each including at least one instance of a target recognition, analysis, and tracking pipeline, analyze selected components of the test data. The job controller provides at least a subset of the searchable set of test data to test the pipeline, and the analysis engine receives trace data output from the pipeline over at least the subset of the searchable set. The report generator outputs an analysis of the tracking data relative to the ground truth in at least the subset to provide an output of an error relative to the ground truth.

Many features are described herein that render the techniques flexible, scalable, and unique systems and methods.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

Drawings

FIG. 1A is a block diagram illustrating an environment for practicing the techniques described herein.

FIG. 1B is a flow diagram of an overall process validation analysis for depth and motion tracking for a target recognition, analysis, and tracking system.

FIG. 2 is a block diagram of an analysis engine used in the environment of FIG. 1A.

FIG. 3A illustrates an example embodiment of a target recognition, analysis, and tracking system.

FIG. 3B illustrates another example embodiment of a target recognition, analysis, and tracking system.

FIG. 4 illustrates one example embodiment of a capture device that may be used in a target recognition, analysis, and tracking system.

FIG. 5 illustrates an exemplary body model for representing a human target.

FIG. 6A illustrates a base elevation view of an exemplary skeletal model used to represent a human target.

FIG. 6B illustrates an oblique view of an exemplary skeletal model used to represent a human target.

FIG. 7A is a flow chart illustrating a process for creating test data.

FIG. 7B is a flow diagram of the high-level operation of one embodiment of a target recognition, analysis, and tracking system motion tracking pipeline.

FIG. 7C is a flow chart of the model fitting process of FIG. 7A.

FIG. 8 illustrates a conceptual block diagram of a processing pipeline for tracking targets, according to one embodiment of the present technology.

FIG. 9A is a flow diagram illustrating one embodiment for creating ground truth data from a depth clip by manually tagging skeletal data.

FIG. 9B is a flow diagram illustrating another embodiment for creating ground truth data for a test clip using calibrated sensors.

FIG. 10 is a flow diagram illustrating the alignment of motion capture coordinates to a skeletal tracking coordinate space.

FIG. 11 is a flow diagram illustrating compositing two depth clips to create a depth map composite.

Fig. 12A-12J are representations of the compounding step shown in fig. 11.

FIG. 13 is a flow chart illustrating manual annotation of test data.

FIG. 14 is a flow chart illustrating the output of an analysis test pipeline with respect to ground truth data.

FIG. 15 is a flow chart illustrating the creation of a test suite.

Fig. 16 shows the model being moved.

FIG. 17 is an exemplary computing environment suitable for use in the present technology.

FIG. 18 is another exemplary computing environment suitable for use in the present technology.

Detailed Description

Techniques are provided that allow testing of target recognition, analysis, and tracking systems. A target recognition, analysis, and tracking system may be used to recognize, analyze, and/or track a human target such as a user. The target recognition, analysis, and tracking system includes a processing pipeline implemented in hardware and software to perform recognition, analysis, and tracking functions. Designers of such systems need to optimize such systems against known good data sets, and continually strive to improve the accuracy of such systems.

Test systems include a vast collection of recorded and synthesized test data. The test data includes a plurality of depth clips including a sequence of depth frames recorded during the test data capture session. The test data is correlated with the motion data to ascertain the ground truth of the depth clip. The test data contains the motions and gestures of humans that a developer of the pipeline or a particular application designed to use the pipeline is interested in recognizing. The ground truth reflects the known accurate data in the depth clip. Ground truth can be of different types, including skeletal data, background removal data, and floor data. Annotating test data to allow developers to easily determine desired depth clips and build sets of depth clips into a test suite. A synthesized depth clip may be created from existing clips and other three-dimensional object data, such as static objects within a scene. The analysis controller directs processing of the test data to the new pipeline, receives the tracked results from the pipeline, and manages analysis of the accuracy of pipeline processing relative to ground truth. Analysis of individual errors, as well as a summary of pipeline performance relative to previous pipelines, is provided. In this way, the technique is directed to any local processing device (such as an Xbox)Console) runs an evaluation of the new version pipeline or divides work among many test consoles. It collects these results and provides various statistical analyses of the data to help identify problems when tracking in certain situations. There are also scalable methods for generating, compositing and synthesizing test data to form new combinations of variables.

Motion capture, motion tracking, or "motion capture" (mocap) is used interchangeably herein to describe recording movement and converting the movement into a digital model. In a motion capture session, the movements of one or more actors are sampled multiple times per second to record the actors' movements.

The motion capture data may be a recorded or combined output of the motion capture device that is converted into a three-dimensional model. A motion capture system tracks one or more feature points of a subject in space relative to its own coordinate system. The capture information may be in any number of known formats. Motion capture data is created using any of a number of optical or non-optical systems with active, passive or non-marker systems, or inertial, magnetic or mechanical systems. In one embodiment, the model is developed by a processing pipeline in a target recognition, analysis, and tracking system. To verify the accuracy of the pipeline, the performance of the pipeline in both building the model and tracking the movement of the model is compared to known good skeletal tracking information.

Such known good skeletal tracking information is referred to herein as ground truth data. One type of ground truth data may be developed by manually or automatically tracking movements of a subject and verifying points used in a skeletal model using various techniques. Other types of ground truth include background data and floor location data. The depth clip with the ground truth can then be used to test further implementations of the pipeline. Analysis metrics are provided to allow developers to evaluate the effectiveness of various interactions and changes to the pipeline.

In general, as described below, the target recognition, analysis, and tracking system of the present technology uses depth information to define and track the motion of a user within the field of view of a tracking device. A skeletal model of the user is generated, and points on the model are used to track the movement of the user provided to the respective applications, which use the data for various purposes. The accuracy of the skeletal model and the motion tracked by the skeletal model is generally desirable.

FIG. 1A illustrates an environment in which the present technology may be used. FIG. 1A is a block diagram illustrating various test data sources, a test data store, and a plurality of different types of test systems in which the present techniques may be used. In one embodiment, to create test data for a test data store, a motion capture clip of a test subject is created while a depth sensor is used to create a depth clip of the subject. A depth clip is a sequence of depth frames, each depth frame comprising an array of height values representing the depth sensor output of a single frame. Other methods of creating test data are also discussed herein.

In one embodiment, motion capture data is acquired using a motion capture system. The motion capture system 111 may comprise any of a number of known types of motion capture systems. In one embodiment, the motion capture system 111 is a magnetic motion capture system in which a plurality of sensors are placed on the body of the subject to measure the magnetic field generated by one or more emitters. Motion capture data differs from ground truth in that motion capture data is the position (and in some cases orientation) of the sensor relative to the motion capture system, where ground truth is the position and in some cases orientation of the subject's joints relative to the depth sensor. When using such systems, the simultaneously recorded depth clips must be correlated between the positions detected by the motion capture system with sensors to generate ground truth. The correlation is performed by alignment and calibration between the motion capture system and the depth sensor data.

Also shown in FIG. 1A is a depth capture system that may include a depth sensor (such as depth camera 426 discussed below) that records a depth clip.

The various sources of test data 110 through 118 may provide test data and ground truth to test data repository 102. The raw motion capture data 110 is an output provided by an active or passive motion capture device, such as a capture system 111. The raw motion capture data may not have been analyzed to provide associated ground truth information. Depth clip 112 may be data recorded concurrently with the motion capture data, or the depth clip may be created without association with the accompanying motion capture data. Such raw depth data may be reviewed manually by an annotator who reviews each or a portion of the plurality of frames in the depth clip and models joints of the subject in the depth space. Particular sources of motion capture and depth data include game developer motion and depth clips 114, or researcher-provided motion and depth clips 116. The game developer clips include clips specifically defined by the application developer with the movements required by the developer's game. For example, tennis games may require a high degree of accuracy dedicated to playing tennis, as well as distinguishing forehand shots from ground shots. Research clip 116 is provided by a researcher seeking to push the development of a pipeline in a particular direction. The composite depth clip 118 is a combination of existing clips that define movement scenarios and scenes, which may not otherwise be available.

Ground truth development 115 represents the correlation of motion capture data with depth data to create ground truth, or the manual annotation of depth data by a person to match joints to depth frames.

The environment of FIG. 1A includes various types of test systems, including user test equipment 130, batch test system 145, and automatically built test system 140. Each of the test systems 130, 145, 140 accesses the data repository 102 to test the target recognition, analysis, and tracking pipeline 450 used by the target recognition, analysis, and tracking system. The pipeline is discussed with reference to fig. 7 and 8.

The test data repository 102 comprises a test data store 104 containing depth clips and a ground truth data store 105 containing ground truth associated with depth clips. It should be understood that data stores 104 and 105 may be combined into a single data store.

Test data repository 102 may include clip and ground truth data, as well as clip submission interface 106 and data server 108. The submission interface 106 may be one of a batch or web server that allows the data creator to provide any of the test data 110 through 118. Data repository 102 may include one or more standard databases that house clip data and allow metadata, described below, to be associated with the clip data, thereby allowing a user to quickly and easily identify information available in individual clips, selected clips, and/or all clips and retrieve them from data store 108 for use in testing devices 103, 145, and 140.

As discussed below with reference to fig. 4, each target recognition, analysis, and tracking system may include, for example, a depth imaging processing and skeletal tracking pipeline 450. The pipeline acquires motion data in the form of depth images and processes these images to compute the position, motion, identity, and other aspects of the user in the field of view of a capture device, such as capture device 20 in FIG. 4.

In one embodiment, depth image processing and skeletal tracking pipeline 450 may include any combination of hardware and code to perform the various functions described with reference to fig. 4-8, as well as the associated applications referenced herein and incorporated by reference. In one embodiment, the hardware and code may be modified by uploading the updated code 125 into the processing system.

Each of the test systems 130, 145, 140 has access to one or more versions of the motion tracking pipeline. When creating a new version of the pipeline (code 125, 127 in FIG. 1A), the test systems 130, 140, and 145 use the depth clip and ground truth data to evaluate the performance of the new version by comparing the performance of the pipeline function in tracking motion in the test data to the known ground truth data.

For example, user test equipment 130, including the processing equipment described below with reference to FIGS. 17 and 18, may include pipeline code 450 updated by new code 125. When performing a test on data selected from data repository 102, pipeline code may be executed within a user's processing device 130. When testing new code 125, user test equipment 130 is configured by the developer to access the clip data from data repository 102 through the data server on which the pipeline is tested.

The analysis engine 200 described below with reference to fig. 2 outputs an analysis of the depth image processing and skeletal tracking pipeline with respect to ground truth associated with test data input into the pipeline. The analysis engine 200 provides a plurality of reporting metrics to the analysis user interface 210.

The batch test system 145 may include a collection of one or more processing devices that includes an analysis engine 200 and an analysis user interface 210. The batch test system includes a connection to one or more consoles 150, such as processing equipment 150 and 160. Each of console 150 and computer 160 may execute pipeline 450 and may be updated with new pipeline code 125. The new pipeline code may be submitted to the batch test system and loaded to the respective console 150 and computer 160 under the control of the batch test system.

The analysis engine 200 and job controller 220 in the batch test system 145 control the supply of test data to each of the console and associated pipelines in the computer and collect analysis of the output of each of the console and computer on the new pipeline code 125, which new pipeline code 125 is submitted and on which batch tests are performed.

The automatically built test system 140 is similar to a batch system in that it provides access to a plurality of consoles and computers, each having an associated processing pipeline. The processing pipeline is defined by new pipeline code, including, for example, code 127 for each night that is submitted to the automatically built test system 140. The automatically built test system is designed to perform periodic, periodic tests on newly submitted code 127. Thus, the code manager 142 manages when new code 127 is allowed into the system, which code is verified as testable, and which code is provided to the consoles 150 and 160 for periodic processing under the control of the analysis engine 200. It should be appreciated that the periodic processing may occur, for example, on some other periodic basis. The automatically built test system 140 is useful when multiple different developers are providing new pipeline code, the management of which is defined by the code manager 412. The code may be checked on a nightly basis after each check of the developer code, or according to some other schedule defined by the automatically built test system.

FIG. 1B is a flow chart illustrating a method for implementing the present technology. At 164, a depth clip of the subject's movement of interest is created. Various embodiments for creating a depth clip and resulting test data (including the depth clip and associated ground truth for the depth clip) are discussed in FIG. 7A. In one embodiment, the test data is created without motion capture data. In another embodiment, the test data may be created by using the motion capture system to create motion clips simultaneously with the depth clip at 164, which are then used to create ground truth for the depth clip. Alternatively, at 164, the depth clip may be composited with other depth clips and three-dimensional data.

At 166, ground truth is created and associated with the test data. Ground truth data is created and/or validated through a machine process or a manual flag process. As explained below, the target recognition, analysis, and tracking system uses a skeletal model (such as the skeletal model shown in FIGS. 6A and 6B) to identify a test subject within a field of view of a capture device. In one context, ground truth data is a skeletal model used to track the verification of a subject within or outside of a field of view, and the actual motion of the subject is accurately identified in space by the model. As described above, ground truth, such as background information and floor location, may be used for other purposes described herein. At 168, the annotated ground truth and depth data is stored. At 168, the test data is annotated with metadata to allow advanced searching of data types. The annotation metadata is discussed below with reference to fig. 13. The annotation of the test data includes attaching an associated set of metadata to the test data. It should be understood that the number and types of data tags attached to the data and described below are exemplary. The data allows developers to quickly find a particular type of data that can be used to perform a particular test on the new pipeline code.

When a test is initiated for a particular pipeline at 170, generally one of two types of tests will be provided, either custom or batch tests or tests every night (periodically).

If an automatic or periodic test is to be performed using, for example, the automatically built test system 140, the test data will be run through the particular processing pipeline or pipelines of interest at 174 and the output analyzed against ground truth at 176. Various reports and report summaries may be provided at 178. The process of analyzing the pipeline relative to the ground truth at 176 is explained below and is performed with reference to the detection of differences between the ground truth and the output of the pipeline, and analyzed using a plurality of different metrics. Generally, automatic or periodic tests will be run periodically with respect to the same data set to track changes in the performance of the code over time.

If custom or batch testing is to be used at 172, the testing may need to be optimized for selected features or test specific portions of the pipeline. At 182, optionally, a test kit is constructed. The test suite may be a subset of test data, and an associated ground truth customized for a particular function. For example, if an application developer wishes to test a particular pipeline with respect to use in a tennis game, the accuracy of pipeline detection of user arm movements that differentiate between over-shoulder shots, teeers, ground shots, forehands and backhoes in the processing pipeline may be optimal. If the test suite contains enough data to perform custom analysis, then new additional data requirements are not necessary at 184 and the test suite of data may be run through the processing pipeline at 186. The output of the pipeline is again analyzed at 188 with respect to the existing ground truth in the test suite and reported at 190 in a manner similar to that described above with reference to step 178. If additional test data is needed for the particular application for which the test suite is being used, steps 172, 174, 176, and 178 may be repeated (at 192) to create custom data that needs to be added to the test suite created at 182. The custom data may be newly recorded data or synthesized composite data, as described below.

FIG. 2 illustrates an analysis engine 200, an analysis user interface 210, and a job manager 220. As described above, the analysis engine 200 may be provided in the user test system 130, the batch test system 145, or the automatically built test system 140.

Analysis user interface 210 allows a developer or other user to define specific test data and metrics for use by the job controller and analysis engine in one or more test sequences and reports. Analysis user interface 210 also allows a user to select various test data for use in a particular test run using build and test clip selection interface 214. For any test, a particular choice of test codes and metrics, or all codes and all metrics, may be used in the test. The analysis user interface provides a visualization 216 to the user, the visualization 216 outputting the results of the roll-up report generator and the various metric reports provided by the analysis engine. Job manager 220 is fed test data via analysis user interface 210, and a result set to be analyzed by device 150/160.

Job manager 220 includes a pipeline load controller 224 and a job controller 226. The pipeline load controller receives new pipeline code 125 from any number of sources and ensures that the pipeline code can be installed in each of a plurality of devices using the device interface 222. Job controller 226 receives input from analysis user interface 210 and defines information that is provided to the various pipelines in each of the different devices that provide the code to be analyzed and receive the analysis performed. In other implementations, the set of batch test instructions may supplement or replace the analysis user interface and job manager 220.

The analysis engine includes an analysis manager 230, a report generator 250, and a metrics plugin assembly 240 with a metrics process controller 245. The analysis manager 230 takes the performed analysis 232 and compiles the completed results 234. The analysis performed includes the resulting clip tracking results of the pipeline compared to the ground truth of the clip. In one embodiment, individual data elements are compared between clips and ground truth, and errors are passed to a plurality of statistical metric engines. Alternatively, the metrics engine invokes a plurality of metrics plugins that each compare the tracking results to ground truth and further evaluate the error. For example, a skeletal metrics plug-in (SkeletonMetrics plug) yields the original error, as well as any resulting statistical evaluation of the clip relative to ground truth. In another embodiment, there may be a metrics engine for each physical CPU core available to the analysis engine, and each metrics engine has a list of all the metrics plugins that have been requested.

A test run is defined by feeding at least a subset of the searchable collection of test data to a given trace pipeline and logging the results (i.e., where a skeletal joint is for each frame). Various metrics are then used in the analysis engine to compare the tracking results to the ground truth for each test data. The output from the analysis engine is processed by a report generator to create an aggregated report. In addition, even where ground truth is not involved, useful information may be determined from the test run, including performance and general tracking information such as whether any skeletons are tracked for a frame.

The clip tracking results and the ground truth are provided to any of a number of different metric engines 242, 244, 246 that compute various reported metrics of the tracking results relative to the ground truth. Exemplary metrics are described below. The metrics engine is enabled via plug-in 240, and plug-in 240 allows for the modification and customization of available metrics available for analysis. The example metrics described herein are merely exemplary and are illustrative of one example of the present technology. Any number of different types of metrics in accordance with the present techniques may be used to analyze ground truth data relative to tracking data, as described herein. The metrics engine returns the analysis metrics results to the analysis manager, which compiles and outputs the results to the roll-up report generator 250. The roll-up report generator provides the data set provided with the report to the analysis manager for provision to the analysis user interface 210. An exemplary summary report is described below.

Fig. 3A-4 illustrate a target recognition, analysis, and tracking system 10 that may be used to recognize, analyze, and/or track a human target, such as a user 18. Various embodiments of the target recognition, analysis, and tracking system 10 include a computing environment 12 for executing gaming or other applications. The computing environment 12 may include hardware components and/or software components such that the computing environment 12 may be used to execute applications such as games and non-gaming applications. In one embodiment, the computing environment 12 may include a processor, such as a standardized processor, a specialized processor, a microprocessor, or the like that may execute instructions stored on a processor readable storage device for performing the processes described herein.

The system 10 also includes a capture device 20, the capture device 20 for capturing image and audio data related to one or more users and/or objects sensed by the capture device. In various embodiments, the capture device 20 may be used to capture information related to partial or full body movements, gestures, and speech of one or more users, which is received by the computing environment and used to present, interact with, and/or control aspects of gaming or other applications. Examples of the computing environment 12 and the capture device 20 are explained in more detail below.

Various embodiments of the target recognition, analysis, and tracking system 10 may be connected to an audio/visual (a/V) device 16 having a display 14. The device 16 may be, for example, a television, a monitor, a High Definition Television (HDTV), or the like that may provide game or application visuals and/or audio to a user. For example, the computing environment 12 may include a video adapter such as a graphics card and/or an audio adapter such as a sound card that may provide audio/visual signals associated with a game or other application. a/V device 16 may receive audio/visual signals from computing environment 12 and may then output game or application visuals and/or audio associated with those audio/visual signals to user 18. According to one embodiment, the audio/visual device 16 may be connected to the computing environment 12 via, for example, an S-video cable, a coaxial cable, an HDMI cable, a DVI cable, a VGA cable, a component video cable, or the like.

In various embodiments, the computing environment 12, the A/V device 16, and the capture device 20 may cooperate to present an avatar or on-screen character 19 on the display 14. For example, FIG. 3A shows a user 18 playing a soccer game application. The movements of the user are tracked and used to animate the movements of avatar 19. In various embodiments, the avatar 19 mimics the movements of the user 18 in real world space such that the user 18 may perform movements and gestures that control the movements and actions of the avatar 19 on the display 14. In fig. 3B, the capture device 20 is used in a NUI system where, for example, the user 18 is scrolling and controlling a user interface 21 having various menu options presented on the display 14. In FIG. 1A, the computing environment 12 and capture device 20 may be used to recognize and analyze movements and gestures of the user's body, and such movements and gestures may be interpreted as controls for the user interface.

The various embodiments of fig. 3A-3B are two of many different applications that may be run on the computing environment 12, and the application running on the computing environment 12 may be various other gaming and non-gaming applications.

Fig. 3A-3B illustrate an environment containing static background objects 23, such as floors, chairs, and plants. These objects are objects within the FOV captured by the capture device 20, but do not change from frame to frame. The static object may be any object viewed by the image camera in the capture device 20, other than the floor, chair, and plant shown. Additional static objects within the scene may include any walls, ceilings, windows, doors, wall decorations, and the like.

Suitable examples of the system 10 and its components are found in the following co-pending patent applications, which are all hereby incorporated by reference in their entirety: U.S. patent application serial No. 12/475,094 entitled "environmental and/or object segmentation" filed on 29.5.2009; U.S. patent application serial No. 12/511,850 entitled "automated generation a visual representation" filed on 29.7.2009; U.S. patent application serial No. 12/474,655 entitled "gestalto" filed on 29.5.2009; U.S. patent application serial No. 12/603,437 entitled "postrackingpipeline" filed on 21/10/2009; U.S. patent application serial No. 12/475,308 entitled "device for identifying and tracking multiple human devices over time" filed on 29.5.2009; U.S. patent application serial No. 12/575,388 entitled "human tracking system" filed on 7/10/2009; U.S. patent application serial No. 12/422,661 entitled "gesture recognizer system architecture" filed on 13.4.2009; U.S. patent application serial No. 12/391,150 entitled "standard getroots" filed on 23.2.2009; and U.S. patent application serial No. 12/474,655 entitled "gestural tool" filed on 29.5.2009.

FIG. 4 illustrates one example embodiment of a capture device 20 that may be used in the target recognition, analysis, and tracking system 10. In an example embodiment, the capture device 20 may be configured to capture video having a depth image that may include depth values via any suitable technique including, for example, time-of-flight, structured light, stereo image, or the like. According to one embodiment, the capture device 20 may organize the calculated depth information into "Z layers," or layers perpendicular to a Z axis extending from the depth camera along its line of sight. The X and Y axes may be defined as being perpendicular to the Z axis. The Y-axis may be vertical and the X-axis may be horizontal. X, Y and the Z-axis together define the 3-D real world space captured by the capture device 20.

As shown in FIG. 4, the capture device 20 may include an image camera component 422. According to an example embodiment, the image camera component 422 may be a depth camera that may capture a depth image of a scene. The depth image may include a two-dimensional (2-D) pixel area of the captured scene where each pixel in the 2-D pixel area may represent a depth value, such as a length or distance in, for example, centimeters, millimeters, or the like of an object in the captured scene from the camera.

As shown in FIG. 4, according to one exemplary embodiment, the image camera component 422 may include an IR light component 424, a three-dimensional (3-D) camera 426, and an RGB camera 428 that may be used to capture a depth image of a scene. For example, in time-of-flight analysis, the IR light component 424 of the capture device 20 may emit an infrared light onto the scene and may then use sensors (not shown) to detect the backscattered light from the surface of one or more targets and objects in the scene with, for example, the 3-D camera 426 and/or the RGB camera 428.

In some embodiments, pulsed infrared light may be used, such that the time between an outgoing light pulse and a corresponding incoming light pulse may be measured and used to determine a physical distance from the capture device 20 to a particular location on a target or object in the scene. Additionally, in other example embodiments, the phase of the outgoing light wave may be compared to the phase of the incoming light wave to determine the phase shift. This phase shift may then be used to determine a physical distance from capture device 20 to a particular location on the targets or objects.

According to another example embodiment, time-of-flight analysis may be used to indirectly determine a physical distance from the capture device 20 to a particular location on the targets or objects by analyzing the intensity of the reflected beam of light over time via various techniques including, for example, shuttered light pulse imaging.

In another example embodiment, the capture device 20 may use a structured light to capture depth information. In such an analysis, patterned light (i.e., light displayed as a known pattern such as a grid pattern or a stripe pattern) may be projected onto the scene via, for example, the IR light component 424. Upon landing on the surface of one or more targets or objects in the scene, the pattern may deform in response. Such deformations of the pattern may be captured by, for example, the 3-D camera 426 and/or the RGB camera 428 and may then be analyzed to determine a physical distance from the capture device 20 to a particular location on the targets or objects.

The capture device 20 may also include a microphone 430. The microphone 430 may include a transducer or sensor that may receive and convert sound into an electrical signal. According to one embodiment, the microphone 430 may be used to reduce feedback between the capture device 20 and the computing environment 12 in the target recognition, analysis, and tracking system 10. In addition, the microphone 430 may be used to receive audio signals that may also be provided by the user to control applications such as gaming applications, non-gaming applications, or the like that may be executed by the computing environment 12.

In an example embodiment, the capture device 20 may also include a processor 432 that may be in operative communication with the image camera component 422. Processor 432 may include a standard processor, a special purpose processor, a microprocessor, or the like that may execute instructions that may include instructions for receiving a depth image, determining whether a suitable target may be included in a depth image, converting a suitable target into a skeletal representation or model of the target, or any other suitable instructions.

The capture device 20 may also include a memory component 434, and the memory component 34 may store instructions executable by the processor 432, images or frames of images captured by a 3-D camera or RGB camera, or any other suitable information, images, or the like. According to an example embodiment, the memory component 434 may include Random Access Memory (RAM), Read Only Memory (ROM), cache, flash memory, a hard disk, or any other suitable storage component. As shown in FIG. 4, in one embodiment, the memory component 434 may be a separate component in communication with the image camera component 22 and the processor 432. According to another embodiment, the memory component 34 may be integrated into the processor 32 and/or the image camera component 22.

As shown in FIG. 4, the capture device 20 may communicate with the computing environment 12 via a communication link 436.

Computing environment 12 includes a depth image processing and skeletal tracking pipeline 450 that uses depth images to track one or more persons that may be detected by the depth camera functionality of capture device 20. Depth image processing and skeletal tracking pipeline 450 provides tracking information to application 452, which application 452 may be a video game, a productivity application, a communications application, or other software application, among others. The audio data and visual image data may also be provided to the application 452 and to the depth image processing and skeleton tracking module 450. The application 452 provides the tracking information, audio data, and visual image data to the recognizer engine 454.

Recognizer engine 454 is associated with a set of filters 460, 462, 464. For example, filters 460, 462, 464. These gestures may be associated with various controls, objects, or conditions of the application 452. Thus, the computing environment 12 may use the recognizer engine 454 with filters to interpret and track the movement of objects (including people).

The recognizer engine 454 includes a plurality of filters 460, 462, 464. The filter includes information defining a gesture, action, or condition and parameters or metadata for the gesture, action, or condition. For example, a throw that includes motion of one hand from behind the body past the front of the body may be implemented as a gesture that includes information representing motion of one hand of the user from behind the body past the front of the body, as that motion may be captured by the depth camera. Parameters may then be set for the gesture. When the gesture is a throw, the parameters may be a threshold speed that the hand must reach, a distance the hand travels (absolute, or relative to the overall size of the user), and a confidence rating of the recognizer engine for the gesture that occurred. These parameters for a gesture may vary from application to application, from context to context of a single application, or within one context of one application over time.

Inputs to the filter may include such things as joint data about the user's joint position, the angle formed by the bones that intersect at the joint, RGB color data from the scene, and the rate of change of some aspect of the user. The output from the filter may include such things as the confidence that a given gesture is being made, the speed at which the gesture motion is made, and the time at which the gesture motion is made.

More information about the recognizer engine 454 may be found in U.S. patent application 12/422,661 "Gesturer recognitier System architecture," filed on 13/4/2009, the entire contents of which are incorporated herein by reference. More information about recognized gestures may be found in U.S. patent application 12/391,150, "standard gestures" (standard gestures), filed on 23/2/2009; and us patent application 12/474,655 "geturetool", filed on 29.5.2009, which is hereby incorporated by reference in its entirety.

FIG. 5 shows a non-limiting visual representation of an example body model 70. Body model 70 is a machine representation of a modeled target (e.g., game player 18 of fig. 3A and 3B). The body model may include one or more data structures that include a set of variables that collectively define the modeled target in the language of a game or other application/operating system.

The model of the target may be configured differently without departing from the scope of the invention. In some examples, the model may include one or more data structures representing the target as a three-dimensional model including rigid and/or deformable shapes, or body parts. Each body part may be characterized as a mathematical primitive, examples of which include, but are not limited to, spheres, anisotropically scaled spheres, cylinders, anisotropic cylinders, smooth cylinders, squares, beveled squares, prisms, and the like.

For example, body model 70 of FIG. 5 includes body sites bp1 through bp14, each representing a different portion of the modeled target. Each body part is a three-dimensional shape. For example, bp3 is a rectangular prism representing the left hand of the modeled target, while bp5 is an octahedral prism representing the upper left arm of the modeled target. Body model 70 is exemplary in that the body model may contain any number of body parts, each of which may be any machine-understandable representation of a respective portion of the object being modeled.

A model comprising two or more body parts may also comprise one or more joints. Each joint may allow one or more body parts to move relative to one or more other body parts. For example, a model representing a human target may include a plurality of rigid and/or deformable body parts, some of which may represent respective anatomical body parts of the human target. Further, each body part of the model may include one or more structural members (i.e., "bones" or skeletal parts) with joints located at the intersection of adjacent bones. It should be understood that some bones may correspond to anatomical bones in the human target, and/or some bones may not have corresponding anatomical bones in the human target.

The bones and joints may together constitute a skeletal model, which may be a constituent element of a body model. In some embodiments, a skeletal model may be used in place of another type of model (such as model 70 of FIG. 5). The skeletal model may include one or more skeletal members for each body part and joints between adjacent skeletal members. An exemplary skeletal model 80 and an exemplary skeletal model 82 are shown in FIGS. 6A and 6B, respectively. FIG. 6A shows a skeletal model 80 with joints j1 through j33 as viewed from the front. FIG. 6B shows skeletal model 82 also having joints j1 through j33, as viewed from an oblique view.

The skeletal model 82 also includes roll joints j34 through j47, where each roll joint may be used to track an axial roll angle. For example, an axial roll angle may be used to define a rotational orientation of a limb relative to its parent limb and/or the torso. For example, if the skeletal model shows an axial rotation of an arm, the roll joint j40 may be used to indicate the direction in which the associated wrist is pointing (e.g., palm facing up). By examining the orientation of a limb relative to its parent limb and/or torso, the axial roll angle can be determined. For example, if a calf is being examined, the orientation of the calf relative to the associated thigh and hip can be examined to determine the axial roll angle.

The skeletal model may include more or fewer joints without departing from the spirit of the present invention. Other embodiments of the present system explained below operate using a skeletal model with 31 joints.

As described above, some models may include a skeleton and/or other body parts that serve as a machine representation of the object being modeled. In some embodiments, the model may alternatively or additionally include a wireframe mesh, which may include a hierarchy of rigid polygonal meshes, one or more deformable meshes, or any combination of the two.

The body part models and skeletal models described above are non-limiting example types of models that may be used as machine representations of the objects being modeled. Other models are also within the scope of the invention. For example, some models may include polygonal meshes, patches, non-uniform rational B-splines, subdivision surfaces, or other high-order surfaces. The model may also include surface textures and/or other information to more accurately represent clothing, hair, and/or other aspects of the modeled target. The model may optionally include information related to a current pose, one or more past poses, and/or model physics. It should be understood that the various models that may be posed are compatible with the target recognition, analysis, and tracking described herein.

As described above, the model serves as a representation of an object, such as game player 18 in FIGS. 3A and 3B. As the target moves in physical space, information from the capture device (such as depth camera 20 in FIG. 4) may be used to adjust the pose and/or basic size/shape of the model in each frame so that it accurately represents the target.

FIG. 7A illustrates a flow diagram of one example of various methods for creating test data. Test data may be created by simultaneously capturing a motion clip at 602 and a depth clip at 604, both for the same movements that the subject is making at the same time, which are then used to create a ground truth for the depth clip at 606. As discussed below in fig. 9B and 10, this simultaneous recording is preceded by an alignment between the motion capture system and the depth capture system. Alternatively, the two test clips may be composited together, with the ground truth included therein inherited by the composited clip. Alternatively, a depth clip of the subject may be captured at 612 and annotated with skeletal tracking coordinates at 614 by the developer. At 608, test data including the depth clip and the associated ground truth is stored as test data.

FIG. 7B is a flow diagram representing the functionality of the target recognition, analysis, and tracking pipeline shown above at 450 and implemented by the new code 125 being tested by the system. The example method may be implemented using the capture device 20 and/or the computing environment 12 of the target recognition, analysis, and tracking system 50, such as described with reference to FIGS. 3A-4. In step 800, depth information, or in the case of a test, a depth clip, is received from a capture device. At step 802, a determination is made as to whether the depth information or clip includes a human target. This decision is made based on the model fitting and model parsing process described below. If a human target is determined at 804, the human target is scanned for body parts at 806, a model of the human target is generated at 808, and the model is tracked at 810.

The depth information may include depth clips created in the process discussed above with reference to fig. 7A. After receiving a clip comprising a plurality of depth images at 800, each image in the clip can be downsampled to a lower processing resolution so that the depth images can be more easily used and/or processed more quickly with less computational overhead. Additionally, one or more highly variable and/or noisy depth values may be removed and/or smoothed from the depth image; portions of missing and/or removed depth information may be filled in and/or reconstructed; and/or any other suitable processing may be performed on the received depth information so that the depth information may be used to generate a model, such as a skeletal model, as will be described in more detail below.

At 802, the target recognition, analysis, and tracking system may determine whether the depth image includes a human target. For example, at 802, each target or object in the depth image may be flood filled and compared to a pattern to determine whether the depth image includes a human target. The acquired image may include a two-dimensional (2-D) pixel area of the captured scene where each pixel in the 2-D pixel area may represent a depth value.

At 804, if the depth image does not include a human target, a new depth image of the capture area may be received at 800 such that the target recognition, analysis, and tracking system may determine at 802 whether the new depth image may include a human target.

At 804, if the depth image includes a human target, the human target may be scanned for one or more body parts at 806. According to one embodiment, the human target may be scanned to provide metrics (such as length, width, etc.) associated with one or more body parts of a user (such as user 18 described above with reference to fig. 3A and 3B) such that an accurate model of the user may be generated based on such metrics.

If the depth image of the frame includes a human target, the frame may be scanned for one or more body parts at 806. The determined values of the body parts for each frame may then be averaged such that the data structure may include average measurements such as length, width, etc. of the body parts associated with the scan for each frame. According to another embodiment, the measurements of the determined body parts may be adjusted, such as enlarged, reduced, etc., so that the measurements in the data structure more closely correspond to a typical human body model.

At 808, a model of the human target may then be generated based on the scan. For example, according to one embodiment, the measurements determined by the scanned bitmask may be used to define one or more joints in the skeletal model. The one or more joints may be used to define one or more bones that may correspond to a body part of a human.

At 810, the model may then be tracked. For example, according to an example embodiment, a skeletal model such as skeletal model 82 described above with reference to FIG. 6B may be a representation of a user such as user 18. As the user moves in physical space, information from a capture device, such as capture device 20 described above with reference to FIG. 4, may be used to adjust the skeletal model so that the skeletal model may accurately represent the user. In particular, one or more forces may be applied to one or more force-receiving aspects of the skeletal model to adjust the skeletal model to a pose that more closely corresponds to the pose of the human target in physical space.

FIG. 7C is a flow diagram of one embodiment of the present system for obtaining a model of user 18 (e.g., skeletal model 82 generated in step 808 of FIG. 7B) for a given frame or other time period. In addition to or instead of skeletal joints, the model may include one or more polygonal meshes, one or more mathematical primitives, one or more high-order surfaces, and/or other features for providing a machine representation of the target. Further, the model may exist as an instance of one or more data structures that exist on the computing system.

The method of fig. 7C may be performed in accordance with the teachings of U.S. patent application serial No. 12/876,418 entitled "systemfefast, probabilistic skelestaltracking" filed on 7.9.2010 by the inventor, the entire contents of which are incorporated herein by reference.

At step 812, m skeletal hypotheses are proposed using one or more computational theories that exploit some or all of the available information. One example of stateless processing for assigning a probability that a particular pixel or group of pixels represents one or more objects is sample processing. The sample processing uses a machine learning approach that employs depth images and classifies each pixel by assigning it a probability distribution over one or more objects to which it may correspond. Sample processing is further described in U.S. patent application No. 12/454,628 entitled "human body pose estimation," which is hereby incorporated by reference in its entirety.

In another embodiment, sample processing and centroid generation are used to generate probabilities for correctly identifying particular objects such as body parts and/or props. The centroid may have an associated probability that the captured object is correctly identified as a given object (such as a hand, face, prop). In one embodiment, centroids of the user's head, shoulders, elbows, wrists, and hands are generated. Sample processing and centroid generation is further described in U.S. patent application No. 12/825,657 entitled "skelestale joint recognition and tracking system" and U.S. patent application No. 12/770,394 entitled "multiple centroids compression of probability distribution clouds". The entire contents of each of the above applications are incorporated herein by reference.

Next, in step 814, a rating score is calculated for each skeletal hypothesis. In step 816, n mined samples are filled from the m recommendations of step 814Set X of sample skeletal hypotheses_t. A given skeletal hypothesis may be selected into the sampled set X_tIs proportional to the score assigned in step 814. Thus, once step 812-_tIn (1). In this way, X_tWill move towards a good state estimate. Then, in step 818, a sampled set X may be selected_tOne or more sample skeletal hypotheses (or combinations thereof) are selected as output for the frame or other time period of captured data.

FIG. 8 illustrates a flow diagram of an example pipeline 540 for tracking one or more targets. Pipeline 540 may be executed by a computing system (e.g., computing environment 12) to track one or more players interacting with a game or other application. In one embodiment, pipeline 540 may be used in a target recognition, analysis, and tracking system of element 450 as described above. Pipeline 540 may include a number of concept stages: depth image acquisition 542, background removal 544, foreground pixel assignment 546, model fitting 548 (using one or more experts 594), model parsing 550 (using arbiter 596), and skeletal tracking 560. Depth image acquisition 542, background removal 544, and foreground pixel assignment 546 may all be considered part of the pre-processing of the image data.

Depth image acquisition 542 may include receiving a depth image of a target observed within a field of view from depth camera 26 of capture device 20. The observed depth image may include a plurality of observed pixels, where each observed pixel has an observed depth value.

As shown at 554 of fig. 8, depth image acquisition 542 may optionally include downsampling the observed depth image to a lower processing resolution. Downsampling to a lower processing resolution may allow observed depth images to be more easily used and/or processed more quickly with less computational overhead. One example of downsampling is to group pixels into slices using a technique occasionally referred to as repetitive segmentation. Tiles may be selected to have approximately constant depth and approximately equal regions of world space. This means that slices further away from the camera appear smaller in the image. All subsequent reasoning about the depth image can be expressed in terms of slices rather than pixels. As indicated, the downsampling step 554 of grouping pixels into tiles may be skipped to allow the pipeline to work with depth data from individual pixels.

As shown at 556 of fig. 8, depth image acquisition 542 may optionally include removing and/or smoothing out one or more highly variable and/or noisy depth values from the observed depth image. Such highly variable and/or noisy depth values in an observed depth image may originate from a number of different sources, such as random and/or systematic errors occurring during the image capture process, imperfections and/or distortions due to the capture device, and so forth. Since such highly variable and/or noisy depth values may be artifacts of the image capture process, including these values in any future analysis of the image may skew the results and/or slow the computation. As such, removing such values may provide better data integrity and/or speed for future calculations.

Background removal 544 may include distinguishing human targets to be tracked from non-target, background elements in the observed depth image. As used herein, the term "background" is used to describe anything in a scene that is not part of an object to be tracked. The background may include, for example, the floor, chairs, and plants 23 in fig. 3A and 3B, but may generally include elements in front of (i.e., closer to the depth camera) or behind the target to be tracked. Distinguishing foreground elements to be tracked from negligible background elements may increase tracking efficiency and/or simplify downstream processing.

Background removal 544 may include assigning each data point (e.g., pixel) of the processed depth image a value, which may be referred to as a player index, that identifies the data point as belonging to a particular target or non-target background element. When using this approach, pixels or other data points assigned with background indices may be removed from consideration at one or more subsequent stages of pipeline 540. As an example, pixels corresponding to a first player may be assigned a player index equal to 1, pixels corresponding to a second player may be assigned a player index equal to 2, and pixels not corresponding to a target player may be assigned a player index equal to 0. Such player indices may be maintained in any suitable manner. In some embodiments, the pixel matrix may include, at each pixel address, a player index indicating whether the surface at that pixel address belongs to a background element, a first player, a second player, or the like. The player index may be a discrete index or fuzzy index indicating the probability that a pixel belongs to a particular target and/or background.

Pixels may be classified as belonging to a target or background in various ways. Some background removal techniques may use information from one or more previous frames to aid and improve the quality of background removal. For example, a depth history image may be derived from two or more frames of depth information, where the depth value for each pixel is set to the deepest depth value experienced by the pixel during the sample frame. The depth history image may be used to identify moving objects (e.g., human game players) in the foreground of the scene from the background elements that do not move. In a given frame, a moving foreground pixel may have a different depth value than the corresponding depth value in the depth history image (at the same pixel address). In a given frame, the non-moving background pixels may have depth values that match corresponding depth values in the depth history image.

As one non-limiting example, connected island background removal may be used. Such a technique is described, for example, in U.S. patent application No. 12/575,363 filed on 7/10/2009, which is incorporated herein by reference in its entirety. Additional or alternative background removal techniques may be used to assign a player index or a background index to each data point, or otherwise distinguish foreground objects from background elements. In some embodiments, a particular portion of the background may be identified. In addition to not being considered when processing foreground objects, the found floor may be used as a reference surface that can be used to accurately locate virtual objects in the game space, stop flood filling as part of generating connected islands, and/or reject islands if their centers are too close to the floor plane. One technique for inspecting floors in a FOV is described, for example, in U.S. patent application No. 12/563,456 filed on 9, 21, 2009, which is incorporated herein by reference in its entirety. Other floor discovery techniques may be used.

Additional or alternative background removal techniques may be used to assign a player index or a background index to each data point, or otherwise distinguish foreground objects from background elements. For example, in fig. 8, pipeline 540 includes a bad body rejection 560. In some embodiments, objects initially identified as foreground objects may be rejected because they are dissimilar from any known target. For example, objects initially identified as foreground objects may be tested for ground rules that will occur in any object to be tracked (e.g., bone lengths within a predetermined tolerance that the head and/or torso may identify, etc.). If an object initially identified as a candidate foreground object fails such a test, it may be reclassified as a background element and/or subjected to further testing. In this way, moving objects that are not to be tracked (such as chairs pushed into a scene) may be classified as background elements because such elements are not similar to human targets. Where, for example, the pipeline is tracking the target user 18 and a second user enters the field of view, the pipeline may take several frames to confirm that the new user is actually a human. At this time, a new user may be tracked instead of the target user, or a new user other than the target user may be tracked.

After distinguishing foreground pixels from background pixels, pipeline 540 also classifies pixels that are deemed to correspond to foreground objects to be tracked. In particular, in foreground pixel assignment 546 of fig. 8, each foreground pixel is analyzed to determine what part of the target user's body the foreground pixel may belong to. In various embodiments, the background removal step may be omitted, and foreground objects may be determined in other ways, for example, by movement relative to past frames.

Once depth image acquisition 542, background removal 544, and foreground pixel assignment 546 are completed, pipeline 540 performs model fitting 548 to identify skeletal hypotheses used as machine representations of player targets 18, and performs model parsing 550 to select, from among these skeletal hypotheses, the hypothesis(s) estimated to be the best machine representation of player target 18. The model fitting step 548 is performed in accordance with, for example, U.S. patent application serial No. 12/876,418 entitled "SystemForFast, probabilistic skelletaltracking (system for fast, probabilistic skeletal tracking)" filed on 7.9.2010 by inventor williams, referenced above.

Generally, at 565, the target recognition, analysis, and tracking system tracks the configuration of the articulated skeletal model. After each of the images is received, information associated with the particular image may be compared to information associated with the model to determine whether the user may have performed a movement. For example, in one embodiment, the model may be rasterized into a synthesized image, such as a synthesized depth image. The pixels in the composite image may be compared to the pixels associated with the human target in each received image to determine whether the human target in the received image has moved.

According to an example embodiment, one or more force vectors may be calculated based on pixels compared between the composite image and the received image. The one or more forces may then be applied or mapped to one or more force-receiving aspects, such as joints of the model, to adjust the model to more closely correspond to the pose of the human target or user in physical space.

According to another embodiment, the model may be adjusted to fit a mask or representation of the human target in each received image, thereby adjusting the model based on the user's movements. For example, after receiving each observed image, vectors including X, Y and Z values that may define each bone and joint may be adjusted based on the mask of the human target in each received image. For example, the model may move in the X-direction and/or the Y-direction based on the X and Y values associated with the pixels of the mask of the human in each received image. In addition, joints and bones of the model may be rotated in the Z-direction based on depth values associated with pixels of the mask of the human target in each received image.

Fig. 9A and 9B illustrate a process for creating test data and associated ground truth according to step 166 of fig. 1 above. As described above, one embodiment for providing depth clip data associated with ground truth is to manually flag the depth clip data.

FIG. 9A illustrates a method for manually annotating depth data. At 904, the original depth clip of the subject movement or the depth clip with the computed ground truth is loaded. For manually marked clips, the manual process may then generate all ground truth by annotating the original depth clip, or modify the ground truth data to match the desired skeleton model. At 904, the depth clip is loaded into the analysis interface.

According to the method of FIG. 9A, for each frame 908, the user views 909 the skeletal model coordinates relative to the depth data in a viewer and manually marks 910 the skeletal data for the skeletal processing pipeline based on what is visually perceptible. If ground truth data exists, step 910 can include identifying an offset between the existing ground truth and the observed ground truth. After each frame in a particular clip has been completed at 912, if additional clips are available at 914, another calibration clip is loaded and the process is repeated at 916. If so, additional processing of additional clips may continue at step 918.

FIG. 9B illustrates an alternative process for creating a ground truth, where a motion capture system is used in conjunction with depth clips to create the ground truth. At 922, the motion capture system aligns the position of the sensor with the location on the body where the true "joint" is expected. Where active sensors are used to record motion capture data, a calibration is performed 922 between the coordinate space of the motion capture sensor and the particular joint associated with the sensor, resulting in an offset between the sensor and the joint. These offsets may be used to calibrate the recording of motion capture data and automatically correlate the data to skeletal ground truth. Each sensor provides a position and orientation from which a transformation from the sensor to one or more joints in the sensor's coordinate space can be calculated. For all motion captures, alignment is performed between the motion capture and the depth camera. This alignment process is shown in fig. 10.

The depth clip and motion capture clip are recorded simultaneously at 926, and the calibration clip is loaded at 928. The depth clips in the target recognition, analysis, and tracking pipeline are analyzed at 930, and the offset between the pipeline-identified joint and the motion capture sensor coordinate system position is calculated at 932. An offset is applied to the depth clip at 934. Any offset in distance, direction, force, or motion may be determined at step 932 and the difference used to determine calibration accuracy at 936. The alignment accuracy is verified at 936, and may include a quality assessment of accuracy made by the person processing the clip. If accuracy is acceptable at 938, processing continues. If the accuracy is not acceptable at 940, additional calibration clips are recorded at 940.

FIG. 10 illustrates the process used in step 922 of FIG. 9A or 9B for aligning the coordinate space of the motion capture system to the depth capture coordinate space. Alignment accuracy is a quantified value of the difference between the skeleton model of the subject being tracked by the pipeline and the position detected by the motion capture. In one implementation, test data is created at step 1010 using a motion capture sensor that remains facing the depth sensor and is shaken in a physical environment within a field of view of the motion capture system and the depth sensor. The motion capture system will use its own techniques to determine the position of the motion capture sensor in the local coordinate space of the motion capture system 1020 and the depth sensor as described above with reference to fig. 4 will determine the closest point to the sensor at 1014. The set of points in the motion capture coordinate space (from the sensors tracked by the motion capture device) and the set of points in the depth capture coordinate space are used at 1040 to compute a transformation matrix that relates the two sets of points. Any subsequently recorded motion capture data applies alignment to the motion capture sensor to transform it into depth camera space. Using this matrix, subsequently recorded motion data can be corrected for ground truth according to fig. 9A and 9B.

As described above, there may be instances where test data for a particular situation is not present. Fig. 11 and 12 show the synthesis of combined depth data. The synthesis of new test data, if it exists, is done using the depth clip information and the associated ground truth.

To compound one or more depth clips, possibly with associated ground truth, into a single scene, the user starts with a base clip of a room, for example as shown in fig. 12A. As mentioned above, the room must include a floor, and may include other objects, walls, etc. The room shown in fig. 12A is a depth map image of the exemplary room shown in fig. 3A. As shown in fig. 12B, the creator adds the new clip to be composited to the scene by first removing background artifacts from the new clip as shown in fig. 12C.

As shown in fig. 11, the steps shown in fig. 12A-12C may be performed by first retrieving the depth map of the current frame of the clip at 1162. At 1164, a background removal process is used to isolate the users in the frame. In one embodiment, background removal is performed manually by removing the floor via floor detection, and then providing a bounding box separating the user from the depth map. Alternatively, we may use other background removal algorithms, such as the background removal algorithm described in U.S. patent application serial No. 12/575,363 entitled "system and method for removing the background of an image," filed on 7/10/2009, which is incorporated herein by reference in its entirety. FIG. 12B shows the isolated user in the scene.

At step 1166, the depth map is converted to a three-dimensional mesh. The three-dimensional mesh allows, at 1168, the transformation of the clipped floor plane into a ground plane of the composite clip. Coordinate matching may be used for this purpose. Matrices and transforms for the cut floor planes and composite floor layers are computed at 1168. The transformation uses the floor planes and floor layers of the individual clips to complete the transformation map.

In step 1170, the depth map for each frame in the clip is transformed by the matrix calculated in step 1168. At step 1172, a model of the composite scene is presented. One such composite scenario is shown in fig. 12D, while another such composite scenario is shown in fig. 12H. At step 1174, the depth buffer is sampled and converted to depth map format. At 1176, the composite depth map is merged with the depth map computed at 1174 to complete the composite cut. If the new clip contains ground truth data, the data is transformed by the same matrix, thereby producing a new ground truth in the composite clip.

Fig. 12B shows a newly cropped depth map image added to the scene, while fig. 12C shows a human user in the scene without removing the background image. Next, the new clip is inserted into the composite scene as shown in FIG. 12D and discussed above. The location of the user in the new clip is provided by converting the isolated foreground image of the user to a reference frame of the target image based on the transformation matrix. The position of the new clip within the base clip may then be set as shown in fig. 12E. (Note that the user moves between the position shown in FIG. 12B and the position shown in FIG. 12E.)

The steps discussed above may be repeated for another new clip by adding two children to the scene in fig. 12F and isolating the background artifacts from the new clip in fig. 12G. In fig. 12H, the new clip foreground (body shape of the child) is inserted into the scene as shown, and the user is located within the scene as shown in fig. 12I. Next, a complete playback of the composite clip with all the various users in the background scene is shown in FIG. 12J.

It should be understood that any type of depth data (captured or synthesized) may be used for the above synthesis process. That is, the user's real world depth capture may be used with computer generated objects and composited in the scene. For example, such objects may be used to test the motion of a user's part that may be hidden from the capture device. In addition, the user may order the input clips to occur at defined times and play defined durations in the composite scene.

FIG. 13 is a process showing the steps of annotating a depth clip discussed above in step 174 of FIG. 1B. At step 1302, bounded metadata is assigned to each depth clip available at step 1304 and it is attached with a clip. The bounded metadata may include, for example, the metadata listed below in a table. It should be appreciated that the metadata assigned to the test data may occur at the time the data is recorded, after the data is analyzed, or at any point before or after the test data has been inserted into the data repository.

The primary goal of the set of metadata is ultimately to assist developers in tracking problem situations and poses. Once they have identified that a clip is problematic, they will be able to find other clips with similar characteristics to test for common causes. An additional objective is to provide valuable information as part of the report. Version, firmware version, driver version, date/time, platform, etc. may be determined automatically and generally minimize input and reduce the risk of error.

Table 1 illustrates various types of metadata that may be associated with test data:

in addition, a quality control process may be performed on the ground truth to ensure that all metadata shown in the metadata attached to the ground truth and associated clips are correct. For each frame in the depth clip, the calculated position of a particular joint in the pipeline may be compared to the established ground truth position and ground truth data associated with the clip. If correction is required, the position of the element can be reset manually using a marker tool. Once the entire clip is annotated, the clip is saved.

FIG. 14 illustrates a process for providing an performed analysis 232 of a pipeline, which may be performed by the analysis engine 200, as shown in FIG. 2. At step 1402, test data is obtained and fed to the processing pipeline at 1404. Under the control of the job manager, the processing pipeline runs the data at 1406 and outputs the results to a log file, which is then used as input to an analysis engine that invokes a metrics plug-in to make comparisons between the tracked results and the ground truth. Analysis currently stores a portion of the kernel analysis comparison in a file, so build-to-build comparisons can be performed fairly quickly. For each frame of tracking data, 1408, the tracked joint information is stored 1412. The metric engine provides various measurements of the difference between the ground truth and the calculated tracked position or orientation of the joints of the pipeline. The process continues at 1414 for each frame of the depth clip fed into the processing pipeline. When the frames in the clip have completed, the process continues in each pipeline at 1416. It should be understood that steps 1404-1416 may be performed simultaneously in multiple pipelines in multiple execution devices. At step 1418, for each metric for which a result is to be calculated, a respective metric calculation is performed. At 1420, the metric results are output, and at 1422, a summary report is output.

An illustration of an exemplary metric is provided by a system as described below with reference to table 2. As described above, the number and types of metrics that can be used to evaluate the performance of a pipeline is relatively unlimited.

Exemplary metrics that can be used are shown in table 2:

in addition, a facility is provided that identifies which images are used from the machine learning process to classify body parts in a given frame or pose, thereby providing information on how the system draws conclusions that are presented in the final result. In this example, "similar" is defined by using a threshold and determining the skeletal distance between the frame being examined and the machine learning training image, thereby returning the most similar that fits within the defined threshold. A weighted average is calculated based on the plurality of images within each additional threshold point and used to derive a "popularity score". The score indicates a popularity of the gesture in the training set. In general, popular poses should be well supported by machine learning and also have good sample metric scores. If a frame has a low score and a low popularity score, it may be determined that the training set is not able to support the gesture well. If the metric score is low but the popularity score is high, it indicates a potential flaw in the training algorithm.

Another problem is that it requires a fast search and comparison across millions of training images. To meet this requirement, clustering algorithms group all images into clusters, and the images within these clusters are located within a certain distance from the center of the cluster. When an image is searched, a comparison is made between the skeleton from the frame being investigated and each cluster center. If the distance from the center is too far, the entire cluster may be skipped. The clustering algorithm can improve processing time by an order of magnitude. For high performance in this search function, the data file format is also optimized to have a direct mapping of its records.

In one embodiment, a summary report of the metrics is provided. It is valuable to developers to use the previously mentioned metrics and to quickly and accurately communicate the appropriate high-level summary of build improvement versus degradation through potential filtering of stable and unstable clips. The administrative user interface 210, along with the high-level summary, allows mining to enable developers to quickly identify reports of faulty clips and frames. The types of summaries that may be used include the following types discussed in Table 3:

it should be understood that the foregoing summary is illustrative only.

FIG. 15 shows a process for creating a test suite in accordance with the discussion set forth above. At 1502, the developer will determine the specific application requirements necessary to develop one or more application programs. One example discussed above is a tennis game application, which requires detecting certain types of user movements with higher accuracy than other movements. At 1504, the developer may traverse the metadata to search for associated test data having ground truth that suits the requirements of the motion required for the particular application. At 1506, desired test data may be acquired. If specific test data meeting specific criteria is not available, additional composite clips or test clips may be created at 1508. At 1510, the test data can be sorted and the duration of each clip can be set to play for a particular duration.

At 1512, the test suite is assembled for use in a test application, and at 1514, the suite is sent to the pipeline for execution.

FIG. 16 illustrates an example embodiment of a model adjusted based on a movement or gesture of a user (such as user 18).

As described herein, a user may be tracked and adjusted to form a gesture that may represent the user waving his or her left hand at a particular point in time. Movement information later associated with the joints and bones of the model 82 for each pose may be captured in the depth clip.

The frames associated with these gestures may be rendered at various timestamps in sequential temporal order in the depth clip. For frames at various time stamps where a human user annotating the model determines that the position of the joint or reference joint j1-j8 is incorrect, the user can adjust the reference point by moving the reference point, as shown in FIG. 16. For example, in FIG. 16, joint j12 was found to have been in an incorrect position relative to the correct position (shown in phantom).

FIG. 17 illustrates an example embodiment of a computing environment that may be used to interpret one or more locations and motions of a user in a target recognition, analysis, and tracking system. The computing environment such as the computing environment 12 described above with reference to fig. 3A-4 may be a multimedia console 600 such as a gaming console. As shown in FIG. 17, the multimedia console 600 includes a Central Processing Unit (CPU)601 having a level one cache 602, a level two cache 604, and a flash ROM 606. The level one cache 602 and the level two cache 604 temporarily store data and thus reduce the number of memory access cycles, thereby improving processing speed and throughput. The CPU601 may be configured with more than one core and thus add level one and level two caches 602 and 604. The flash ROM606 may store executable code that is loaded during an initial phase of a boot process when the multimedia console 600 is powered ON.

A Graphics Processing Unit (GPU)608 and a video encoder/video codec (coder/decoder) 614 form a video processing pipeline for high speed and high resolution graphics processing. Data is transferred from the GPU608 to the video encoder/video codec 614 via a bus. The video processing pipeline outputs data to an a/V (audio/video) port 640 for transmission to a television or other display. A memory controller 610 is connected to the GPU608 to facilitate processor access to various types of memory 612, such as, but not limited to, RAM.

The multimedia console 600 includes an I/O controller 620, a system management controller 622, an audio processing unit 623, a network interface controller 624, a first USB host controller 626, a second USB host controller 628 and a front panel I/O subassembly 630 that are preferably implemented on a module 618. The USB controllers 626 and 628 serve as hosts for peripheral controllers 642(1) - (642) (2), a wireless adapter 648, and an external memory device 646 (e.g., flash memory, external CD/DVDROM drives, removable media, etc.). The network interface 624 and/or wireless adapter 648 provide access to a network (e.g., the Internet, home network, etc.) and may be any of a wide variety of various wired or wireless adapter components including an Ethernet card, a modem, a Bluetooth module, a cable modem, and the like.

System memory 643 is provided to store application data that is loaded during the boot process. A media drive 644 is provided and may comprise a DVD/CD drive, hard drive, or other removable media drive, among others. The media drive 644 may be internal or external to the multimedia console 600. Application data may be accessed via the media drive 644 for execution, playback, etc. by the multimedia console 600. The media drive 644 is connected to the I/O controller 620 via a bus, such as a Serial ATA bus or other high speed connection (e.g., IEEE 1394).

The system management controller 622 provides various service functions related to ensuring availability of the multimedia console 600. The audio processing unit 623 and the audio codec 632 form a corresponding audio processing pipeline with high fidelity and stereo processing. Audio data is transmitted between the audio processing unit 623 and the audio codec 632 via a communication link. The audio processing pipeline outputs data to the A/V port 640 for reproduction by an external audio player or device having audio capabilities.

The front panel I/O subassembly 630 supports the functionality of the power button 650 and the eject button 652, as well as any LEDs (light emitting diodes) or other indicators exposed on the outer surface of the multimedia console 600. The system power supply module 636 provides power to the components of the multimedia console 600. A fan 638 cools the circuitry within the multimedia console 600.

The CPU601, GPU608, memory controller 610, and various other components within the multimedia console 600 are interconnected via one or more buses, including serial and parallel buses, a memory bus, a peripheral bus, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures may include a Peripheral Component Interconnect (PCI) bus, a PCI-Express bus, and the like.

When the multimedia console 600 is powered ON, application data may be loaded from the system memory 643 into memory 612 and/or caches 602, 604 and executed on the CPU 601. The application may present a graphical user interface that provides a consistent user experience when navigating to different media types available on the multimedia console 600. In operation, applications and/or other media contained within the media drive 644 may be launched or played from the media drive 644 to provide additional functionalities to the multimedia console 600.

The multimedia console 600 may be operated as a standalone system by simply connecting the system to a television or other display. In this standalone mode, the multimedia console 600 allows one or more users to interact with the system, watch movies, or listen to music. However, with the integration of broadband connectivity made available through the network interface 624 or the wireless adapter 648, the multimedia console 600 may further be operated as a participant in a larger network community.

When the multimedia console 600 is powered on, a set amount of hardware resources may be reserved for system use by the multimedia console operating system. These resources may include memory reserves (e.g., 17MB), CPU and GPU cycle reserves (e.g., 5%), network bandwidth reserves (e.g., 8kbs), and so on. Since these resources are reserved at system boot, the reserved resources are not present from an application perspective.

In particular, the memory reservation is preferably large enough to contain the launch kernel, concurrent system applications, and drivers. The CPU reservation is preferably constant so that if the reserved CPU usage is not used by the system applications, the idle thread will consume any unused cycles.

With regard to the GPU reservation, lightweight messages generated by system applications (e.g., popups) are displayed by using a GPU interrupt to schedule code to render popup into an overlay. The amount of memory required for the overlay depends on the overlay area size, and the overlay preferably scales with the screen resolution. Where a complete user interface is used by a concurrent system application, it is preferable to use a resolution that is independent of the application resolution. A scaler may be used to set this resolution so that there is no need to change the frequency and cause a TV resynch.

After the multimedia console 600 boots and system resources are reserved, concurrent system applications execute to provide system functionality. The system functions are encapsulated in a set of system applications that execute within the reserved system resources described above. The operating system kernel identifies the thread as a system application thread rather than a game application thread. The system applications are preferably scheduled to run on the CPU601 at predetermined times and intervals, thereby providing a consistent system resource view for the applications. The scheduling is to minimize cache disruption for the gaming application running on the console.

When the concurrent system application requires audio, audio processing is asynchronously scheduled to the gaming application due to time sensitivity. A multimedia console application manager (described below) controls the audio level (e.g., mute, attenuate) of the gaming application when system applications are active.

Input devices (e.g., controllers 642(1) and 642(2)) are shared by the gaming application and the system application. Rather than reserving resources, the input devices are switched between the system application and the gaming application so that each has a focus of the device. The application manager preferably controls the switching of input stream without knowledge of the gaming application's knowledge, and the driver maintains state information regarding focus switches. The cameras 26, 28 and capture device 20 may define additional input devices for the console 600.

FIG. 18 illustrates another example embodiment of a computing environment 720, which may be the computing environment 12 illustrated in FIGS. 3A-4 for interpreting one or more locations and motions in a target recognition, analysis, and tracking system. The computing system environment 720 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the presently disclosed subject matter. Neither should the computing environment 720 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary operating environment 720. In some embodiments, each illustrated computing element may include circuitry configured to instantiate certain aspects of the present disclosure. For example, the term circuitry used in this disclosure may include dedicated hardware components configured to perform functions through firmware or switches. In other example embodiments, the term "circuitry" may include a general purpose processing unit, memory, etc., configured by software instructions embodying logic operable to perform functions. In example embodiments where circuitry includes a combination of hardware and software, an implementer may write source code embodying logic and the source code can be compiled into machine readable code that can be processed by the general purpose processing unit. The selection of hardware or software to implement a particular function is a design choice left to the implementer as it will be appreciated by those skilled in the art that the prior art has evolved to a point where there is little difference between hardware, software, or a combination of hardware/software. More specifically, those skilled in the art will appreciate that software processes may be transformed into equivalent hardware structures, and that hardware structures may themselves be transformed into equivalent software processes. Thus, the choice of hardware implementation versus software implementation is a design choice and is left to the implementer.

In FIG. 18, computing environment 720 includes a computer 741, computer 741 typically including a variety of computer-readable media. Computer readable media can be any available media that can be accessed by computer 741 and includes both volatile and nonvolatile media, removable and non-removable media. The system memory 722 includes computer storage media in the form of volatile and/or nonvolatile memory such as ROM723 and RAM 760. A basic input/output system 724(BIOS), containing the basic routines that help to transfer information between elements within computer 741, such as during start-up, is typically stored in ROM 723. RAM760 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 759. By way of example, and not limitation, FIG. 18 illustrates operating system 725, application programs 726, other program modules 727, and program data 728. FIG. 18 also includes a Graphics Processor Unit (GPU)729 having associated video memory 730 for high speed and high resolution graphics processing and storage. The GPU729 can be connected to the system bus 721 via a graphics interface 731.

The computer 741 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only, FIG. 18 illustrates a hard disk drive 738 that reads from or writes to non-removable, nonvolatile magnetic media, a magnetic disk drive 739 that reads from or writes to a removable, nonvolatile magnetic disk 754, and an optical disk drive 740 that reads from or writes to a removable, nonvolatile optical disk 753 such as a CDROM or other optical media. Other removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like. The hard disk drive 738 is typically connected to the system bus 721 through a non-removable memory interface such as interface 734, and magnetic disk drive 739 and optical disk drive 740 are typically connected to the system bus 721 by a removable memory interface, such as interface 735.

The drives and their associated computer storage media discussed above and illustrated in FIG. 18, provide storage of computer readable instructions, data structures, program modules and other data for the computer 741. In FIG. 18, for example, hard disk drive 738 is illustrated as storing operating system 758, application programs 757, other program modules 756, and program data 755. Note that these components can either be the same as or different from operating system 725, application programs 726, other program modules 727, and program data 728. Operating system 758, application programs 757, other program modules 756, and program data 755 are given different numbers here to illustrate that, at a minimum, they are different copies. A user may enter commands and information into the computer 741 through input devices such as a keyboard 751 and pointing device 752, commonly referred to as a mouse, trackball or touch pad. Other input devices (not shown) may include a microphone, joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 759 through a user input interface 736 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a Universal Serial Bus (USB). The cameras 26, 28 and capture device 20 may define additional input devices for the console 700. A monitor 742 or other type of display device is also connected to the system bus 721 via an interface, such as a video interface 732. In addition to the monitor, computers may also include other peripheral output devices such as speakers 744 and printer 743, which may be connected through an output peripheral interface 733.

The computer 741 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 746. The remote computer 746 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 741, although only a memory storage device 747 has been illustrated in FIG. 18. The logical connections depicted in FIG. 18 include a Local Area Network (LAN)745 and a Wide Area Network (WAN)749, but may also include other networks. Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.

When used in a LAN networking environment, the computer 741 is connected to the LAN745 through a network interface or adapter 737. When used in a WAN networking environment, the computer 741 typically includes a modem 750 or other means for establishing communications over the WAN749, such as the Internet. The modem 750, which may be internal or external, may be connected to the system bus 721 via the user input interface 736, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 741, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation, fig. 18 illustrates remote application programs 748 as residing on memory device 747. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A method for verifying the accuracy of a target recognition, analysis, and tracking system, the method comprising:

providing a searchable collection (102) of recorded and synthesized depths (104) and associated ground truth data (105), the ground truth data comprising a correlation of target motion at a series of points, the ground truth data providing samples for which deviations provide errors when provided to a tracking pipeline to determine accuracy of the tracking pipeline;

returning at least a subset of the searchable set in response to a request (170, 172) to test the trace pipeline;

receiving trace data output (176, 188) from a trace pipeline on at least a subset of the searchable set; and

an analysis (178, 190) of the tracking data relative to ground truth in the at least a subset is generated to provide an output of error relative to the ground truth.

2. The method of claim 1, wherein returning at least a subset of the searchable set comprises: the method includes searching the searchable collection, creating a test suite, and outputting the searchable collection to a plurality of processing devices, each of the plurality of processing devices including an instance of the trace pipeline.

3. The method of claim 2, wherein the step of generating an analysis comprises: detecting an error between the tracking data output and the ground truth data, and analyzing the error according to one or more calculated metrics that provide an indication of tracking accuracy of the tracking pipeline.

4. The method of claim 3, wherein the test metrics measure the total progress and degradation in tracking accuracy for one or more of: an entire scene, a single frame, a single subject in a frame, or a specific body part of a user.

5. The method of claim 3, wherein the test metrics measure the total progress and degradation in tracking accuracy for one or more of: background removal, retargeting of simulated player representations, body part classification, and detection of scene components.