[go: up one dir, main page]

HK1173691B - Visual target tracking - Google Patents

Visual target tracking Download PDF

Info

Publication number
HK1173691B
HK1173691B HK13100912.0A HK13100912A HK1173691B HK 1173691 B HK1173691 B HK 1173691B HK 13100912 A HK13100912 A HK 13100912A HK 1173691 B HK1173691 B HK 1173691B
Authority
HK
Hong Kong
Prior art keywords
pixel
synthesized
model
pixels
force
Prior art date
Application number
HK13100912.0A
Other languages
Chinese (zh)
Other versions
HK1173691A1 (en
Inventor
R.M.盖斯
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US12/632,599 external-priority patent/US8588465B2/en
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Publication of HK1173691A1 publication Critical patent/HK1173691A1/en
Publication of HK1173691B publication Critical patent/HK1173691B/en

Links

Description

Virtual target tracking
Background
Many computer games and other computer vision applications utilize sophisticated controls to allow a user to manipulate game characters or other aspects of the application. These controls can be difficult to learn, creating a barrier to entry into the market for many gaming or other applications. Further, these controls may be quite different from the actual game action or other application action for which they are used. For example, a game control that causes a game character to wave a baseball bat may not resemble the actual motion of waving a baseball bat at all.
SUMMARY
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.
Various embodiments related to visual target tracking are discussed herein. The disclosed embodiments include representing the target with a model that includes a plurality of portions, each portion associated with a site index corresponding to a portion of the target. A portion of the model may be classified in pixel cases based on one or both of: a site index associated with the portion of the model; and a difference between the modeled position of the portion of the model and the observed position of the corresponding portion of the target. The model is then adjusted according to the pixel instances of the portion of the model.
Brief Description of Drawings
FIG. 1A illustrates an embodiment of an exemplary target recognition, analysis, and tracking system that tracks game players playing a boxing game.
FIG. 1B shows the game player of FIG. 1A throwing a punch that is tracked and interpreted as a game control that causes the player avatar to throw a punch in game space.
FIG. 2 schematically illustrates a computing system according to an embodiment of the present disclosure.
FIG. 3 illustrates an exemplary body model for representing a human target.
FIG. 4 illustrates a substantially frontal view of an exemplary skeletal model used to represent a human target.
FIG. 5 illustrates an oblique view of an exemplary skeletal model used to represent a human target.
FIG. 6 illustrates an exemplary mesh model for representing a human target.
FIG. 7 shows a flow diagram of an example method of visually tracking a target.
Fig. 8 shows an exemplary observed depth image.
FIG. 9 illustrates an exemplary synthesized depth image.
Fig. 10 schematically shows some of the pixels constituting the synthesized depth image.
FIG. 11A schematically illustrates the application of force to a force-receiving location of a model.
FIG. 11B schematically shows the result of applying a force to the force-bearing location of the model of FIG. 11A.
FIG. 12A shows a player avatar rendered from the model of FIG. 11A.
FIG. 12B shows a player avatar rendered from the model of FIG. 11B.
Fig. 13 schematically illustrates comparing a synthesized depth image with a corresponding observed depth image.
Fig. 14 schematically illustrates an area of synthesized pixels identifying the comparison mismatch of fig. 13.
FIG. 15 schematically illustrates another comparison of a synthesized depth image and a corresponding observed depth image, where regions of unmatched pixels correspond to various pixel cases.
FIG. 16 schematically illustrates an example embodiment of a pull pixel case.
Fig. 17 schematically shows an example embodiment of a push pixel case.
FIG. 18 shows a table detailing example relationships between various pixel cases and skeletal model joints.
FIG. 19 illustrates the application of constraints to a model representing an object.
FIG. 20 illustrates another application of constraints to a model representing an object.
FIG. 21 illustrates yet another application of constraints to a model representing an object.
Detailed Description
The present disclosure relates to target recognition, analysis, and tracking. In particular, the use of a depth camera or other source to acquire depth information for one or more targets is disclosed. As described in detail below, this depth information may then be used to efficiently and accurately model and track one or more targets. The target recognition, analysis, and tracking described herein provides a robust platform in which one or more targets may be consistently tracked at a relatively fast frame rate even if the targets move into a pose that has been considered difficult to analyze using other methods (e.g., when two or more targets partially overlap and/or occlude each other; when a portion of a target self-occludes another portion of the same target; when a target changes its local anatomical appearance (e.g., someone touches his or her head), etc.).
FIG. 1A illustrates a non-limiting example of a target recognition, analysis, and tracking system 10. In particular, FIG. 1A illustrates a computer gaming system 12, which computer gaming system 12 may be used to play a variety of different games, play one or more different media types, and/or control or manipulate non-gaming applications. FIG. 1A also shows display 14 in the form of a high definition television, or HDTV 16, which may be used to present game visuals to game players, such as game player 18. Further, FIG. 1A shows a capture device in the form of a depth camera 20 that may be used to visually monitor one or more game players, such as game player 18. The example shown in FIG. 1A is non-limiting. As described below with reference to fig. 2, a variety of different types of target recognition, analysis, and tracking systems may be used without departing from the scope of the present disclosure.
A target recognition, analysis, and tracking system may be used to recognize, analyze, and/or track one or more targets, such as game player 18. FIG. 1A shows a scenario in which game player 18 is tracked using depth camera 20 such that movements of game player 18 may be interpreted by gaming system 12 as controls that may be used to affect a game executed by gaming system 12. In other words, game player 18 may use his movements to control the game. The movements of game player 18 may be interpreted as virtually any type of game control.
The example scenario illustrated in FIG. 1A shows game player 18 playing a boxing game being executed by gaming system 12. The gaming system uses HDTV 16 to visually present a boxing opponent 22 to game player 18. In addition, the gaming system uses HDTV 16 to visually present a player avatar 24 that game player 18 controls with his movements. As shown in FIG. 1B, game player 18 may throw a punch in physical space as an instruction to player avatar 24 to throw a punch in game space. Gaming system 12 and depth camera 20 may be used to recognize and analyze a punch of game player 18 in physical space such that the punch may be interpreted as a game control that causes game avatar 24 to throw a punch in game space. For example, FIG. 1B shows HDTV 16 visually presenting that game avatar 24 throws a punch that strikes boxing opponent 22 in response to game player 18 throwing a punch in physical space.
Other movements by game player 18 may also be interpreted as other controls, such as controls to bob, weave, shuffle, block, jab, or throw a variety of different power punches. Further, some movements may be interpreted as controls for purposes other than controlling game avatar 24. For example, a player may use movements to end, pause or save a game, select a level, view high scores, communicate with friends, and so forth.
In some embodiments, the target may include a human being and an object. In these embodiments, for example, a player of an electronic game may be holding an object such that the motion of the player and the object is used to adjust and/or control parameters of the electronic game. For example, the motion of a player holding a racquet may be tracked and utilized to control an on-screen racquet in an electronic sports game. In another example, the motion of a player holding an object may be tracked and utilized to control an on-screen weapon in an electronic combat game.
The target recognition, analysis, and tracking system may be used to interpret target movements as operating system and/or application controls outside the realm of gaming. Virtually any controllable aspect of an operating system and/or application such as the boxing game shown in FIGS. 1A and 1B may be controlled by movement of an object such as game player 18. The boxing scenario shown is provided as an example, but is in no way meant to be limiting in any way. Rather, the illustrated scenario is intended to demonstrate a general concept that may be applied to a wide variety of different applications without departing from the scope of the present disclosure.
The methods and processes described herein may be bound to a variety of different types of computing systems. FIGS. 1A and 1B show non-limiting examples in the form of a gaming system 12, an HDTV 16, and a depth camera 20. As another more general example, FIG. 2 schematically illustrates a computing system 40 that may perform one or more of the target recognition, tracking, and analysis methods and processes described herein. Computing system 40 may take a variety of different forms, including but not limited to: game consoles, personal computing gaming systems, military tracking and/or targeting systems, and feature acquisition systems that provide green screen or motion capture functionality.
Computing system 40 may include a logic subsystem 42, a data-holding subsystem 44, a display subsystem 46, and/or a capture device 48. The computing system may optionally include components not shown in fig. 2, and/or some of the components shown in fig. 2 may be peripheral components that are not integrated in the computing system.
Logic subsystem 42 may include one or more physical devices configured to execute one or more instructions. For example, the logic subsystem may be configured to execute one or more instructions that are part of one or more programs, routines, objects, components, data structures, or other logical constructs. Such instructions may be implemented to perform a task, implement a data type, transform the state of one or more devices, or otherwise arrive at a desired result. The logic subsystem may include one or more processors configured to execute software instructions. Additionally or alternatively, the logic subsystem may include one or more hardware or firmware logic machines configured to execute hardware or firmware instructions. The logic subsystem may optionally include individual components that are distributed across two or more devices, which may be remotely located in some embodiments.
Data-holding subsystem 44 may include one or more physical devices configured to hold data and/or instructions executable by the logic subsystem to implement the herein described methods and processes. In implementing such methods and processes, the state of data-holding subsystem 44 may be transformed (e.g., to hold different data). Data-holding subsystem 44 may include removable media and/or built-in devices. Data-holding subsystem 44 may include optical memory devices, semiconductor memory devices (e.g., RAM, EEPROM, flash memory, etc.), and/or magnetic memory devices, among others. Data-holding subsystem 44 may include devices with one or more of the following characteristics: volatile, nonvolatile, dynamic, static, read/write, read-only, random access, sequential access, location addressable, file addressable, and content addressable. In some embodiments, logic subsystem 42 and data-holding subsystem 44 may be integrated into one or more common devices, such as an application specific integrated circuit or a system on a chip.
FIG. 2 also illustrates an aspect of the data-holding subsystem in the form of computer-readable removable media 50, which may be used to store and/or transfer data and/or instructions executable to implement the herein described methods and processes.
Display subsystem 46 may be used to present a visual representation of data held by data-holding subsystem 44. As the herein described methods and processes change the data held by the data-holding subsystem, and thus transform the state of the data-holding subsystem, the state of display subsystem 46 may likewise be transformed to visually represent changes in the underlying data. As one non-limiting example, the target recognition, tracking, and analysis described herein may be reflected by display subsystem 46 in the form of a game character that changes pose in game space in response to movement of a game player in physical space. Display subsystem 46 may include one or more display devices using virtually any type of technology. These display devices may be combined in a shared enclosure with logic subsystem 42 and/or data-holding subsystem 44, or these display devices may be peripheral display devices, as shown in FIGS. 1A and 1B.
Computing system 40 also includes a capture device 48 configured to obtain depth images of one or more targets. Capture device 48 may be configured to capture video with depth information via any suitable technique (e.g., time-of-flight, structured light, stereo image, etc.). As such, capture device 48 may include a depth camera, a video camera, a stereo camera, and/or other suitable capture devices.
For example, in time-of-flight analysis, the capture device 48 may emit infrared light toward the target and may then use a sensor to detect the backscattered light from the surface of the target. In some cases, pulsed infrared light may be used, where the time between an outgoing light pulse and a corresponding incoming light pulse may be measured and used to determine a physical distance from the capture device to a particular location on the target. In some cases, the phase of the outgoing light wave may be compared to the phase of the incoming light wave to determine a phase shift, and the phase shift may be used to determine a physical distance from the capture device to a particular location on the target.
In another example, time-of-flight analysis may be used to indirectly determine a physical distance from the capture device to a particular location on the target by analyzing the intensity of the reflected beam of light over time via techniques such as shuttered light pulse imaging.
In another example, capture device 48 may utilize structured light analysis to capture depth information. In such an analysis, patterned light (i.e., light displayed as a known pattern such as a grid pattern or a stripe pattern) may be projected onto a target. Upon landing on the surface of the target, the pattern may deform in response, and this deformation of the pattern may be studied to determine the physical distance from the capture device to a particular location on the target.
In another example, a capture device may include two or more physically separated cameras viewing a target from different angles to obtain visual stereo data. In these cases, the visual stereo data may be parsed to generate a depth image.
In other embodiments, capture device 48 may utilize other techniques to measure and/or calculate depth values. Additionally, capture device 48 may organize the calculated depth information into "Z layers," i.e., layers perpendicular to a Z axis extending from the depth camera along its line of sight to the viewer.
In some embodiments, two or more cameras may be integrated into one integrated capture device. For example, a depth camera and a video camera (e.g., an RGB video camera) may be integrated into a common capture device. In some embodiments, two or more separate capture devices may be used in conjunction. For example, a depth camera and a separate video camera may be used. When a camera is used, the camera may be used to provide target tracking data, confirmation data to correct for target tracking, image capture, facial recognition, high precision tracking of a finger (or other small feature), light sensing, and/or other functions.
It is to be understood that at least some target analysis and tracking operations may be performed by the logic machine of one or more capture devices. The capture device may include one or more onboard processing units configured to perform one or more target analysis and/or tracking functions. The capture device may include firmware to assist in updating such onboard processing logic.
Computing system 40 may optionally include one or more input devices, such as a controller 52 and a controller 54. The input device may be used to control the operation of the computing system. In the context of a game, input devices such as controller 52 and/or controller 54 may be used to control aspects of the game that are not controlled by the target recognition, tracking, and analysis methods and processes described herein. In certain embodiments, input devices such as controller 52 and/or controller 54 may include one or more of accelerometers, gyroscopes, infrared target/sensor systems, etc., which may be used to measure movement of the controllers in physical space. In some embodiments, the computing system may optionally include and/or utilize input gloves, keyboards, mice, trackpads, trackballs, touch screens, buttons, switches, dials, and/or other input devices. As will be appreciated, target recognition, tracking, and analysis may be used to control or augment aspects of a game or other application that are conventionally controlled by an input device, such as a game controller. In some embodiments, the target tracking described herein may be used as a complete replacement for other forms of user input, while in other embodiments, such target tracking may be used to supplement one or more other forms of user input.
The computing system 40 may be configured to perform the target tracking methods described herein. However, it should be understood that computing system 40 is provided as a non-limiting example of a device that may perform such target tracking. Other devices are also within the scope of the present disclosure.
Computing system 40 or another suitable device may be configured to represent each target with a model. As described in more detail below, information derived from such a model may be compared to information obtained from a capture device, such as a depth camera, so that the base scale or shape of the model, as well as its current pose, may be adjusted to more accurately represent the modeled target. The model may be represented by one or more polygonal meshes, by a set of mathematical primitives, and/or by other suitable machine representations of the object being modeled.
FIG. 3 shows a non-limiting visual representation of an example body model 70. Body model 70 is a machine representation of a modeled target (e.g., game player 18 of fig. 1A and 1B). The body model may include one or more data structures that include a set of variables that together define the modeled target in the language of the game or other application/operating system.
The model of the target may be configured differently without departing from the scope of this disclosure. In some examples, a model (e.g., a machine-readable model) may include one or more data structures that represent a target as a three-dimensional model that includes rigid and/or deformable shapes, i.e., body parts. Each body part may be characterized as a mathematical primitive, examples of which include, but are not limited to, spheres, anisotropically scaled spheres, cylinders, anisotropic cylinders, smooth cylinders, boxes, beveled boxes, prisms, and the like.
Further, the target may be represented by a model that includes a plurality of portions, each portion associated with a location index corresponding to a location of the target. Thus, for the case where the target is a human target, the part index may be a body part index corresponding to a part of the human target. For example, body model 70 in FIG. 3 includes body parts bp1 through bp14, each of which represents a different portion of the object being modeled. Each body part is a three-dimensional shape. For example, bp3 is a rectangular prism representing the left hand of the modeled target, while bp5 is an octagonal prism representing the upper left arm of the modeled target. The body model 70 is exemplary in that the body model may contain any number of body parts, each of which may be any machine-understandable representation of the corresponding part of the object being modeled.
A model comprising two or more body parts may also comprise one or more joints. Each joint may allow one or more body parts to move relative to one or more other body parts. For example, a model representing a human target may include a plurality of rigid and/or deformable body parts, some of which may represent respective anatomical body parts of the human target. In addition, each body part of the model may include one or more structural members (i.e., "bones"), with joints located at the intersection of adjacent bones. It should be understood that some bones may correspond to anatomical bones in the human target, and/or some bones may not have corresponding anatomical bones in the human target.
As an example, a human target may be modeled as a skeleton including a plurality of skeleton points, each having a three-dimensional position in world space. Each skeletal point may correspond to an actual joint of the human target, an extremity of the human target, and/or a point that is not directly anatomically linked to the human target. Each skeletal point has at least three degrees of freedom (e.g., world space x, y, z). As such, the skeleton may be defined entirely by 3 x λ values, where λ is equal to the total number of skeleton points included in the skeleton. For example, a skeleton with 33 skeleton points may be defined by 99 values. As described in more detail below, some skeletal points may account for axial roll angles.
Bones and joints may together constitute a skeletal model, which may be a constituent element of the model. The skeletal model may include one or more skeletal members for each body part and joints between adjacent skeletal members. An exemplary skeletal model 80 and an exemplary skeletal model 82 are shown in FIGS. 4 and 5, respectively. FIG. 4 shows a skeletal model 80 with joints j1 through j33 viewed from the front. FIG. 5 shows a skeletal model 82 from an oblique view, also with joints j1 through j 33. The skeletal model 82 also includes roll joints j34 through j47, where each roll joint may be used to track an axial roll angle. For example, an axial roll angle may be used to define the rotational orientation of a limb relative to its source limb and/or torso. For example, if the skeletal model shows an axial rotation of an arm, the roll joint j40 may be used to indicate the direction in which the associated wrist is pointing (e.g., palm facing up). Thus, while the joints may be stressed and the skeletal model adjusted, as described below, the roll joints may instead be constructed and utilized to track the axial roll angle. More generally, by examining the orientation of a limb relative to its source limb and/or torso, an axial roll angle may be determined. For example, if examining a calf, the orientation of the calf relative to the associated thigh and hip can be examined to determine the axial roll angle.
As described above, certain models may include a skeleton and/or body parts that serve as a machine representation of the object being modeled. In some embodiments, the model may alternatively or additionally include a wireframe mesh, which may include a hierarchy of rigid polygonal meshes, one or more deformable meshes, or any combination of the two. As one non-limiting example, FIG. 6 illustrates a model 90, which model 90 includes a plurality of triangles (e.g., triangles 92) arranged into a mesh that defines the shape of the body model. Such a mesh may include bending limits at each polygon edge. When a mesh is used, the number of triangles and/or other polygons that together make up the mesh may be selected to achieve a desired balance between quality and computational overhead. More triangles may provide higher quality and/or more accurate models, while fewer triangles may be less computationally demanding. A body model comprising a polygonal mesh need not comprise a skeleton, but may in some embodiments comprise a skeleton.
The above-described body part models, skeletal models, and polygonal meshes are non-limiting example types of models that may be used as machine representations of the modeled targets. Other models are also within the scope of the present disclosure. For example, some models may include patches (patch), non-uniform rational B-splines, subdivision surfaces, or other higher order surfaces. The model may also include surface textures and/or other information to more accurately represent clothing, hair, and/or other aspects of the modeled target. The model may optionally include information related to a current pose, one or more past poses, and/or model physics. It is to be understood that any model that can be posed and then rasterized into (or otherwise rendered or represented by) a synthesized depth image is compatible with the target recognition, analysis, and tracking described herein.
As described above, the model serves as a representation of an object, such as game player 18 in FIGS. 1A and 1B. As the target moves in physical space, information from a capture device, such as the depth camera 20 in FIGS. 1A and 1B, may be used to adjust the pose and/or basic size/shape of the model so that it more accurately represents the target. In particular, one or more forces may be applied to one or more force-receiving aspects of the model to adjust the model to a pose that more closely corresponds to the pose of the target in physical space. Depending on the type of model used, forces may be applied to the joints of the model, the centroids of the body parts, the vertices of the triangles, or any other suitable force-receiving aspect. Further, in certain embodiments, two or more different calculations may be used in determining the direction and/or magnitude of the force. As described in more detail below, the difference between the observed image of the target retrieved by the capture device and the rasterized (i.e., synthesized) image of the model may be used to determine the forces applied to the model to adjust the body to different poses.
FIG. 7 shows a flowchart of an example method 100 for tracking a target using a model (e.g., the body model 70 of FIG. 3). In some embodiments, the target may be a human, and the human may be one of the two or more targets tracked. As such, in some embodiments, method 100 may be performed by a computing system (e.g., gaming system 12 shown in FIG. 1 and/or computing system 40 shown in FIG. 2) to track one or more players interacting with an electronic game being played on the computing system. As introduced above, tracking players allows the physical movements of these players to be used as a real-time user interface to adjust and/or control parameters of an electronic game. For example, the tracked motions of the player may be used to move an on-screen character or avatar in an electronic role-playing game. In another example, the tracked motion of the player may be used to control an on-screen vehicle in an electronic racing game. In yet another example, the tracked motion of the player may be used to control a building or organization of objects in the virtual environment.
At 102, method 100 includes receiving an observed depth image of a target from a source. In some embodiments, the source may be a depth camera configured to obtain depth information about the target through a suitable technique such as time-of-flight analysis, structured light analysis, stereo vision analysis, or other suitable technique. The observed depth image may include a plurality of observed pixels, where each observed pixel has an observed depth value. The observed depth values include depth information of the target as seen from the source. Knowing the horizontal and vertical field of view of the depth camera, as well as the depth value of a pixel and the pixel address of the pixel, the world space location of the surface imaged by the pixel can be determined. For convenience, the world space location of the surface imaged by the pixel may be referred to as the world space location of the pixel.
Fig. 8 shows a visual representation of an exemplary observed depth image 140. As shown, observed depth image 140 captures an exemplary observed pose of a standing person (e.g., game player 18) with its arms raised.
As shown at 104 of fig. 7, after receiving the observed depth image, the method 100 may optionally include downsampling the observed depth image to a lower processing resolution. Downsampling to a lower processing resolution may allow observed depth images to be more easily used and/or processed more quickly with less computational overhead.
As shown at 106, after receiving the observed depth image, method 100 may optionally include removing non-player background elements from the observed depth image. Removing these background elements may include separating various regions of the observed depth image into background regions and regions occupied by the image of the target. The background region may be removed from the image or identified so that it may be ignored during one or more subsequent processing steps. Virtually any background removal technique may be used, and information from the tracking (and from the previous frame) may optionally be used to assist and improve the quality of the background removal.
As shown at 108, after receiving the observed depth image, method 100 may optionally include removing and/or smoothing one or more highly variable and/or noisy depth values from the observed depth image. Such highly variable and/or noisy depth values in an observed depth image may originate from a number of different sources, such as random and/or systematic errors occurring during the image capture process, imperfections and/or distortions due to the capture device, and so forth. Since such highly variable and/or noisy depth values may be artifacts of the image capture process, including these values in any future analysis of the image may skew the results and/or slow the computation. Thus, removing such values may provide better data integrity for future calculations.
Other depth values may also be filtered out. For example, the accuracy of the grow operation described below with reference to step 118 may be enhanced by selectively removing pixels that meet one or more removal criteria. For example, if a depth value is midway between the hand and the torso occluded by the hand, removing the pixel may prevent the grow operation from spilling (sprill) one body part over another body part during subsequent processing steps.
As shown at 110, the method 100 may optionally include filling in and/or reconstructing portions of lost and/or removed depth information. Such backfilling may be accomplished by averaging nearest neighbors, filtering, and/or any other suitable method.
As shown at 112 of fig. 7, the method 100 may include obtaining a model (e.g., the body model 70 of fig. 3). As described above, the model may include a skeleton that includes a plurality of skeleton points, one or more polygon meshes, one or more mathematical primitives, one or more higher order surfaces, and/or other features for providing a machine representation of the target. Further, the model may exist as an instance of one or more data structures that exist on the computing system.
In some embodiments of the method 100, the model may be a posed model obtained from a previous time step (i.e., frame). For example, if the method 100 is performed continuously, a posed model corresponding to a previous time step from a previous iteration of the method 100 may be obtained. In this way, the model may be adjusted from one frame to the next based on the observed depth image of the current frame and the model from the previous frame. In some cases, the model of the previous frame may be projected by momentum calculations to produce an estimated model for comparison with the currently observed depth image. This can be done without looking up the model from a database or otherwise starting every frame from scratch. Instead, incremental changes to the model may be made in successive frames.
In some embodiments, the pose may be determined by one or more algorithms that may analyze the depth image and identify at a coarse level where the object of interest (e.g., a person) is located and/or the pose of the object(s). The pose may be selected using an algorithm during an initial iteration or whenever it is believed that the algorithm may select a more accurate pose than the pose calculated during a previous time step.
In some embodiments, the model may be obtained from a database and/or other program. For example, a model may not be available during the first iteration of method 100, in which case the model may be obtained from a database that includes one or more models. In this case, a search algorithm designed to select a model exhibiting a pose similar to that of the target may be used to select a model from the database. Models from the database may be used even if models from a previous time step are available. For example, if the target has changed posture by more than a predetermined threshold, and/or according to other criteria, the model from the database may be used after a certain number of frames.
In other embodiments, the model or portions thereof may be synthesized. For example, if the body core (torso, upper abdomen, and hips) of the target is represented by a deformable polygonal model, the model may be initially constructed using the content of the observed depth image, where the outline (i.e., contour) of the target in the image may be used to shape the mesh in the X and Y dimensions. In addition, in this approach, the observed depth values in this region of the observed depth image may be used to "shape" the mesh in the XY as well as Z directions of the model to more satisfactorily represent the body shape of the target.
Another method for obtaining a model is described in U.S. patent application 12/603,437 filed on 21/10/2009, the contents of which are incorporated herein by reference in their entirety.
The method 100 may also include representing any clothing present on the target using a suitable method. Such a suitable method may include adding an auxiliary geometry in the form of a primitive or polygonal mesh to the model, and optionally adjusting the auxiliary geometry based on pose to reflect gravity, clothing simulation, and the like. This approach may facilitate modeling the model into a more realistic representation of the target.
As shown at 114, the method 100 may optionally include applying a momentum algorithm to the model. This algorithm can help to obtain the pose of the model since the momentum of various parts of the target can predict changes in the image sequence. The momentum algorithm may use the trajectory of each joint or vertex of the model over a fixed number of previous frames to help obtain the model.
In some embodiments, knowledge that different portions of the target may move a limited distance within a time frame (e.g., one second of 1/30 or 1/60) may be used as a constraint in obtaining the model. This constraint can be used to exclude certain poses when the previous frame is known.
At 116 of FIG. 7, the method 100 may further include rasterizing the model into a synthesized depth image. Rasterization allows a model described by a mathematical primitive, polygon mesh, or other object to be converted into a synthesized depth image described by a plurality of pixels.
Rasterization may be performed using one or more different techniques and/or algorithms. For example, rasterizing the model may include projecting a representation of the model onto a two-dimensional plane. In the case of a model that includes multiple body-part shapes (e.g., body model 70 of fig. 3), rasterization may include projecting and rasterizing a set of body-part shapes onto a two-dimensional plane. For each pixel in the two-dimensional plane onto which the model is projected, a variety of different types of information may be stored.
Fig. 9 shows a visual representation 150 of an exemplary synthesized depth image corresponding to the body model 70 of fig. 3. Fig. 10 shows a pixel matrix 160 of a portion of the same synthesized depth image. As indicated at 170, each synthesized pixel in the synthesized depth image may include a synthesized depth value. The synthesized depth value for a given synthesized pixel may be the depth value from the corresponding part of the model represented by the synthesized pixel, as determined during rasterization. In other words, if a portion of a forearm body-part (e.g., forearm body-part bp4 of fig. 3) is projected onto a two-dimensional plane, the corresponding synthesized pixel (e.g., synthesized pixel 162 of fig. 10) may be given a synthesized depth value (e.g., synthesized depth value 164 of fig. 10) that is equal to the depth value of the portion of the forearm body-part. In the illustrated example, the synthesized pixel 162 has a synthesized depth value of 382 cm. Likewise, if a neighboring hand body-part (e.g., hand body-part bp3 of fig. 3) is projected onto the two-dimensional plane, the corresponding synthesized pixel (e.g., synthesized pixel 166 of fig. 10) may be given a synthesized depth value (e.g., synthesized depth value 168 of fig. 10) that is equal to the depth value of the portion of the hand body-part. In the example shown, the synthesized pixel 166 has a synthesized depth value 383 cm. The corresponding observed depth value is the depth value observed by the depth camera at the same pixel address. It will be appreciated that the above is provided as an example. The synthesized depth values may be saved in any unit of measure, or as a dimensionless number.
As indicated at 170, each synthesized pixel in the synthesized depth image may include an original body-part index determined during rasterization. This original body-part index may indicate which body parts of the model the pixels correspond to. In the example shown in FIG. 10, synthesized pixel 162 has an original body-part index bp4, while synthesized pixel 166 has an original body-part index bp 3. In some embodiments, if the synthesized pixel does not correspond to the body part of the target (e.g., if the synthesized pixel is a background pixel), the original body-part index of the synthesized pixel may be 0 (nil). In some embodiments, synthesized pixels that do not correspond to a body part may be given a different type of index. The body-part index may be a discrete value or a probability distribution indicating the likelihood that a pixel belongs to two or more different body parts.
As indicated at 170, each synthesized pixel in the synthesized depth image may include an original player index determined during rasterization, where the original player index corresponds to the target. For example, if there are two targets, the composited pixel corresponding to the first target will have a first player index, while the composited pixel corresponding to the second target will have a second player index. In the illustrated example, the pixel matrix 160 corresponds to only one target, so the synthesized pixel 162 has the original player index P1, while the synthesized pixel 166 has the original player index P1. Other types of indexing systems may be used without departing from the scope of this disclosure.
As indicated at 170, each synthesized pixel in the synthesized depth image may include a pixel address. The pixel address may define the location of the pixel relative to other pixels. In the illustrated example, synthesized pixel 162 has a pixel address [5,7], while synthesized pixel 166 has a pixel address [4,8 ]. It is understood that other addressing schemes may be used without departing from the scope of this disclosure.
As indicated at 170, each synthesized pixel may optionally include other types of information, some of which may be obtained after rasterization. For example, each synthesized pixel may include an updated body-part index, which may be determined as part of a snap (snap) operation performed during rasterization, as described below. Each synthesized pixel may include an updated player index, which may be determined as part of a snap operation performed during rasterization to determine that each synthesized pixel may include an updated body-part index, which may be obtained as part of a grow/patch (fix) operation, as described below. Each synthesized pixel may include an updated player index, which may be obtained as part of a grow/patch (fix) operation, as described below. Each synthesized pixel may include an updated synthesized depth value that may be obtained as part of a snap operation.
The example types of pixel information provided above are not limiting. Various different types of information may be stored as part of each pixel. Such information may include information obtained from the depth image, information obtained from rasterizing the machine-readable model, and/or information derived from one or more processing operations (e.g., a snapping operation, a growing operation, etc.). Such information may be stored as part of a common data structure, or different types of information may be stored in different data structures that may be mapped to specific pixel locations (e.g., by pixel address). As an example, the player index and/or body-part index obtained as part of a snap operation during rasterization may be stored in a rasterization map and/or snap map, while the player index and/or body-part index obtained as part of a grow/fix-up operation after rasterization may be stored in a grow map, as described below. Non-limiting examples of other types of pixel information that may be assigned to each pixel include, but are not limited to, joint indices, bone indices, vertex indices, triangle indices, centroid indices, and the like.
Although a distinction is made between observed pixels and synthesized pixels, it is to be understood that these distinctions are made merely for convenience of description. At each pixel address, data may be used to represent observed information obtained from a depth camera or other source. Also, at each pixel address, data may be used to represent information that is rasterized, derived, calculated, or otherwise synthesized. When considering observed data (e.g., an observed depth value) for a pixel, the pixel may be referred to as an observed pixel. When the synthesized data (e.g., synthesized depth values) of the same pixel is considered, the same pixel may be referred to as a synthesized pixel. As such, by comparing the observed data at the pixel address with the synthesized data at the pixel address, a comparison can be made between the observed pixel and the synthesized pixel at the same pixel address.
At 118, the method 100 of FIG. 7 may optionally include snapping and/or growing the body part index and/or the player index. In other words, the synthesized depth image may be enlarged to cause the body-part index and/or player index of certain pixels to change in an attempt to more closely correspond to the modeled target. Where reference is made to a body part index or player index without explicit reference to an index derived from the rasterization initially, an index derived from the snapping operation, or an index derived from the growing operation, it is to be understood that any one or more of these indices, as well as other indices derived from other suitable methods of estimating the player and/or body part to which the pixel belongs, may be used.
In performing the rasterization described above, one or more Z-buffers and/or body-part/player index maps may be constructed. By way of non-limiting example, this first version of a cache/map may be constructed by performing the following Z-test: a front-most surface at each pixel that is closest to the viewer (e.g., a depth camera) is selected and a body-part index and/or player index associated with that surface is written to the corresponding pixel. This mapping may be referred to as a rasterization mapping or an original synthesized depth mapping, and may include an original body-part index for each pixel. This second version of the cache/map may be constructed by performing the following Z-test: a surface of the model that is closest to the observed depth value at the pixel is selected, and a body-part index and/or player index associated with the surface is written to the corresponding pixel. This may be referred to as a snap map, and this map may include a snap body-part index for each pixel. These tests may be constrained to reject Z-distances between synthesized depth values and observed depth values that exceed a predetermined threshold. In some embodiments, two or more Z-buffers and/or two or more body part/player index maps may be maintained, allowing two or more of the above-described tests to be performed.
A third version of the buffer/map may be constructed by growing and/or correcting the body-part/player index map. This may be referred to as a growth map. Starting with a copy of the snap map described above, values may be incremented over any "unknown" value within a predetermined Z distance so that the space occupied by the target but not occupied by the body model may be filled in with the appropriate body part/player index. This method may also include replacing the known value if a more satisfactory match is identified.
The growth map may begin with a pass over the synthesized pixels of the snap map to detect pixels having neighboring pixels with different body-part/player indices. These may be considered "edge" pixels, i.e., boundaries along which values may optionally be propagated. As described above, growing a pixel value may include growing to an "unknown" or "known" pixel. For "unknown" pixels, for example, in one scenario, the body-part/player index value may be 0 before, but may now have non-zero neighboring pixels. In this case, four directly adjacent pixels may be examined, and an adjacent pixel having an observed depth value that more closely resembles the depth value of the pixel of interest may be selected and assigned to the pixel of interest. In the case of a "known" pixel, it may be possible that a pixel with a known non-zero body-part/player index value may not be fetched (overrake) if one of its neighboring pixels has a depth value written during rasterization that more closely matches the observed depth value of the pixel of interest than the synthesized depth value of that pixel.
Additionally, for efficiency purposes, updating the body-part/player index value of the synthesized pixel may include adding its neighboring four pixels to a pixel queue to be re-accessed on a subsequent pass. As such, the value may continue to propagate along the boundary without going through all pixels in its entirety. As another optimization, different N × N blocks of pixels (e.g., 16 × 16 blocks of pixels) occupied by the target of interest may be tracked so that other blocks not occupied by the target of interest may be ignored. This optimization can be applied in various forms at any point during target analysis after rasterization.
However, it is noted that the grow operation may take a variety of different forms. For example, various flood fills may be performed first to identify regions of similar value, and then it may be decided which regions belong to which body parts. Furthermore, the number of pixels that any body-part/player index object (e.g., left forearm body part bp4 of FIG. 3) can grow may be limited based on how many pixels such object is expected to occupy (e.g., given its shape, distance, and angle) versus how many pixels in the snap map are assigned to the body-part/player index. Additionally, the above-described method may include adding advantages or disadvantages for certain poses to offset growth for certain body parts so that growth may be correct.
If it is determined that a distribution of pixels from a body part is grouped to one depth and another distribution of pixels from the same body part is grouped to another depth, such that there is a gap between the two distributions, then a progressive snap adjustment may be made to the snap map. For example, an arm swung in front of and near the torso may "spill over" (spill intos) into the torso. This may result in a set of torso pixels having a body-part index indicating that these pixels are arm pixels, and in fact they should be torso pixels. By examining the distribution of synthesized depth values in the forearm, it can be determined that some arm pixels can be grouped into one depth and the rest into another depth. The gap between these two sets of depth values indicates a jump between an arm pixel and a pixel that should be a torso pixel. Thus, in response to identifying such a gap, the spill may then be remedied by assigning a torso body-part index to the spill pixel. As another example, progressive snap adjustment may be helpful in an "arm-over-background-object" case. In this case, a histogram may be used to identify gaps in the observed depth of the pixels of interest (i.e., pixels that are deemed to belong to the arm). Based on this gap, one or more groups of pixels may be identified as belonging correctly to the arm, and/or other groups may be rejected as background pixels. The histogram may be based on various metrics, such as absolute depth; depth error (synthesized depth-observed depth), etc. Prior to any grow operations, progressive bite adjustments may be performed inline during rasterization.
At 120, method 100 of FIG. 7 may optionally include creating a height map from the observed depth image, the synthesized depth image, and the body part/player index map at the three stages of processing described above. The gradient of such a height map and/or a blurred version of such a height map may be used in determining the direction of adjustments to be made to the model, as described below. However, height mapping is only an optimization; alternatively or additionally, a search in all directions may be performed to identify the nearest joints to which adjustments may be applied and/or the direction in which those adjustments are to be made. When height maps are used, they may be created before, after, or simultaneously with the pixel class determinations described below. When used, the height map is designed to set the actual body of the player to a low elevation (elevation) and the background element to a high elevation. A watershed-style technique may then be used to track a "downhill" in the height map to find the closest point on the player to the background, or vice versa (i.e., to search for an "uphill" in the height map to find the background pixel closest to a given player pixel).
The synthesized depth image and the observed depth image may not be identical, and thus the synthesized depth image may use adjustments and/or modifications to more closely match the observed depth image and thus more accurately represent the target. It is to be understood that adjustments may be made to the synthesized depth image by first making adjustments to the model (e.g., changing the pose of the model), and then synthesizing the adjusted model into a new version of the synthesized depth image.
Many different approaches may be taken to modify the synthesized depth image. In one approach, two or more different models may be obtained and rasterized to produce two or more synthesized depth images. Each synthesized depth image may then be compared to the observed depth image according to a predetermined set of comparison metrics. A synthesized depth image exhibiting a closest match to the observed depth image may be selected and the process may optionally be repeated to refine the model. When used, this process may be particularly useful for refining the body model to match the player's body shape and/or body size.
In another approach, two or more synthesized depth images may be blended via interpolation or extrapolation to produce a blended synthesized depth image. In yet another approach, two or more synthesized depth images may be mixed in such a way that the mixing techniques and parameters vary across the mixed synthesized depth images. For example, if a first synthesized depth image satisfactorily matches the observed depth image in one region and a second synthesized depth image satisfactorily matches in a second region, the selected pose in the blended synthesized depth image may be a blend similar to the pose used to create the first synthesized depth image in the first region and the pose used to create the second synthesized depth image in the second region.
In yet another approach, as shown at 122 of FIG. 7, the synthesized depth image may be compared to the observed depth image. Each synthesized pixel of the synthesized depth image may be classified based on the result of the comparison. This classification may be referred to as determining a pixel case for each pixel. The model used to create the synthesized depth image (e.g., body model 70 of fig. 3) may be systematically adjusted according to the determined pixel cases.
As described above, one or more pixel instances of each synthesized pixel may be selected based on a comparison with a corresponding pixel of the observed image having the same pixel address as the synthesized pixel. In some embodiments, the comparison may be based on one or more factors including, but not limited to-the difference between the observed depth value and the synthesized depth value for the synthesized pixel; a difference between the original body-part index, (bite) body-part index, and/or (grow) body/part index of the synthesized pixel; and/or the difference between the original player index, (snap) player index and/or (grow) player index of the synthesized pixel. Accordingly, in some embodiments, the pixel case may be selected from a defined set of pixel cases, as described in more detail with reference to 124 and 136 of FIG. 7.
As an example, fig. 13 illustrates an example of a synthesized depth image (e.g., synthesized depth image 150 of fig. 9) being compared to a corresponding observed depth image (e.g., observed depth image 140 of fig. 8) by analysis to determine pixel mismatches and thereby identify pixel cases. The synthesized pixels of the synthesized depth image 150 corresponding to the model are represented in fig. 13 by a synthesized contour 200 depicted with a solid line, while the observed pixels of the observed depth image 140 corresponding to the target are represented in fig. 13 by an observed contour 202 depicted with a dashed line. It will be appreciated that although this comparison is schematically depicted as a visual comparison, in practice this comparison may be an analytically-passed comparison of information corresponding to each pixel address such as that shown in figure 10.
After each synthesized pixel or group of synthesized pixels is compared to the corresponding observed pixel or group of observed pixels, each synthesized pixel may be associated with a pixel case. For example, for each synthesized pixel, a pixel case may be selected from a defined set of pixel cases, such as a refine-z pixel case, a magnetic pixel case, a push-pixel case, a pull-pixel case, a self-blocking push and/or pull-pixel case, and so forth.
FIG. 14 illustrates an example area of synthesized pixels of synthesized outline 200 having pixels that do not match (e.g., the depth value of the observed depth image differs from the depth map of the synthesized image by more than a threshold amount), such as indicated in the diagonal shading shown at 204. After identifying which synthesized pixels of the synthesized depth image do not match with respect to pixels having the same pixel address in the observed image, the model represented in the synthesized depth image may be adjusted so that the model better represents the target.
FIG. 15 illustrates another example comparison 206 of a synthesized depth image and a corresponding observed depth image, where different pixel cases have been selected for different regions of synthesized pixels. Region 208 includes one or more portions of the model that are shifted forward or backward in the depth direction (e.g., Z-shifted) from corresponding one or more portions of the observed depth image. As an example, the region 208 may correspond to pixels having a refine-z pixel case. The region identified by the diagonal shading (such as the example region 210) indicates a portion of the model that is removed from the contour of the human target in the observed depth image. As an example, region 210 may correspond to pixels having a push pixel case. The region identified by the horizontal line shading (such as the example region 212) indicates the portion of the observed depth image that is displaced from the outline of the model. As an example, the region 212 may correspond to pixels having a pull pixel case. The regions identified by cross-hatching (such as the example region 214) indicate portions of the model corresponding to pixels having a magnetic pixel case, such as arms and/or hands.
Returning to FIG. 7, as described above, for each synthesized pixel, a pixel case may be selected from a defined set of pixel cases, such as a refine z-pixel case, a magnetic pixel case, a push pixel case, a pull pixel case, a self-blocking push and/or pull pixel case. In doing so, the synthesized pixels in the model that have these pixel mismatches may then be corrected by adjusting the model to more closely match the observed image. Such adjustments may be made, for example, by applying a force to the model to reposition the model to a different pose that more closely matches the observed image. In certain embodiments, the force may be applied via a force vector having a magnitude and a direction, which may be applied to the force-receiving locations of the model, as indicated at 141, 142, and 144 of FIG. 7. The calculation and application of each force vector may be based on the pixel case. Such a force vector may be derived from a single pixel address or from a set of two or more related pixel addresses (e.g., adjacent pixel addresses having matching values-body part index, player index, etc.). Examples of pixel cases and associated force vectors are discussed in more detail below.
As shown at 124 of FIG. 7, determining the pixel case may include selecting a refine-z pixel case. A refined z-pixel case may be selected when an observed depth value of an observed pixel (or in a region of observed pixels) of the observed depth image does not match a synthesized depth value in the synthesized depth image, but is close enough to likely belong to the same object in both images, and the body-part indices match (or in some cases correspond to neighboring body parts or regions). A refined z-pixel case may be selected for a synthesized pixel if the difference between the observed depth value and the synthesized depth value for the synthesized pixel is within a predetermined range, and (optionally) the (growing) body-part index for the synthesized pixel corresponds to a body-part that has not been designated to receive magnetic force. As another example, if the synthesized depth value does not match the observed depth value, and the absolute difference between the synthesized depth value and the observed depth value is less than a predetermined threshold, then the synthesized pixel of interest may be classified with a refine-z pixel case.
The refine-z pixel case corresponds to a calculated force vector that can apply a force to the model to move the model to the correct position. In other words, a refine-Z force vector may be applied to one or more force-receiving locations of the model to move a portion of the model toward a corresponding portion of the observed depth image (e.g., in a direction along the Z-axis and perpendicular to the image plane). The calculated force vector may be applied along a Z-axis perpendicular to the image plane, along a vector normal to an aspect of the model (e.g., the face of the respective body part), and/or along a vector normal to nearby observed pixels. In some embodiments, the calculated force vector may be applied to a combination of the normal vector of the face of the respective body part and the normal vector of the nearby observed pixels. By way of non-limiting example, such a combination may be an average, a weighted average, a linear interpolation, or the like. The magnitude of the force vector is based on the difference between the observed depth value and the synthesized depth value, with greater differences corresponding to greater forces. In other words, in some embodiments, the force vector may increase in proportion to the absolute difference between the synthesized depth value and the observed depth value. The force-receiving location at which the force is applied may be selected as the nearest qualifying force-receiving location (e.g., nearest torso joint) for the pixel of interest, or the force may be distributed among a weighted blend of the nearest qualifying force-receiving locations. The nearest qualifying force-receiving location may be selected, however, in some cases, it may be helpful to apply an offset. For example, if a certain pixel is located halfway down the thigh and it has been determined that the hip joint is less mobile (or flexible) than the knee, it may be helpful to shift the joint force for the middle leg pixel to act on the knee rather than the hip. Other examples of the offset are described later.
In some embodiments, the nearest qualifying force-receiving location for the refine-z pixel case may be determined by comparing the distance between the synthesized pixel of interest and each qualifying force-receiving location. The nearest qualifying force-receiving location may be determined, for example, by comparing the distance between the synthesized pixel of interest and each qualifying force-receiving location on the body-part associated with the body-part index of the synthesized pixel of interest. As another example, the force vector may be one of a plurality of force vectors applied to a weighted blend of a plurality of nearest qualifying force-receiving locations. Further, the force vector may be biased, for example, toward an acceptable force-receiving position that is relatively more mobile. For example, the application of the force vector may be biased to a less recent qualifying force-receiving location that is more mobile than the recent qualifying force-receiving location.
With or without the offset described above, the determination of which force-receiving location is closest to the pixel of interest (i.e., the synthesized pixel of interest) can be found by brute force searching. To speed up the search, the set of force-receiving locations searched may be limited to only those force-receiving locations that are located at or near the body-part associated with the body-part index for that pixel. A BSP (binary space partitioning) tree may also be provided to help speed up these searches each time the pose changes. Each region on the body or each body part corresponding to a body part index may be given its own BSP tree. If so, the offset may be applied differently for each body part, which further allows for a judicious choice of the appropriate force-receiving location.
As shown at 126 of fig. 7, determining the pixel case may include selecting a magnetic pixel case. The magnetic pixel case may be used when the synthesized pixel being examined in the growth map corresponds to some predetermined subset of body parts, such as the arms or bp3, bp4, bp5, bp7, bp8, and bp9 of fig. 3. Although an arm is provided as an example, in some scenarios, other body parts such as legs or the entire body may optionally be associated with the magnetic pixel case. Also, in some scenarios, the arm may not be associated with a magnetic pixel case.
The pixels labeled for the magnetic case may be grouped into regions, each region being associated with a particular body part (e.g., upper left arm, lower left arm, left hand, etc. in this example). For example, a grow operation such as described above may be completed before processing the magnetic pixels. During the grow operation, each pixel may be "tagged" with the body part of the target that most likely corresponds to the pixel. However, during the grow operation, it may be possible to mark one or more pixels with the wrong body part (i.e., mismark). As an example, during rapid movements of the arm and/or hand, the motion predictor may not be able to complete the prediction, and thus rapidly moving hand pixels may not be added to the snap map, whereas slower upper arm pixels near the shoulder may still be added to the snap map. In this case, limb pixels further from the shoulder may have relatively more error in assigning body part indices. In the case where these pixels are mis-labeled, the lower arm pixels may grow down into the hand region during the grow operation. As another example, if neither the forearm pixel nor the hand pixel is added to the snap map, the upper arm pixel added to the snap map may be grown down into the forearm and hand regions. Thus, for example, pixels corresponding to a hand of a human target may be labeled as "lower arm" or all arm pixels may be labeled as "upper arm". Therefore, it may be useful to discard this information when dealing with magnetism, as described in more detail below.
Although the grow operation may incorrectly identify to which part of the limb these pixels belong, the original body part assigned to the pixel identifying the limb itself tends to have a higher confidence. In other words, although a lower arm pixel may be incorrectly associated with an upper arm, the fact that the pixel corresponds to a certain part of the arm is still correct. Accordingly, the subset classification assigned during the grow operation may be discarded. As such, magnetic pixels can be grouped into broader categories (i.e., "pools"), such as "left arm," right arm, "and" others. The pixels in the left and right arm pools may then be labeled as belonging to the magnetic pixel case. The above is a non-limiting example, and other methods of identifying arm pixels or other pixels belonging to flexible body parts may be used.
For each pixel labeled for magnetic example (e.g., a pixel of the left arm pool), the location of that pixel may be converted from a screen space location having X, Y pixel locations and a depth value to a world space location having coordinates identifying that location in three-dimensional space. It will be appreciated that this is only one embodiment of processing pixels. In other embodiments, the screen space location of the pixel may be converted to a world space location.
Continuing with the processing of each of the left and right arm magnetic pixels, the pixel may be projected onto a "bone" that constitutes the arm of the model and is represented by a line segment. As with all pixel cases, the pixel may be projected onto the current, best guess version of the bone. This best guess version of the bone may be from the final pose of the previous frame (with or without momentum), or may be updated with any adjustments during the current frame (e.g., running a refined z-pixel case to adjust the bone and then using the adjusted bone for the magnetic pixels). In other words, the joints may be progressively updated at any point during the processing of the frame, and subsequent processing in the current or subsequent frame may be performed using the updated joints.
As an example of magnetic processing, the arm may comprise three bone segments, namely an upper arm bone, a lower arm bone and a hand. For each pixel, the point on the finite line segment closest to the pixel can be determined analytically. In some embodiments, this may include comparing the pixel to a three-dimensional joint position (which is pulled forward in the Z direction by the estimated radius of the limb at the joint) such that the comparison is a comparison of two surface values rather than one surface value and one internal value.
The pixel may then be assigned to the nearest line segment. In some embodiments, if it is determined that the closest line segment may be incorrect, the pixel may be assigned to a different line segment. For example, if the target's arm is extended and the model's arm is in a "chicken wing" position, a pixel far enough from the shoulder (e.g., 1.5 times the length of the upper arm) may cause the nearest line segment to be replaced with the lower arm skeleton. After determining which bone the pixel may be associated with, the location of the pixel may be added to the "near" and "far" centroids of the bone, as described in more detail below.
For each of these magnetic regions, the centroid of the pixels belonging to that region can be calculated. These centroids may be orthonormal (all contributing pixels are weighted equally), or offset, with some pixels weighted more heavily than others. For example, for the upper arm, three centroids may be tracked: 1) a centroid without offset; 2) "near" centroids, whose contributing pixels are weighted more heavily as they come closer to the shoulder; and 3) "far" centroids, whose contributing pixels are weighted more heavily as they are closer to the elbow. These weights may be linear (e.g., 2X) or non-linear (e.g., X2) or follow any curve.
Once these centroids are computed, various options are available (and can be dynamically selected) for computing the position and orientation of the body part of interest, even if some are partially occluded. For example, when attempting to determine a new location for an elbow, if the centroid in that region is sufficiently visible (if the sum of the weights of the contributing pixels exceeds a predetermined threshold), then the centroid itself marks the elbow (estimate # 1). However, if the elbow region is not visible (perhaps because it is occluded by some other object or body part), the elbow location may still be determined often, as described in the following non-limiting example. If the far centroid of the upper arm is visible, a projection can be made from the shoulder through that centroid for the length of the upper arm to obtain a very likely position of the elbow (estimate # 2). If the near centroid of the lower arm is visible, a projection can be made from the wrist up through this centroid for the length of the lower arm to obtain a very likely position of the elbow (estimate # 3).
Estimates with higher visibility, confidence, pixel count, or any number of other metrics are given priority (or higher weight), a selection of one of the three possible estimates may be made, or a mix between the three possible estimates may be made. Finally, in this example, a single magnetic force vector may be applied to the model at the location of the elbow; however, it may be weighted more heavily (when accumulated with the pixel force vectors generated from other pixel instances but acting on the same stressed location) to represent the fact that many pixels are used to construct it. When applied, the calculated magnetic force vector may move the model so that the corresponding model may more satisfactorily match the target shown in the observed image. One advantage of the magnetic pixel case is its ability to perform well on highly flexible body parts such as arms.
In some embodiments, models of undefined joints or body parts may be adjusted using only the magnetic pixel case.
As shown at 128 and 130 of FIG. 7, determining the pixel case may include selecting a push pixel case and/or a pull pixel case. These pixel cases may be called at contours where the synthesized depth value and the observed depth value may be severely mismatched at the same pixel address. It is noted that the pull pixel case and the push pixel case can also be used when the original player index does not match (grow) the player index. The determination of whether to push or pull is as follows. If the synthesized depth image at the same pixel address contains a depth value that is larger (farther) than the depth value in the observed depth image, e.g., by more than a threshold amount, the model may be pulled toward the true contour seen in the growing image. In other words, for portions of the observed depth image that are displaced from the contour of the model, the model may be pulled in the XY plane toward the contour of the target in the observed depth image. As an example, a pull force vector applied to one or more force-receiving locations of the model may be used to "pull" the model. An example of such a pull pixel case is shown in FIG. 16 and described in more detail below.
Fig. 16 schematically shows an example observed depth image 220 as compared to an example synthesized depth image 222, as indicated at 224. In this manner, the pixel address of the synthesized depth image 222 corresponds to the pixel address of the observed depth image 220. To more clearly illustrate this example, FIG. 16 shows an exaggerated example where the observed depth image 220 and the synthesized depth image 222 do not significantly match. However, it will be appreciated that in practice, the two images may only be mismatched by a relatively small amount, and that mismatches as severe as shown may be problematic.
Observed depth image 220 includes an image of an observed human target (e.g., a game player) (i.e., player image 226), where the player image 226 has an outline (i.e., player outline 228) such that pixels within player outline 228 are pixels of player image 226 and pixels outside player outline 228 are pixels of an observed background 230. Similarly, the synthesized depth image 222 includes a model 232 representing an observed game player, where the model 232 has contours (i.e., model contours 234) such that pixels inside the model contours 234 are pixels of the model 232 and pixels outside the model contours 234 are pixels of the synthesized background 236.
After comparing the synthesized depth image 222 with the observed depth image 220, the existence of a mismatch becomes more apparent such that pixels at the same pixel address correspond to different parts of each depth image. For example, an example pixel, synthesized pixel of interest 238, is selected for discussion. As shown, the synthesized pixel of interest 238 corresponds to the synthesized background 236 of the synthesized depth image 222. However, the same pixel address in the respective observed depth image corresponds to observed pixel 240 associated with player image 226. In this particular example, synthesized pixel of interest 238 has a greater depth value than the corresponding observed pixel 240 because the background is at a greater depth than the game player (i.e., farther away from the depth camera). As such, model 232 may be pulled toward synthesized pixel of interest 238 (i.e., toward player silhouette 228), as indicated by arrow 240.
Conversely, if the original synthesized image contains a depth value that is smaller (more recent) than the depth value in the observed depth image, e.g., by more than a threshold amount, the model may be pushed out of the space that the player no longer occupies (and toward the real silhouette in the growing image). In other words, for portions of the model that are moved away from the contour of the human target in the observed depth image, the model may be pushed in the XY plane toward the contour of the human target in the observed depth image. As an example, the model may be "pushed" using a thrust vector applied to one or more force-receiving locations of the model. An example of such a push pixel case is shown in FIG. 17 and described in more detail below.
Fig. 17 schematically shows a comparison similar to that shown in fig. 16, i.e. a comparison of the synthesized depth image 222 and the observed depth image 220. However, for the example shown in FIG. 17, a different synthesized pixel of interest is examined, namely synthesized pixel of interest 250. Synthesized pixel of interest 250 corresponds to model 232 of synthesized depth image 222, while observed pixel 252 is associated with observed background 230 at the same pixel address in the corresponding observed depth image 220. In this particular example, synthesized depth pixel of interest 250 has a smaller depth value than the corresponding observed pixel 252 because the model is at a smaller depth than the background (i.e., closer to the depth camera). As such, the model 222 may be pushed away from the synthesized pixel of interest 250 (i.e., toward the player silhouette 228), as indicated by arrow 254.
In either case (e.g., the pull pixel example of fig. 16 or the push pixel example of fig. 17), for each of these pixels or pixel regions, a two-or three-dimensional computed force vector may be applied to the model to correct contour mismatches, pushing or pulling portions of the body model to locations that more accurately match the locations of the targets in the observed depth image. The direction of this push and/or pull is typically primarily in the XY plane, but in some scenarios a Z component may be added to the force. Accordingly, in some examples, the thrust and/or tension vectors may be three-dimensional vectors that include a Z component.
For example, for the pull example shown in FIG. 16, a pull force vector may be applied to the force-receiving location of the model 232 to pull the model 232 toward the player silhouette 228 in the observed depth image. The magnitude of the pull force vector may be proportional to the pull-offset distance that a portion of the observed depth image is moved away from the contour of the model. In other words, the pull-offset distance D1 may be defined as the distance between the synthesized pixel of interest (e.g., pixel 238) and the nearest qualifying pixel of the model silhouette 234. As such, the magnitude of the pull force vector D2 may be a function of the pull deflection distance D1, as described in more detail below. Further, the direction of the pull force vector may be parallel to the vector extending from the nearest qualifying pixel on the model outline 234 to the synthesized pixel of interest 238.
For the push pixel example shown in FIG. 17, a push vector may be applied to the force-receiving location of the model 232 to push the model 232 toward the player silhouette 228 in the observed depth image 220. The magnitude of the thrust vector may be proportional to the push offset distance that a portion of the model moves away from the player silhouette 228. In other words, the push offset distance D1 may be defined as the distance between the synthesized pixel of interest (e.g., pixel 250) and the nearest qualifying pixel of the player silhouette 228. In certain embodiments, the magnitude of the thrust vector, D2, may be a function of the thrust offset distance, D1, as described in more detail below. Further, the direction of the thrust vector may be parallel to the vector extending from the synthesized pixel of interest 250 to the nearest qualifying pixel on the player silhouette 228.
To generate the appropriate force vector for either the push or pull case, the nearest qualifying point on the player's silhouette (i.e., model silhouette) in the synthesized depth image (for the pull case) or the player's silhouette in the observed depth image (for the push case) may be found first. For each source pixel (or for each group of source pixels), the point can be found by performing a brute force exhaustive 2D search of the closest point (on the desired contour) that satisfies the following criteria. In the pull pixel case, the nearest pixel in the original map with a player index (at the seek position) that matches the player index in the grown map (at the source pixel or region) is found. In the push pixel case, the nearest pixel in the growing map with the player index (at the seek position) that matches the player index in the original map (at the source pixel or region) is found.
However, brute force searches can be computationally very expensive, and optimizations can be used to reduce computational expense. One non-limiting example optimization for finding this point more efficiently is to follow the gradient of the height map described above, or a blurred version thereof, and to examine pixels in a straight line only in the direction of the gradient. In this height map, the height value is low when the player index is the same in both the original player index map and the grown player index map, and high when the player index (in both maps) is zero. At any given pixel, the gradient may be defined as a vector pointing "downhill" in the height map. As described above, both pull and push pixels may then seek along this gradient (downhill) until they reach their respective stop conditions. As such, a one-dimensional search along the gradient of the blurred height map may be used to find the nearest qualifying pixel on the model silhouette 234 and/or the nearest qualifying pixel on the player silhouette 228. Further, the nearest qualifying pixel on the model contour 234 may be found by testing model contour pixels near the contour pixel found using the one-dimensional search. Likewise, the nearest qualifying pixel on the player silhouette 228 may be found by testing player silhouette pixels near the silhouette pixel found using the one-dimensional search.
Other basic optimizations for this seek operation include skipping pixels using interval bisection or using slope-based approximation; re-sampling the gradient at regular intervals as the search progresses; and check nearby to find a better/closer match once the stopping criterion is met (not directly along the gradient). Some search strategies may select the nearest qualifying pixel from a subset of candidate pixels that satisfy one or more selection criteria, such as pixels having a certain body part index.
Regardless of the technique used to find the closest point on the contour of interest, the distance traveled (distance between source pixel and contour pixel) D1 may be used to calculate the magnitude (length) of the force vector that will push or pull the model D2. In certain embodiments, D2 may be linearly or non-linearly related to D1 (e.g., D2=2 × D1 or D2= D1)2). As one non-limiting example, the following formula may be used: d2= (D1-0.5 pixels) × 2. As described above, D1 may be a pull offset distance or a push offset distance. Accordingly, D2 may be the magnitude of the pull or push force vector, respectively. The pull offset distance and/or push offset distance may be found using a one-dimensional search along the gradient of the blurred height map as described above.
For example, if there is a 5-pixel gap between contours in two depth images, each pixel in the gap may perform a small "seek" and generate a force vector. Pixels near the true contour may only seek by 1 pixel to reach the contour, such that the force magnitude at those pixels is (1-0.5) × 2= 1. Pixels far from the real contour may be searched by 5 pixels, so that the force magnitude will be (5-0.5) × 2= 9. In general, the seek distance from the pixels closest to the true contour to those farthest away will be D1= {1,2,3,4,5}, and the resulting force magnitude will be: d2= {1,3,5,7,9 }. The average of D2 in this example is 5, as desired-the average magnitude of the resulting force vectors corresponds to the distance between the contours (near each force-receiving location), which is the distance the model can be moved to place the model in place.
For each source pixel, the final force vector may then be constructed with the direction and magnitude (i.e., length). For a pull pixel, the direction is determined by the vector from the contour pixel to the source pixel; for a push pixel, it is a reverse vector. The length of the force vector is D2. At each pixel, the force may then be applied to the most eligible (e.g., nearest) force-receiving location (or distributed among several force-receiving locations), and these forces may be averaged at each force-receiving location to produce the appropriate local movement of the body model. Although not shown in fig. 16-17, in some embodiments, the force-receiving location may be a joint of the model.
As shown at 132 and 134 of FIG. 7, determining the pixel case may include selecting a self-occluding push and/or pull pixel case. Although a body part may move in the foreground relative to the background or another object in the push pixel case and the pull pixel case described above, the self-occlusion push pixel case and the pull pixel case consider the case where the body part is in front of another body part of the same object (e.g., one leg in front of another leg, an arm in front of a torso, etc.). These instances may be identified when the (snap) player index of a pixel matches its corresponding (grow) player index, but the (snap) body-part index does not match its corresponding (grow) body-part index. In these examples, the search direction (to find the contour) may be derived in several ways. As a non-limiting example, a brute force 2D search may be performed; the second set of "occlusion" height maps may be adjusted for this example so that the gradient may guide the 1D search; or the direction may be set toward the closest point on the nearest skeletal member. The details of these two examples are otherwise similar to the standard pull and push examples.
If the (growing) body-part index of a synthesized pixel corresponds to a body-part that has not been designated to receive magnetic force, then a push, pull, self-occlusion push, and/or self-occlusion pull instance may be selected for the synthesized pixel.
It is to be understood that in some cases a single pixel may be responsible for one or more pixel instances. As a non-limiting example, a pixel may be responsible for both self-occlusion push pixel force, which is applied to a force-bearing location on the body part undergoing occlusion, and refinement z pixel force, which is applied to a force-bearing location on the body part being occluded.
As shown at 136 of FIG. 7, determining the pixel case may include not selecting the pixel case for the synthesized pixel. It will often not be necessary to calculate a force vector for all synthesized pixels of the synthesized depth image. For example, synthesized pixels that are further away from the body model shown in the synthesized depth image and observed pixels that are further away from the target shown in the observed depth image (i.e., background pixels) may not affect any force-receiving locations or body parts. The pixel case need not be determined for these pixels, but may be determined in some scenarios. As another example, the difference between the observed depth value and the synthesized depth value for the synthesized pixel may be below a predetermined threshold (e.g., the model has matched the observed image). Thus, the pixel case need not be determined for these pixels, but may be determined in some scenarios.
At 141, the method 100 of FIG. 7 includes, for each synthesized pixel for which a pixel case has been determined, calculating a force vector based on the pixel case selected for the synthesized pixel. As described above, each pixel case corresponds to a different algorithm and/or method for selecting the magnitude, direction, and/or force-receiving location of the force vector. In particular, a force vector (magnitude and direction) may be calculated for each synthesized pixel based on the determined pixel case, and depending on the type of model, the calculated force vector may be applied to the nearest qualified joint, the center of mass of the body part, a point of the body part, the vertices of a triangle, or another predetermined force-receiving location of the model used to generate the synthesized depth image. In some embodiments, the force attributed to a given pixel may be distributed between two or more force-receiving locations on the model.
The force vectors may be calculated and/or accumulated in any coordinate space, such as world space, screen space (pre-Z division), projection space (post-Z division), model space, and the like. For example, as described above for the push and/or pull pixel example, the magnitude of the push and/or pull force vectors may be proportional to the push offset distance and/or the pull offset distance, respectively. For the refine-z pixel case, the magnitude of the refine-z force vector may be based on an absolute difference between the synthesized depth value and the observed depth value, such that the refine-z force vector increases in proportion to the absolute difference. For the magnetic case, the force vector may depend on the proximity of the synthesized pixel to the bone segment, as well as the centroid of the corresponding limb.
At 142, the method 100 includes mapping each calculated force vector to one or more force-receiving locations of the model. The mapping may include mapping the calculated force vector to a "best match" force-receiving location. The selection of the best matching force-receiving location for the model depends on the pixel case selected for the corresponding pixel. The best matching force-receiving location may be, for example, the closest joint, vertex, or centroid. In some embodiments, a moment (i.e., a rotational force) may be applied to the model.
In some cases, a single pixel may be responsible for two or more different force vectors. As a non-limiting example, a certain pixel may be identified as a limb pixel occluding the torso after a snap operation, while the same pixel may be subsequently identified as a torso pixel after a grow operation (i.e., the limb has been moved away from the pixel address). In this case, the pixel may be responsible for the thrust to the limb for pushing the limb aside and the refined z-force to the torso for moving the torso to the appropriate depth. As another example, if a single pixel is located between two or more joints, two or more pixel forces may come from the single pixel. For example, a calf (mid-calf) pixel can move both the ankle and the knee.
In general, at each pixel address, one or more pixel instances for the pixel address may be determined using a combination of the original player index, the snap player index, the grow player index, the original body-part index, the snap body-part index, the grow body-part index, the synthesized depth values, the snap depth values, the observed depth values, and/or other observed or synthesized data for the given pixel address.
FIG. 18 shows a table detailing an example relationship between the pixel case described above and the joints shown in the skeletal model 82 of FIG. 5 to which force vectors may be mapped. In the table, pixel examples 1-7 are abbreviated as follows: 1-pull (regular), 2-pull (occlusion), 3-push (regular), 4-push (occlusion), 5-refine Z, 6-magnetic pull, and 7-occlusion (no action). "force is applied? A "yes" entry in the column indicates that the joint of the row may receive force from the force vector. The "X" entry in the pixel case column indicates that the joint of the row may receive a force from the force vector corresponding to the pixel case of the column. It will be appreciated that this table is provided as an example. Which are not to be considered limiting. Other relationships between the model and the pixel case may be established without departing from the scope of this disclosure.
In general, translation may be caused by forces having similar directions acting on the force-receiving locations of the model, while rotation may be caused by forces having different directions acting on the force-receiving locations of the model. For deformable objects, some components of the force vector may be used to deform the model within its deformation limits, and the remaining components of the force vector may be used to translate and/or rotate the model.
In some embodiments, the force vectors may be mapped to the best matching rigid or deformable object, sub-object, and/or polygon set of objects. Thus, some components of the force vector may be used to deform the model, while the remaining components of the force vector may be used to perform rigid translation of the model. This technique may result in a "disconnected" model (e.g., the arm may be severed from the body). As discussed in more detail below, a correction step may then be used to transform the translation into a rotation and/or apply constraints to tie the body parts together along a low energy path.
Further, in certain embodiments, 142 of the method 100 includes mapping more than one force vector. For example, a first synthesized pixel having a body-part index corresponding to an arm of the human target may have been classified with a first pixel case, while a second synthesized pixel having a body-part index corresponding to a torso of the human target may have been classified with a second pixel case. In this case, the first force vector of the first synthesized pixel may be calculated according to the first pixel case, and the second force vector of the second synthesized pixel may be calculated according to the second pixel case. Accordingly, a first force vector may be mapped to a first force-receiving location of the model, where the first force-receiving location corresponds to an arm of the human target. Accordingly, a second force vector may be mapped to a second force-receiving location of the model, where the second force-receiving location corresponds to the torso of the human target.
FIGS. 11A and 11B show a very simplified example of applying force vectors to a model, in the example shown, a skeletal model 180. For simplicity, only two force vectors are shown in the illustrated example. Each such force vector may be the result of the addition of two or more different force vectors resulting from the pixel case determination and force vector calculation of two or more different pixels. Typically, the model will be adjusted by a number of different force vectors, each force vector being the sum of a number of different force vectors determined from pixel instances of a number of different pixels and calculated from the force vectors.
FIG. 11A shows a skeletal model 180 in which a force vector 182 is applied to joint j18 (i.e., the elbow) and a force vector 184 is applied to joint j20 (i.e., the wrist) for the purpose of straightening one arm of the skeletal model 180 to more closely match the observed depth image. FIG. 11B shows skeletal model 180 after the force is applied. FIG. 11B shows how the applied force adjusts the pose of the model. As shown in fig. 11B, the length of the skeletal member may be maintained. As further shown, the position of joint j2 remains on the shoulder of the skeletal model, which is expected for a human straightening their arm. In other words, the skeletal model remains intact after the application of force. Maintaining the integrity of the skeletal model when applying the force is caused by one or more constraints applied, as discussed in more detail below. Various different constraints may be implemented to maintain the integrity of different possible model types.
At 144, the method 100 of FIG. 7 optionally includes correcting the model to a pose that satisfies one or more constraints. As described above, after the calculated force vectors are collected and mapped to the force-receiving locations of the model, the calculated force vectors may then be applied to the model. If performed without constraints, this may "break" the model, stretch it disproportionately and/or move the body part to a configuration that is not valid for the target's actual body. Iterations of various functions can then be used to "relax" the new model position to a "nearby" legal configuration. During each iteration of the correction model, constraints may be applied to the poses gradually and/or stepwise in order to limit the set of poses to poses that may be physically expressed by one or more actual bodies of one or more targets. In other embodiments, this correction step may be done in a non-iterative manner.
In certain embodiments, constraints may include one or more of the following: skeletal member length constraints, joint angle constraints, polygonal corner constraints, and crash tests, as discussed below.
As an example of using a skeletal model, a skeletal member (i.e., bone) length constraint may be applied. Force vectors that may be detected (i.e., force vectors at locations where joints and/or body parts are visible and unobstructed) may propagate along a network of skeletal members of a skeletal model. By applying the skeletal member length constraint, the transmitted force can "settle" (seat in) once all the skeletal members are of acceptable length. In certain embodiments, one or more skeletal member lengths are allowed to vary within a predetermined range. For example, the length of the skeletal members that make up the sides of the torso may be varied to simulate a deformable upper abdomen. As another example, the length of the skeletal members making up the upper arm may be varied to simulate a complex shoulder socket.
The skeletal model may additionally or alternatively be constrained by calculating the lengths of each skeletal member based on the goals, such that these lengths may be used as constraints during correction. For example, the desired bone length is known from the body model; the difference between the current bone length (i.e., the distance between the new joint positions) and the desired bone length may be evaluated. The model may be adjusted to reduce any error between the desired length and the current length. Certain joints and/or bones that are considered more important and joints or body parts that are currently more visible than others may be given priority. High amplitude changes may also be given priority over low amplitude changes.
By way of example, FIG. 19 illustrates applying one or more constraints to a model representing an object. It will be appreciated that although fig. 19 provides a visual illustration, in practice the application of constraints may be analytical and may include, for example, modifying pixel data such as that shown in fig. 10. For the example depicted in FIG. 19, applying a force vector to the model 260 may result in a "disconnected" model. For example, the target may reposition itself to lift the arm overhead. In tracking the motion, a force vector may be applied to one or more force-receiving locations of the arm to mimic the motion of the target. However, doing so may result in "breaking" the arm, as depicted at 262, and/or changing the scale of the arm, as shown at 264. Since the model represents a human target in this example, these two scenarios are physically impossible for the human arm. Accordingly, constraints may be imposed to ensure that adjustments to the model are physically appropriate. For example, a constraint may be applied (as shown at 266) to ensure that the forearm and upper arm remain attached at the elbow. As another example, a bone length constraint may be applied to the forearm (as shown at 268) to ensure that the forearm remains approximately the same length. After these constraints are applied, the model maintains its physical integrity, as shown at 270.
Joint visibility and/or confidence may be tracked separately in the X, Y and Z dimensions to allow for more accurate application of bone length constraints. For example, if a bone connects the chest to the left shoulder, and the Z position of the chest joint is high-confidence (i.e., many refined Z pixels correspond to that joint) and the Y position of the shoulder is high-confidence (many push/pull pixels correspond to that joint), any error in bone length can be corrected while partially or completely limiting shoulder movement in the Y direction or chest movement in the Z direction.
In some embodiments, the joint position before correction may be compared to the joint position after correction. If it is determined that a consistent set of adjustments are being made to the skeletal model in each frame, the method 100 may use this information to perform a "progressive refinement" of the skeletal and/or body models. For example, by comparing joint positions before and after correction, it can be determined that the shoulders are being pushed wider apart in each frame during correction. This consistent adjustment implies that the shoulders of the skeletal model are smaller than the shoulders of the represented object, and therefore, the shoulder width is adjusted every frame during correction to correct the problem. In this case, progressive refinement such as increasing the shoulder width of the skeletal model may be made to correct the skeletal and/or body model to better match the target.
With respect to joint angle constraints, the range of motion of certain limbs and body parts relative to adjacent body parts may be limited. Additionally, the range of motion may be varied based on the orientation of adjacent body parts. Thus, applying joint angle constraints may allow limb segments to be constrained to possible configurations given the orientation of the source limb and/or body part. For example, the lower leg may be configured to bend backward (at the knee), but not forward. If illegal angles are detected, the violating normative body parts and/or their sources (parent) (or, in the case of mesh models, the violating normative triangles and their neighbors) are adjusted to keep the pose within a predetermined range of possibilities, thereby helping to avoid situations where the model collapses into a pose that is considered unacceptable. FIG. 20 shows an example of a model 280 that has one or more joint angle constraints applied to correct an incorrect joint angle shown at 282 to be within an acceptable range of motion, such as shown at 284. In some cases of extreme angle violations, the pose may be considered the opposite, i.e., what is tracked as the chest is actually the player's back; the left hand is actually the right hand; and so on. When such an impossible angle is clearly visible (and sufficiently unusual), this can be interpreted to mean that the pose has been reverse mapped onto the player's body, and the pose can be flipped to correctly model the target.
A collision test may be applied to prevent the model from penetrating itself. For example, the collision test may prevent any part of the forearms/hands from penetrating into the torso, or prevent the forearms/hands from penetrating each other. In other examples, the collision test may prevent one leg from penetrating the other. In some embodiments, collision tests may be applied to models of two or more players to prevent similar scenarios from occurring between models. It is to be understood that this can be achieved by many different representations of the model geometry-for example, polygonal shells can be used for the core body, while parametric capsules (cylinders, which may have different radii at opposite ends) are used as limb segments. In some embodiments, collision tests may be applied to the body model and/or the skeletal model. In some embodiments, collision tests may be applied to certain polygons of the mesh model. By way of example, fig. 21 shows a model 290 in which the hands and forearms of model 290 have penetrated the torso, as shown at 292. After applying the collision test, such penetration may be detected and corrected, as shown at 294.
The collision test may be applied in any suitable manner. One method checks for a collision of one "body line segment" (volumetric line segment) with another, where the body line segment may be a line segment with a radius that extends outward in 3-D. An example of such a collision test may be to examine one forearm versus the other. In some embodiments, the body line segment may have a different radius at each end of the segment.
Another approach examines the collision of a body segment with a posed polygon object. An example of such a collision test may be to examine a forearm versus torso. In some embodiments, the posed polygonal object may be a deformed polygonal object.
In some embodiments, knowledge that different portions of the target may move a limited distance within a time frame (e.g., one second of 1/30 or 1/60) may be used as a constraint. This constraint may be used to exclude certain poses caused by applying forces to pixel receiving locations of the model.
After adjusting and optionally constraining the model, the process may loop back to begin rasterizing the model a new time into a new synthesized depth image, which may then be compared to the observed depth image to make further adjustments to the model, as shown at 145. In this way, the model may be progressively adjusted to more closely represent the modeled target. Each frame may complete virtually any number of iterations. More iterations may yield more accurate results, but more iterations may also require more computational overhead. In many cases, two or three iterations per frame are considered suitable, but in some embodiments one iteration may be sufficient.
At 146, the method 100 of FIG. 7 optionally includes changing the visual appearance of an on-screen character (e.g., the player avatar 190 of FIG. 12A) in response to a change to the model, such as the change shown in FIG. 11B. For example, a game console may track users playing electronic games on a game console (e.g., gaming system 12 of fig. 1A and 1B) as described herein. In particular, a body model (e.g., body model 70 of FIG. 3) including a skeletal model (e.g., skeletal model 180 of FIG. 11A) may be used to model the target game player, and an on-screen player avatar may be rendered using the body model. As the game player straightens one arm, the game console may track the motion and then adjust the model 180 as shown in FIG. 11B in response to the tracked motion. The game console may also apply one or more constraints as described above. After making such adjustments and applying such constraints, the game console may display the adjusted player avatar 192, as shown in FIG. 12B. This is also shown as an example in FIG. 1A, where player avatar 24 is shown punching boxing opponent 22 in response to game player 18 throwing a punch in real space.
As described above, visual target recognition may be performed for purposes other than changing the visual appearance of an on-screen character or avatar. Thus, the visual appearance of an on-screen character or avatar need not be changed in all embodiments. As discussed above, target tracking can be used for virtually unlimited different purposes, many of which do not result in changes to the on-screen character. The pose of the target tracking and/or the adjusted model may be used as a parameter to affect virtually any element of an application such as a game.
The above process may be repeated for subsequent frames, as shown at 147.
It will be appreciated that the configurations and/or approaches described herein are exemplary in nature, and that these specific embodiments or examples are not to be considered in a limiting sense, because numerous variations are possible. The specific routines or methods described herein may represent one or more of any number of processing strategies. As such, various acts illustrated may be performed in the sequence illustrated, in other sequences, in parallel, or in some cases omitted. Also, the order of the above-described processes may be changed.
The subject matter of the present disclosure includes all novel and nonobvious combinations and subcombinations of the various processes, systems and configurations, and other features, functions, acts, and/or properties disclosed herein, as well as any and all equivalents thereof.

Claims (10)

1. A method for tracking a human target, the method comprising:
receiving an observed depth image of the human target from a source, the observed depth image comprising a plurality of observed pixels;
obtaining a synthesized depth image representing a model of the human target, the synthesized depth image comprising a plurality of synthesized pixels;
classifying synthesized pixels of the plurality of synthesized pixels in pixel proportion by comparing the synthesized depth image to the observed depth image;
calculating a force vector for the synthesized pixel based on the pixel instance of the synthesized pixel, an
Mapping the force vector to one or more force-receiving locations of a model representing the human target to adjust the model representing the human target to an adjusted pose.
2. The method of claim 1, in which the synthesized depth image is obtained at least in part by rasterizing a model representing the human target.
3. The method of claim 1, wherein the synthesized pixels are classified with the pixel case based at least on body-part indices of the synthesized pixels.
4. The method of claim 3, wherein the synthesized pixel is classified with the pixel case based at least in part on a difference between an original body-part index of the synthesized pixel and a snap body-part index of the synthesized pixel or a grow body-part index of the synthesized pixel.
5. The method of claim 1, wherein the synthesized pixel is classified with the pixel case based at least on a player index of the synthesized pixel.
6. The method of claim 5, wherein the synthesized pixel is classified with the pixel case based at least in part on a difference between an original player index of the synthesized pixel and a snap player index of the synthesized pixel or a grow player index of the synthesized pixel.
7. The method of claim 1, wherein the synthesized pixels are classified with the pixel case based at least on a difference in depth value between the synthesized pixel and the corresponding observed pixel.
8. The method of claim 1, wherein the synthesized pixel has a pixel address, and wherein the synthesized pixel at the pixel address is classified with the pixel case based on one or more of an original player index, a snap player index, a grow player index, an original body-part index, a snap body-part index, a grow body-part index, an original synthesized depth value, a snap depth value, and an observed depth value for the pixel address.
9. The method of claim 1, wherein the pixel case is selected from a group of defined pixel cases, the defined pixel cases including a refine-z pixel case, a magnetic pixel case, a push-pixel case, a pull-pixel case, a self-blocking push-pixel case, and a self-blocking pull-pixel case.
10. The method of claim 1, wherein the force vector mapped to the force-receiving location is one of a plurality of force vectors mapped to the force-receiving location, each such force vector corresponding to a synthetic pixel and a classified pixel case, and each such force vector acting on an adjustment of a model representing the human target to an adjusted pose.
HK13100912.0A 2009-12-07 2010-12-06 Visual target tracking HK1173691B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US12/632,599 2009-12-07
US12/632,599 US8588465B2 (en) 2009-01-30 2009-12-07 Visual target tracking
PCT/US2010/059094 WO2011071815A2 (en) 2009-12-07 2010-12-06 Visual target tracking

Publications (2)

Publication Number Publication Date
HK1173691A1 HK1173691A1 (en) 2013-05-24
HK1173691B true HK1173691B (en) 2014-03-21

Family

ID=

Similar Documents

Publication Publication Date Title
CN102665837B (en) Visual target tracking
CN102648484B (en) Visual target tracking
CN102640186B (en) Visual target tracking
CN102640187B (en) Visual target tracking
CA2748557C (en) Visual target tracking
US7974443B2 (en) Visual target tracking using model fitting and exemplar
CN102639198B (en) virtual object tracking
CN102648032B (en) Visual target tracking
HK1173691B (en) Visual target tracking
HK1172134B (en) Visual target tracking