HK1184266B

HK1184266B - Computing pose and/or shape of modifiable entities

Info

Publication number: HK1184266B
Application number: HK13111523.8A
Authority: HK
Inventors: 杰米．肖顿; 安德鲁．菲茨吉本; 乔纳森．泰勒; 马修．库克
Original assignee: 微软技术许可有限责任公司
Priority date: 2011-11-18
Filing date: 2013-10-12
Publication date: 2017-04-21

Description

Computing pose and/or shape of a modifiable entity

Technical Field

The present invention relates to a computer-implemented method and apparatus for computing the pose and/or shape of an articulated or modifiable entity.

Background

Modifiable entities, such as entities having joints or entities that can be partially or wholly deformed, can exist in many different configurations and shapes. For example, articulated entities such as humans, animals, articulated robots, plants, or articulated or deformable parts of such entities (including human organs) can exist in different shapes and in different postures. For example, a human hand may be stretched, curled into a fist, or held in many other different positions. Human hands may also exist in different shapes depending on individual differences between people.

Existing image processing methods for interpreting changeable entities, such as humans, animals or parts thereof, have involved estimating the position or angle between joints given an image or set of images of the entity. Such an estimate from the image sequence may be used to calculate or track a skeletal model of the human body. There is a continuing need to improve the accuracy and speed of computations, especially because many practical applications require real-time processing, such as robotics, computer games, medical image processing, telepresence, health care, sports training, and the like.

The embodiments described below are not limited to implementations for addressing any or all of the shortcomings of known systems for interpreting observation data describing one or more changeable entities.

Disclosure of Invention

The following presents a simplified summary of the disclosure in order to provide a basic understanding to the reader. This summary is not an extensive overview of the disclosure, and it does not identify key/critical elements or delineate the scope of this specification. The sole purpose of this summary is to present an abstraction of the concepts disclosed herein in a simplified form as a prelude to the more detailed description that is presented later.

Computing the pose and/or shape of a modifiable entity is described. In various embodiments, a model of an entity (e.g., a human hand, a golf player holding a golf club, an animal, a body organ) is fitted to an image that describes an example of the entity in a particular pose and shape. In an example, the optimization process results in values for pose and/or shape parameters that when applied to the model interpret the image data well. In an example, the optimization process is influenced by the correspondence between image elements and model points obtained from a regression engine, which may be a random decision forest. For example, a random decision forest may take elements of an image and compute candidate correspondences between those image elements and model points. In examples, the models, gestures, and correspondences may be used for control of various applications, including computer games, medical systems, augmented reality, robotics, and so forth.

Many of the attendant features will be better understood by reference to the following detailed description considered in connection with the accompanying drawings, and so will be more readily appreciated.

Drawings

The specification will be better understood from the following detailed description read with reference to the accompanying drawings, in which:

FIG. 1 is a schematic diagram of observed data describing changeable entities and corresponding models of entities;

FIG. 2 is a schematic diagram of an apparatus for pose and/or shape calculation and model fitting;

FIG. 3 is a schematic diagram of a system for training a regression engine;

FIG. 4 is a schematic illustration of an image of an articulated solid object;

FIG. 5 is a schematic diagram of a tree-based regression engine;

FIG. 6 is a flow chart of a method performed by the model fitting engine;

FIG. 7 is a flow diagram of a method of training a tree-based regression engine;

FIG. 8 is a flow diagram of a method of using a trained tree-based regression engine to obtain preliminary correspondences (tentatively corerresponsiveness) between observation data and model points;

FIG. 9 is a schematic diagram of a camera-based control system for controlling a computer game;

FIG. 10 is a schematic diagram of an image capture device;

FIG. 11 illustrates an exemplary computing-based apparatus in which embodiments of a system for computing pose and/or shape and fitting a model to observed data of a modifiable entity may be implemented.

In the drawings, like reference numerals are used to denote like parts.

Detailed Description

The detailed description provided below in connection with the appended drawings is intended as a description of the present examples and is not intended to represent the only forms in which the present example may be constructed or utilized. The description sets forth the functions of the example and the sequence of steps for constructing and operating the example. However, the same or equivalent functions and sequences may be accomplished by different examples.

Although the present examples are described and illustrated herein as being implemented in a computer gaming system, the described system is provided as an example and not a limitation. As will be appreciated by those skilled in the art, the present examples are suitable for application in a variety of different types of systems for computing pose and/or shape and fitting models to observed data that may change an entity. A non-exhaustive list of examples is: a medical imaging system; a robotic system; a satellite imaging system; an augmented reality system; remote presentation; health care; physical training, etc.

The alterable entity may be an articulated entity, or a deformable entity, or a combination thereof. An articulated entity is any entity having joints that connect one or more limbs (limbs), components, or other components together so that they can move relative to each other. An example of a deformable entity is a body organ. The entities may be of any type and may include different types of entities that are connected together, such as a person holding a support (e.g., a golfer holding a golf club).

For many application domains, it is useful to represent a changeable entity by a canonical model (a model representing a class of objects) that is invariant to the pose and/or shape of the entity; that is, for example, one canonical model of cats may be used to represent instances of cats of different poses and/or shapes. For example, a sleeping cat, a cat with a long tail, a cat with a short tail. The following examples describe: given the observed data of a changeable entity, how that data can be accurately fit to a canonical model at the actual time scale. In general, sensing and control applications can use the output of the examples described herein to improve performance.

FIG. 1 is a schematic diagram of observed data describing an alterable entity and a corresponding model of the entity. A canonical parameterized model 100 of the classes of alterable entities is stored in memory. The observation data is an entity that is a member of the class. In some examples, both the shape and the pose are parameterized directly in the model. That is, in some examples, the plurality of parameters of the model specify how the model can potentially be articulated, and the at least one parameter of the model specifies a shape of the model. In other examples, only the shape or only the pose is parameterized directly in the model. For example, the model may include pose parameters instead of shape parameters. In this case, multiple models of the entity may be used, each model having a different shape (e.g., several regular body shapes). The model fitting process described herein may be performed on each model to get a good fit. Multiple models of the entity, each model having a different pose but the same shape, may also be used, where the models include shape parameters instead of pose parameters.

The model may be a two-dimensional model, a three-dimensional model, or a higher-dimensional model, and the model may include a manifold (manifest) that includes a set of points located on a surface of the manifold. For example, in the case of a human body, the manifold may be the surface of the human body. A manifold may be represented as a mesh of triangles or other two-dimensional shapes tessellated on a surface. In other examples, the model is a volumetric model comprising 3D or higher dimensional points, wherein the 3D or higher dimensional points form the volume of the represented articulated entity. For example, in the case of a human organ, the model may be a volumetric model. A human organ is deformable in the sense that the volume of the organ deforms according to differences between subjects or over time as the organ undergoes aging, surgery, disease or other changes.

Example observation data 104 for a changeable entity may include image data, which may be two-dimensional, three-dimensional, or higher-dimensional. A non-exhaustive list of examples of image data is: medical images, depth images, color images, satellite images, or other types of images. The observation data may be two-dimensional, three-dimensional, or higher. The observation data 104 may include a single image or a sequence of images. In some examples, the observation data 104 includes stereo images from a stereo camera or from multiple cameras at different viewpoints. In some examples, the observation data 104 includes a contour image. The contour image is a two-dimensional binary image used to identify foreground and background regions of a depth image and/or color RGB image captured by an imaging sensor. In some examples, the contour image may be considered a depth image flattened to a fixed depth. The image need not be in the form of a regular grid. For example, a laser range scanner or rolling shutter camera may capture image data that is returned as an image of one line scan at a time. In an example, the observation data 108 includes one or more images of at least one person or portion of a person. The model includes a mesh model of the person 106 with arms extending horizontally and legs standing in an upright position.

By fitting the observation 104 to the model 100, the pose and/or shape of the model is obtained that provides a good match to the observation. The term "pose" is used to represent values of parameters of a model that specify how articulated portions of the model are oriented relative to one another to describe observed data; and the term "pose" is also used to refer to the value of one or more parameters that specify the overall orientation (translation, rotation, and scaling) of the model such that it corresponds to the observed data. The term "shape" is used to denote the values of parameters of a model that are used to specify the form and configuration of the model. For example, in the case of a model of a human body, the shape parameters may specify the body form of the human (e.g., male/female/child, tall/short).

In examples disclosed in this document, correspondences between observation data points and model points are also obtained as part of the model fitting process. Model points are locations in the model. For example, the model points may be coordinates of points, patches (patches) or regions on a manifold such as a human body surface. In another example, the model points may be coordinates of a three-dimensional pixel (vertex) or region in the volumetric model. The model points may be defined according to a manifold or coordinate system of the model, or using embedding in real world space. The embedding may take into account euclidean distances or geodesic distances. The translation or mapping between the coordinate systems may be performed in any suitable manner. Since the model is a continuous representation of the observed articulated entity, and the correspondence provides insight as to how the observed data relates to the model, the model fit output (which is the value of the pose and/or shape parameters and the correspondence) is a very powerful result. Computer gaming applications, medical applications, augmented reality applications, robotic applications, human-computer interactions and other applications may acquire model, pose and correspondence information and use it for control of other processes. In general, sensing and control applications can use the output of the examples described herein.

In some embodiments, human body gestures and/or shapes are detected. For example, the model is a mesh model 106 of a human body in a specified posture, such as standing with arms and legs extended horizontally. In some embodiments, the mesh model is a triangular mesh model. For example, the observation data 108 may be an image of a person standing with their arms extended upward. The system described herein derives the pose and/or shape of the model, where the pose and/or shape of the model provides a good match to the observed data and a correspondence between data points and points on the model.

In an example, a mesh model of a human body has a structure comprising a plurality of limbs, for example: sternum, abdomen, pelvis, left upper arm, left lower arm, left hand, right upper arm, right lower arm, right hand, left thigh, left calf, left foot, right thigh, right calf, right foot, neck, head. Other numbers and selections of limbs in the model may be used. Each limb is considered to have its own local coordinate system, which is related to the world coordinate system via a transformation that can be represented as a matrix. The transformations for the various limbs may be defined hierarchically according to the arrangement of the limbs in the model. For example, a hand change may rely on a lower arm change which in turn relies on an upper arm change, and so on.

In some examples, the observation data includes at least one medical image 112, and the model includes a high-dimensional canonical model 110 of the body organ. In this case, the model may be the above-mentioned volumetric model, and the parameters of the model represent the global rotation, translation and scaling of the model, and how the components/regions of the volume are deformed. For example, a tetrahedral mesh may be used.

In some examples, the observation data includes a color image 116 of the cat, and the model includes a 3D mesh model of the cat 114. The model in this case may have a similar type of structure to that of a human body, but with a different number of limbs and a different hierarchical arrangement of these limbs.

In other examples, a model may have one or more sub-entities. For example, the model may be a person holding a support such as a golf club or a hand holding a dart. The sub-entities may be considered as static parts, or hinged parts, or deformable parts of the entire model.

FIG. 2 is a schematic diagram of an apparatus for pose and/or shape calculation and model fitting. The parameterized model 210 of the articulated or deformable entity is stored at a location accessible to the regression engine 200 and the model fitting engine 202. The model may be of any suitable type as described above with reference to fig. 1. The observation data 208 is also stored at a location accessible to the regression engine 200 and the model fitting engine 202. Observation data 208 may be an image or sequence of images of a changeable entity to be fitted to model 210, or may be any other form of observation data 208 described above with reference to FIG. 1. The regression engine 200 is trained using training data 212, which will be described in more detail below.

Once trained, the regression engine 200 provides functionality for identifying or predicting candidate correspondences 214 between observed data points and model points. The candidate correspondences 214 may also be referred to as preliminary correspondences between data points and model points. The regression engine can also provide a measure of certainty for each of the preliminary correspondences. In some examples, the regression engine is implemented using a tree-based classifier, such as a random decision forest, or a regressor. The random decision forest includes one or more decision trees, each decision tree having a root node, a plurality of split nodes, and a plurality of leaf nodes. Observations such as image elements of an image may be processed through a tree of a random decision forest from a root node to a leaf node, thereby making a decision at each of the disjoint nodes. The probability distribution of candidate model points that may correspond to the observation of an image element is associated with the leaf node that the image element arrives at. In other examples, the regression engine is implemented using nearest neighbor matching, linear regression, gaussian processes, support vector regression, correlation vector machines.

The model fitting engine 202 may access the candidate correspondences 214 and certainty information from the regression engine 200. Model fitting engine 202 also accesses observation data 208 and model 210. Model fitting engine 202 is computer-implemented and accesses or stores objective or energy functions 203 regarding the consistency of model 210 with observed data 208. The model fitting engine is arranged to optimize the energy function to fit the model to the observation data; that is, values of the pose and/or shape parameters 206 are derived that enable the model to describe the observation data well, as well as good correspondence between observation data points and model points. This can be a very complex process because the number of possible fits between the model and the observed data to be searched is very large. The model fitting engine includes an optimizer 204, the optimizer 204 configured to optimize the energy function in an efficient manner in real time. The model fitting engine 202 uses the candidate correspondences 214 and the certainty information to influence the optimization process, thereby facilitating its speed and accuracy.

The energy function 203 is a function of pose parameters and/or shape parameters and correspondences (correspondences are model points fitted to data points). Global optimization of the energy function provides values for the pose parameters and/or shape parameters and correspondences that account for the observation data 208 given the model 210. In an example, the energy function sums an anti-disparity measure (robust) between the image element and the model point with respect to the image element. The energy function is made robust by adding the following options. In case the image is a depth image, the distance measure may be a 3D euclidean distance measure. In case the image is a contour image, the distance measure may be a 2D re-projection distance measure or a point-to-ray distance (point-to-ray distance).

The energy function may include a plurality of terms. In an example, at least one item (sometimes referred to as a "work" item) that depends on both the gesture parameters (or shape parameters) and the correspondence is included. The energy function optionally includes one or more other terms that depend at least on the pose parameters (or shape parameters) or at least on the correspondence. The choice of options for the energy function and careful design of the weights used to combine the terms enables the model fitting engine 202 to properly handle situations where local minima occur in the energy function and may result in the optimizer getting a poor solution. For example, in observation data 208, a portion of the entity is obscured. There may also be another example where some gestures of a body are excluded (e.g., elbow joint angle of a human body or other motion constraints) due to constraints on the articulation of the body. Yet another example, where the observation data points are associated with the following model points: given the pose parameters derived by the optimizer, the model points may not be visible.

The optimizer 204 is also computer-implemented. In some examples, the optimizer 204 includes functionality for finding the minimum of the non-linear function by using an iterative hill-climbing process that takes curvature information into account, such as newton's method. In other examples, the optimizer includes functionality that is an approximation of newton's method, such as the quasi-newton-Fletcher-goddesb-Shannon (Broyden-Fletcher-Goldfarb-Shannon, BFGS) method. The BFGS optimization method does not directly calculate curvature information and uses an approximation of the curvature information obtained from the gradient estimation. The gradient estimate may be approximate, such as a finite difference approximation, or for some examples, the derivative of the energy function is calculated to machine accuracy. Any type of optimizer 204 may be used to obtain the minimum of the non-linear function. For example, one type of optimizer implements a gradient descent scheme, or a particle swarm optimizer (particle swarm optimizer). The optimizer 204 may also include functionality for integer optimization. For example, image segmentation methods with alpha expansion (graphits), recursive belief propagation (lopybeliefprogration), tree re-weighting messaging, or other integer optimization processes are used.

The optimizer 204 may be arranged to perform the optimization in an iterative process by first fixing the correspondence using the output from the regression engine and optimizing the pose and/or shape parameters while the correspondence is fixed. The pose and/or shape parameters may then be fixed and the search for good correspondence between the model and the observed data points continued using another optimization process that may be guided by output from the regression engine. The iterative process is described in more detail below with reference to fig. 6.

In an example, the regression engine 200 includes a trained random decision forest 304, the random decision forest 304 having a probability distribution of candidate model points associated with its leaves. More details regarding this are now given with reference to fig. 3-5. Observation data, such as image elements of an image, may be input to a trained random decision forest. The image element is processed through each tree until the image element reaches a leaf node. The probability information associated with the destination leaf node of each tree is combined and used to give a possible number of candidate correspondences between observed data points (image elements) and model points, as well as information about the certainty of these candidate correspondences. As described above, model points may be defined using coordinates on a model manifold, or using model coordinates embedded in space. Probability information associated with leaf nodes may be used to model points using any of these coordinate systems.

As described above, elements of an image may traverse a tree of a random decision forest from a root node to leaf nodes in processing, thereby making a decision at each split node. In an example, the decision is made according to a characteristic of an image element and a characteristic of other image elements of the same image that are shifted from the image element by a spatial offset specified by a parameter at the separation node. At the disjunct node, the image element proceeds to the next level of the tree along the branch selected according to the outcome of the decision. The random decision forest may use regression or classification as described in more detail below. During training, parameter values (also referred to as features) are learned for use at disjoint nodes, and data is accumulated at leaf nodes. For example, candidate correspondences are accumulated and stored at leaf nodes. Since there may be a large number of candidate correspondences, a strategy for reducing storage may be used. Candidate correspondences may optionally be filtered to remove outliers using a threshold that may be learned using the validated data set. The accumulated correspondences may be stored as raw data or samples of the accumulated correspondences may be stored. A histogram of cumulative correspondences may be stored or correspondences may be aggregated by employing an average, median, mode, or other form of aggregation. In some examples, the multi-modal distribution is fitted to the cumulative correspondence. This enables a good adaptation to application domains involving changeable entities where data is found to be multimodal. In an example, correspondences are clustered using an average shift pattern detection process, and a weight is stored for each cluster or pattern according to the number of correspondences that reach a particular pattern. It is not necessary to use the average shift pattern detection process; other clustering processes may be used. Average shift pattern detection is an algorithm that efficiently detects the mode (peak) in the distribution defined by the kernel density estimator. In the case of cumulative correspondences, the kernel density estimator is a non-parametric process for estimating the probability density function.

To train a random decision forest, training data pairs 300 are used, wherein each pair comprises an image of an articulated solid body (e.g., schematically illustrated at 400 in fig. 4) and a corresponding image of the solid body, wherein the pose and/or shape is known and each image element has a known assigned model point. The training data may be computer-generated from a model and may include many examples with different pose and/or shape parameter values.

The training data pairs 300 may be used to form a trained structure of trees in a random decision forest and may be used to learn parameters of decisions or tests to be applied at the split nodes. However, this is not essential. The random decision forest that has been trained on classification or regression tasks can be reused by storing new probability distributions for candidate correspondences at the leaf nodes. This enables reduced training time and cost, and also potentially provides improved accuracy. Thus, fig. 3 shows a training data pair 300 applied to a random decision forest 302 with or without a trained structure.

In an example, a random decision forest that has been trained to classify image pixels as candidate body parts may be used. Existing data stored in association with the leaf node is discarded and replaced with new data. New data is obtained by passing the training data through a random decision forest and by aggregating candidate correspondences obtained at leaf nodes or storing them in any other suitable manner. The result is a trained random decision forest 304, which trained random decision forest 304 has a probability distribution of candidate model points at its leaves. Deterministic information of the candidate model points can also be obtained from the probability distribution.

In other examples, the structure of the random decision forest is formed as part of the training process, and this is described in more detail below with reference to fig. 7.

FIG. 5 is a schematic diagram of observation data 500 applied to a trained random decision forest to obtain candidate correspondences. Processing image elements of an image of the alterable entity at the root node, wherein a test represented by T in the graph is performed. For example, the test compares an image element with another image element of the image (specified by parameters learned during training, or parameters reused from other trained random decision forests), and depending on the result, the image element goes to the left child node or the right child node. For example, as indicated by the arrows in FIG. 5, the process is repeated so that the image elements pass along the tree to the leaf nodes. The leaf nodes reached have an associated probability distribution of candidate correspondences as represented by 504 in fig. 5. The candidate correspondences are represented as x in the region corresponding to observation 500. Correspondence is accumulated since all image elements pass through the trees in the decision forest (note that fig. 5 only shows one image element passing through one tree). In some examples, the candidate correspondences are ranked across all leaves according to confidence. In other examples, the candidate correspondences are optionally filtered and then aggregated to form an overall correspondence aggregation for each image element. Any suitable method of aggregating correspondences over test time may be used. For example, aggregation methods similar to those described above are used at training time.

FIG. 6 is a schematic diagram of a method performed by model fitting engine 202. Candidate correspondences are selected 600 from the regression engine. Various different criteria may be used to select candidate correspondences. In an example, for a given image element of the observation data, the candidate correspondences from the leaf nodes of each tree are ranked according to confidence, and one or more of the highest confidence candidate correspondences are selected as the candidate correspondences for the image element.

In some cases, the candidate correspondences may not lie on the model surface or in the model volume. In this case, the closest point on or in the model is selected as the candidate correspondence. Any suitable method of locating the predicted corresponding points on the surface manifold of the model may be used, such as nearest neighbors or other ways of obtaining close points on the model. In examples where the regression engine comprises a random decision forest, the probability distribution of correspondence may be stored as a mode derived from the average shift pattern or other clustering process described above. When clustering is performed, correspondences with model points at different locations on the articulated entity are potentially clustered into the same cluster, even though some of these model points may be on different limbs. Due to the nature of the clustering process, the resulting cluster centers may be in the space around the model, rather than in or on the model.

Various different starting points for the optimization process may be used. A non-exhaustive list of examples is: a random starting point, starting with a fixed specific pose, by first rotating, translating and scaling the fixed specific pose to fit the candidate correspondence, by observing the results of the previous frame in the example of the image sequence of the entity in time.

As described above, the optimizer may use an iterative process 602, the iterative process 602 including: the pose and/or shape is optimized while the correspondence is fixed in the energy function 604, and then the correspondence is optimized while the pose and/or shape is fixed in the energy function 606. The steps of optimizing for pose and/or shape 604 and optimizing for correspondence 606 may be repeated 608. The output of the process includes the gesture parameters and/or shape parameters and the final corresponding values 634.

For example, in examples where the model is not fully parameterized for shape, the iterative process 602 itself by the optimizer may be repeated 636 for other candidate models (e.g., short men, tall women, children).

The process of optimizing 604 the pose and/or shape parameters includes: an energy function regarding the model's consistency with the observed data is accessed 610 and the model points are fixed 612 in the energy function using the values of correspondence selected from the regression engine. The energy function term depending on the pose and/or shape is optimized 614. In an example, as described above, an optimized newton method or an approximation of a newton method (quasi-newton method) is used for the optimization. Optionally, multiple restarts 618 of the optimization are performed to mitigate the local minimum. The result of the optimization is a candidate pose and/or shape 620 that includes values for pose and/or shape parameters of the model.

The process of optimizing 606 the correspondence includes: an energy function regarding the model's conformity with the observed data is accessed 622 and the pose and/or shape parameters are fixed 624 in the energy function using the values of the pose and/or shape output by the pose and/or shape optimization process 604. The energy function term that depends on the correspondence is optimized 614, for example, using any suitable type of integer optimization process.

In some examples, the optimization of the energy function term depending on correspondence is unconstrained 626 and includes: for each image element, a search is made over all model points that have been transformed given the pose and/or shape parameters to find the closest model point to the image element. The process outputs candidate correspondences 632.

In other examples, the optimization of the energy function term depending on the correspondence is guided by the output of the regression engine. For example, the optimization of the energy function term depending on the correspondence is fully constrained 630 by the output of the regression engine. For example, all preliminary correspondences are accessed 628 from the regression engine, and for each image element, when the corresponding model point has been transformed given the current pose and shape estimate, a search is made over these preliminary correspondences for the closest preliminary correspondence to the image element. In another example, for each image element, the highest confidence correspondence (or correspondence selected using other specified criteria) is accessed from the regression engine, and model points near the correspondence are searched for the closest model point to the image element. The evaluation of the model points near the correspondence with the highest confidence may be performed by using a distance measure or other similarity measure.

As described above, the energy function includes work items that depend on the pose and/or shape parameters and the correspondence. The energy function optionally comprises one or more additional terms, each depending on a gesture and/or shape parameter, or correspondence.

In an example, a work term of an energy function may be a term that searches for good pose and/or shape and takes visibility into account given the model and data points. For example, the data points (which are image elements) theoretically do not correspond to model points (transformed by the current pose and shape parameters) that are not visible from the camera or from the viewpoint from which the image was captured. In view of this, when the model points are not visible from the viewpoint of the camera or captured image, the work item may be set to infinity or other specified value. The terms of the energy function then become dependent on the correspondence and the pose and/or shape. To evaluate whether a model point is visible from the viewpoint of the captured image, the surface normal of the model point may be calculated and its direction compared to the direction of the visual axis of the camera. The back-facing surface normal may be identified as having a direction generally parallel to the camera's visual axis. In the case of depth images, the visual axis of the camera is obtained by finding the positive direction of the Z axis.

In the examples described herein, at least one of the terms of the energy function is designed to take into account the similarity between the data points and the model points. This may be referred to as a shape term and may depend only on correspondence. Any one or more similarity measures may be used, such as color, curvature, depth, intensity, or other measures. For example, the profile item may be arranged to have a low value when the model point is a possible counterpart of a given data point according to similarity, and may be arranged to have a high value in the opposite case. In the examples described herein, the regression engine is used to provide a similarity measure whereby the candidate correspondences identified by the regression engine are similar.

In some examples, the energy function includes a term that takes into account existing information about the class of articulated entities being modeled. For example, in the case of the human body, the elbow joint has a limited angular range through which it can move. The existing item may depend only on the pose and/or shape parameters. Any other existing items of motion or action may be considered using this item. For example, where the observation data includes a sequence of images of an entity that moves or deforms over time, then the motion model may be implemented using existing item constraints. The motion model may be specified based on knowledge of the manner in which an entity typically moves or deforms over time. For example, by modeling the joint angular velocity as a constant, or by modeling the joint angular acceleration.

In some examples, the energy function includes a term that takes into account coherence between neighboring image points and neighboring model points and depends only on correspondence. For example, the coherence term may motivate the mapping of neighboring image points to neighboring model points. Other image-based weighting terms may be incorporated into the relevance terms.

The coherence term enables the model fitting engine to take into account the fact that: the correspondences inferred by the regression engine are noisy and have uncertainty. In the constrained and guided correspondence optimization process described above with reference to FIG. 6, the coherence term can be used to account for uncertainty by defining a Markov random field and solving for the Markov random field using image segmentation, circular belief propagation, or other methods. The coherence term stimulates image points that are close to each other to map to model points that are also close to each other. Any suitable metric for measuring how "close" the model points or image points are may be used. For example, euclidean distance, geodesic distance in the embedding space.

The terms of the energy functions may be combined using weighted summation or other aggregation means. The weights may be learned using the validation data. The weights of the shape and work items may be summed to 1 and may be adjusted during the iterative optimization process (although the sum is 1). For example, the weight of the visibility term may be set to zero in the first iteration of the optimization and increased as the optimization approaches convergence. Adjusting the weights in this manner enables the optimization process to be performed efficiently and provides accurate results.

In examples where the observation data includes a sequence of images of the entity over time, the model fitting engine may be configured to derive shape parameters that are static over time, such as player height or other player shape parameters. The values of these static parameters may then be fixed throughout the optimization process.

In examples where the observation data comprises a sequence of images of the entity over time, the model fitting engine may be arranged to optimize pose and shape over any subset of the sequence, for example the entire sequence of the most recent K frames.

In examples where the observation data includes multiple images of the entity taken simultaneously from different views, multiple sets of candidate correspondences (one from each of the multiple different views) may be obtained. The optimization process may use these candidate correspondence sets.

FIG. 7 is a flow diagram of a method of training a random decision forest for use as the regression engine 200 of FIG. 2. The training set is first received 700. The number of decision trees to be used in the random decision forest is selected 702. A random decision forest is a collection of deterministic decision trees. Decision trees may be used in classification or regression algorithms, but may suffer from overfitting, i.e., generalizing non-idealities (Poorgeneralisation). However, the ensemble of many randomly trained decision trees (random forest) yields improved generalization. The number of trees is fixed during the training process.

The following notation is used to describe the training process. A picture element in the image I is defined by its coordinates x = (x, y).

The manner in which the parameters used by each of the disjoint nodes are selected and how the leaf node probabilities may be calculated is now described. A decision tree is selected 704 from the decision forest and 706 nodes are selected. At least a subset of the image elements is then selected 708 from each of the training images. For example, the image may be segmented, thereby selecting image elements in the foreground region.

Then, a random set of test parameters used by the binary test performed at the root node is generated 710 as candidate features. In one example, the binary test has the form: ξ > f (x; θ) > τ, such that f (x; θ) is a function applied to image element x with a parameter θ and with a function output that is compared to thresholds ξ and τ. If the result of f (x; θ) is in the range between ξ and τ, the result of the binary test is true. Otherwise, the result of the binary test is false. In other examples, only one of the thresholds ξ and τ may be used such that the result of the binary test is true if the result of f (x; θ) is greater than (or alternatively less than) the threshold. In the example described here, the parameter θ defines a feature of the image.

The candidate function f (x; theta) may use only the image information available at the test time. The parameters theta of the function f (x; theta) are randomly generated during training. The process for generating the parameter θ may include generating a random spatial offset value in the form of a two-dimensional or three-dimensional displacement. The value of the function f (x; θ) is then calculated by observing the depth value (or intensity or other value of the observation data) of the test image element that is displaced from the image element of interest x in the image by a spatial offset. Optionally, the spatial offset is depth invariant by scaling by 1/depth of the image element of interest.

The result of the binary test performed at the root node or component node determines to which child node the image element goes. For example, if the result of the binary test is true, the image element goes to the first child node, and if the result is false, the image element goes to the second child node.

The generated set of random test parameters includes a plurality of random values of the function parameter θ, and thresholds ξ and τ. To inject randomness into the decision tree, the function parameters θ for each split node are optimized only over a randomly sampled subset Θ of all possible parameters. This is an efficient and simple way of injecting randomness into the tree and adds generality.

In other words, the available values for θ (i.e., θ) are tried one after another in conjunction with the available values for ξ and τ for each image element in each training image_i∈ Θ) for each combination, the criteria (alsoReferred to as a target) to perform the calculation. Note that these objectives are different from the energy functions or objective functions described above for pose and/or shape estimation. In an example, the calculated criterion includes an information gain (also referred to as relative entropy). Optimizing criteria (e.g. maximizing information gain (in theta)^*、ξ^*And τ^*Representation)) is selected 714 and stored on the current node for future use. Instead of the information gain, other criteria may be used, such as kuney entropy, or a two-minute criterion (two-entropy), or the like. In an example, the target comprises differential entropy, which selects a parameter that reduces the variance of the correspondence of the subset of image elements at the separation node (predicted as model points mapped to the observed image elements).

It is then determined 716 whether the value of the calculated criterion is less than (or greater than) a threshold value. If the value of the calculated criterion is less than the threshold, this indicates that further expansion of the tree does not provide significant benefit. This results in an asymmetric tree that naturally stops growing when no further nodes are beneficial. In such a case, the current node is set 718 to be a leaf node. Similarly, the current depth of the tree (i.e., how many levels of nodes are between the root node and the current node) is determined. If this is greater than the predetermined maximum, then the current node is set 718 to be a leaf node. Each leaf node has a candidate correspondence that is accumulated at that leaf node during the training process described below.

Other stopping criteria may also be used in combination with those already mentioned. For example, the number of example image elements that reach a leaf is evaluated. If there are few examples (e.g. compared to a threshold), the process may be arranged to stop to avoid overfitting. However, it is not necessary to use this stopping criterion.

If the value of the calculated criteria is greater than or equal to the threshold value and the tree depth is less than the maximum value, then the current node is set 720 to be the split node. Since the current node is a split node, which has children, the process moves to training these children. Each child node is trained using a subset of the training image elements at the current node. The subset of image elements sent to the child node is determined using parameters that optimize the criteria. These parameters are used in a binary test, and the binary test is performed 722 on all image elements at the current node. Image elements that pass the binary test form a first subset that is sent to the first child node, while image elements that fail the binary test form a second subset that is sent to the second child node.

For each of the child nodes, the processing outlined in blocks 710 through 722 of FIG. 7 is recursively performed 727 on the subset of image elements for the respective child node. In other words, for each child node, new random test parameters are generated 710, applied 712 to the corresponding subset of image elements, parameters are selected 714 that optimize the criteria, and the node type (split or leaf) is determined 716. If the node is a leaf node, the current branch of the recursion stops. If the node is a split node, a binary test is performed 722 to determine a subset of the other image elements and to begin another branch of the recursion. So, the process recursively traverses the tree, training each node until a leaf node is reached on each branch. When the leaf node is reached, the process waits 726 until the nodes in all branches have been trained. Note that in other examples, recursive alternative techniques may be used to achieve the same functionality.

Once all of the nodes in the tree have been trained to determine the parameters for the binary test that optimizes the criteria at each disjunct node, and the leaf nodes have been selected to terminate each branch, candidate correspondences may be accumulated 728 at the leaf nodes of the tree. This is the training phase, so that a particular image element that reaches a given leaf node has specified a correspondence that is known from real (groudtuth) training data. The representations of the accumulated correspondences may be stored 730 using the various different manners described above.

Once the accumulated correspondences have been stored, a determination 732 is made as to whether there are more trees in the forest. If so, the next tree in the decision forest is selected and the process is repeated. If all the trees in the forest have been trained and no others remain, the training process is complete and the process terminates 734.

Therefore, as a result of the training process, one or more decision trees are trained using synthesized or empirical training images. Each tree includes a plurality of split nodes that store optimized test parameters and leaf nodes that store associated candidate correspondences or a representation of aggregated candidate correspondences. Since the parameters are randomly generated from a finite subset used at each node, the trees of the forest are distinct (i.e., different) from each other.

The training process may be performed prior to using the regression engine. The decision forest and optimized test parameters may be stored on a storage device for later use in model fitting.

FIG. 8 illustrates a flow diagram of a process for predicting correspondences (model points corresponding to image elements) in images that have not been seen before using a decision forest that has been trained as described above. First, a depth image that has not been seen is received 800. The image is referred to as "unseen" to distinguish it from a training image having a designated correspondence. Note that the unseen image may be pre-processed to the extent that foreground regions are identified, for example, which reduces the number of image elements to be processed by the decision forest. However, preprocessing to identify foreground regions is not necessary.

Image elements are selected 802 from unseen images. A trained decision tree is also selected 804 from the decision forest. The selected image element is passed 806 through the selected decision tree (in a manner similar to that described above with reference to fig. 5), testing it for trained parameters at the node, then going to the appropriate child node according to the results of the test, and repeating the process until the image element reaches the leaf node. Once an image element reaches a leaf node, the cumulative correspondence associated with that leaf node (from the training phase) is stored 808 for that image element.

If it is determined 810 that there are more decision trees in the forest, a new decision tree is selected 804, the image is initially passed 806 through the tree, and the accumulated correspondences are stored 808. This process is repeated until the process is performed for all decision trees in the forest. Note that the process for passing image elements through multiple trees in a decision forest may also be performed in parallel, instead of sequentially as shown in fig. 8.

Then, it is determined 812 whether there are other unanalyzed image elements in the unseen depth image, and if so, another image element is selected and the process is repeated. Once all image elements in the unseen image have been analyzed, correspondences are obtained for all image elements.

As the image elements traverse the trees in the decision forest, correspondences are accumulated. These accumulated correspondences are aggregated 814 to form an overall correspondence aggregation for each image element. The method of aggregating correspondences at the time of test mentioned above with reference to fig. 5 may be used. Alternatively, corresponding samples may be employed for aggregation. For example, the N correspondences may be chosen randomly, or the first N weighted correspondences may be employed, and then the aggregation process is applied only to those N correspondences. This enables a compromise to be made on speed accuracy.

At least one correspondence set may then be output 816, where the correspondences may be confidence weighted. More than one set of correspondences may be output; for example, in the presence of uncertainty. Additionally, the correspondence set may include null values for one or more observed data points.

In some examples, the regression engine 200 and the model fitting engine 202 described above are incorporated into the system for controlling a computer game now described.

Fig. 9 shows an example of a camera-based control system 900 for controlling a computer game. In this illustrative example, FIG. 9 shows a user 902 playing a boxing game. In some examples, the camera-based control system 900 may be used to determine body gestures, bind, recognize, analyze, track, associate to human subjects, provide feedback, interpret gestures, and/or adapt to aspects of human subjects such as the user 902, and the like.

The camera-based control system 900 includes a computing device 904. The computing device 904 may be a general purpose computer, a gaming system or console, or a dedicated image processing device. The computing device 904 may include hardware components and/or software components such that the computing device 904 may be used to execute applications such as gaming applications and/or non-gaming applications. The structure of computing device 904 is described below with reference to FIG. 10.

The camera-based control system 900 also includes a capture device 906. For example, as described in more detail below, the capture device 906 may be an image sensor or detector that may be used to visually monitor one or more users (e.g., user 902) such that gestures performed by the one or more users may be captured, analyzed, processed, and tracked to perform one or more controls or actions within a game or application. The capturing means 906 may be arranged to capture the observation data 208 of fig. 2.

The camera-based control system 900 may also include a display device 908 connected to the computing device 904. The display device may be a television, monitor, High Definition Television (HDTV), etc., which may provide game or application images (and optionally audio) to the user 902.

In operation, the user 902 may be tracked using the capture device 906 and the computing device 904 such that the gestures and correspondence of the user 902 (and any supports used by the user) may be interpreted by the computing device 904 (and/or the capture device 906) as controls that may be used to affect the application being executed by the computing device 904. As a result, the user 902 may move his body to control the executed game or application.

In the illustrative example of fig. 9, the application executing on the computing device 904 is a boxing game that the user 902 is playing. In this example, computing device 904 controls display device 908 to provide user 902 with a visual representation of a boxing opponent. The computing device 904 also controls the display device 908 to provide a visual representation of the user's avatar that may be controlled by the user 902 with his or her movements. For example, the user 902 may punch a punch in physical space, such that the customization avatar punches a punch in game space. Thus, according to the present example, the computing device 904 and the capture device 906 of the camera-based control system 900 may be used to recognize and analyze a punch of the user 902 in physical space such that the punch may be interpreted as a game control of the user avatar in game space.

In addition, some movements may be interpreted as controls corresponding to actions other than controlling an avatar. For example, the user may use the movements to enter, exit, turn the system on or off, pause, save games, select ratings, profiles or menus, view high scores, communicate with friends, and so forth. Additionally, the movements of the user 902 may be used and analyzed in any suitable manner to interact with applications other than gaming, such as entering text, selecting icons or menu items, controlling media playback, browsing websites, or operating any other controllable aspect of an application or operating system.

Referring now to fig. 10, fig. 10 illustrates a schematic diagram of a capture device 906 that may be used in the camera-based control system 900 of fig. 9. In the example of fig. 10, capture device 906 is configured to capture a video image having depth information. Such a capture device may be referred to as a depth camera. The depth information may be in the form of a depth image that includes a depth value, i.e., a value associated with each image element of the depth image that is related to the distance between the depth camera and the item or object located at that image element.

Depth information may be obtained using any suitable technique, including time-of-flight (TIME-OF-FLIGHT), structured light, stereo images, and the like, for example. In some examples, the capture device 906 may organize the depth information into "Z layers," or layers that may be perpendicular to a Z axis extending from the depth camera along a line of sight of the depth camera.

As shown in fig. 10, the capture device 906 includes at least one imaging sensor 1000. In the example shown in fig. 10, the imaging sensor 1000 includes a depth camera 1002 arranged to capture a depth image of a scene. The captured depth image may include a two-dimensional (2-D) region of the captured scene, where each image element in the 2-D region represents a depth value, such as a length or distance of an object in the captured scene from depth camera 1002.

The capturing apparatus may further comprise a transmitter 1004 arranged to illuminate the scene in such a way that depth information may be determined by the depth camera 1002. For example, where depth camera 1002 is an Infrared (IR) moveout ranging camera, emitter 1004 emits IR light onto the scene, and depth camera 1002 is arranged to detect backscattered light from the surface of one or more targets and objects in the scene. In some examples, pulsed infrared light may be emitted from the emitter 1004 such that the time between an outgoing light pulse and a corresponding incoming light pulse may be detected and measured by the depth camera and used to determine a physical distance from the capture device 1006 to the location of a target or object in the scene. Additionally, in some examples, the phase of the light wave exiting the emitter 1004 may be compared to the phase of the light wave incident at the depth camera 1002 to determine a phase shift. The phase shift may then be used to determine a physical distance from the capture device 1006 to the location of the target or object. In other examples, time-lapse ranging analysis may be used to indirectly determine a physical distance from the capture device 906 to the location of the target or object by analyzing the intensity of the reflected light beam in time via various techniques including, for example, shutter light pulse imaging.

In another example, the capture device 1006 may use structured light to capture depth information. In such techniques, a transmitter 1004 may be used to project patterned light (e.g., light displayed as a known pattern such as a grid pattern or a stripe pattern) onto a scene. When shining on the surface of one or more targets or objects in the scene, the pattern becomes distorted. Such deformations of the pattern may be captured by depth camera 1002 and then analyzed to determine a physical distance from capture device 1006 to a location of a target or object in the scene.

In another example, depth camera 1002 may be in the form of two or more physically separated cameras viewing a scene from different perspectives, obtaining visual stereo data that may be parsed to generate depth information. In this case, the transmitter 1004 may be used to illuminate the scene or may be omitted.

In some examples, in addition to depth camera 1002, capture device 1006 may include a conventional video camera known as an RGB camera. The RGB camera is arranged to capture a sequence of images of the scene at visible frequencies and can therefore provide images that can be used to enhance the depth image. In an alternative example, an RGB camera may be used in place of depth camera 1002.

The capture device 1006 shown in fig. 10 also includes at least one processor 1008 in communication with the imaging sensor 1000 (i.e., the depth camera 1002 and the RGB camera in the example of fig. 10) and the emitter 1004. Processor 1008 may be a general purpose microprocessor, or a dedicated signal/image processor. The processor 1008 is arranged to execute instructions to control the imaging sensor 1000 and the emitter 1004 to capture depth images and/or RGB images. The processor 1008 may also optionally be arranged to perform processing on these images, as outlined in more detail below.

In some examples, the imaging sensor is used to provide a contour image, which is a two-dimensional binary image used to identify foreground and background regions of a depth image and/or RGB image captured by the imaging sensor. A contour image may be formed at the imaging sensor and/or processor 1008 from the captured depth image and RGB image. The contour image may be processed using the methods described herein to address the ambiguity that arises when RGB images are used in the methods described herein. In this case, the contour image may be considered to be a depth image flattened to a fixed depth.

The capture device 1006 shown in fig. 10 also includes a memory 1010, the memory 1010 being arranged to store instructions for execution by the processor 1008, images or image frames captured by the depth camera 1002 or the RGB camera, or any other suitable information, images, or the like. In some examples, memory 1010 may include Random Access Memory (RAM), Read Only Memory (ROM), cache, flash memory, a hard disk, or any other suitable storage component. The memory 1010 may be a separate component in communication with the processor 1008, or may be integrated into the processor 1008.

The capture device 1006 also includes an output interface 1012 in communication with the processor 1008 and arranged to provide data to the computing device 904 via a communication link. For example, the communication link may be a wired connection (e.g., USB (trademark), Firewire (trademark), Ethernet (trademark), or the like) and/or a wireless connection (e.g., WiFi (trademark), Bluetooth (trademark), or the like). In other examples, output interface 1012 may be connected to one or more communication networks (e.g., the internet) and provide data to computing device 904 via these networks. The computing device 904 may include the regression engine 200 and the model fitting engine 202 described above with reference to fig. 2.

Fig. 11 illustrates various components of an exemplary computing-based device 904, the computing-based device 904 may be implemented as any form of computing and/or electronic device, and in the computing-based device 904, embodiments of a system for computing a pose of an articulated entity from observation data, such as one or more images, may be implemented.

The computing-based device 904 includes one or more processors 1100, and the processors 1100 may be microprocessors, controllers, graphics processing units, parallel processing units, or any other suitable type of processor for processing computer-executable instructions for controlling the operation of the device to calculate a pose of the articulated entity from observed data, such as one or more images. In some examples, such as where a system on a chip architecture is used, processor 1100 may include one or more fixed function blocks (also referred to as accelerators) that implement portions of the methods of model fitting and pose calculation in hardware (rather than software or firmware).

The computing-based device 904 includes one or more input interfaces 1102, the input interfaces 1102 being arranged to receive and process input from one or more devices, such as user input devices (e.g., capture device 906, game controller 1104, keyboard 1106, and/or mouse 1108). The user input may be used to control a software application or game executing on the computing device 904.

The computing-based device 904 also includes an output interface 1110, the output interface 1110 being arranged to output display information to a display device 908, the display device 908 being separate from the computing device 904 or integrated into the computing device 904. The display information may provide a graphical user interface. In an example, if the display device 908 is a touch-sensitive display device, the display device 908 may also serve as a user input device. The output interface may also output data to a device other than the display device, such as a locally connected printing device.

Computer-executable instructions may be provided using any computer-readable media accessible by computing-based device 904. Computer readable media may include, for example, computer storage media 1112, such as memory, and communication media. Computer storage media 1112, such as memory 1112, includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EPROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information for access by a computing device. Rather, communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism. As defined herein, computer storage media does not include communication media. Therefore, computer storage media should not be construed as propagating signals per se. While computer storage 1112 (memory) is shown within computing-based device 904, it is contemplated that the memory may be distributed or remotely located and accessed via a network or other communication link (e.g., using communication interface 1113).

Platform software, including an operating system 1114 or any other suitable platform software, may be provided at the computing device 904 to enable application software 1116 to execute on the device. Other functions that may be performed on computing device 904 include: a regression engine 1118; model fitting engine 1220 (see, e.g., fig. 6 and the description above). Data store 1122 is provided to store data such as observation data, training data, intermediate function results, tree training parameters, probability distributions, classification labels, regression targets, classification targets, energy function terms, energy function term weights, and other data.

The term "computer" is used herein to refer to any device having processing capabilities such that it can execute instructions. Those skilled in the art will recognize that such processing capabilities are incorporated into many different devices, so the terms "computer" and "computing-based device" each include PCs, servers, mobile phones (including smart phones), tablet computers, set-top boxes, media players, game consoles, personal digital assistants, and many other devices.

The methods described herein may be performed by software in machine-readable form on a tangible storage medium, such as in the form of a computer program comprising computer program code means adapted to perform all the steps of any of the methods described herein when the program is run on a computer, and the computer program may be embodied in a computer-readable medium. Examples of tangible (or non-transitory) storage media include disks, thumb drives, memory, etc., and do not include propagated signals. The software may be adapted for execution on a parallel processor or a serial processor such that the method steps may be performed in any suitable order, or simultaneously.

This acknowledges that software can be a valuable separately tradable commodity. Is intended to include software that runs on, or controls, the "dumb" or standard hardware to perform the desired functions. It is also intended to include software which "describes" or defines the configuration of hardware to perform a desired function, such as HDL (hardware description language) software for designing silicon chips, or for configuring general purpose programmable chips.

Those skilled in the art will realize that storage devices utilized to store program instructions may be distributed over a network. For example, the remote computer may store an example of the process described as software. A local or terminal computer may access the remote computer and download a part or all of the software to run the program. Alternatively, the local computer may download pieces of the software as needed, or execute some software instructions at the local terminal and some at the remote computer (or computer network). Those skilled in the art will also realize that by utilizing conventional techniques known to those skilled in the art that all, or a portion of the software instructions may be carried out by a dedicated circuit, such as a DSP, programmable logic array, or the like.

It will be apparent to those skilled in the art that any of the ranges or device values given herein may be extended or altered without losing the effect sought.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

It should be understood that the benefits and advantages described above may relate to one embodiment, or may relate to several embodiments. The embodiments are not limited to those embodiments that solve any or all of the stated problems or those embodiments that have any or all of the stated benefits and advantages. It will also be understood that reference to "an" item refers to one or more of those items.

The steps of the methods described herein may be performed in any suitable order, or simultaneously where appropriate. Additionally, individual blocks may be deleted from any of the methods without departing from the spirit and scope of the subject matter described herein. Aspects of any of the above examples may be combined with aspects of any of the other examples described to form further examples without losing the effect sought.

The term "comprising" is used herein to mean including the identified method blocks or elements, but such blocks or elements do not include the exclusive list, and a method or apparatus may include additional blocks or elements.

It should be understood that the above description of the preferred embodiments is given by way of example only and that various modifications may be made by those skilled in the art. The above specification, examples and data provide a complete description of the structure and use of exemplary embodiments of the invention. Although various embodiments of the invention have been described above with a certain degree of particularity, or with reference to one or more individual embodiments, those skilled in the art could make numerous alterations to the disclosed embodiments without departing from the spirit or scope of this invention.

In addition, the present invention may also be configured as follows:

(1) a computer-implemented method of calculating a pose or shape of an articulated or deformable entity, comprising:

receiving at least one image of the entity;

accessing a model of a class of articulated or deformable entities, wherein the imaged entities are members of the class, the model comprising a plurality of parameters for specifying the pose or shape of the model;

accessing a plurality of candidate correspondences between image elements of the received image and model points using the image, wherein the model points are locations on or in the model;

performing an optimization process to obtain values of the parameters specifying the pose or shape of the model consistent with the received image; and wherein the optimization is affected by at least some of the candidate correspondences.

(2) The method of (1), comprising performing the optimization process by optimizing an energy function with respect to correspondence between the model and the received image, wherein the energy function comprises a term arranged to favor correspondence between image elements and similar model points, wherein the candidate correspondence is used to evaluate similarity.

(3) A method as claimed in (1), comprising accessing the candidate correspondences from a random decision forest arranged to obtain image elements of the received image, and for each image element, using information associated with its leaves to compute a probability distribution for the candidate correspondences.

(4) The method of (1), comprising performing the optimization process by iteratively fixing and optimizing different terms of an energy function; the energy function is about the correspondence between the model and the received image.

(5) The method of (4), comprising iteratively fixing and optimizing the parameters and the correspondences for specifying the pose or shape of the model, and wherein an initial correspondence is selected from the candidate correspondences.

(6) The method of (1), comprising receiving a sequence of images of the entity over time; using the sequence to derive at least one temporally static model parameter; and fixing the parameter during the optimization process.

(7) The method of (1), comprising performing the optimization process by optimizing an energy function with respect to correspondence between the model and the received image, wherein the energy function comprises a term arranged to excite neighboring image points to map to neighboring model points.

(8) The method of (1), comprising performing the optimization process by optimizing an energy function, the energy function being in respect of a correspondence between the model and the received image, wherein the energy function comprises a term arranged to omit model points that are not visible from a viewpoint of an image capturing device capturing the received image by taking into account a direction of a surface normal of the model.

(9) The method of (1), comprising receiving a plurality of images of the entity taken simultaneously from different viewpoints, and accessing the candidate correspondences using the plurality of images.

(10) The method of (1), comprising performing the optimization process by optimizing an energy function with respect to a correspondence between the model and the received images, wherein the energy function comprises a term arranged to penalize values of pose parameters known to be impossible, or wherein the method comprises receiving a sequence of images of the entity over time that do not follow a specified motion model.

(11) The method of (1), comprising performing the optimization by accessing a plurality of candidate correspondences for each image element of the received image and searching for model points only among the candidate correspondences.

(12) The method of (1), comprising performing the optimization by accessing at least one candidate correspondence for each image element of the received image and using the candidate correspondence to direct the search for the model points.

(13) A computer-implemented method of calculating a pose or shape of an articulated or deformable entity, comprising:

receiving at least one depth image of the entity;

accessing a model of a class of articulated or deformable entities, wherein the imaged entities are members of the class, the model comprising a plurality of parameters for specifying the pose of the model;

accessing, using the image from a random decision forest, a plurality of candidate correspondences between image elements of the received image and model points, wherein the model points are locations on or in the model; the random decision forest is arranged to acquire image elements of the received image and, for each image element, to calculate a distribution of candidate correspondences using information associated with its leaf;

performing an optimization process to derive values of the parameters specifying the pose or shape of the model consistent with the received image and also derive an optimal correspondence between the image elements and model points; and wherein the optimization is affected by at least some of the candidate correspondences; the optimization process includes summing measures based on distances between image elements and their corresponding model points with respect to image elements of the depth image.

(14) The method of (13), wherein the depth image is a contour image comprising a binary image identifying foreground and background regions of the depth image, wherein the depth is flattened to a fixed depth.

(15) The method of (13), comprising performing the optimization process by iteratively fixing and optimizing different terms of an energy function; the energy function is about the correspondence between the model and the received image.

(16) The method of (14), comprising iteratively fixing and optimizing the parameters and the correspondences for specifying the pose or shape of the model, and wherein an initial correspondence is selected from the candidate correspondences.

(17) The method of (13), comprising performing the optimization process by optimizing an energy function, the energy function being in respect of correspondence between the model and the received image, wherein the energy function comprises a term arranged to omit model points not visible from a viewpoint of an image capturing device capturing the received image by taking into account a direction of a surface normal of the model.

(18) An apparatus for computing a pose or shape of an articulated or deformable entity, comprising:

an input device arranged to receive at least one image of the entity;

a memory storing a model of a class of articulated or deformable entities, wherein the imaged entity is a member of the class, the model comprising a plurality of parameters for specifying the pose or shape of the model;

a regression engine arranged to compute a plurality of candidate correspondences between image elements of the received image and model points, wherein the model points are locations on or in the model;

an optimizer arranged to derive values of the parameters for specifying the pose or shape of the model in accordance with the received image; the optimizer is influenced by at least some of the candidate correspondences.

(19) The apparatus of (18), the optimizer being arranged to optimize an energy function with respect to correspondence between the model and the received image, wherein the energy function comprises a term arranged to facilitate correspondence between image elements and similar model points, wherein similarity is evaluated using the regression engine.

(20) The apparatus of (18), wherein the regression engine comprises a random decision forest arranged to acquire image elements of the received image and to calculate, for each image element, a probability distribution about candidate correspondences using information associated with its leaves.

Claims

1. A computer-implemented method of calculating a pose or shape of an articulated or deformable entity, comprising:

receiving at least one image (104,108,112,116) of the entity;

accessing a model (100,106,110,114) of a class of articulated or deformable entities, wherein the imaged entity is a member of the class, the model comprising a plurality of parameters for specifying the pose or shape of the model;

accessing a plurality of candidate correspondences (214) between image elements of the received image and model points using the image, wherein the model points are locations on or in the model;

performing an optimization process (602) to derive values of the parameters for specifying the pose or shape of the model consistent with the received image, and wherein the optimization is affected by at least some of the candidate correspondences; and

the candidate correspondences are accessed from a random decision forest (304), the random decision forest (304) being arranged to acquire image elements of the received image and to calculate, for each image element, a probability distribution about the candidate correspondences using information associated with its leaves.

2. The method of claim 1, comprising performing the optimization process (602) by optimizing an energy function with respect to correspondence between the model and the received image, wherein the energy function comprises a term arranged to favor correspondence between image elements and similar model points, wherein the candidate correspondence is used to evaluate similarity.

3. The method of any preceding claim, comprising performing the optimization process (602) by iteratively fixing and optimizing the parameters and the correspondences for specifying the pose or shape of the model, and wherein an initial correspondence is selected from the candidate correspondences.

4. The method of claim 1 or 2, comprising receiving a sequence of images of the entity over time; obtaining at least one temporally static model parameter using the sequence of images; and fixing the parameter during the optimization process.

5. The method of claim 1 or 2, comprising performing the optimization process by optimizing an energy function with respect to a correspondence between the model and the received image, wherein the energy function comprises one or more of: a term arranged to excite neighboring image points to map to neighboring model points; an item arranged to omit model points that are not visible from a viewpoint of an image capture device that captures the received image by considering a direction of a surface normal of the model; and a term arranged to penalize values of the gesture parameters known to be impossible.

6. A method as claimed in claim 1 or 2, comprising receiving a sequence of images of the entity over time that do not follow a specified motion model.

7. The method of claim 1 or 2, comprising receiving a plurality of images of the entity taken simultaneously from different viewpoints, and accessing the candidate correspondences using the plurality of images.

8. A method as claimed in claim 1 or 2, comprising performing the optimization by accessing a plurality of candidate correspondences for each image element of the received image and searching for model points only among the candidate correspondences.

9. A method as claimed in claim 1 or 2, comprising performing the optimization by accessing at least one candidate correspondence for each image element of the received image and using the candidate correspondence to guide the search for the model points.

10. A computer-implemented apparatus for computing a pose or shape of an articulated or deformable entity, comprising:

means for receiving at least one image (104,108,112,116) of the entity;

means for accessing a model (100,106,110,114) of a class of articulated or deformable entities, wherein the imaged entity is a member of the class, the model comprising a plurality of parameters for specifying the pose or shape of the model;

means for accessing a plurality of candidate correspondences between image elements of the received image and model points using the image, wherein the model points are locations on or in the model;

means for performing an optimization process (602) to derive values for the parameters specifying the pose or shape of the model consistent with the received image, the optimization being affected by at least some of the candidate correspondences; and

means for accessing the candidate correspondences from a random decision forest (304), the random decision forest (304) being arranged to acquire image elements of the received image and, for each image element, to calculate a probability distribution about the candidate correspondences using information associated with its leaves.