HK1171543B

HK1171543B - Classification of posture states

Info

Publication number: HK1171543B
Application number: HK12112337.3A
Authority: HK
Inventors: A.巴兰; M.西迪基; R．M．盖斯; A．A-A．基普曼; O．M．C．威廉姆斯; J．肖顿
Original assignee: 微软技术许可有限责任公司
Priority date: 2010-12-28
Filing date: 2012-11-29
Publication date: 2015-07-10

Description

Posture state classification

Technical Field

The present invention relates to interactive systems, and more particularly to gesture state classification.

Background

A controllerless interactive system, such as a gaming system, may be controlled at least in part by natural motion. In some examples, such systems may employ a depth sensor or other suitable sensor to estimate motion of a user and convert the estimated motion into commands to a console of the system. However, when estimating the motion of a user, these systems can only estimate the user's primary joints, such as skeletal estimation, and lack the ability to detect subtle gestures.

Disclosure of Invention

Accordingly, embodiments are disclosed herein that relate to pose estimation of a body part of a user. For example, in one disclosed embodiment, an image is received from a sensor, where the image includes at least a portion of an image of a user that includes the body part. Estimating skeletal information of a user from an image, identifying an image region corresponding to the body part based at least in part on the skeletal information, extracting a shape descriptor for the region, and classifying the shape descriptor according to training data to estimate a pose of the body part. A response may then be output based on the estimated pose of the body part.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.

Drawings

FIG. 1 illustrates a schematic diagram of a user interacting with an embodiment of a controllerless gaming system using natural motion captured by a depth camera.

FIG. 2 illustrates an exemplary method of determining a hand state of a user according to an embodiment of the invention.

FIG. 3 illustrates an exemplary method of determining a hand state of a user according to an embodiment of the invention.

FIG. 4 schematically illustrates a computing system according to an embodiment of the invention.

Detailed Description

A controllerless interactive system, such as the gaming system shown at 10 of FIG. 1, may employ a capture device 12, such as a depth camera or other suitable sensor, to estimate the motion of the user 14. The motion of the user 14 may be estimated in a variety of different ways. In an exemplary approach, skeletal mapping may be employed to estimate one or more joint positions from a user image. The estimated user motion may be translated into commands to the console 16 of the system. In some examples, such commands may allow a user to interact with a game displayed at 18 on display device 20. For example, when user 14 interacts with an object, such as object 26 displayed on display device 20, an image 28 of user 14 may be displayed on display device 20.

However, motion estimation routines, such as skeletal mapping, may lack the ability to detect subtle gestures of a user. For example, these motion estimation routines may lack the ability to detect and/or discern subtle hand gestures of the user, such as those shown at 22 and 24 of fig. 1 where the user opens and closes his hand, respectively.

The systems and methods described below will therefore relate to the determination of the hand state of a user. For example, the act of closing or opening the hand may be used by such a system for triggering events such as a selection action, an engagement action, or an act of grasping and dragging an object (e.g., object 26) on the screen that would correspond to pressing a button when using the controller. These refined controllerless interactions can be used as hand-swing or spin-based alternatives, which can be non-intuitive or cumbersome. Through the determination of the user's hand state described herein below, the user's interaction with the system may be more, simpler, and a more intuitive interface may be presented to the user.

FIG. 2 illustrates an exemplary method 200 of determining a user's hand state according to one embodiment of the invention, and FIG. 3 illustrates various steps of an exemplary method, such as method 200, of determining a user's hand state according to one embodiment of the invention. Since fig. 3 includes schematic illustrations of the steps of fig. 2, it will be described below in conjunction with fig. 2 and 3.

At 202, method 200 receives a depth image from a capture device, such as capture device 12 shown in FIG. 1. The capture device may be any suitable device that captures three-dimensional image data, such as a depth camera. The depth image captured by the capture device includes at least a portion of the user image, the portion including the hand. For example, as shown in FIG. 1, a user 14 may interact with computing system 10, and computing system 10 captures an image of the user via capture device 12.

302 in fig. 3 illustrates a depth image of a portion of a user. Each pixel in the depth image contains depth information, such as the grayscale gradient illustrated in fig. 3. For example, at 302, the left hand is closer to the capture device than the right hand, as indicated by the darker area of the user's left hand. A capture device or depth camera captures a user within a viewing scene. As described below, the depth image of the user may be used to determine distance information for various regions of the user, size information for the user, curves, and skeletal information for the user.

At 204, method 200 includes estimating skeleton information of the user from the depth image obtained at step 202 to obtain a virtual skeleton. For example, a virtual skeleton 304 estimated from the user depth image shown at 302 is shown in FIG. 3. A virtual skeleton 304 may be derived from the depth image of 302 to provide a machine-readable representation of a user (e.g., user 14). Virtual skeleton 304 may be derived from the depth image in any suitable manner. In some embodiments, one or more skeletal adaptation algorithms may be applied to the depth image. It should be understood that any suitable skeletal modeling technique may be used.

Virtual skeleton 304 may include a plurality of joints, each joint corresponding to a portion of the user. The illustration of fig. 3 is simplified for clarity of understanding. A virtual skeleton according to the present invention may include any suitable number of joints, each of which may be associated with virtually any number of parameters (e.g., three-dimensional joint position, joint rotation, part pose of the corresponding body part, etc.). It should be appreciated that the virtual skeleton may take the form of a data structure that includes one or more parameters for each joint of a plurality of skeletal joints (e.g., a joint matrix that includes x-coordinates, y-coordinates, z-coordinates, and rotations for each joint). In some embodiments, other types of virtual skeletons may be used (e.g., wireframes, shape descriptor sets, etc.).

As previously mentioned, existing motion estimation from depth images, such as the skeletal estimation described above, may lack the ability to detect subtle gestures of a user. For example, these motion estimation routines may lack the ability to detect and/or discern subtle hand gestures of the user, such as those shown at 22 and 24 of fig. 1 where the user opens and closes his hand, respectively. In addition, depth images at greater depths have limited resolution, combined with variations in hand size between users of different ages and/or sizes, variations in the orientation of the hand relative to the capture device, and the like, may increase the difficulty of detecting and classifying refined gestures, such as opening and closing hands.

This estimated skeleton may be used to estimate various other physical characteristics of the user. For example, skeletal data may be used to estimate user body and/or body part dimensions, the orientation of one or more user body parts relative to each other and/or the capture device, the depth of one or more user body parts relative to the capture device, and so forth. These estimates of the user's physical characteristics may then be utilized, as described below, to normalize and reduce discrepancies in detecting and classifying the user's hand state.

At 206, method 200 includes segmenting a single hand or hand of the user. In some examples, the method 200 may additionally include segmenting one or more body regions in addition to the hand.

Segmenting the user's hand includes identifying a region in the depth image corresponding to the hand, where the identifying is based at least in part on the skeletal information obtained in step 204. Likewise, any region of the user's body may be identified in a similar manner as described below. At 306, FIG. 3 illustrates an example of segmenting a depth image of a user into different regions represented by different shadows according to the estimated skeleton 304. In particular, FIG. 3 shows a hand region 308 positioned to correspond to the right hand of the user being raised.

The hand or body region may be segmented or located in various ways and the joints may be selected based on the recognition in the skeletal estimation described above.

As an example, hand detection and positioning in the depth image may be based on estimated wrist and/or end-of-hand joints in the estimated skeleton. For example, in some embodiments, hand segmentation in depth images may be performed using the following steps: using a topological search on the depth image around the joints of the hand, locating nearby local extrema in the depth image as candidates for fingertips, and segmenting the other parts of the hand by considering the body size scaling factor determined from the estimated skeleton and the depth discontinuity identified by the boundary.

As another example, a flood fill method may be employed to identify regions in the depth image that correspond to the user's hand. In the flood fill method, the depth image may be searched starting from an initial point, which may be a wrist joint, and an initial direction, which may be a direction from the elbow to the wrist joint, for example. Neighboring pixels in the depth image may be iteratively stored based on the projection in the initial direction as a way of giving preference to points away from the elbow and toward the end of the hand, while a depth consistency constraint, such as depth discontinuity, may be used to identify boundaries or extrema of the user's hand in the depth image. In some examples, a depth map search may be limited in both forward and reverse directions of the initial direction, e.g., using a threshold distance value, based on a fixed value or proportionally based on a user estimated size.

As another example, a bounding circle or other suitable bounding shape placed according to a skeletal joint (e.g., a wrist or end-of-hand joint) may be used to contain all pixels in the depth image until the depth is not continuous. For example, a window may be drawn across a bounding circle to identify depth discontinuities, which may be used to establish a boundary in a hand region of a depth image.

In some approaches, hand region segmentation may be performed when a user lifts a hand outward or over the torso. In this way, hand region recognition in the depth image may be less blurred, as the hand region is more easily distinguishable from the body.

It should be understood that the above-described example of exemplary hand segmentation is presented for purposes of illustration and is not intended to limit the scope of the present invention. In general, any of the hand or body part segmentation methods may be used alone, in combination with each other, and/or in combination with one of the exemplary methods described above.

Continuing with the method 200 of FIG. 2, at 208, the method 200 includes extracting shape descriptors for regions, such as the region corresponding to the hand in the depth image identified at 206. The shape descriptors extracted at step 208 may be any suitable representation of hand regions, which are used to classify hand regions, e.g., based on training data as described below. In some embodiments, the shape descriptor may be a vector or set of numbers used to codify or describe the shape of the hand region.

In some examples, the shape descriptor may be invariant with respect to one or more transformations, e.g., consistent (translation, rotation, mapping, etc.), equidistant, depth-changing, etc. For example, the shape descriptor may be extracted in such a way that its orientation or position relative to the capture device or sensor remains unchanged. The shape descriptor may also be made invariant with respect to the map, in which case the left and right hands are not distinguished. Furthermore, if the shape descriptor is not invariant with respect to the map, it always needs to be mirrored by flipping the input image left and right, doubling the amount of training data for each hand. Furthermore, the shape descriptors may be normalized based on estimated body size, thus remaining substantially unchanged with respect to body and or hand differences between different users. Alternatively, a calibration step may be performed in advance, where the individual size is pre-estimated, in which case the descriptors need not be size invariant.

As one example of shape descriptor extraction, a histogram of distances from the centroid of the hand region in the hand region identified in step 206 may be constructed. For example, such a histogram may include 15 bars, where each bar includes the number of points in the hand region that are within a particular distance range from the centroid, such a particular distance range being associated with such a bar. For example, a first bar in this histogram may include the number of points in the hand region between 0 and 0.40 centimeters from the centroid, a second bar includes the number of points in the hand region between 0.4 and 0.80 centimeters from the centroid, and so on. In this way, a vector can be constructed to codify the shape of the hand. These vectors may also be normalized, for example, based on the estimated body size.

In another exemplary method, a histogram may be constructed based on point-to-joint distances and/or angles in the hand region, a condyle or palm plane from the user's estimated skeleton (e.g., elbow, wrist), etc.

Another example of a shape descriptor is a fourier descriptor. The construction of the fourier descriptor includes codifying the outline of the hand region, for example by plotting the distance from each pixel in the hand region to the perimeter of the hand region against the ellipse-matched radius of the hand boundary, and then performing a fourier transform on the graph. Furthermore, these descriptors may be normalized with respect to the estimated body size. These descriptors may be invariant with respect to panning, scaling and translation.

Another example of constructing a shape descriptor includes determining the convexity of the hand, for example by determining the proportion of one region in the outline of the hand region relative to the convex hull of the hand region.

It will be appreciated that these exemplary descriptors are exemplary in nature and are not intended to limit the scope of the invention. In general, any suitable shape descriptor for the hand region may be used alone, in combination with each other, and/or in combination with one of the exemplary methods described above. For example, shape descriptors such as histograms or vectors described above may be mixed and matched, combined, and/or concatenated into larger vectors, and so on. This may allow for the identification of new patterns that cannot be identified by looking at them in isolation.

Continuing with method 200, at 210, method 200 includes classifying a state of the hand. For example, the shape descriptors extracted at step 208 may be classified according to training data to estimate the state of the hand. For example, as illustrated at 310 of fig. 3, a hand may be classified as open or closed. In some examples, the training data may include depth image examples of various hand states that are labeled. The training data may be real or synthetically generated, full or upper body 3D models depicting different body sizes and arm orientations, and different consecutive hand poses based on motion capture or hand poses designed by hand. The quality of the composite image may be reduced to simulate a noisy real image.

In some examples, the training data used in the classification step 210 may be based on a predetermined set of hand examples. The various hand states may be compared against the shape descriptors of the hand regions, with hand examples grouped or labeled according to the various hand states.

In some examples, different metadata may be used to partition the training data. For example, the training data may include a plurality of hand state examples and be partitioned, for example, according to one or more of the following: lateral (e.g., left or right) hand, orientation of the hand (e.g., low arm angle or low arm orientation), depth, and/or body size of the user. Dividing these training hand examples into separate subsets can reduce variability in hand shape in each part, which makes the overall classification of hand state more accurate.

Additionally, in some examples, the training data may be specific to an individual application. That is, the training data may depend on the desired action in a given application, such as a desired activity in a game, etc. Further, in some examples, the training data may be user-specific. For example, an application or game may include a training module in which a user performs one or more training exercises to calibrate training data. For example, the user may make a series of gestures to open or close the hand to establish a training data set that is used to estimate the user's hand state during subsequent interactions with the system.

Training examples may be given in various ways to perform user hand classification. For example, various machine learning techniques may be employed in classification. Non-limiting examples include: support vector machine training, regression, neighbor sampling, (unsupervised) clustering, and the like.

As described above, these classification techniques may use annotated depth image examples of various hand states for predicting the likelihood that an observed hand is one of a plurality of states. In addition, confidence may be added to the classification during or after the classification step. For example, a confidence interval may be assigned to the estimated hand state based on training data or by adapting an S-function or other suitable error function to the output of the classification step.

As a simple, non-limiting example of classifying hand states, there are two possible hand states, open or closed, such as shown at 310 of fig. 3. In this example, the training data may include two sets of labeled hands: a first set of hand examples representing an open or near open hand state and a second set of hand examples representing a closed or near closed hand state. In this way, when given the extracted shape descriptors of the identified hand regions, the extracted shape descriptors of the identified hand regions may be compared to hand examples of the first (open) and second (closed) hand sets to determine the likelihood that the identified hand regions fall into each set. The state of the hand can then be estimated from the higher likelihood.

For example, as shown at 310 of fig. 3, the identified hand regions are determined to have a higher likelihood of being open and are thus classified. Further, in some examples, the determined likelihood of the identified hand being a particular hand state may be used to establish a confidence interval for the hand state estimate.

Various post-classification filtering steps may be employed to improve the accuracy of hand state estimation. The method 200 may therefore include a filtering step at 211. For example, temporal consistency filtering, such as a low pass filtering step, may be applied to the hand state between predicted successive depth image frames to smooth the predictions and reduce temporal jitter, such as that caused by false hand motion, sensor noise, or accidental classification errors. That is, estimation of multiple states of a user's hand from multiple depth images from a capture device or sensor may be performed, as well as temporal filtering of the multiple estimates to estimate the hand state. Further, in some examples, the classification results may be biased toward one state or another (e.g., toward open or closed hands), as some applications may be more sensitive to false positives (in one direction or another) than others.

Continuing with method 200, method 200 includes, at 212, outputting a response based on the estimated hand state. For example, commands may be output to a console of a computing system, such as console 16 of computing system 10. As another example, the response may be output to a display device, such as display device 20. In this way, the estimated user motion, including the estimated hand state, may be translated into commands to the console 16 of the system 10 so that the user may interact with the system as described above. Further, the above-described methods or processes may be performed to determine a state estimate of any part of the user's body, such as the mouth, eyes, etc. For example, the posture of the user's body part may be estimated using the methods described above.

The methods and processes described herein may be incorporated into a variety of different types of computing systems. The computing system 10 described above is a non-limiting, example system that includes a game console 16, a display device 20, and a capture device 12. As another more general example, FIG. 4 schematically illustrates a computing system 400 that can perform one or more of the methods and processes described herein. Computing system 400 may take a variety of different forms, including, but not limited to, gaming consoles, personal computing systems, and audio/visual theaters, among others.

Computing system 400 may include a logic subsystem 402, a data-holding subsystem 404 operatively connected to the logic subsystem, a display subsystem 406, and/or a capture device 408. The computing system may optionally include components not shown in fig. 4, and/or some of the components shown in fig. 4 may be peripheral components that are not integrated into the computing system. Further, computing system 400 may be part of a network, such as a local area network or a wide area network.

Logic subsystem 402 may include one or more physical devices configured to execute one or more instructions. For example, logic subsystem 402 may be configured to execute one or more instructions that are part of one or more programs, routines, objects, components, data structures, or other logical constructs. Such instructions may be implemented to perform a task, implement a data type, transform the state of one or more devices, or otherwise arrive at a desired result. The logic subsystem may include one or more processors configured to execute software instructions. Additionally or alternatively, logic subsystem 402 may include one or more hardware or firmware logic machines configured to execute hardware or firmware instructions. Logic subsystem 402 may optionally include individual components that are distributed across two or more devices, which may be remotely located in some embodiments.

Data-holding subsystem 404 may include one or more physical devices configured to hold data and/or instructions executable by the logic subsystem to implement the herein described methods and processes. When such methods and processes are implemented, the state of data-holding subsystem 404 may be transformed (e.g., holding different data). Data-holding subsystem 404 may include removable media and/or built-in devices. Data-holding subsystem 704 may include optical memory devices, semiconductor memory and storage devices (e.g., RAM, EEPROM, flash memory, etc.), and/or magnetic storage devices, among others. Data-holding subsystem 404 may include devices with one or more of the following features: volatile, nonvolatile, dynamic, static, read/write, read-only, random access, sequential access, location addressable, file addressable, and content addressable. In some embodiments, logic subsystem 402 and data-holding subsystem 404 may be integrated into one or more common devices, such as an application specific integrated circuit or a system on a chip.

FIG. 4 also illustrates an aspect of the data-holding subsystem that may be used in the form of a computer-readable removable storage medium 416, such as a DVD, CD, floppy disk, and/or tape drive, to store and/or transfer data and/or instructions executable to implement the herein described methods and processes.

Display subsystem 406 may be used to present a visual representation of data held by data-holding subsystem 404. As the herein described methods and processes change the data held by the data-holding subsystem, and thus transform the state of the data-holding subsystem, the state of display subsystem 406 may similarly be transformed to visually represent changes in the underlying data. Display subsystem 406 may include one or more display devices using virtually any type of technology. Such display devices may be combined in a shared enclosure with logic subsystem 402 and/or data-holding subsystem 404, or such display devices may be peripheral display devices.

Computing system 400 also includes a capture device 408 configured to obtain depth images of one or more targets and/or scenes. Capture device 408 may be configured to capture video with depth information via any suitable technique (e.g., time-of-flight, structured light, stereo image, etc.). As such, capture device 408 may include a depth camera, a video camera, a stereo camera, and/or other suitable capture devices.

For example, in time-of-flight analysis, the capture device 408 may emit infrared light into the scene, and then use a sensor to detect backscattered light from the scene surface. In some cases, pulsed infrared light may be used, where the time difference between an outgoing light pulse and a corresponding incoming light pulse may be measured and used to determine a physical distance from the capture device to a particular location in the scene. In some cases, the phase of the outward light waves and the phase of the inward light waves may be compared to determine a phase offset, which may be used to determine a physical distance from the capture device to a particular location in the scene.

In another example, time-of-flight analysis may be used to indirectly determine a physical distance from the capture device to a particular location in the scene by analyzing the intensity variation of the reflected beam of light over time via techniques such as shuttered light pulse imaging.

In another example, the capture device 408 may utilize structured light analysis to capture depth information. In this analysis, patterned light (e.g., light displayed as a known pattern such as a grid pattern or a stripe pattern) may be projected onto the scene. On the surface of the scene, the pattern becomes distorted, and this distortion of the pattern can be studied to determine the physical distance from the capture device to a particular location in the scene.

In another example, a capture device may include two or more physically separated cameras that view a scene from different angles to obtain viewable stereo data. In these cases, the visual stereo data may be decomposed to generate a depth image.

In other embodiments, capture device 408 may utilize other techniques to measure and/or calculate depth values.

In some embodiments, two or more cameras may be integrated into one integrated capture device. For example, a depth camera and a video camera (e.g., an RGB video camera) may be integrated into a common capture device. In some embodiments, two or more separate capture devices may be used in conjunction. For example, a depth camera and a separate video camera may be used. When a video camera is used, it may be used to provide target tracking data, validation data for scene analysis, image capture, face recognition, high precision finger (or other small feature) tracking, light sensing, and or error correction for other functions. In some embodiments, two or more depth and/or RGB cameras may be placed on different sides of a subject to obtain a more complete 3D model of such subject, or to further improve the resolution of the view around the hand. In other embodiments, a single camera may be used, for example, to obtain RGB images, and the images may be segmented based on color, for example, the color of the hands.

It should be understood that at least some of the depth analysis operations may be performed by the logic machine of one or more capture devices. The capture device may include one or more on-board processing units configured to perform one or more depth analysis functions. The capture device may include firmware to help update such on-board processing logic.

For example, computing system 400 may also include various subsystems configured to execute one or more instructions that are part of one or more programs, routines, objects, components, data structures, or other logical constructs. Such subsystems may be operatively connected to logic subsystem 402 and/or data-holding subsystem 404. In some examples, these subsystems may be implemented as software stored on a computer-readable storage medium, which may or may not be removable.

For example, the computing system 400 may include an image segmentation subsystem 410 configured to identify regions in the depth image that correspond to the hand, such identification based at least in part on the skeletal information. The computing system 400 may further include a descriptor extraction subsystem 412 configured to extract shape descriptors for the regions identified by the image segmentation subsystem 410. The computing system 400 may also include a classifier subsystem 414 configured to classify shape descriptors based on training data to estimate hand state.

It is to be understood that the configurations and/or approaches described herein are exemplary in nature, and that these specific embodiments or examples are not to be considered in a limiting sense, because numerous variations are possible. The specific routines or methods described herein may represent one or more of any number of processing strategies. As such, various acts illustrated may be performed in the sequence illustrated, in other sequences, in parallel, or in some cases omitted. Also, the order of the above-described processes may be changed.

It should be understood that the examples described herein to detect open and closed hands are exemplary in nature and are not intended to limit the scope of the present invention. The methods and systems described herein may be applied to estimate various refined poses in depth images. For example, various other hand contours may be estimated using the systems and methods described herein. Non-limiting examples include: fist posture, open palm posture, finger pointing, etc.

The subject matter of the inventions includes all novel and nonobvious combinations and subcombinations of the various processes, systems and configurations, and other features, functions, acts, and/or properties disclosed herein, as well as any and all equivalents thereof.

Claims

1. A method (200) for estimating a pose of a body part of a user, comprising:

receiving (202) an image (28) from a sensor (12), the image (28) comprising at least a portion of an image of a user (14), the at least a portion containing the body part;

estimating (204) skeletal information of the user (14) from the image (28);

identifying (206) a region (308) in the image (28) corresponding to the body part, the identifying based at least in part on the skeletal information;

extracting (208) a shape descriptor of the region (308);

classifying (210) the shape descriptor based on training data to estimate a pose of the body part;

estimating a plurality of poses of the body part from a plurality of images from the sensor and performing a plurality of estimated temporal filters to estimate estimated poses of the body part; and

outputting (212) a response according to the estimated pose of the body part.

2. The method of claim 1, further comprising estimating a body size scaling factor based on at least a distance between joints in the skeletal information and normalizing the shape descriptors based on the body size scaling factor.

3. The method of claim 2, wherein identifying the region in the image corresponding to the body part comprises using a terrain search based on the body size scaling factor.

4. The method of claim 1, wherein identifying the region in the image corresponding to the body part is based at least in part on a flood filling method.

5. The method of claim 1, wherein the body part is a hand, and estimating the pose of the body part comprises estimating whether the hand is open or closed.

6. The method of claim 1, further comprising assigning a confidence interval to the estimated pose of the body part.

7. The method of claim 1, wherein classifying the shape descriptor based on training data to estimate the pose of the body part is based on at least one machine learning technique.

8. The method of claim 1, wherein the body part is a hand and the training data is partitioned based on metadata, the metadata including at least one of: hand orientation, low arm angle, low arm orientation, depth, and user body size.

9. A system for estimating a pose of a body part of a user, comprising:

means for receiving (202) an image (28) from a sensor (12), the image (28) comprising at least a portion of an image of a user (14), the at least a portion containing the body part;

means for estimating (204) skeletal information of the user (14) from the image (28) to obtain a virtual skeleton, the virtual skeleton comprising a plurality of joints;

means for identifying (206) a region (308) in the image (28) corresponding to the body part, the identifying based at least in part on the skeletal information;

means for extracting (208) a shape descriptor of the region (308);

means for classifying (210) the shape descriptor based on training data to estimate a pose of the body part;

means for estimating a plurality of poses of the body part from a plurality of images from the sensor, and performing a plurality of estimated temporal filters to estimate estimated poses of the body part; and

means for outputting (212) a response based on the estimated pose of the body part.