HK1192790B

HK1192790B - Method and system of inferring spatial object descriptions from spatial gestures

Info

Publication number: HK1192790B
Application number: HK14105756.7A
Authority: HK
Inventors: A.D.威尔逊; C．霍茨
Original assignee: 微软技术许可有限责任公司
Priority date: 2011-04-29
Filing date: 2012-04-28
Publication date: 2017-08-11

Description

Method and system for inferring spatial object descriptions from spatial gestures

Background

For many people, it can be challenging to describe the shape and size of an object. For example, in a conversation, many people may use gestures to help describe shapes, especially when it may be cumbersome to use only verbal expressions. For example, the roofline of a new car may be expressed by a sudden drop in the protruding hand, or a particular chair type may be indicated to the store owner by a series of gestures that describe a surface arrangement that is unique to the particular chair design. In such cases, the person expressing the information often appears to draw a contour of the three-dimensional (3-D) shape of the object being described. The listener may look at the gesture with great care and attempt to recreate the 3-D shape in his/her own mind.

Stores and warehouses may welcome shoppers using branding and/or locating customer service representatives that provide assistance via inventory. For example, a chair seeking customer may request a brand name or type, which the customer service representative may type into a keyboard attached to the warehouse inventory system and receive information about the location of the store requesting the item, or an indication that the requesting item is unavailable. If the shopper does not know/remember the brand name or type name/number, the customer may attempt to describe the desired item to a customer service representative to determine whether the representative can recall that such item is seen in the inventory.

Many gaming environments provide players with the option of calling a particular object into a game. For example, a player of a war game may request specific weapons, such as bows and arrows, nunchakus, knuckle collars, or various types of guns and cannons. These items may be programmed into the game prior to distribution to the customer. As another example, a virtual community game may provide players with the option of items that they may incorporate into their particular desired virtual environment. For example, a user may build a dinosaur park by selecting from a large variety of dinosaurs and cages, as well as food and cleaning supplies, all of which may be preprogrammed into a game prior to distribution. For example, a user may select a desired item by viewing a list of game inventory items and clicking, touching, or pointing to the desired item via an input device.

Disclosure of Invention

According to one general aspect, a spatial object management engine may include a database access engine configured to initiate access to a database including a plurality of database objects, each database object associated with a predefined three-dimensional (3-D) model that simulates an appearance of a predetermined 3-D item. The spatial object management engine may also include an image data receiving engine configured to receive 3-D spatial image data associated with at least one arm motion of the participant based on free-form movement of at least one hand of the participant based on natural gesture motions. The spatial object management engine may also include an integrated model generator configured to generate an integrated 3-D model based on integrating temporally continuous 3-D representations of 3-D locations of at least one hand from the received 3-D spatial image data. The spatial object management engine may further include a matching engine configured to select, by the spatial object handler, at least one of the predetermined 3-D items based on accessing the database access engine and determining at least one database object associated with at least one of the predefined 3-D models that matches the integrated 3-D model.

According to another aspect, a computer program product tangibly embodied on a computer-readable medium may include executable code that, when executed, is configured to cause at least one data processing apparatus to receive three-dimensional (3-D) spatial image data associated with at least one arm motion of a participant based on free-form movement of at least one hand of the participant based on natural gesture motion of the at least one hand. Furthermore, the data processing apparatus may determine, based on the received 3-D spatial image data, a plurality of consecutive 3-D spatial representations, each 3-D spatial representation comprising 3-D spatial mapping data corresponding to 3-D poses and positions of the at least one hand at consecutive time instances during the free-form movement. Further, based on progressively integrating the 3-D spatial mapping data included in the determined sequential 3-D spatial representation and comparing a threshold time value to a model time value, the data processing apparatus may, with the spatial object processor, generate an integrated 3-D model, wherein the model time value indicates a number of instances of time spent by at least one hand occupying the plurality of 3-D spatial regions during the free-form movement.

According to another aspect, a computer program product tangibly embodied on a computer-readable medium may include executable code that, when executed, is configured to cause at least one data processing apparatus to receive three-dimensional (3-D) sensory data associated with at least one natural gesture of a participant based on a free-form movement of the participant based on natural gesture actions that emulate an appearance of a predetermined three-dimensional (3-D) item. Furthermore, the data processing apparatus may generate an integrated 3-D model based on integrating the received 3-D sensory data, the 3-D sensory data representing a 3-D positioning of at least one 3-D moving object associated with the moving participant according to a free form. Further, the data processing apparatus, via the spatial object handler, may determine a predefined 3-D model associated with the database object that matches the integrated 3-D model.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features will be apparent from the description and drawings, and from the claims.

Drawings

FIG. 1a is a block diagram of an example system for spatial object management.

FIG. 1b is a block diagram of a view of a portion of the example system of FIG. 1 a.

FIGS. 2a-2d are flow diagrams illustrating example operations of the system of FIGS. 1a-1 b.

FIGS. 3a-3c are flow diagrams illustrating example operations of the system of FIGS. 1a-1 b.

FIG. 4 is a flow chart illustrating example operations of the system of FIGS. 1a-1 b.

5a-5e illustrate example gesture movements and example models associated with gesture movements.

FIG. 6 illustrates an example participant gesturing in close proximity to a camera.

FIG. 7 illustrates an example three-dimensional (3-D) project.

8a-8c illustrate example gestures of example participants.

FIG. 9 shows example hand gestures of a participant.

10a-10b illustrate example hand gestures of a participant.

11a-11d show graphical views of an example process of image data according to the example system of FIG. 1.

FIG. 12 illustrates example overlay results of matching the generated 3-D model to a predetermined 3-D model.

Detailed Description

When talking about or describing physical objects, speakers often use gestures. For example, such gestures help speakers to convey characteristics of shapes that are difficult to describe verbally. The techniques described herein may be used, for example, to provide gesture-based description functionality for generating three-dimensional (3-D) models that may emulate the appearance of a 3-D item envisioned by a gesturing human. For example, a customer at a retail store may wish to determine whether the store has a particular item in its current inventory. Using the techniques described herein, a customer may gesture using his/her hand (or other object) within range of a sensing device (e.g., a depth camera), or stroke with a hand, a description of a 3-D item (e.g., spatially describing the 3-D item), and an example system may generate a 3-D model based on the customer's gesture. If desired, the generated model may then be matched against a predetermined 3-D model, for example, in an inventory database, to determine one or more predefined items that most closely match the generated 3-D model. For example, the system may then present the closest match next to their location to a customer or store employee.

If a store employee or manager wants to add another inventory object to the inventory database (e.g., for later selection), he/she may provide a pre-constructed 3-D model to the database, or may present one 3-D object to a sensor device (e.g., a depth camera) so that a 3-D model may be generated and added to the inventory database for later retrieval.

As another example, a person participating in an electronic secondary gaming activity may want to have a particular gaming object summoned into the gaming experience. He/she may spatially describe one of the game objects (e.g., by gesture, or data emulation, discussed further below) such that the example system may generate a 3-D model based on human gestures. For example, the system may then search a database of predetermined 3-D models associated with predetermined game objects to determine one or more predefined items that most closely match the generated 3-D model. For example, the game may customize the object to the size indicated by the player's gesture.

If the person is to add another game object to the game object database (e.g., for later selection), the person may provide a pre-constructed 3-D model to the database, or may present a physical 3-D object to a sensor device (e.g., a depth camera) so that a 3-D model may be generated and added to the game database for later retrieval.

In another example, a person participating in a virtual environment activity may want to summon a 3-D virtual object for use in the virtual environment. Similar to the previous example, he/she may gesture, or stroke with a hand, a description of a desired 3-D object, and the example system may generate a 3-D model based on his/her gesture. The generated 3-D model may then be used to search a database of predetermined virtual environment objects to determine one or more matches. For example, a child may want to set up a virtual playhouse by calling in predetermined house structures and predetermined household items such as furniture. For example, the child may spatially describe a table (e.g., by gesturing or stroking with a hand), and the system may search the virtual environment database for a matching pre-defined 3-D object.

For example, a user may add a new virtual environment object to the virtual environment database by providing a pre-constructed 3-D model to the database, or presenting a physical 3-D object to a sensing device (e.g., a depth camera), so that a 3-D model may be generated and added to the virtual environment database for later retrieval.

As discussed further herein, FIG. 1a is a block diagram of an example system 100 for spatial object management. FIG. 1b is a block diagram of a more detailed view of portions of the example system of FIG. 1 a.

As shown in FIGS. 1a-1b, spatial object management engine 102 can include a sensory data receiving engine 104 that can be configured to receive sensory data 106. For example, sensory data reception engine 104 may receive sensory data 106 from one or more sensory devices. The memory 108 may be configured to store information including the sensed data 106. For example, sensory data 106 may include image data 110 received from an image data input device 112, audio data 114 received from an audio data input device 116, and/or haptic data 118 received from a haptic data input device 120. For example, image data input device 112 may include a three-dimensional (3-D) image data device that may be configured to obtain 3-D spatial image data. For example, the image data input device 112 may include a depth camera that may be configured to acquire the image data 110 including depth values. As another example, the image data input device 112 may include one or more cameras configured to obtain image data 110 representing stereoscopic images corresponding to 3-D shapes. In this context, "memory" may include a single memory device or multiple memory devices configured to store data and/or instructions. Further, the memory 108 may span multiple distributed storage devices.

The user interface engine 122 may be configured to manage communications between the user 124 and the spatial object management engine 102. For example, a store employee or system administrator (e.g., user 124) may communicate with the spatial object management engine 102 through the user interface engine 122. The network communication engine 126 may be configured to manage communications between the spatial object management engine 102 and other entities that may communicate with the spatial object management engine 102 over one or more networks.

For example, the display 128 may provide visual, audio, and/or tactile media for the user 124 (e.g., a store employee or system administrator) to monitor his/her inputs to the spatial object management engine 102 and responses from the spatial object management engine 102. For example, user 124 may provide input via a touch pad, touch screen, keyboard or keypad, mouse device, trackball device, or audio input device or other input sensing device. For example, user 124 may speak information for speech recognition processing into a character format.

According to an example embodiment, sensory data receiving engine 104 may include an image data receiving engine 130 that may be configured to receive 3-D spatial image data associated with at least one arm action of a participant 132 based on free-form movement of at least one hand of the participant based on natural gesture actions. For example, the image data reception engine 130 may receive 3-D spatial image data from the image data input device 112, and the image data input device 112 may include a depth camera focused on 3-D space that may be partially occupied by the participant 132.

In this context, a "natural gesture" may include a gesture made by a participant substantially without advance instructions as to how the gesture is to be made, and substantially without a predetermined order of any particular gesture. Thus, a "natural gesture" may include a gesture determined only by the participant in any timed or sequenced form of participant selection. In addition, "natural gestures" may include elements such as the height, width, and depth, as well as the shape and positioning of various components of an object (e.g., table tops, legs, chair backs, seats, chair legs).

The spatial object management engine 102 may include a sensory data analysis engine 134, the sensory data analysis engine 134 including a spatial representation engine 136 (shown in FIG. 1 b), the spatial representation engine 136 may be configured to determine a plurality of sequential 3-D spatial representations 138 based on the received 3-D spatial image data, each 3-D spatial representation including 3-D spatial mapping data corresponding to 3-D poses and positions of at least one hand at successive time instances during the freeform movement.

The spatial object management engine 102 may include an integrated model generator 140, and the integrated model generator 140 may be configured to generate, via the spatial object processor 142, an integrated 3-D model 144 based on progressively integrating the 3-D spatial mapping data included in the determined ordered 3-D spatial representation 138 and comparing a threshold time value 146 to a model time value, wherein the model time value indicates a number of time instances spent by at least one hand occupying the plurality of 3-D spatial regions during the free-form movement. According to an example embodiment, the integrated model generator 140 may be configured to generate an integrated 3-D model 144 based on integrating temporally continuous 3-D representations of 3-D locations of at least one hand from the received 3-D spatial image data 110.

In this context, a "processor" may include a single processor or multiple processors configured to process instructions associated with a processing system. A processor may thus include multiple processors that process instructions in a parallel manner and/or a distributed manner.

In this context, "integration" may include substantially pure integration or aggregation of the positioning of a hand or other object without a particular order or timing of positioning, and without particular predefined movements associated with particular elements of any predefined 3-D model. For example, there may be substantially no intended advance input or training to be associated with integrated positioning.

According to an example embodiment, the spatial object management engine 102 may include an initialization engine 148 configured to initialize a virtual 3-D mapping space based on discretized 3-D virtual mapping elements represented as volumetric elements, each volumetric element including a weight value initialized to an initial value, wherein the virtual 3-D mapping space represents the 3-D space in close proximity to the participant 132. For example, the integrated model 144 may include a virtual 3-D mapping space.

According to an example embodiment, the integrated model generator 140 may include an element activation engine 150 (as shown in fig. 1 b), the element activation engine 150 configured to proportionally increment a weight value of a selected volumetric element associated with a 3-D region of the 3-D space based on a determination indicating that a portion of at least one hand occupies the 3-D region for a period of time during the free-form movement. According to an example embodiment, the integrated model generator 140 may include a threshold comparison engine 152 configured to compare the threshold 146 to a weight value for each of the volumetric elements.

According to an example embodiment, the integrated model generator 140 may include a position attribute engine 154 and a virtual element locator 156, the position attribute engine 154 configured to determine a depth, position, and orientation of the at least one hand, the virtual element locator 156 configured to determine a position of the volumetric element associated with a virtual 3-D mapping space, the virtual 3-D mapping space corresponding to the depth, position, and orientation of the at least one hand. According to an example embodiment, the element activation engine 150 may be configured to activate a plurality of volumetric elements associated with a region of the virtual 3-D mapping space representing depths, positions, and orientations corresponding to the depth, position, and orientation of the at least one hand based on the position determined by the virtual element locator 156.

According to an example embodiment, the volumetric elements may comprise volume picture elements (voxels). According to an example embodiment, the initialization engine 148 may be configured to initialize voxels to an inactive state with initialized weight values. According to an example embodiment, the element activation engine 150 may be configured to activate the groups of voxels based on increasing a weight value associated with the groups of voxels with each activation of the groups of voxels based on a determination indicating that a portion of at least one hand occupies the 3-D region for a period of time during the free-form movement. In this context, "voxel" may represent the smallest resolvable box-like portion of a 3-D image.

According to an example embodiment, the integrated model generator 140 may include an envelope detection engine 158 configured to determine an envelope space indicated by the pose of the at least one hand. For example, the enclosure detection engine 158 may determine an enclosure region within the closed fist of the participant 132.

According to an example embodiment, the integrated model generator 140 may include a depth determination engine 160, the depth determination engine 160 determining a depth of the bounding space based on a determination of a depth of a region surrounding the bounding space. According to an example embodiment, the element activation engine 150 may be configured to activate, for a period of time associated with the gesture indicating the enclosure space, a plurality of volumetric elements associated with a region of the virtual 3-D space representing a depth, position and orientation corresponding to the enclosure space instead of activating a plurality of volumetric elements associated with a region of the virtual 3-D space representing a depth, position and orientation corresponding to the at least one hand. For example, the element activation engine 150 may activate a plurality of volumetric elements associated with a region of virtual 3-D space representing a depth, position, and orientation corresponding to the bounding space associated with the closed fist of the participant 132 in place of a region corresponding to the volume occupied by the hand in the closed fist orientation.

According to an example embodiment, the sensory data analysis engine 134 may include a volume determination engine 162, and the volume determination engine 162 may be configured to determine a volume associated with one hand of the participant based on the received 3-D spatial image data 110. According to an example embodiment, the spatial representation engine 136 may be configured to determine a 3-D representation of one hand in the 3-D virtual mapping space based on the determined volume.

According to an example embodiment, the sensory data analysis engine 134 may include a pose determination engine 164, and the pose determination engine 164 may be configured to determine a pose of at least one hand based on the 3-D representation of the one hand.

According to an example embodiment, the envelope detection engine 158 may be configured to determine that at least one hand pose of at least one hand is indicative of a 3-D envelope space. According to an example embodiment, the integration model generator 140 may be configured to: if it is determined that the bounding 3-D space is indicated, for successive time instances associated with the pose indicating the bounding space, activating a plurality of volumetric elements associated with a portion of the integrated 3-D model representing a depth, position, and orientation corresponding to the bounding space; otherwise, a plurality of volumetric elements of depth, position and orientation associated with the portion of the integrated 3-D model representing the depth, position and orientation corresponding to the at least one hand are activated. For example, the element activation engine 150 may be configured to activate the volumetric elements corresponding to regions within a clenched fist, as discussed above.

According to an example embodiment, the gesture determination engine 164 may be configured to determine one or more of at least one hand gesture including a flat hand gesture, a curved hand gesture, a clenched fist hand, or a finger-pinched hand gesture. For example, a single hand may be clenched into a fist, or two hands may be juxtaposed to form an enclosed space, as discussed further below.

According to an example embodiment, the enclosure detection engine 158 may be configured to determine an enclosure indicated by at least one pose of at least one hand, and the depth determination engine 160 may be configured to determine a depth of the enclosure based on a determination of a depth of a region surrounding the enclosure. According to an example embodiment, the spatial representation engine 136 may be configured to determine a plurality of sequential 3-D spatial representations, each 3-D spatial representation including 3-D spatial mapping data corresponding to a depth, position and orientation for the enclosure in place of 3-D spatial data corresponding to a pose and position of at least one hand during successive time instances associated with gesturing indicative of the at least one hand of the enclosure.

According to an example embodiment, the spatial object management engine 102 may include a matching engine 166, and the matching engine 166 may be configured to determine, by the spatial object handler 142, predefined 3-D models 168a, 168b, 168c associated with database objects 170a, 170b, 170c that match the integrated 3-D model 144, wherein the natural gesture actions may emulate the appearance of a predefined three-dimensional (3-D) item. For example, database objects 170a, 170b, 170c may be stored in association with database 172. For example, the predefined models 168a, 168b, 168c may represent physical 3-D objects.

According to an example embodiment, volume determination engine 162 may be configured to determine a volume associated with one hand of participant 132 based on tracking a viewable portion of the one hand over time according to received 3-D spatial image data 110. For example, the tracking engine 174 may be configured to track the received image data 110 over time. The trace engine 174 may receive and store trace data 175 in the memory 108. For example, tracking data 175 may include timing data associated with an instance of received 3-D aerial image data.

According to an example embodiment, the position attribute engine 154 may be configured to determine a yaw angle of a hand based on a rotation of a tracked visual portion of the hand in a top view based on the received 3-D spatial image data.

According to an example embodiment, the position attribute engine 154 may be configured to determine a roll angle and a pitch angle of a hand based on changes in depth values associated with the tracked visual portions.

According to an example embodiment, the spatial object management engine 102 may include a database access engine 176, the database access engine 176 configured to initiate access to a database 172 including a plurality of database objects 170a, 170b, 170c, each database object 170a, 170b, 170c associated with a predefined 3-D model 168a, 168b, 168c that simulates the appearance of a predetermined 3-D item.

According to an example embodiment, the matching engine 166 may be configured to select, via the spatial object handler 142, at least one predetermined 3-D item based on accessing the database access engine 176 and determining at least one database object 170a, 170b, 170c associated with a predefined 3-D model 168a, 168b, 168c matching the integrated 3-D model 144. For example, the matching engine 166 may select one or more predefined 3-D models 168a, 168b, 168c that most closely match the generated integrated model 144.

According to an example embodiment, the spatial object management engine 102 may include an update item input engine 178, the update item input engine 178 configured to retrieve an update 3-D model 180 that simulates the appearance of a predefined update 3-D item, and initiate storage of an update database object associated with the update 3-D model 180 in the database 172 via the data access engine 176.

According to an example embodiment, the predefined updated 3-D items may include one or more of 3-D inventory items, 3-D game objects, 3-D real world items, or 3-D virtual reality environment objects.

According to an example embodiment, the update item input engine 178 may be configured to obtain the updated 3-D model 180 based on one or more of receiving, via an input device, image data 110 associated with a predefined picture of the updated 3-D item or receiving the updated 3-D model 180. For example, user 124 may present the physical object to a sensing device (e.g., image data input device 112) for generation of the model, or user 124 may provide an already generated model that simulates the appearance of the physical object for inclusion in database 172 as a predefined 3-D model 168.

According to an example embodiment, spatial object management engine 102 may include an audio data reception engine 182 configured to receive audio data 114 associated with at least one verbal indicator representative of a utterance. According to an example embodiment, the matching engine 166 may be configured to select, based on the verbal indicator, by the spatial object handler 142, at least one predetermined 3-D item based on accessing the database access engine 176 and determining at least one database object 170a, 170b, 170c associated with at least one of the predefined 3-D models 168a, 168b, 168c that matches the integrated 3-D model 144. For example, participant 132 may speak the pronunciation of "chair" so that matching engine 166 may remove items that are not related to "chair" from consideration for the matching operation.

According to an example embodiment, matching engine 166 may include a preliminary alignment engine 184, preliminary alignment engine 184 configured to generate a first alignment 186 of one of predefined 3-D models 168a, 168b, 168c and integrated 3-D model 144 based on scaling, converting, rotating one of predefined 3-D models 168a, 168b, 168c and integrated 3-D model 144 based on matching at least one component included in one of predefined 3-D models 168a, 168b, 168c with integrated 3-D model 144. According to an example embodiment, matching engine 166 may include an iterative alignment engine 188, iterative alignment engine 188 configured to generate one of the predefined 3-D models 168a, 168b, 168c and a second alignment 190 of the integrated 3-D model 144 based on the first alignment 186 based on an iterative closest point algorithm.

According to an example embodiment, the matching engine 166 may include a brute force alignment engine 192, the brute force alignment engine 192 configured to generate a second alignment 190 of one of the predefined 3-D models 168a, 168b, 168c and the integrated 3-D model 144 based on the first alignment 186 based on a plurality of scaling, rotations, and transformed brute force alignments including the one of the predefined 3-D models 168a, 168b, 168c and the integrated 3-D model 144.

The matching engine 166, based on the alignments 186, 190, may select at least one selected model 194 from the pre-defined 3-D models 168a, 168b, 168 c.

Those skilled in the art of data processing will appreciate that there are many techniques for determining the match of a 3-D model to a 3-D model stored in a database. For example, partial matching of 3-D objects may be provided by example modeling techniques based on individual portions of the compared objects.

Spatial object management engine 102 can include a haptic data reception engine 196 that can be configured to receive haptic data 118 from haptic data input device 120. For example, sensors may be attached to participant 132 and movement of participant 132 may be sensed as 3-D spatial sensory data. For example, if a sensor is attached to a participant's hand, the 3-D location of the hand may be sensed by the tactile input device 120, received by the haptic data reception engine 196, and processed by the integrated model generator 140, similar to the 3-D spatial image data discussed above.

2a-2d are a flowchart 200 illustrating example operations of the system of FIGS. 1a-1b, according to an example embodiment. In the example of FIG. 2a, access to a database comprising a plurality of database objects, each database object associated with a predefined three-dimensional (3-D) model that simulates a predetermined 3-D item, may be initiated (202). For example, database access engine 176 may initiate access to a database 172 comprising a plurality of database objects 170a, 170b, 170c, each database object 170a, 170b, 170c associated with a predefined 3-D model 168a, 168b, 168c that simulates the appearance of a predetermined 3-D item, as discussed above.

Based on the freeform movement of at least one hand of the participant, three-dimensional spatial image data associated with at least one arm motion of the participant may be received (204) based on a natural gesture motion. For example, as discussed above, based on free-form movement of at least one hand of participant 132, image data reception engine 130 may receive 3-D spatial image data associated with at least one arm motion of participant 132 based on natural gesture motions.

An integrated 3-D model may be generated (206) based on integrating a temporally continuous 3-D representation of the 3-D positioning of the at least one hand from the received 3-D spatial image data. For example, as previously discussed, the integrated model generator 140 may integrate the temporally continuous 3-D representation of the location of at least one hand based on the received 3-D spatial image data 110 to generate an integrated 3-D model 144.

At least one predetermined 3-D item may be selected (208) based on accessing the database and determining at least one database object associated with at least one of the predefined 3-D models that matches the integrated 3-D model.

For example, as discussed above, the matching engine 166 may select at least one predetermined 3-D item based on accessing the database access engine 176 and determining at least one database object 170a, 170b, 170c associated with the predefined 3-D model 168a, 168b, 168c that matches the integrated 3-D model 144.

According to an example embodiment, an updated 3-D model simulating the appearance of a predefined updated 3-D item may be obtained and storage of an updated database object associated with the updated 3-D model in a database may be initiated (210). For example, as discussed above, the update item input engine 178 may retrieve an update 3-D model 180 that simulates the appearance of a predefined update 3-D item and initiate storage of an update database object associated with the update 3-D model 180 in the database 172 via the database access engine 176.

According to an example embodiment, the predefined updated 3-D items may include one or more of 3-D inventory items, 3-D game objects, 3-D real world items, or 3-D virtual reality environment objects. According to an example embodiment, an updated 3-D model may be obtained based on one or more of receiving image data associated with a predefined picture of an updated 3-D item or receiving an updated 3-D model. For example, as discussed above, update item input engine 178 may obtain update 3-D model 180 based on one or more of receiving image data associated with a picture of a predefined update 3-D item through an input device or receiving update 3-D model 180.

According to an example embodiment, audio data associated with at least one verbal indicator representative of a utterance may be received, and selecting, by the spatial object handler, at least one predetermined 3-D item may be based on determining at least one database object associated with a predefined 3-D model matching the integrated 3-D model, based on the verbal indicator (212). For example, the audio data reception engine 182 may be configured to receive audio data 114 associated with at least one verbal indicator representative of a speech. According to an example embodiment, the matching engine 166 may select at least one predetermined 3-D item based on the verbal indicator based on accessing the database access engine 176 and determining at least one database object 170a, 170b, 170c associated with the predefined 3-D model 168a, 168b, 168c that matches the integrated 3-D model 144.

According to an example embodiment, based on matching at least one component included in one of the predefined 3-D models with the integrated 3-D model, a first alignment of the one of the predefined 3-D models and the integrated 3-D model may be generated (214) based on scaling, converting, rotating the one of the predefined 3-D models and the integrated 3-D model. For example, as discussed further herein, the preliminary alignment engine 184 may generate a first alignment 186 of one of the pre-defined 3-D models 168a, 168b, 168c and the integrated 3-D model 144 based on scaling, converting, rotating one of the pre-defined 3-D models 168a, 168b, 168c and the integrated 3-D model 144 based on matching at least one component included in one of the pre-defined 3-D models 168a, 168b, 168c with the integrated 3-D model 144.

According to an example embodiment, based on an iterative closest point algorithm, one of the predefined 3-D models and a second alignment of the integrated 3-D model may be generated (216) based on the first alignment. For example, as discussed further herein, iterative alignment engine 188 may generate a second alignment of one of the predefined 3-D models 168a, 168b, 168c and integrated 3-D model 144 based on the first alignment 186 based on an iterative closest point algorithm.

According to an example embodiment, based on matching at least one component included in one of the predefined 3-D models with the integrated 3-D model, a first alignment of the one of the predefined 3-D models and the integrated 3-D model may be generated (218) based on scaling, converting, rotating the one of the predefined 3-D models and the integrated 3-D model. According to an example embodiment, based on a brute force alignment including a plurality of scalings, rotations, and transformations of one of the pre-defined 3-D models and the integrated 3-D model, a second alignment of the one of the pre-defined 3-D models and the integrated 3-D model may be generated (220) based on the first alignment. For example, as discussed further herein, the brute force alignment engine 192 may generate the second alignment 190 of the integrated 3-D model 144 and one of the predefined 3-D models 168a, 168b, 168c based on the first alignment 186 based on a plurality of scaling, rotations, and transformed brute force alignments including the one of the predefined 3-D models 168a, 168b, 168c and the integrated 3-D model 144.

According to an example embodiment, a virtual 3-D mapping space may be initialized based on discretized 3-D virtual mapping elements represented as volumetric elements, where each volumetric element includes a weight value initialized to an initial value, where the virtual 3-D mapping space represents a 3-D space in close proximity to a participant (222). For example, as discussed above, the initialization engine 148 may initialize the virtual 3-D mapping space based on discretized 3-D virtual mapping elements represented as volumetric elements, where each volumetric element includes weight values initialized to initial values, where the virtual 3-D mapping space represents the 3-D space in close proximity to the participant 132.

According to an example embodiment, integrating the temporally continuous 3-D representation may include, based on a determination indicating that a portion of at least one hand has occupied the 3-D region for a period of time during the free-form movement, incrementing (224) a weight value of a selected volumetric element associated with the 3-D region of the 3-D space in proportion and comparing (226) a threshold to the weight value of each volumetric element. For example, as discussed above, the element activation engine 150 may proportionally increment a weight value of the selected volumetric element associated with the 3-D region of the 3-D space based on a determination indicating that a portion of at least one hand has occupied the 3-D region for a period of time during the free-form movement. For example, as discussed further below, the threshold comparison engine 152 may compare the threshold 146 to a weight value for each of the volumetric elements.

According to an example embodiment, a depth, position and orientation of the at least one hand may be determined (228), and a position of a volumetric element associated with the virtual 3-D mapping space corresponding to the depth, position and orientation of the at least one hand may be determined (230). For example, as discussed above, the position attribute engine 154 may determine a depth, position, and orientation of at least one hand, while the virtual element locator 156 may determine a position of a volumetric element associated with the virtual 3-D mapping space that corresponds to the depth, position, and orientation of the at least one hand.

According to an example embodiment, integrating the temporally continuous 3-D representation may include activating, based on the determined position, a plurality of volumetric elements (232) representing depths, positions and orientations corresponding to depths, positions and orientations of the at least one hand associated with the region of the virtual 3-D mapping space. For example, as discussed above, the element activation engine 150 activates a plurality of volumetric elements associated with a region of the virtual 3-D mapping space that represent depths, positions, and orientations corresponding to the depth, position, and orientation of the at least one hand based on the position determined by the virtual element locator 156.

According to an example embodiment, the volume elements may include volume picture elements (voxels), and the voxels are initialized to an inactive state with initialized weight values (234). For example, as discussed above, the initialization engine 148 may initialize voxels to an inactive state with initialized weight values.

According to an example embodiment, activating the plurality of volumetric elements may include, based on a determination indicating that a portion of at least one hand occupies the 3-D region for a period of time during the free-form movement, activating the sets of voxels based on increasing a weight value associated with the sets of voxels with each activation of the sets of voxels (236). For example, as discussed above, the element activation engine 150 may activate the groups of voxels based on increasing a weight value associated with the groups of voxels with each activation of the groups of voxels based on a determination indicating that a portion of at least one hand occupies the 3-D region for a period of time during the free-form movement.

According to an example embodiment, an enclosure indicated by the pose of the at least one hand may be determined (238). For example, as discussed above, the envelope detection engine 158 may determine an envelope space indicated by the pose of at least one hand. According to an example embodiment, the depth of the bounding volume may be determined (240) based on a determination of the depth of the region surrounding the bounding volume. For example, as discussed above, the depth determination engine 160 may determine the depth of an enclosure based on a determination of the depth of a region surrounding the enclosure.

According to an example embodiment, a plurality of volumetric elements representing a depth, position, and orientation corresponding to an enclosure space associated with a region of a virtual 3-D space are activated for a period of time associated with a gesture indicative of the enclosure space in lieu of activating a plurality of volumetric elements representing a depth, position, and orientation corresponding to at least one hand associated with a region of the virtual 3-D space (242). For example, as discussed above, the element activation engine 150 may activate a plurality of volumetric elements associated with a region of the virtual 3-D space representing a depth, position, and orientation corresponding to the enclosure space in lieu of activating a plurality of volumetric elements associated with a region of the virtual 3-D space representing a depth, position, and orientation corresponding to at least one hand, for a period of time associated with a gesture indicating the enclosure space.

3a-3c are a flowchart 300 illustrating example operations of the system of FIGS. 1a-1b, according to an example embodiment. In the example of fig. 3, three-dimensional (3-D) spatial image data associated with at least one arm motion of a participant may be received (302) based on free-form movement of at least one hand of the participant based on natural gesture motion of the at least one hand of the participant. For example, as discussed above, sensory data receiving engine 104 may include an image data receiving engine 130 that may be configured to receive 3-D spatial image data associated with at least one arm action of a participant 132 based on free-form movement of at least one hand of the participant based on natural gesture motion.

Further, a plurality of consecutive 3-D spatial representations may be determined based on the received 3-D spatial image data, each 3-D spatial representation comprising 3-D spatial mapping data corresponding to the 3-D pose and position of the at least one hand at consecutive time instances during the free-form movement (304). For example, as discussed above, the spatial representation engine 136 may determine a plurality of consecutive 3-D spatial representations 138 based on the received 3-D spatial image data, each 3-D spatial representation 138 including 3-D spatial mapping data corresponding to the 3-D pose and position of the at least one hand at consecutive time instances during the freeform movement.

Generating, by the spatial object processor, an integrated 3-D model based on progressively integrating the 3-D spatial mapping data included in the determined continuous 3-D spatial representation and comparing a threshold time value to a model time value, wherein the model time value is indicative of a number of time instances spent occupying at least one hand of the plurality of 3-D spatial regions during the free-form movement (306). For example, as discussed above, the integrated model generator 140, via the spatial object processor 142, generates an integrated 3-D model 144 based on progressively integrating the 3-D spatial mapping data included in the determined continuous 3-D spatial representation 138 and comparing a threshold time value 146 to a model time value, wherein the model time value indicates a number of time instances spent by at least one hand occupying a plurality of 3-D spatial regions during the free-form movement.

According to an example embodiment, based on the received 3-D spatial image data, a volume associated with a hand of the participant may be determined (308) based on tracking a viewable portion of the hand over time. For example, as discussed above, the volume determination engine 162 may determine a volume associated with one hand of the participant 132 based on tracking a viewable portion of the hand over time based on the received 3-D spatial image data 110.

According to an example embodiment, based on a rotation of a tracked visual portion of a hand in a top view, a yaw angle of the hand may be determined based on the received 3-D spatial image data (310). For example, as discussed above, the position attribute engine 154 may determine a yaw angle of a hand based on received 3-D spatial image data based on a rotation of a tracked visual portion of the hand in a top view.

According to an example embodiment, based on changes in depth values associated with the tracked visual portions, a roll angle and a pitch angle of a hand may be determined (312). For example, as discussed above, the position attribute engine 154 may determine the roll and pitch angles of a hand based on changes in the depth values associated with the tracked visual portions.

According to an example embodiment, a volume associated with one hand of the participant may be determined (314) based on the received 3-D spatial image data. For example, as discussed above, volume determination engine 162 may determine a volume associated with one hand of participant 132 based on the received 3-D spatial image data 110.

A3-D representation of a hand in the 3-D virtual mapping space may be determined (316) based on the determined volume, and at least one hand pose may be determined (318) based on the 3-D representation of the hand. For example, as discussed above, the spatial representation engine 136 may determine a 3-D representation of one hand in the 3-D virtual mapping space based on the determined volume. For example, as discussed above, the gesture determination engine 164 may determine the hand gesture based on a 3-D representation of at least one hand.

According to an example embodiment, it may be determined whether at least one hand pose of at least one hand is indicative of a 3-D bounding space (320). If it is determined that the bounding 3-D space is indicated, generating the integrated 3-D model may include, for successive time instances associated with the pose indicating the bounding space, activating a plurality of volumetric elements associated with the portion of the integrated 3-D model representing the depth, position, and orientation corresponding to the bounding space (322). Otherwise, generating the integrated 3-D model may include activating a plurality of volumetric elements associated with the portion of the integrated 3-D model that represent the depth, position, and orientation corresponding to the at least one hand (324). For example, as discussed above, the integrated model generator 140 may: activating a plurality of volumetric elements associated with a portion of the integrated 3-D model representing depths, positions, and orientations corresponding to the bounding space; alternatively, a plurality of volumetric elements associated with the portion of the integrated 3-D model representing the depth, position and orientation corresponding to the depth, position and orientation of the at least one hand may be activated.

According to an example embodiment, the at least one hand gesture may be determined to include one or more of a flat hand gesture, a curved hand gesture, a clenched fist hand, or a finger-pinched hand gesture (326). For example, as discussed further herein, the gesture determination engine 164 may determine that the at least one hand gesture includes one or more of a flat hand gesture, a curved hand gesture, a clenched fist hand, or a finger-pinched hand gesture.

According to an example embodiment, access may be initiated to a database comprising a plurality of database objects, each database object associated with a predefined three-dimensional (3-D) model that simulates an appearance of a predetermined 3-D item (328). For example, as discussed above, database access engine 176 may initiate access to a database 172 that includes a plurality of database objects 170a, 170b, 170c, each database object 170a, 170b, 170c associated with a predefined 3-D model 168a, 168b, 168c that simulates the appearance of a predetermined 3-D item.

By the spatial object handler, at least one predefined 3-D model associated with at least one database object matching the integrated 3-D model may be determined, wherein the natural gesture motion mimics an appearance of a predetermined three-dimensional (3-D) item (330). For example, as discussed above, the matching engine 166 may, through the spatial object handler 142, determine predefined 3-D models 168a, 168b, 168c associated with database objects 170a, 170b, 170c that match the integrated 3-D model 144, wherein the natural gesture actions may emulate the appearance of a predetermined three-dimensional (3-D) item.

According to an example embodiment, an updated 3-D model that simulates the appearance of a predefined updated 3-D item may be obtained (332), and storage of an updated database object associated with the updated 3-D model in a database may be initiated (334). For example, as discussed above, the update item input engine 178 may retrieve an update 3-D model 180 that simulates the appearance of a predefined update 3-D item and initiate storage of an update database object associated with the update 3-D model 180 in the database 172 via the database access engine 176.

FIG. 4 is a flowchart 400 illustrating example operations of the system of FIGS. 1a-1b, according to an example embodiment. In the example of FIG. 4, three-dimensional (3-D) sensory data associated with at least one natural gesture of a participant may be received (402) based on a natural gesture motion that emulates an appearance of a predetermined three-dimensional (3-D) item based on free-form movement of the participant. For example, as discussed above, the sensory data receiving engine 104 may be configured to receive 3-D sensory data 106 associated with at least one natural gesture of a participant based on free-form movement of the participant based on natural gesture actions that emulate the appearance of a predefined three-dimensional (3-D) item. For example, as discussed above, sensory data 106 may include one or more of image data 110, audio data 114, or haptic data 118.

An integrated 3-D model is generated based on integrating the received 3-D sensory data, the 3-D sensory data representing a 3-D positioning of at least one 3-D moving object associated with the participant according to free-form movements (404). For example, as discussed above, the integrated model generator 140 may be configured to generate, by the spatial object processor 142, an integrated 3-D model 144 based on integrating the received 3-D sensory data 106, the 3-D sensory data 106 representing a 3-D positioning of at least one 3-D moving object associated with the participant according to free-form movements.

For example, a participant may grab a 3-D object (e.g., book, laptop, mobile phone) and move the 3-D object in a natural gesture motion to describe the shape of a desired 3-D item. As another example, a participant may wear or attach a sensing device (e.g., glove, joystick) and gesture through the sensing device to shape a desired 3-D item. For example, the participant may move to present a perceived shape of the 3-D item.

A predefined 3-D model associated with the database object that matches the integrated 3-D model may be determined (406) by the spatial object handler. For example, as discussed above, the matching engine 166 may, through the spatial object handler 142, determine the predefined 3-D models 168a, 168b, 168c associated with the database objects 170a, 170b, 170c that match the integrated 3-D model 144.

According to an example embodiment, a portion of the received 3-D sensory data may be selected for integration (408) based on comparing a threshold time value to a value indicative of a length of time it takes to move at least one 3-D moving object within the plurality of 3-D regions during free-form movement. For example, as discussed above, the integration model generator 140 may be configured to select portions of the received 3-D sensory data 106 for integration based on comparing the threshold time value 146 to a value indicative of a length of time it takes to move at least one 3-D moving object within the plurality of 3-D regions during free-form movement, as discussed further below.

5a-5e illustrate example gesture movements and example models associated with the gesture movements, according to an example embodiment. As shown in FIG. 5a, a participant 502 may mentally visualize a 3-D object 504. For example, the 3-D object may include a three-leg stool that includes a stool face 506 and angled legs 508. The participant 502 may indicate various dimensions of the 3-D object 504, for example, by laying flat and moving his/her hand to a location 510 indicating a distance separating the two hands to indicate a height, width, and/or depth associated with the 3-D object 504. As shown in FIG. 5a, a participant 502 is within range of a sensing device 512 (e.g., an overhead depth camera), may be spatially depicted, or simulated using natural gestures, a depiction of a 3-D object 504.

As shown in fig. 5b, the participant 502 may lay his/her hand flat and move in a rotational action 514 to visualize an image of a stool face 506 of a stool (e.g., 3-D object 504) in the participant's mind. As shown in fig. 5c, the participant 502 may make a fist and move his/her hand in an angled vertical motion 516 to visualize an image of the angled stool legs 508 of the stool (e.g., 3-D object 504) in the participant's mind.

According to an example embodiment, the example system 100 of FIG. 1 may receive image data acquired by the sensing device 512 in tracking the movement of the participant. System 100 may then generate a 3-D model 518 as shown in FIG. 5D based at least on the image data derived from the tracking of the position 510, rotational action 514, and angled vertical action 516 of the hand of participant 502.

According to an example embodiment, as discussed above, the system 100 may determine a predetermined 3-D object that most closely matches the generated 3-D model 518. For example, the generated 3-D model 518 may match a predetermined 3-D model associated with a three leg stool, as shown by the superimposed 3-D model 522 of the generated 3-D model 520 shown in FIG. 5 e.

According to an example embodiment, a "data impersonation" technique may be based on observing and perceiving human gestures, using human senses of spatial reference, and using rich hand shapes and motions in describing objects to infer the described objects, as shown in fig. 5a-5 e. Similar to using gestures when talking to a human observer, data impersonation or gesturing may be observed passively, thus providing little to no feedback during the gesture. The example participant may thus work solely from the image in the brain of the depicted object (e.g., 3-D object 504), and his/her gestures may be used to implicitly create a virtual representation of the image in brain (e.g., 3-D model 518).

The virtual representation may be used to classify a desired object (e.g., "stool") and extract details of the object to distinguish it from other instances of that class. For example, participant 132 may describe a chair, and further describe a particular existing chair, which has three legs that tilt from the middle, which has a height of two feet (e.g., a stool as discussed above in connection with fig. 5a-5 e). Without further specific details, reference to a particular chair may not be clear.

The data impersonation or gesturing may thus passively observe how participant 132 is acting and provide no feedback while participant 132 is gesturing. Thus, participant 132 may work solely from his/her conceptual model of the 3-D object to determine his/her gestural movement. Because it may provide little to no feedback during gesturing, data emulation or gesturing may take participant 132 to maintain a frame of reference while gesturing. According to an example embodiment, the participant's body may be used as a spatial reference. For example, participant 132 may not rely on visual feedback when the two hands are together. For example, humans may also have the ability to know where the two hands are relatively in the air, and may maintain a space outside their body anchored for a brief period of time (e.g., short-term visual memory for maintaining spatial relationships).

Speakers have used gestures in conversations since ancient times. For example, the speaker may use a stroke gesture for emphasis, a pointing gesture (pointing) for pointing an object, a metaphorical gesture for conveying an abstract meaning, and a avatar gesture. The character gestures may depict tangible objects or events, have a relationship with a shape that approximates the content of the utterance, and may be used to attempt to describe the shape or shape of the object. As other examples, the metaphor may include a symbol that conveys meaning only by convention (e.g., an "OK" symbol).

The mute gestures may resemble avatar gestures in that they may depict objects or actions, but do not involve speech (and may be distinguished from gestures used in theatrical performances). Furthermore, symbolic languages may be used as an alternative to spoken languages and may therefore at least supplement the speaking.

According to an example embodiment, data impersonation or gesturing may be used to perform matching of 3-D models against stored predetermined objects. Because these models may not be based on conventions, but rather actual shapes of real-world physical objects, it is understood that data-mimicking gestures may include avatar gestures and dumb gestures, and typically do not include symbols. For example, while a class (e.g., "chair") is described in detail by speech, gestures and speech can be used to supplement the technique, as the interdependence of one on the other does not involve the complexities typical of natural-looking gestures. According to an example embodiment, some example gestures discussed herein may be referred to as "dumb" or "mimic" gestures.

In an example learning, humans have been observed to determine potential gesturing techniques for describing objects. For example, participant 132 may look at an object and then make a gesture in memory that describes the object. For example, participants 132 may only use their hands to describe the object, they may not use speech or body gestures to augment their description of the object. For example, participant 132 may not receive any feedback during their object description. For example, the participant 132 may complete the description of the object by lowering their arm. For example, participant 132 may not receive instructions as to which gesture to use for object description. For example, the participant 132 may be required to spatially describe original objects, such as boxes, cones, and pyramids, as well as more complex shapes. For example, such learning results may involve determining features that participant 132 may include when spatially describing a complex object.

FIG. 6 illustrates an example participant gesturing proximate to a camera. As shown in the example scenario illustrated in fig. 6, an elevated camera 602 may be secured on a roof 8 feet off the ground. Thus, the example camera 602 may capture an overhead volume that is 5 feet wide, 3 feet high, and 5 feet deep. As shown in fig. 6, a participant 604 (e.g., participant 132) may spatially describe an object within the range of the camera 602 (e.g., image data input device 112). As shown in fig. 6, the participant 604 may provide gestures through the participant's hand 606, as discussed above.

FIG. 7 illustrates an example three-dimensional (3-D) project. For example, a 3-D project may include a three leg bench 702, and a taller three leg bench 704. The 3-D items may also include an S-shaped chair 706, a one-leg chair 708, and a four-leg chair 710. The 3-D items may also include wheeled office chairs 712, monitors 714, Personal Computers (PCs) 716, cones 718, ladders 720, and tables 722. Such 3-D items may be spatially described by human participants (e.g., participant 132).

8a-8c illustrate example gestures of example participants. According to example learning observations for a group of people, participant 502 involved in gestures or impersonations may naturally maintain relative proportions of portions of one object, as well as maintain relative scaling across objects. For example, participant 502 may naturally describe a large table using a large portion of their arm length, while describing a chair smaller as appropriate. For example, participant 502 may naturally scale objects non-uniformly to fit the size in various directions relative to the area covered by the arm length. Thus, depending on the characteristics of the human body, the object may be described as being wider 802 than the height 804 and higher than the depth 806.

For example, participant 502 may naturally describe an object using a top-down approach; after a larger, more pronounced surface, they may describe a smaller portion of the object. For example, participant 502 may distinguish between surfaces (e.g., the plane of PC716, monitor 714, table 722, and curved surfaces such as the frame of ladder 720, the seat and back of a chair) and smaller components (such as support posts and connectors).

For example, the participant 502 may naturally use both hands in a parallel pose, facing each other to define symmetric elements of an object (e.g., PC716, monitor 714). Those symmetrical portions do not represent the size of the entire object but specific portions may be described in detail. For example, the participant 502 may naturally use simultaneous and symmetric hand movements to describe smaller parts, such as the legs of the chair 710, or the frame of the ladder 720.

For example, when the object is shaped like a box, participant 502 may define the dimensions of portions of the object (e.g., PC716, monitor 714). For example, the participant 502 may naturally repeatedly move both hands back and forth along the boundary dimensions of the object in a flat gesture at the same time. For example, participant 502 may naturally hold both hands flat in position to define those boundaries. For example, a participant may naturally draw a wire frame of an object (e.g., PC 716) in a box shape.

FIG. 9 shows example hand gestures of a participant. For example, the participant 502 may naturally use their hands to "track" large surfaces, i.e., move their flattened hands along those surfaces as if wiping them with their hands (e.g., table 722, top surface, chair seat, cone 718). For example, the participant 502 may naturally wipe an area within the boundary of the surface to "fill in" it, as shown by the flat hand gesture 902. As shown in fig. 9, curved hand gesture 904 may naturally be used to describe a curved surface, while held hand gestures 906 and 908 may naturally be used to describe support posts and legs. Relaxed hand poses 910 and 912 may be used naturally when there is no tracking surface.

For example, the participant 502 may naturally detail the outline of the media surface with their flat hand (e.g., hand gesture 902) and wipe the enclosed area with their hand to "fill in" it (e.g., monitor, back of office chair). For example, the participant 502 may naturally abstract those media surfaces as a swing of their flat hands (e.g., hand gesture 902), which they may repeat the indication (e.g., for a chair). For example, the participant 502 may naturally describe a surface (e.g., monitor, chair) simply by repeatedly waving his hands generally in the position of the surface. For example, the participant 502 may naturally use their hands to "track" smaller parts of an object (e.g., steps of a ladder, outer frame of a ladder).

For example, to change shape, participant 502 may naturally adjust the shape of their hand to match a curved surface and repeat "wiping" up and down its surface. The participant 502 may form a closed circle with two thumbs and an index finger (e.g., forming an enclosed space, as discussed above) and move their hand downward, driving them apart, with the fingers thus maintaining the original shape (e.g., hand gesture 904).

For example, participant 502 may naturally move both hands symmetrically and simultaneously to describe symmetry about smaller parts. With respect to bars, support posts, legs, a participant may naturally form a fist (e.g., hand gesture 906) and move it along the bar to represent a straight bar (e.g., table leg, tripod, chair, office chair brace). For example, the participant 502 may naturally pinch their thumb and forefinger and move them along the bar (e.g., hand gesture 908).

For larger support posts, the example participant 502 may hold their hands a close distance, keep them parallel or connect the fingers and palms of both hands to enclose the space between the hands, and move both hands to track the shape of the components (e.g., struts and legs of a chair, monitor base). For example, participant 502 may naturally ignore complex shapes such as chair feet, or may abstract them into a single original shape.

In general, the example participant 502 may begin describing an object in a top-down manner. They can abstract the appearance of an object, can first elaborate on large components and faces, and can finally delineate some distinctive but smaller components. For example, participant 502 may naturally indicate the arm rests, struts, and feet of an office chair, ignoring the bars that describe the support of the arm rests or the connection of the back rest to the seat. Similarly, the example participant 502 may describe the ladder by indicating all of the steps of the ladder and then highlighting its exterior contour.

For example, participant 502 may naturally first describe those portions that most clearly represent the function of the object (e.g., the back and seat of a chair, the steps of a ladder, the top of a table). They may then naturally describe the parts that hold the objects together. For example, the participant 502 may naturally utilize a symmetrical appearance whenever possible, and they may use a two-handed mirrored gesture to describe the shape. Similarly, they can use both hands to elaborate the dimensions by defining a bounding plane or "draw" bounding box. The actual size of the media and the small-sized surface may not appear important to participant 502 in their spatial description, and may therefore be ignored in their natural gestures.

For example, the participant 502 may naturally adjust the shape of their hand to the shape of the object or part being described, stick out (e.g., describing a flat surface, hand gesture 902) or bend and put the fingers together (e.g., a round surface, hand gesture 904), as desired. Instead, the participant 502 may relax his hands and allow them to assume their natural state (e.g., hand poses 910, 912) when moving their hands to the next portion of the object.

For smaller pieces of cloth of objects, such as bars and feet, the example participant 502 may either form a fist (e.g., hand gesture 906) or pinch their thumb and forefinger (e.g., hand gesture 908) to indicate a round and square bar, then they may move their hand along their shape. They can then ignore the actual size of those bars, using the action of a hand to indicate the shape of such bars.

The example participant 502 may change the yaw and roll of the hand, which may change the hand slope through a vertical pose only when indicating a parallel portion (as shown in fig. 8 c) due to the limited range of hand tilt angles. However, moving the elbow may expand the range when the hand is vertical. Conversely, hand rotation and deflection may cover a greater range, and elbow movements may also support a range of hand deflections.

In addition to stretching the hands to indicate activity as described above, the example participant 502 may intentionally describe portions of an object more slowly, while moving their hands faster when transitioning to another portion. For smaller surfaces, they may linger in one location for a short period of time. For larger surfaces, the example participant 502 may repeat the description of the surface and describe the surface more carefully than when moving their hands to another portion.

When two components may be closely juxtaposed, the example participant 502 may not linger between the components, but instead treat them as a composite and change the orientation of their hands as they move their hands (e.g., the connected back and seat of a chair). The composite part may be indicated by a gesture, repetition.

The observations discussed above may provide basic, for example, features for translating observed gestures, as they occur, into a virtual representation in an effort to render an image of the participant's mind. In particular, the example techniques discussed herein may not rely on predefined gestures that, when recognized, designate themselves as particular portions of an object. According to an embodiment, the participant's hand may provide substantial focus when expressing a gesture. According to one example embodiment, the position and pose of the participant's arms and body may be ignored and the focus may be directed entirely toward the participant's hands.

For example, the participant 502 may naturally track the surface and structural elements of an object, recreating the object based on their spatial memory, suggesting that a virtual representation of the participant description may also be built over time. According to an example embodiment, those portions that participant 502 has spent more time describing may be given more weight than portions that he/she has only briefly taken.

Because the example participant 502 may describe surfaces of different sizes by waving their hands in respective regions, the participant's hands may create traces in the virtual representation. Since the actual path of the gesture may provide less information about the object, the position and orientation of the participant's hand may be used as the focus of the correct translation action. In conjunction with the time-aware sensing of gestures, such tracking may add more meaning to the virtual representation as the participant 502 repeatedly or more slowly covers a particular portion of the object.

According to an example embodiment, the focus may be directed to other entities than the participant's hand. For example, the participant may present a perceived shape of the 3-D item. According to an example embodiment, the description of the 3-D item may be inferred, for example, based on the length of time spent by participants presenting the perceived shape of the 3-D item.

According to one example embodiment, a participant may spatially describe a 3-D item by a 3-D moving object associated with the participant. For example, a participant may grab a 3-D object (e.g., book, laptop, mobile phone) and move the 3-D object in a natural gesture motion to describe the shape of a desired 3-D item. As another example, a participant may wear or attach a sensing device (e.g., glove, joystick) and gesture the shape of the desired 3-D item.

10a-10b illustrate example hand gestures of a participant. As shown in fig. 10a, an extended hand gesture 1002 may indicate a shape (e.g., a flat surface). The extended hand, finger, closed gesture 1004 may also show an intent to indicate a shape. The shape of the curved hand and the finger-closed hand gesture 1004 may suggest that the action is meaningful, with a relaxed gesture 1008 indicating a transition (e.g., similar to that discussed in connection with hand gestures 910, 912).

According to one example embodiment, the example techniques may recognize and translate only meaningful portions of the participant's gesture, while ignoring actions that are only used to translate the hand to the next portion of the object. For example, the example participant 502 may momentarily relax his/her muscles while moving the hand to another portion of the object, where he/she may hold and extend his/her fingers, or bend the hand muscles to signal a meaningful hand gesture. It may be desirable to capture this difference, however, the change in finger pose and curvature may be quite subtle, as shown in FIGS. 10a-10 b.

Given the potential concern of sensing muscle relaxation with a camera, example data emulation or gesturing techniques may forego interpretation of finger bending. For each moment, the participant's hands may leave marks in the virtual representation (e.g., spatial representation 138 and/or integration model 144) whose positions and orientations correspond to those of the hands of the participant in the real world. In other words, the orientation and pose of the hand at each moment may determine the volume of the component added to the virtual representation (i.e., a flat, oblique hand may make a flat, oblique, small-scale effect on the virtual representation). According to an example embodiment, these concepts may be extended to 3-D objects other than participants' hands, as discussed further below.

According to an example embodiment, by replicating the volume of the participant's hands and representing them in virtual space, the example techniques discussed herein may sense flat and curved hand gestures (e.g., flat surfaces, spherical surfaces) and may also refer to smaller elements (e.g., legs of a chair) when the participant 502 forms a fist or pinches their fingers. Furthermore, the hands may be considered separately.

According to an example embodiment, the data impersonation techniques or gesturing techniques discussed herein may generate a virtual representation (e.g., the generated integrated model 144) of the participant's description in a discretized 3-D volume employing l x m x n voxels. This voxel space may thus represent the virtual representation "memory" of the object description system. Each voxel may be in an activated or non-activated state. According to an example embodiment, one scene may be initiated with only inactive voxels, and the voxels may be appropriately activated as the process of observing the participant gesture. According to an example embodiment, each voxel may also be associated with a particular weight, which may increase as the participant repeatedly activates the voxel, as discussed above. Thus, since the sets of voxels above a particular weight can be interpreted to represent meaningful portions of the participant's description while the rest is ignored, it is possible to capture how slowly or repeatedly the participant tracks the most meaningful portions of the object.

According to an example embodiment, the example 3D scene techniques discussed herein may be earth-anchored such that the location and orientation of the scene may not accommodate the location and orientation of the participants. Thus, when the scene is centered in front of the participant 502 (i.e., earth anchored), the participant 501 may be able to maintain this spatial anchoring because the 3-D object description may be completed in as little as a few seconds.

According to an example embodiment, the identification of the participant-described object may be determined based on an example database (e.g., database 172) that matches the candidate object in the voxel representation. According to an example embodiment, the data impersonation techniques, or gesturing techniques discussed herein, may select the closest matching object from the database. As discussed herein, for each candidate object, the generated 3-D model (e.g., the integration model 144) may be aligned with a predefined database model (e.g., the predefined 3-D models 168a, 168b, 168 c) for similarity comparison and measurement. Further, a zoom and rotation different from the participant's creation may be obtained.

Because objects may be roughly assembled based on feature parts, participant 502 may describe such feature parts separately. Since humans may often make implicit assumptions about their audience, they may not describe the less prominent portions of the foundation that appear implicit to the structure (e.g., the connecting portions between surfaces such as the back and seat of a chair) or features that do not further aid in identifying objects.

According to one example embodiment, such modeling of the composition of patches on the part of the participant may be reflected in the matching process by allowing the participant to ignore any part of the object, believing that the participant will naturally describe in sufficient detail, giving some similarity to such objects for reference and variability of shapes in the class.

Further, according to an example embodiment, example data impersonation or gesturing techniques may incorporate speech recognition to narrow the class of the object under consideration. For example, the participant 502 may say "chair" and then describe a particular chair in detail by gesturing to indicate an identifying feature (or set of features) of the chair shape.

According to an example embodiment, the example system 100 discussed herein may be implemented on an end-user system and have a single depth sensing camera (e.g., image data input device 112) that may be mounted above the participant 132. For example, a MICROSOFT KINECT camera may provide depth images at 30Hz and 640X 480 resolution. For example, the camera may have a diagonal field of view of 70 °. According to an example embodiment, the example system 100 may process each camera frame in less than 15ms, which provides real-time processing and translation of the participant gestures into voxel representations.

11a-11d show graphical views of an example process of image data according to the example system of FIG. 1. According to an example embodiment, the example techniques discussed herein may process the original image 1100a as discussed below. As shown in fig. 11a, a camera 1102, a chair 1104, and a table 1106, along with arms 1108, 1110 and hands 1112, 1114 of a participant (e.g., participant 132) are displayed in an original image 1100 a.

According to an example embodiment, each picture element (pixel) in the input image 1100a may be converted to world coordinates. As shown in fig. 11b, coordinates (i.e., pixels) outside of the volume that is 3 feet wide by 2 feet high by 2.5 feet deep may then be cropped, thus removing the floor, walls, and potentially other objects from the depth image 1100b (i.e., the chair 1104 and table 1106 in fig. 11 a).

According to an example embodiment, arms 1108 and 1110 in the image 1100b may be identified, with only gradient depth values distinguishing between successive regions to account for overlapping arms, while the participants' hands 1112, 1114 may be extracted by removing the arms 1108, 1110, as shown in FIG. 11 c.

Based on the assumption that: the participant's arm 1108, 1110 may enter and touch the volume from the outside, the example technique may determine the farthest point of each hand 1112, 1114, measure the distance as the length of the path within the shape of the arm 1108, 1110 (i.e., the euclidean distance of this example) to account for the curved elbow and wrist. According to an example embodiment, to extract the participant's hands 1112, 1114, a constant hand length (depending on the distance to the camera) may be used. According to an example embodiment, calibration may be used for a particular hand size of a participant.

According to an example embodiment, the orientation and volume of the two hands 1112, 1114 may be calculated by tracking the viewable area of each hand 1112, 1114 over time. According to an example embodiment, the roll and tilt angle of each hand 1112, 1114 may be calculated from the change in depth values across the viewable area. According to an example embodiment, if the viewable area, such as for vertical hand scrolling, is too small (e.g., only the thumb and index finger are visible from top to bottom), the example techniques may estimate how much of the hands 1112, 1114 are likely to be occluded based on previous observations, and may thus determine the orientation of the hands.

According to an example embodiment, the calculation of the angle of deflection of each hand 1112, 1114 may be straightforward given that the camera 1102 may be mounted above the participant's head. According to an example embodiment, from observations over time, the pose of each hand in 3-space and its precise extension in the z-axis (i.e., the view axis of the camera 1102) direction may be reconstructed.

According to an example embodiment, after calculating the orientation of the hands 1112, 1114, the example techniques may then translate the position and orientation of each hand 1112, 1114 directly into a position of a voxel in voxel space. According to an example embodiment, this may include activating all voxels within a region that have the same depth, position, orientation as the participant hands 1112, 1114, as discussed above.

According to one example embodiment, the example techniques may detect an intent of a participant to create finer elements by pinching his/her fingers and thumbs or moving both hands together. As shown in fig. 11d, once such an enclosed region 1116 (e.g., by enclosing detection engine 158) is detected, the region (as opposed to hand 1114) may be processed. According to an example embodiment, the depth values of the region 1116 may be sampled from the surrounding region (i.e., the hand 1114). Thus, if the voxels share a location with the enclosing region 1116, they may be activated. According to an example embodiment, if the hand 1114 encloses an area (e.g., area 1116), its actual shape may no longer be considered. Thus, participant 132 may indicate a thinner element, such as a leg or a tripod support column. A similar technique may be applied if participant 132 connects both the thumb and forefinger, thereby enclosing a larger area.

According to an example embodiment, the voxel space may be implemented as a three-dimensional array of positive numbers, representing a 3-D histogram. According to an example embodiment, each voxel may have a constant width, height, and depth (e.g., 10 nm). According to an example embodiment, the center of the voxel space may be placed directly in front of the participant 132, approximately at torso level (e.g., as shown in fig. 11 a).

According to an example embodiment, activating one voxel may enhance its count in the histogram, thereby implying that voxels through which the participant 132 repeatedly or slowly (i.e., with a meaningful portion of the object description) may accumulate a higher count than voxels through which the participant 132 passed when moving the arm 1108, 1110 to the next meaningful location. According to an example embodiment, simply thresholding across all voxels in space may then leave a meaningful and relevant portion of the object description. For example, as discussed above, the integrated model generator 140 generates the integrated 3-D model 144 based on progressively integrating the 3-D space mapping data included in the determined continuous 3-D space representation 138 and comparing a threshold time value 146 to a model time value, wherein the model time value indicates a number of instances of time spent occupying at least one of the plurality of 3-D space regions during the free-form movement.

According to an example embodiment, the example iterative alignment technique may employ an example Iterative Closest Point (ICP) algorithm to record two models (e.g., the generated integrated model 144 and one of the predefined 3-D models 168a, 168b, 168 c). For example, the ICP algorithm may be initiated after both models have been pre-aligned (e.g., by scaling, converting, and rotating to at least match the selected part). According to an example embodiment, the preliminary alignment may further uniformly accommodate the scale of the two models. For example, as discussed above, the iterative alignment engine 188 may generate one of the predefined 3-D models 168a, 168b, 168c and the second alignment 190 of the integrated 3-D model 144 based on an iterative closest point algorithm.

According to an example embodiment, the ICP algorithm may be based on iteratively matching points in one model with the closest points in another. For example, statistical techniques based on distance distribution can be employed to handle outliers, occlusions, displays, and disappearing, which provide example techniques associated with subset-to-subset matching. An example least squares technique may be employed to estimate the 3-D motion from the point correspondences, which may reduce the average distance between points in the two models.

Alternatively, an example brute force technique may detect any combination of 4 levels of quarter rotation around the z-axis (vertical) and transitions within 16cm x 16 cm. According to one example embodiment, the rotation (horizontal) around x and y may be ignored, as participants 132 may maintain the orientation of the object around those axes, while they may "steer" the object to them during their spatial description (i.e., they may rotate around the z-axis). The number of z-rotations of this example technique may correspond to the number of vertical faces (e.g., four faces) of the object. This example technique may also pre-align the two models and uniformly adapt their proportions. For example, as discussed above, the brute force alignment engine 192 may generate the second alignment 190 of one of the predefined 3-D models 168a, 168b, 168c and the integrated 3-D model 144 based on the first alignment 186 based on a brute force alignment including a plurality of scaling, rotations, and transformations of the one of the predefined 3-D models 168a, 168b, 168c and the integrated 3-D model 144.

According to one example embodiment, an example ICP technique may be computationally expensive and may involve an approximate time of 8 seconds to compare two models, and an example brute force technique may involve less than 1 second for comparison, as a brute force technique may operate in discrete voxel space (i.e., viewing voxels may involve relatively fast operations). However, the example ICP technique may provide more flexibility in that it may rotate the object about all three axes to determine one or more closest matches.

While at least two different techniques of matching objects represented in voxel space are discussed herein, those skilled in the art of data processing will appreciate that many other matching techniques may be used to match the generated integrated model 144 to one or more predetermined 3-D models 168a, 168b, 168c without departing from the spirit of the discussion herein.

FIG. 12 illustrates example overlay results of matching a generated 3-D model to a pre-defined 3-D model. As shown in FIG. 12, generated 3-D model 1202 is matched to predefined model 1204 based on the example matching techniques discussed herein. For example, participant 132 may envision a chair having disc-shaped feet (e.g., similar to chair 708), and may spatially characterize the chair (e.g., by data mimicking gestures) within the confines of an example image data input device 112 (e.g., a depth camera). The example spatial object management engine 102 may receive the spatial image data 110 from the image data input device 112, and the integration model generator 140 may generate the integration model 144 (e.g., the integration model 1202). Matching engine 166 may then match integration model 144 with a predefined 3-D model 168 to select selected model 194 (e.g., predefined model 1204), as discussed above.

According to an example embodiment, the example system 100 may be implemented by capturing interactions with the participants 132 through a depth camera, such as a MICROSOFT KINECT camera. According to an example embodiment, video of the participant's 132 gestures may be recorded with 30Hz depth information at a resolution of 640 x 480 pixels. The spatial object management engine 102 may be implemented, for example, by running a WINDOWS7 flagship version, a computing device with an Intel core 2Duo2.13GHz processor and 6GB Random Access Memory (RAM).

According to an example embodiment, the example matching technique may employ a closest three (close-three) technique, where the participant 132 may spatially describe objects by gestures where the system 100 may provide the three closest matching objects 170a, 170b, 170c from the database 172. Participant 132 or user 124 may then select one of the three options, may restart, or may choose to continue to provide more detail by gesturing, as it is apparent that they are already describing the object in detail. According to an example embodiment, the results of the closest three (close-three) may also be consumed by a larger system that may model the context of the interaction (e.g., spoken dialog). This additional information may provide disambiguation of participant inputs (e.g., conversely, gestures may disambiguate other aspects of an interaction such as a speech).

Example techniques discussed herein may involve specific 3-D models stored in a database (e.g., predefined 3-D models 168a, 168b, 168c stored in database 172). Using a direct shape matching approach, these models may be matched with similarly represented inputs (e.g., generated integration models 144), which may not be informed by the nature of the human gesture. Thus, adding an object to the database 172 (e.g., updating the model 180) may involve obtaining only a 3-D model of the object.

According to an example embodiment, other items besides the hands of the participant 132 may be used to spatially describe the objects to obtain the generated integration model 144. For example, in robotic applications, the end effector may include a device at the end of a robotic arm designed to interact with the environment. Thus, for example, such end effectors may be used (e.g., in the context of a human hand environment rather than a robotic appendage) in place of a human hand for spatial description. Further, if a participant's hand is occupied (e.g., holding a book, holding a mobile device), objects held by the hand may be tracked by the example system 100 in place of the participant's 132 hand. For example, if participant 132 does not have a hand available for gesturing, other objects may be used without departing from the spirit discussed herein.

Furthermore, sensing devices other than the image data input device 112 may be used to acquire the sensed data 106, as discussed above. For example, the sensing glove may be used to capture hand gestures while participant 132 is gesturing with the sensing glove.

Example techniques discussed herein may provide example methods to sense gestures that may be used, for example, to describe a particular physical object. According to an example embodiment, as discussed herein, the example data emulation or gesturing techniques may be based on volumetric pixel or picture element (voxel) representations of the space tracked by the participant's 132 hand during the gesture. According to an example embodiment, 3-D model matching techniques may be used to match input voxel representations to select among a database 172 of known physical objects (e.g., physical objects that may be associated with pre-defined 3-D models 168a, 168b, 168 c).

Implementations of the various techniques described herein may be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. Implementations may be implemented as a computer program product, i.e., a computer program tangibly embodied in an information carrier, e.g., in a machine-usable or machine-readable storage device (e.g., a magnetic or digital medium such as a Universal Serial Bus (USB) storage device, a magnetic tape, a hard disk drive, a compact disk, a Digital Video Disk (DVD), etc.), or in a propagated signal, for execution by, or to control the operation of, data processing apparatus, e.g., a programmable processor, a computer, or multiple computers. A computer program, such as the computer program described above, can be written in any form of programming language, including compiled or interpreted languages, and can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program that can implement the techniques discussed above can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.

Method steps can be performed by one or more programmable processors executing a computer program to perform functions by operating on input data and generating output. One or more programmable processors may execute instructions in parallel and/or may be arranged in a distributed configuration for distributed processing. Method steps can also be performed by, and an apparatus can be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Processors suitable for the execution of a computer program may include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor receives instructions and data from a read-only memory or a random access memory or both. Elements of a computer may include at least one processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices (e.g., magnetic, magneto-optical disks, or optical disks). Information carriers suitable for embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, such as internal hard disks or removable disks; magneto-optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

To provide for interaction with a user, implementations may be implemented on a computer that includes a display device (e.g., a Cathode Ray Tube (CRT) or Liquid Crystal Display (LCD) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other types of devices can also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.

An implementation can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a client computer having a graphical user interface or a web browser through which a client can interact with an implementation), or any combination of such back-end, middleware, or front-end components. The components may be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN) and a Wide Area Network (WAN), e.g., the Internet.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. While certain features of the described implementations have been illustrated as described herein, many modifications, substitutions, changes, and equivalents will now occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the scope of the embodiments.

Claims

1. A system for inferring a spatial object description from a spatial gesture, comprising:

a spatial object management engine, comprising:

a database access engine configured to initiate access to a database comprising a plurality of database objects, each database object associated with a predefined three-dimensional (3-D) model that simulates an appearance of a predetermined 3-D item;

an image data receiving engine configured to receive 3-D spatial image data associated with at least one arm action of a participant based on free-form movement of at least one hand of the participant based on natural gesture motion;

an integrated model generator configured to generate an integrated 3-D model based on integrating a temporally continuous 3-D representation of a 3-D positioning of the at least one hand from the received 3-D spatial image data and comparing a threshold time value to a model time value, wherein the model time value is indicative of a number of instances of time spent by the at least one hand occupying a plurality of 3-D spatial regions during free-form movement; and

a matching engine configured to select, by the spatial object handler, at least one of the predefined 3-D items based on accessing the database access engine and determining at least one database object associated with at least one of the predefined 3-D models, the at least one of the predefined 3-D models matching the integrated 3-D model.

2. The system of claim 1, further comprising:

an update item input engine configured to obtain an update 3-D model that simulates the appearance of a predefined update 3-D item and to initiate storage in the database of an update database object associated with the update 3-D model by a data access engine.

3. The system of claim 2, wherein the update item input engine is configured for obtaining an updated 3-D model based on one or more of:

receiving image data associated with a predefined picture of an updated 3-D item; or

Receiving the updated 3-D model via an input device.

4. The system of claim 1, further comprising:

an initialization engine configured to initialize a virtual 3-D mapping space based on discretized 3-D virtual mapping elements represented as volumetric elements, each volumetric element including a weight value initialized to an initial value, wherein

The virtual 3-D mapping space represents a 3-D space in close proximity to the participant, and wherein

The integrated model generator includes:

an element activation engine configured to proportionally increment a weight value of a selected volumetric element associated with a 3-D region of the 3-D space based on a determination indicating that a portion of the at least one hand occupied the 3-D region for a period of time during free-form movement; and

a threshold comparison engine configured to compare a threshold value to a weight value for each of the volumetric elements.

5. The system of claim 4, wherein the integrated model generator comprises:

a location attribute engine configured to determine a depth, a location, and an orientation of the at least one hand,

a virtual element locator configured to determine a location of a volumetric element associated with the virtual 3-D mapping space corresponding to the depth, position and orientation of the at least one hand, and

an element activation engine configured to activate a plurality of volumetric elements associated with a region of the virtual 3-D mapping space representing a depth, position and orientation corresponding to the at least one hand based on the position determined by the virtual element locator.

6. The system of claim 4, wherein the integrated model generator comprises:

an enclosure detection engine configured to determine an enclosure indicated by the pose of the at least one hand,

a depth determination engine configured to determine a depth of the enclosure based on a determination of a depth of a region surrounding the enclosure, an

An element activation engine configured to activate, for a period of time associated with the gesture indicating the enclosure, a plurality of volumetric elements associated with a region of the virtual 3-D space representing a depth, position and orientation corresponding to the enclosure in lieu of the plurality of volumetric elements associated with a region of the virtual 3-D space representing a depth, position and orientation corresponding to the at least one hand.

7. The system of claim 1, wherein the matching engine comprises:

a preliminary alignment engine configured to generate a first alignment of one of the predefined 3-D models and the integrated 3-D model based on scaling, converting, rotating the one of the predefined 3-D models and the integrated 3-D model based on matching at least one component included in the one of the predefined 3-D models with the integrated 3-D model; and

an iterative alignment engine configured to generate one of the predefined 3-D models and a second alignment of the integrated 3-D model based on the first alignment based on an iterative closest point algorithm.

8. The system of claim 1, wherein the matching engine comprises:

a brute force alignment engine configured to generate a second alignment of one of the predefined 3-D models and the integrated 3-D model based on the first alignment based on a brute force alignment comprising a plurality of zooms, rotations, and translations of the one of the predefined 3-D models and the integrated 3-D model.

9. A method for inferring a spatial object description from a spatial gesture, comprising:

receiving three-dimensional (3-D) spatial image data associated with at least one arm motion of a participant based on free-form movement of at least one hand of the participant based on natural gesture motion of the at least one hand;

determining a plurality of successive 3-D spatial representations based on the received 3-D spatial image data, each 3-D spatial representation comprising 3-D spatial mapping data, the 3-D spatial mapping data corresponding to a 3-D pose and position of the at least one hand at successive instances in time during the freeform movement; and

generating, by the spatial object processor, an integrated 3-D model based on progressively integrating the 3-D spatial mapping data included in the determined continuous 3-D spatial representation and comparing a threshold time value to a model time value, wherein the model time value is indicative of a number of time instances spent by at least one hand occupying a plurality of 3-D spatial regions during the free-form movement.

10. A method for inferring a spatial object description from a spatial gesture, comprising:

receiving three-dimensional (3-D) sensory data associated with at least one natural gesture of a participant based on free-form movement of the participant based on natural gesture actions that emulate an appearance of a predetermined three-dimensional (3-D) item;

generating an integrated 3-D model based on integrating the received 3-D sensory data and comparing a threshold time value to a model time value in accordance with the free-form movement, wherein the model time value indicates a number of time instances spent by at least one hand occupying a plurality of 3-D spatial regions during the free-form movement, and the 3-D sensory data represents a 3-D location of at least one 3-D moving object associated with the participant; and

determining, by a spatial object processor, a predefined 3-D model associated with a database object that matches the integrated 3-D model.

11. A system for inferring a spatial object description from a spatial gesture, comprising:

means for receiving three-dimensional (3-D) spatial image data associated with at least one arm motion of a participant based on free-form movement of at least one hand of the participant based on natural gesture motion of the at least one hand;

means for determining a plurality of successive 3-D spatial representations based on the received 3-D spatial image data, each 3-D spatial representation comprising 3-D spatial mapping data, the 3-D spatial mapping data corresponding to 3-D poses and positions of the at least one hand at successive time instances during the free-form movement; and

means for generating, by the spatial object processor, an integrated 3-D model based on progressively integrating 3-D spatial mapping data included in the determined continuous 3-D spatial representation and comparing a threshold time value to a model time value, wherein the model time value is indicative of a number of time instances spent by at least one hand occupying a plurality of 3-D spatial regions during free-form movement.

12. A system for inferring a spatial object description from a spatial gesture, comprising:

means for receiving three-dimensional (3-D) sensory data associated with at least one natural gesture of a participant based on free-form movement of the participant based on natural gesture actions that emulate an appearance of a predetermined three-dimensional (3-D) item;

means for generating an integrated 3-D model based on integrating the received 3-D sensory data and comparing a threshold time value to a model time value in accordance with the free-form movement, wherein the model time value indicates a number of time instances spent by at least one hand occupying a plurality of 3-D spatial regions during the free-form movement, and the 3-D sensory data represents a 3-D positioning of at least one 3-D moving object associated with the participant; and

means for determining, by a spatial object processor, a predefined 3-D model associated with a database object that matches the integrated 3-D model.