HK1171281B

HK1171281B - Real-time camera tracking using depth maps

Info

Publication number: HK1171281B
Application number: HK12112062.4A
Authority: HK
Inventors: R．纽科姆; S.伊扎迪; D．莫利尼奥克斯; O．希利格斯; D.金; J．D．J．肖顿; P.科利; A．费茨吉本; S．E．豪杰斯; D．A．巴特勒
Original assignee: 微软技术许可有限责任公司
Priority date: 2011-01-31
Filing date: 2012-11-23
Publication date: 2014-12-12

Description

Real-time camera tracking using depth maps

Technical Field

The invention relates to an image technology, in particular to a camera tracking technology.

Background

For many applications, it is valuable to be able to track the orientation and position of a camera as it moves through an environment. For example, in the fields of robotics, vehicle navigation, computer gaming applications, medical applications and other problems. Previous approaches have involved using color images captured by a moving camera, identifying features such as lines and edges in those images, and capturing this information in a sequence of color images captured by the camera to try and evaluate relative camera positions. Existing solutions are limited in accuracy, robustness and speed. However, for many applications, accurate camera tracking needs to be done in real time, e.g. so that the robot can successfully move around in its environment.

The embodiments described below are not limited to implementations that address any or all of the disadvantages of known camera tracking processes.

Disclosure of Invention

The following presents a simplified summary of the invention in order to provide a basic understanding to the reader. This summary is not an extensive overview of the invention, and it does not identify key/critical elements of the invention or delineate the scope of the invention. Its sole purpose is to present some concepts disclosed herein in a simplified form as a prelude to the more detailed description that is presented later.

Real-time camera tracking using depth maps is described. In an embodiment, frames of the depth map are captured by the mobile depth camera at more than 20 frames per second and are used to dynamically update a set of registration parameters specifying how the mobile depth camera has moved in real time. In examples, real-time camera tracking output is used for computer gaming applications and robotics. In an example, an iterative closest point process is used with projection data association and point-to-plane error metrics in order to compute updated registration parameters. In an example, a Graphics Processing Unit (GPU) implementation is used to optimize error metrics in real-time. In some embodiments, a dense 3D model of the mobile camera environment is used.

Many of the attendant features will be more readily appreciated as the same becomes better understood by reference to the following detailed description considered in connection with the accompanying drawings.

Drawings

The invention will be better understood from a reading of the following detailed description in light of the accompanying drawings, in which:

FIG. 1 is a schematic diagram of a person holding a mobile depth camera in a room that can be used for real-time camera tracking and optionally also to generate a dense 3D model or map of the room;

FIG. 2 is a plan view of a floor of a building being explored by a person holding a mobile depth camera;

FIG. 3 is a schematic diagram of a mobile depth camera connected to a real-time camera tracking system, a dense 3D model formation system, and a gaming system;

FIG. 4 is a schematic diagram of an example frame alignment engine;

FIG. 5 is a flow chart of an iterative process for camera tracking;

FIG. 6 is a flow diagram of more detail of a portion of the iterative process of FIG. 5 for computing pairs of corresponding points;

FIG. 7 is a flow chart of a process for computing pairs of corresponding points using predictions from a dense 3D model;

FIG. 8 is a flow diagram of a process for calculating and minimizing a point-to-face error metric for use in the iterative process of FIG. 5;

FIG. 9 is a flow diagram of a process at a parallel computing unit, such as a Graphics Processing Unit (GPU);

FIG. 10 illustrates an exemplary computing-based device in which embodiments of the real-time camera tracking system may be implemented.

Like reference numerals are used to refer to like parts throughout the various drawings.

Detailed Description

The detailed description provided below in connection with the appended drawings is intended as a description of the present examples and is not intended to represent the only forms in which the present examples may be constructed or utilized. The description sets forth the functions of the example of the invention, as well as the sequence of steps for constructing and operating the example of the invention. However, the same or equivalent functions and sequences may be accomplished by different examples.

While examples of the present invention are described and illustrated herein as being implemented in a real-time camera tracking system using depth images obtained from a mobile depth camera that emits and captures infrared light, the described system is provided as an example and not a limitation. Those skilled in the art will appreciate that the examples of the present invention are suitable for application in a variety of different types of real-time camera tracking systems, including but not limited to systems using depth information obtained from stereo cameras and systems using depth information obtained by emitting and capturing other types of electromagnetic radiation.

The term "image element" is used herein to refer to a pixel, group of pixels, voxel, group of voxels, or other higher-level component of an image.

The term "dense 3D model" is used herein to refer to a representation of a three-dimensional scene including objects and surfaces, where the representation includes details about image elements of the scene. In contrast, a non-dense 3D model may include a frame-based representation of an object. In an example, all or many points from the incoming depth map may be used to describe the surface in the environment, and this description forms a dense 3D model. Sparse models would take only a subset of these points to speed up the computation and reduce the memory footprint.

Fig. 1 is a schematic diagram of a person 100 standing in a room and holding a mobile depth camera 102, in this example the mobile depth camera 102 also incorporating a projector that projects an image of a cat 108 into the room. The room contains various objects 106 such as a chair, a door, a window, a plant, a light and another person 104. Many of these objects 106 are static, but some of these objects (such as person 104) may move. As a person moves around the room, the mobile depth camera captures images that are used by the real-time camera tracking system 112 to monitor the position and orientation of the camera in the room. The real-time camera tracking system 112 may be integrated with the mobile depth camera 102 or may be in another location as long as it is capable of receiving communications (directly or indirectly) from the mobile depth camera 102. For example, the real-time camera tracking system 112 may be provided at a personal computer, a dedicated computer gaming device, or other computing device in a room and in wireless communication with the mobile depth camera 102. In other embodiments, the real-time camera tracking system 112 may be elsewhere in the building or at another remote location that communicates with the mobile depth camera 102 using any suitable type of communications network. The mobile depth camera 102 also communicates with a dense 3D model 110 of the environment (in this case a 3D model of the room) or another type of map of the environment. For example, as a person moves around a room, images captured by the mobile depth camera 102 are used to form and build a dense 3D model of the environment. The real-time camera tracking system 112 may track the position of the camera relative to a 3D model or map of the environment 110. The output of the real-time camera tracking system 112 and the dense 3D model or map 110 may be used by a gaming system or other application, but this is not required. For example, a projector at the mobile depth camera 102 may be arranged to project an image depending on the output of the real-time camera tracking system 112 and the 3D model 110.

Fig. 2 is a plan view of a floor 200 of a building. The person 202 holding the mobile depth camera 204 is moving around the floor as indicated by the dashed arrow 208. The person walks down the corridor 206 through the various rooms and furniture 210. The real-time camera tracking system 112 is able to track its position as the mobile depth camera 204 moves and a 3D model or map of the layer is formed. Person 202 does not have to carry mobile depth camera 204. In other examples, the mobile depth camera 204 is mounted on a robot or vehicle. This also applies to the example of fig. 1.

FIG. 3 is a schematic diagram of a mobile environmental sensor 300 for use with a real-time camera tracker 316, a dense model forming system 324, and an optional gaming system 332. The mobile environment sensor 300 comprises a depth camera 302 arranged to capture a sequence of depth images of a scene. Each depth image or depth map frame 314 comprises a two-dimensional image in which each image element includes a depth value such as a length or distance from the camera to an object in the captured scene that generated the image element. This depth value may be an absolute value provided in a specified unit of measure (e.g., meters or centimeters), or may be a relative depth value. In some cases, the depth value may be a disparity value, such as in a situation where stereoscopic depth information is available. In each captured depth image, there are approximately 300000 or more image elements, each having a depth value. The frame rate is high enough to enable the depth image to be used for a working robot, computer game, or other application. For example at least 20 frames per second.

Depth information may be obtained using any suitable technique, including but not limited to time-of-flight, structured light, stereo images. In some examples, the depth camera can organize the depth information into Z layers perpendicular to a Z axis extending along a line of sight of the depth camera.

The mobile environment sensor 300 may further comprise a light emitter 304 arranged to illuminate the scene in such a way that depth information can be ascertained by the depth camera 302. For example, where the depth camera 302 is an Infrared (IR) time-of-flight camera, the light emitter 304 emits IR light onto the scene, and the depth camera 302 is arranged to detect backscattered light from the surface of one or more objects in the scene. In some examples, pulsed infrared light may be emitted from the light emitter 304 such that the time between an outgoing light pulse and a corresponding incoming light pulse may be detected and measured by a depth camera and used to determine a physical distance from the environmental sensor 300 to a location on an object in the scene. Additionally, in some examples, the phase of the outgoing light wave from the light emitter 304 may be compared to the phase of the incoming light wave at the depth camera 302 to determine a phase shift. The intensity of the reflected beam over time is then analyzed via various techniques including, for example, shuttered light pulse imaging to use the phase shift to determine the physical distance from the moving environmental sensor 300 to the location on each object.

In another example, the mobile environment sensor 300 may use structured light to capture depth information. In such techniques, patterned light (e.g., light displayed as a known pattern such as a grid or bar pattern) may be projected onto a scene using light emitter 304. After reaching the surface of the object in the scene, the pattern is deformed. Depth camera 302 captures this deformation of the pattern and analyzes it to determine the absolute or relative distance from depth camera 302 to objects in the scene. In some cases, the mobile environmental sensor 300 emits a spatially and/or temporally varying pattern of electromagnetic radiation, and the pattern is calibrated so that when the image is received by the depth camera 302, it can perform pattern matching against a database of patterns and thereby calculate depth information. This can be considered as a point (dot) of a 3D pattern projected into the environment, and whenever there is a surface where the pattern is reflected, depth camera 302 can detect the surface and calculate the distance of the surface from depth camera 302.

In another example, depth camera 302 includes a pair of stereo cameras to obtain and resolve visual stereo data to generate relative depth information. In this case, the light emitter 304 may be used to illuminate the scene or may be omitted.

In some examples, in addition to depth camera 302, mobile environment sensor 300 includes a color video camera referred to as RGB camera 306. The RGB camera 306 is arranged to capture a sequence of images of a scene at visible light frequencies.

The mobile environment sensor 300 may include an orientation sensor 308, such as an Inertial Measurement Unit (IMU), accelerometer, gyroscope, compass, or other orientation sensor 308. However, the use of an orientation sensor is not essential. The mobile environment sensor 300 may include a location tracking device such as a GPS, but this is not required.

The mobile environment sensor may include the projector 312 mentioned above with reference to fig. 1, but this is not required.

The mobile environment sensor also includes one or more processors, memory, and a communication infrastructure, as described in more detail below.

The mobile environmental sensor may be provided in a housing shaped and sized to be held by a user or worn by a user. In other examples, the mobile environment sensor is sized and shaped to be included or mounted on a vehicle, toy, or other movable device.

The mobile environment sensor 300 is connected to a real-time tracker 316. This connection may be a physical wired connection or wireless communication may be used. In some examples, the mobile environmental sensor 300 is indirectly connected to the real-time tracker through one or more communication networks, such as the internet.

The real-time tracker 316 is computer-implemented using a general-purpose microprocessor that controls one or more Graphics Processing Units (GPUs). It includes a frame alignment engine 318 and may optionally include a loop closure engine 320 and a repositioning engine 322. The real-time tracker 316 obtains depth map frames 314 from the depth camera 302 and may also optionally obtain input from the mobile environment sensor 300 and obtain optional map data 334 and optional data from the gaming system 332. The real-time tracker may be used to align the depth map frames to produce a real-time series 328 of six degree-of-freedom pose estimates for depth camera 302. It may also generate transformation parameters (also called registration parameters) for the transformation between pairs of depth map frames. In some examples, the real-time tracker operates on pairs of depth map frames 314 from the depth camera. In other examples, the real-time tracker 216 acquires a single depth map frame 314 and aligns it with the dense 3D model 326 of the scene rather than with another depth map frame 314. In some examples, the real-time tracker also uses color video input from the RGB camera 306, but this is not required.

For example, in some embodiments, the real-time tracker 316 provides an output to the dense 3D model forming system 324, which dense 3D model forming system 324 uses this information along with the depth map frames 314 to form and store a dense 3D model of the scene or environment in which the mobile environment sensor 300 is moving. For example, in the case of fig. 1, the 3D model will be a 3D model of the surfaces and objects in the room. In the case of fig. 2, the 3D model will be a 3D model of that floor of the building. Dense 3D model 326 may be stored in GPU memory, or otherwise stored.

The mobile environment sensor 300 may be used in conjunction with a gaming system 332, the gaming system 332 being connected to a display 330. For example, the game may be a golf game, a boxing game, a racing game, or other type of computer game. Data from the gaming system 332 (e.g., game state or metadata associated with the game) may also be provided to the real-time tracker 316. Moreover, information from the real-time tracker may be used by the gaming system 332 to influence the progress of the game. Information from the 3D model may also be used by the gaming system 332 to influence the course of the game.

Optionally, the graph data 334 is available to the real-time tracker 316. This may be, for example, a drawing of the environment (e.g., a floor of a room or building) by an architect, a location of a known landmark in the environment, a map of the environment available from another source.

The frame alignment engine 318 of the real-time tracker is arranged to align pairs of depth map frames or to align a depth map frame with an estimate of a depth map frame from the dense 3D model. It uses an iterative process that is implemented with one or more graphics processing units to cause the frame alignment engine to operate in real time. More details regarding the frame alignment engine are given below with reference to fig. 4. The ring closure engine is arranged to detect when the mobile environment sensor moves in a ring such that a scene depicted in a current depth image frame at least partially overlaps with a scene of a previous depth frame that is not an immediately preceding depth frame. This may occur, for example, when a user walks around the entire floor of the building in fig. 2 and again reaches the starting point. This may also occur when the user moves around the room behind a piece of furniture and comes out again to or near the original starting position.

The relocation engine 322 is arranged to handle the following: the real-time tracker loses the current position of the mobile environmental sensor 300 and relocates or finds the current position again.

In one example, the processing performed by the real-time tracker 316 and/or the dense 3D model formation system 324 may be performed remotely from the location of the mobile environment capture device 300. For example, the mobile environment capture device 300 may be connected to (or include) a computing device that has relatively low processing power and streams depth images to a server over a communication network. The server has a relatively high processing power and performs the computationally complex tasks of the real-time tracker 316 and/or the dense 3D model forming system 324. The server may return the densely reconstructed rendered images on a frame-by-frame basis to provide the user with an interactive experience, and return the final dense 3D reconstruction for subsequent local use (e.g., in a game) when the model is complete. Such an arrangement avoids the need for the user to have a high-power local computing device.

In one example, input from an RGB camera at a mobile environment sensor may be used to supplement information from a depth camera. This is useful in situations where depth does not provide enough information for tracking, such as when the camera is moving in an environment with too few depth features. If visual features are present in the environment, these visual features may be detected by the RGB camera and may be used to enable simultaneous localization and mapping to be provided.

Fig. 4 is a more detailed schematic diagram of the frame alignment engine 318 of fig. 3. The frame alignment engine 408 is a computer implemented at a computing device having one or more GPUs 416 or other parallel computing units. For example, the parallel computing unit may be a vector processor, a Single Instruction Multiple Data (SIMD) architecture, a graphics processing unit, or other parallel computing device. It includes an iterative closest point process 410 and an optional plane extraction component 412. The iterative closest point process uses projection data correlation and point-to-plane error metrics, as described in more detail below. The frame alignment engine receives the current depth map 400 from the depth camera. This is also referred to as a destination depth map. In some examples, it also receives a source depth map 402 from the depth camera, the sourceThe depth map 402 is the previous depth map frame. In other examples, the frame alignment engine obtains a dense surface model estimate 406 of the source depth map. The output of the frame alignment engine is a set of registration parameters that are transformed to align the current frame with the source frame (i.e., the frame estimate). In some examples, the registration parameters are as SE₃Provided by a matrix form of six degrees of freedom (6DOF) pose estimation, the SE₃The matrix describes the rotation and translation of the depth camera 302 relative to real world coordinates. This transformation matrix can be more formally represented as:

wherein T is_kIs a transformation matrix of the depth image frame k, R_kIs the camera rotation of frame k, t_kIs the camera translation at frame k, and Euclidean group. The coordinates in camera space (i.e., from the camera angle) can be mapped to real world coordinates by multiplying this transformation matrix.

However, the registration parameters may be provided in any suitable form. These registration parameters are used by the real-time tracker 316 to produce a real-time series of 6-degree-of-freedom pose estimates for the depth camera.

Fig. 5 is a flow diagram of an example iterative process at a frame alignment engine. An initial estimate of the registration parameters is formed (500). These are the registration parameters of the transformation used to align the current frame and the source frame. This initial estimate is formed in any suitable manner. For example, one or more of the following information sources may be used to form the initial estimate: game state, game metadata, map data, RGB camera output, orientation sensor output, and GPS data. In another example, the initial estimate is formed by predicting where the camera is using information about the camera's previous motion path. For example, it may be assumed that the camera has a constant speed or a constant acceleration. The motion path of the camera from time 0 to time t-1 can be used to estimate where the camera will be at time t and thus obtain an estimate of the registration parameters.

Using the initial estimate, pairs of corresponding points between the current frame and the source frame (depth map or estimated depth map) are calculated (502). A pair of corresponding points is a point from one depth map and a point from another depth map, where those points are estimated to be from the same real-world point in the scene. The term "dot" is used herein to indicate a pixel, or a group or small block of adjacent pixels. This correspondence problem is very difficult due to the large number of possible combinations of points. Previous approaches using color or grayscale images have addressed this problem by identifying shapes such as lines, edges, corners, etc. in each image and then attempting to match those shapes between pairs of images. In contrast, embodiments described herein identify corresponding points without requiring a shape to be found in the depth map. More details on how the corresponding points are calculated are given below with reference to fig. 6. An updated estimate of the registration parameters is calculated (504) that optimizes the error metric applied to the calculated respective points.

A check is made to assess whether convergence is achieved (506). If so, little or no change in the updated estimation and registration parameters is output (508). If not, the iterative process repeats, as indicated in FIG. 5.

Referring to fig. 6, more details are now given about how pairs of corresponding points are calculated. In some embodiments, sample points are acquired (600) from either or both of the current and source depth maps, and those sample points are used as candidates from which to find respective pairs of corresponding points. Sampling may be achieved by randomly selecting a specified proportion of the points. In another embodiment, the sampling is implemented in a manner that accounts for the surface normals of the points. For example, the surface normal is calculated for each point (as described in more detail below), and a histogram is created with a number of bins for different ranges of surface normal values. Sampling is carried out in such a way that: uniform sampling across the columns is achieved.

By using sampling, the benefit of reduced computational cost is realized. However, the risk is to reduce the accuracy and robustness of the process when sampling is used. This is because the sample points may not provide a good indication of the depth map from which the sample was taken. For example, the samples may cause the process to find a corresponding set of points that the process identifies as a solution but in fact represents a locally optimal solution rather than a globally optimal solution.

As mentioned above, it is not necessary to use any sampling. This procedure is also feasible and gives good results when all available points are used. In this case, the parallel processing implementation described herein allows the process to operate in real time for all points in each depth map, which may be as many as 300000 or more. In the example described below with reference to fig. 6, the process is described using sampling. However, the process of fig. 6 can also be applied when no sampling is performed.

As indicated in fig. 6, a surface normal is calculated (602) for the sample point (or each available point if no sampling is performed). For example, for a given point, this is achieved by finding the two (or more) nearest neighbor points in the depth map and calculating a surface patch that includes these neighbor points and the point itself. The normal of the surface patch at the location of the point is then calculated.

Then the process of finding (604) corresponding pairs of points next now describes this in the case that a source depth map and a current depth map are available without using a dense 3D model. For each source point sampled from the source depth map, a ray is projected (606) from the camera position associated with the source depth map, the ray passing through the sampled source point and being projected onto the destination point in the destination depth map. In some cases, the destination point may be in front of the sampled source point along the projected ray. This projection process may be referred to as "projection data association. Candidate corresponding points are then searched (608) around (and including) the destination point. For example, the search is for those points that have surface normals that are compatible with the surface normal of the sampled origin and within a specified Euclidean distance (Euclidean distance) of the destination point. Surface normals are considered compatible if they are within a specified range of each other. For example, this specified range and Euclidean distance may be user configurable and/or set using empirical data related to the particular application condition under consideration.

As a result of this search, one or more candidate corresponding points are found. A single point is selected (610) from those candidate corresponding points to form a pair with the source point. This selection is made based on a distance metric. For example, the euclidean distance between the source point and each of the candidate corresponding points is calculated. The pair of points giving the smallest euclidean distance is then selected. The process of block 604 is then repeated for each of the sampled source points, or for each of the available points in the source depth map without sampling.

In some embodiments, a weight is assigned (612) to each of the respective pairs of points. For example, information from other sources (such as an RGB camera or other sensor) may be used to assign weights. In an example, for each of the respective pairs of points, weights are calculated and stored relating to measured characteristics of the depth camera, such as radial lens distortion and/or depth-dependent errors. In another example, pairs comprising a point located at an edge detected in the depth map using an edge detection algorithm are given higher weight than other pairs. These weights may be used during the process of applying the error metric in order to improve the quality of the result. For example, the weights associated with the depth-dependent errors cause high depth values to be considered, which may fluctuate much due to lack of precision.

In some embodiments, pairs of points that are on or near the boundary of the depth map are rejected (614). This helps to avoid errors in the case where there is only partial overlap between the two depth maps. Other criteria may also be used to reject pairs. For example, in some embodiments, plane extraction is performed as mentioned above with reference to component 410 of FIG. 4. In this case, pairs on a plane may be rejected in order to prevent the tracker from shifting a larger plane and thus ignoring smaller but unique portions within the depth map.

In some embodiments, the source depth map is estimated or predicted from a dense 3D model of the scene that the depth camera is capturing. In this case the method of fig. 7 is followed. The dense 3D model of the scene includes a 3D surface representation of the scene stored in memory at the parallel computing unit. For example, the dense 3D model may be stored as a linear array in photo-row-column order (more details on this are given below), optionally with some padding to align the slices and rows to a certain memory block size. Other ways of storing the 3D model may be used, such as octrees, coarse-fine representations, mesh-based representations (such as polygonal meshes).

More details are now given of the case where dense 3D models are stored on a parallel computing unit such as a GPU in "slice-row-column" order. In this case, the model may be stored as a linear array of memory locations representing a 3D volume. This is achieved by mapping each voxel to a memory array index using a linear slant memory (linear slant memory) that provides fast parallel access to data stored on the parallel computing unit memory.

As described above, the surface normal of a sample point of the current depth map is calculated by evaluating (700) neighboring points to the sample point. For the predicted source depth map, for each predicted sample point, a surface normal prediction and a surface position prediction are computed (702) from the dense 3D model. The predicted sample points are points from the dense 3D model at the same pixel positions as the sample points from the current depth map. This is done by projecting light rays into the volume of the dense surface model. The ray is projected from the estimated camera position and orientation associated with the current depth map and enters the 3D model through a point on a face of the 3D model corresponding to the sample point in the current depth map. This applies if the 3D model is stored as a 3D model of the volumetric representation. In case the 3D model is stored using a mesh-based representation, then this surface is first projected to form a virtual depth image representation. The ray may then be projected into the virtual depth image representation. A first visible surface is found along the ray by stepping along the ray and evaluating the surface density function to find a first zero crossing from positive to negative. The associated sub-pixel world point is found along the ray from the estimation of the intersection of the surface density functions. In one example, given that trilinearly sampled points are on either side of the detected zero crossing to find the sub-pixel world points where zero occurs, the surface intersection points along the ray can be calculated using simple linear interpolation. The sub-pixel world point is taken as the predicted surface location. To find the predicted surface normal at this location, trilinear interpolation is used to find the finite difference in the gradient of the surface density function. The process of computing the surface normal prediction and the surface position prediction 702 may be implemented at a parallel computing unit, where each ray is handled in parallel.

For each predicted sample point (obtained from the dense 3D model), a process 704 is followed to identify a corresponding point in the current depth map. This is similar to process 604 of fig. 6. The predicted sample points are projected (706) onto the destination point in the destination depth map (current depth map). Candidate corresponding points having compatible surface normals with the destination point are then searched (708) around the destination point. A point is selected (610) from those candidate corresponding points according to the distance metric. For example, a pair of points is compatible if the pair of points are within a specified euclidean distance e1 of each other and the dot product between the surface normals of the pair of points is greater than a specified threshold e 2. The parameters e1 and e2 may be user configurable or may be set during the manufacturing stage, thereby empirically calibrating the device for use in a particular setting.

In some cases, weights are assigned (712) to respective pairs of corresponding points. In some embodiments, a pair is rejected 714 if the pair includes at least one point on or near the depth map boundary. In an example, for each of the respective pairs of points, weights are stored relating to measured features of the depth camera. These weights may be used during the process of applying the error metric to improve the quality of the result.

Once pairs of corresponding points are identified, for example using the process of fig. 6 or 7, an error metric is calculated and minimized, and the iterative process of fig. 5 is repeated.

In an example, a point-to-plane error metric is calculated (800) for each respective pair of points, and this metric is optimized to obtain updated registration parameters. An example of this process is now described with reference to fig. 8. This process is designed to be implemented using at least one GPU to achieve real-time processing as now described.

Calculating a point-to-plane error metric may be considered as calculating 802 a sum of squared distances from each source point to a plane containing the destination point and oriented approximately perpendicular to a surface normal of the destination point. The process seeks to optimize this metric to find an updated set of registration parameters. Solving this type of optimization problem is not straightforward, but typically requires a large amount of computational resources, making this type of process difficult to implement for real-time applications. An example implementation that enables real-time processing and uses at least one GPU is now described.

Each respective pair of points may be scaled and translated (804). This may improve the stability of the optimization process, but is not necessary.

For each respective pair of points, a linear system comprising a plurality of simultaneous equations is formed 806 on a parallel computing unit, such as a GPU, to optimize the error metric using numerical least squares optimization. These matrices are reduced to a single 6 by 6 matrix on the parallel processing unit. Because the frame rate is high (e.g., 20 to 40 frames per second), a small angle approximation to the angle between any two consecutive frames (change in camera orientation) is possible. That is, because the frame rate is so high, the camera will only move a small amount between frames. By making this approximation, real-time operation of the system is facilitated.

The single 6 by 6 matrix is passed to the CPU 808 and solved for updated registration parameters. The solution is scaled and translated 810 back to reverse the scaling and translation steps of 804. The stability of the solution is checked 812 and the process outputs updated registration parameters 814.

FIG. 9 gives a more detailed description of how a linear system can be formed on a parallel computing unit such as a GPU and how the linear system can be reduced to a 6 by 6 matrix. In this example, the following point-to-face error metric is used, but this is not required; other error metrics may also be used.

This error metric can be used to obtain a new translation T_k. More details about the symbols used are now given. Depth camera D_kIs provided in the image domainThe calibrated depth measurement D-D at the image pixel u-D (x, y) in (e)_k(u). These measurements may be taken as v_k(u) — (xd, yd, d, 1) is re-projected into the world space of the camera (using homogeneous coordinates). Since each frame from the depth sensor is a surface measurement on a regular grid, the system can also calculate a corresponding normal vector n_k(u) these normal vectors are estimated by the finite difference between the neighboring re-projected grid points. SE₃The transformation matrix takes the camera coordinate frame at time k asMapping into a global frame g. (the equivalent mapping of the normal vector isThe estimate of the 3D model at time k in the global coordinate system is denoted M_k，M_kMay be stored in the volumetric representation described herein. Camera pose T projected by light to previous frame_k-1In, the incoming depth frame D_kReconstructed model M in full 3D contrast_k-1Is registered with the estimate of the previous frame. This results in a predicted imageOr equivalently a set of global model pointsAnd model normalWhere i ∈ s is the corresponding index set. The symbol ρ k in the above equation for the point-to-plane error metric represents the protective data association mapping between the camera and the model point at time k.

Thus, the method of FIG. 9 is an example implementation of the process shown in FIG. 8 that forms a linear system for each pair of corresponding points on the GPU and simplifies 806 these linear systems into a single 6 by 6 matrix 908. In this example, each pair of corresponding points identified by the frame alignment engine 408 may be processed 902 in parallel at the GPU. Thus, for each pair of corresponding points, a 6 by 6 matrix expression 904 of the linear system is calculated, which 6 by 6 matrix expression gives the arithmetic expression of the point-to-surface constraint system. By making small angle assumptions, the transformation T can be used with a tilt symmetry matrix R ≈ α, β, γ]_×The incremental rotation of the 3 vectors performed is parameterized along with the three element translation vector t. A linear system is obtained by setting the first derivative of the linearized error metric to 0. The point-to-plane constraint system expresses the above-mentioned optimization of the point-to-plane error metric. The present computation is performed in parallel at the GPU for each pair of corresponding points. In this way, the error metric is applied to each of the identified respective points in parallel. Using tree reduction procedures or other suitable methods of evaluating arithmetic expressionsTo evaluate the arithmetic expression 906 for each pair of points. The tree reduction process is an evaluation strategy by which an arithmetic expression is represented as a tree structure, nodes of the tree represent arithmetic operations and leaf nodes of the tree represent values. The expressions are evaluated in an order according to the structure of the tree by passing the results of the evaluation along the branches of the tree. The results of the error metric optimization processes from the parallel processes together provide a 6 by 6 matrix output 908 by reducing the 6 by 6 matrix for each respective pair of points to a single 6 by 6 matrix.

Fig. 10 illustrates components of an exemplary computing-based device 1004, which computing-based device 1004 may be implemented in any form of computing device and/or electronic device, and in which embodiments of a real-time camera tracker may be implemented.

Computing-based device 1004 includes one or more input interfaces 1002 arranged to receive and process input from one or more devices, such as user input devices (e.g., capture device 1008, game controller 1005, keyboard 1006, mouse 1007). This user input may be used to control a software application or real-time camera tracking. For example, the capture device 1008 may be a mobile depth camera arranged to capture a depth map of a scene. The computing-based device 1004 may be arranged to provide real-time tracking of the capture device 1008.

The computing-based device 1004 further comprises an output interface 1010 arranged to output display information to a display device 1009, which may be separate from or integrated with the computing device 1004. The display information may provide a graphical user interface. In an example, if the display device 1009 is a touch-sensitive display device, it may also act as a user input device. The output interface 1010 also outputs data to a device other than the display device (e.g., a locally connected printing device).

Computer-executable instructions may be provided using any computer-readable media that is accessible by computing-based device 1004. Computer-readable media may include, for example, computer storage media such as memory 1012 and communication media. Computer storage media, such as memory 1012, includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EPROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information for access by a computing device.

In contrast, communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism. As defined herein, computer storage media does not include communication media. While a computer storage medium (memory 1012) is shown in computing-based device 1004, it is to be appreciated that the storage can be distributed or remotely located and accessed via a network or other communication link (e.g., using communication interface 1013).

The computing-based device 1004 also includes one or more processors 1000, which may be microprocessors, controllers, or any other suitable type of processor for processing computing executable instructions to control the operation of the device in order to provide real-time camera tracking. In some examples, such as in examples using a system-on-chip architecture, the processor 1000 may include one or more fixed function blocks (also known as accelerators) that implement a portion of the real-time camera tracking method in hardware (rather than software or firmware).

Platform software, including an operating system 1014 or any other suitable platform software, may be provided at the computing-based device to enable execution of application software 1016 on the device. Other software that may be executed on computing device 1004 includes: a frame alignment engine 1018 (see, e.g., fig. 4-8 and description above), a loop closure engine 1020, a repositioning engine 1022. A data store 1024 is provided to store data such as previously received depth maps, registration parameters, user-configurable parameters, other parameters, dense 3D models of scenes, game state information, game metadata, map data, and other data.

The term "computer" as used herein refers to any device with processing capability such that it can execute instructions. Those skilled in the art will recognize that such processing capabilities are integrated into many different devices, and thus, the term computer includes PCs, servers, mobile phones, personal digital assistants, and many other devices.

The methods described herein may be performed by software in machine-readable form on a tangible storage medium, for example in the form of a computer program comprising computer program code means adapted to perform all the steps of any of the methods described herein when the program is run on a computer and wherein the computer program may be included on a computer-readable medium. Examples of tangible (or non-transitory) storage media may include disks, thumb drives, memory, and the like, without the propagated signal. The software may be adapted to be executed on a parallel processor or a serial processor such that the method steps may be performed in any suitable order, or simultaneously.

This confirms that the software can be a valuable, separately tradable commodity. It is intended to encompass software running on, or controlling, "dumb" or standard hardware to carry out the desired functions. It is also intended to encompass software which "describes" or defines the configuration of hardware, such as HDL (hardware description language) software used to design silicon chips, or to configure general purpose programmable chips, to carry out desired functions.

Those skilled in the art will realize that storage devices utilized to store program instructions may be distributed across a network. For example, the remote computer may store an example of the process described as software. A local or terminal computer may access the remote computer and download a part or all of the software to run the program. Alternatively, the local computer may download pieces of the software, as needed, or execute some software instructions at the local terminal and other software instructions at the remote computer (or computer network). Those skilled in the art will also realize that by utilizing conventional techniques known to those skilled in the art that all, or a portion of the software instructions may be carried out by a dedicated circuit, such as a DSP, programmable logic array, or the like.

As will be clear to those skilled in the art, any of the ranges or device values given herein may be extended or altered without losing the effect sought.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

It is to be appreciated that the advantages described above may relate to one embodiment or may relate to multiple embodiments. The embodiments are not limited to embodiments that solve any or all of the problems or embodiments having any or all of the benefits and advantages described. It will further be understood that reference to "an" item refers to one or more of those items.

The steps of the methods described herein may be performed in any suitable order, or simultaneously, where appropriate. In addition, individual blocks may be deleted from any of the methods without departing from the spirit and scope of the subject matter described herein. Aspects of any of the examples described above may be combined with aspects of any of the other examples described to form further examples without losing the effect sought.

The term "comprises/comprising" when used herein is intended to cover the identified blocks or elements of the method, but does not constitute an exclusive list and a method or apparatus may contain additional blocks or elements.

It will be understood that the above description of a preferred embodiment is given by way of example only and that various modifications may be made by those skilled in the art. The above specification, examples and data provide a complete description of the structure and use of exemplary embodiments of the invention. Although various embodiments of the present invention have been described above with a certain degree of particularity, or with reference to one or more individual embodiments, those skilled in the art could make numerous alterations to the disclosed embodiments without departing from the spirit or scope of this invention.

Claims

1. A method of real-time camera tracking, comprising:

receiving a sequence of depth map frames (314) from a moving depth camera (302) that is moving, each depth map frame including a depth value at each image element that is related to a distance from the moving depth camera to a surface in a scene captured by the moving depth camera;

tracking the position and orientation of the mobile depth camera by calculating registration parameters (328, 414) for each depth map frame, these registration parameters being parameters of a transformation used to align each depth map frame with a preceding depth map frame;

wherein calculating the registration parameters comprises using an iterative process (412) for:

respective points of a pair of depth map frames are identified (502) without computing shapes depicted within the depth map frames, and an error metric applied to the identified respective points is optimized using parallel computing units such that the error metric is applied to each of the identified respective points in parallel.

2. The method of claim 1, further comprising receiving input from a second sensor (306, 308, 310) associated with the mobile depth camera (302) and using the input to form the initial estimate of the registration parameter, the second sensor being selected from one of: an orientation sensor (308), an RGB video camera (306), a gaming system (332), a map (334) of an environment in which the depth camera is moving, a movement sensor, a position sensor.

3. The method of claim 1 or 2, comprising receiving the sequence of depth map frames (314) at a frame rate of at least 30 frames per second.

4. The method of claim 1, wherein optimizing an error metric using the parallel processing unit comprises: for each pair of corresponding points, a linear system of numerical least squares optimization is formed (806) to produce 6 by 6 matrices, and these matrices are reduced to a single 6 by 6 matrix at the parallel computing unit.

5. The method of claim 1, comprising estimating a prior depth map frame from a dense 3D model (326) of a scene captured by the mobile depth camera.

6. The method of claim 1 wherein identifying respective points (502) of pairs of depth map frames comprises using a projection data correlation process whereby the estimated position of the mobile camera is used to project (606) points from a source depth map frame to a destination point of a current depth map frame, and the projection data correlation process comprises searching for candidate respective points around the destination point.

7. The method of claim 1, wherein optimizing the error metric comprises optimizing a point-to-plane error metric comprising a sum of squared distances from a source point to a plane containing a destination point and oriented approximately perpendicular to a surface normal of the destination point (802).

8. The method of claim 1, wherein calculating the registration parameters comprises, for each depth map frame, calculating a surface normal for each point and forming a histogram with a plurality of bins for different ranges of normal values, and uniformly sampling points across the bins; and calculating the registration parameters using only those points from the uniform sampling points.

9. A real-time camera tracker (316) comprising:

an input (1002), the input (1002) being arranged to receive a sequence of depth map frames (314) from a moving mobile depth camera (302), each depth map frame comprising a depth value at each image element, the depth value being related to a distance from the mobile depth camera to a surface in a scene captured by the mobile depth camera;

a frame alignment engine (318), the frame alignment engine (318) tracking the position and orientation of the mobile depth camera by calculating registration parameters for each depth map frame, the registration parameters being parameters of a transformation used to align each depth map frame with a previous depth map frame;

the frame alignment engine is arranged to calculate the registration parameters using an iterative process for: identifying respective points of a pair of depth map frames without computing a shape depicted within the depth map frames;

the frame alignment engine comprises a parallel computing unit arranged to optimize an error metric applied to the identified respective points as part of the iterative process such that the error metric is applied to each of the identified respective points in parallel at the parallel computing unit.

10. A game system (332), the game system (332) comprising a mobile infrared depth-of-flight camera using structured light and a real-time tracker (316) for tracking the mobile depth camera as claimed in claim 9, the mobile depth camera and the real-time tracker being arranged to operate at least 30 frames per second, the game system being arranged to influence the process of a game related to tracking the mobile depth camera.