HK1140569A

HK1140569A - Frame and pixel based matching of model-generated graphics images to camera frames

Info

Publication number: HK1140569A
Application number: HK10106736.4A
Authority: HK
Inventors: 卡洛斯‧塔庞
Original assignee: 卡洛斯‧塔庞
Priority date: 2005-09-12
Filing date: 2006-09-12
Publication date: 2010-10-15

Description

Frame-and-pixel based matching of model-generated graphics images to camera frames

Technical Field

The present invention uses state-of-the-art computer graphics to advance the field of computer vision.

Graphics engines, particularly those used in real time, such as first person shooter games, have become very realistic. The basic concept of the invention is to use a graphics engine in image processing: the image frames generated by the real-time graphics engine are matched with those from the camera.

Background

There are two different tasks in vision or image processing. There are on the one hand such difficult tasks of image analysis and feature recognition and on the other hand less difficult tasks of calculating the position of the 3D world of the camera given the input image.

In biological vision, it is difficult to interleave these two tasks together so that they can be distinguished. We perceive our position in world coordinates by identifying and triangulating from features around us. It appears that we cannot triangulate without first identifying the features from which we triangulate and we cannot really identify unless we can place a feature somewhere in the 3D world we live in.

Most, if not all, vision systems in the prior art attempt to accomplish both tasks in the same system. For example, reference patent number US5801970 includes both tasks; the reference patent number US6704621 appears to comprise only triangulation, but it actually requires identification of roads.

Disclosure of Invention

If the triangulation task can be performed virtually separately and independently from the analysis and feature recognition tasks, then we would need half the computing resources in a system that does not perform the latter task. By taking advantage of current advances in communication processing, the present invention allows triangulation of camera position without typical scene analysis and feature recognition. The present invention utilizes a priori, correct model of the world in the field of vision. The latest graphics processing units are used to render 3D models on graphical surfaces. Each frame from the camera is then searched for the best match among the multiple candidates rendered on the graphical surface. The count of images to be compared rendered is small by calculating the change in camera position and view angle from one frame to another, and then using the result of such calculation to limit the next possible position and view angle to render the prior world model.

The main advantage of the present invention compared to the prior art is the mapping of the real world to the world model. One of the most suitable applications of the invention is robotic programming. Robots that are guided with a priori maps, and robots that know their location in the map, are far superior to robots that are not so guided. Are excellent for navigation, guidance, finding paths, avoiding obstacles, aiming at points of interest, and other robotic tasks.

Drawings

FIG. 1 is a diagram of an embodiment of the present invention showing how camera activity in the real world is tracked in the world of a 3D model.

Fig. 2 is an example of a perspective surface or camera frame divided into regions.

FIG. 3 is a high level flow chart of the algorithm described below.

Detailed Description

A diagram of a preferred embodiment of the present invention is shown in fig. 1. An a priori model of the world 100 is rendered (render) on rendered images 102, 103 and 104 using currently available advanced graphics processors 101. The model is a correct but not necessarily complete model of the real world 110. The present invention aims to track the position and view angle of camera 309, which produces frames 107 and 108 at times t and t +1, respectively. Frames 107 and 108 serve as the primary real-time input to the device. Optical flow vectors are calculated from frames 107 and 108 using state of the art methods. From these optical flow vectors the correct orientation and camera view can be derived according to the prior art in a robust (robust) way against noise and outliers. Then, around the point on the line defined by the current orientation, the next possible position is assumed to be at a distance from the current position determined by the current velocity (105). The possible candidate locations N are rendered in the N candidate images 102, 103 and 104 by a graphics processor or processor 101. Each rendered image is then compared to the current camera frame and the best matching image is selected (106). From the selected image, the most accurate position of the camera, the instantaneous speed, the angle of view, and the angular velocity can also be selected from the candidate positions.

Dynamic, frame-to-frame triangulation (tracking) is achieved in the present invention using the following steps in the flow chart shown in fig. 3. In the following step description, for each video frame from the camera, there is a hypothetical set of possible frames that are drawn by the graphics processor for comparison. In the present invention, such a comparison is most computationally expensive. The video frames are equal to the rendered image in both vertical and horizontal resolution. As shown in fig. 2, each frame and each rendered image is divided into a plurality of rectangular regions, which may overlap each other by a plurality of pixels.

1. Starting with a frame from the camera, when the frame is obtained, the absolute world position p (t), the view angle v (t), the zero velocity u (t) 0, and the zero angular velocity w (t) 0 of the camera at the instant time't' are known. A discrete Fast Fourier Transform (FFT) is computed for all regions (Cs) of this frame and the transformed phase components are extracted, PFC (a, t) at region 'a' time't'.

2. The next frame is fetched. All PFCs (a, t +1) are computed, the phase component of the FFT in region a time t + 1.

3. A phase difference between PFC (a, t) and PFC (a, t +1) is calculated, and an inverse FFT transform is performed on the phase difference matrix in order to obtain a phase correlation surface. If the camera is neither panning nor moving from't' to't + 1', the phase correlation surface for each region will indicate a maximum at the center of region 'a'. If the camera moves or pans, then the maximum will occur somewhere other than at the center of each region. For each area OP (a, t +1), an optical flow vector is calculated, which is defined as the deviation from the center to the maximum point in the phase correlation plane. (if there are moving objects in the area of the set, each moving object will cause an additional peak on the phase correlation surface, but as long as the two areas from the frames that are subsequently compared are dominated by static objects such as buildings or walls or ground, those other peaks will be lower than the peaks corresponding to camera position and/or perspective changes.)

4. From all such OPs (a, t +1), and using the absolute world position p (t), the view angle v (t), the current velocity u (t), and the current angular velocity w (t), the range of all possible absolute camera positions (vector Pi (t +1)) and view angles (unit vector Vi (t +1)) at time t +1 is calculated. Pi can be chosen to be located in The active line (transient orientation), which can be easily determined from OP (a, t +1), as detailed in chapter 17 of The reference "Robot Vison" by b.k.p.horn, published by The MIT Press in 1986.

5. It is assumed that a small number (e.g., N) of possible camera positions Pi (t +1) and view angles Vi (t +1) are plotted using an a priori model. This will result in N image renderings Mi (a, t + 1). The FFT of each Mi (a, t +1) is computed and the transformed phase component, PFMi (a, t +1), is extracted.

6. The best match for the camera frame at time t +1 is to take into account the phase deviation of PFMi (a, t +1) and PFC (a, t +1) for each of all regions Mi resulting in an inverse FFT transform, which is the 2D graph with the largest closest center. From this, the best possible position P (t +1) and viewing angle V (t +1) can also be selected. The instantaneous speed is then determined as u (t +1) ═ P (t +1) -P (t), while the instantaneous angular speed is w (t +1) ═ V (t +1) -V (t).

7. The original time t calculation and frame are discarded by copying P (t +1) to P (t), V (t +1) to V (t), u (t +1) to u (t), w (t +1) to w (t), and PFC (a, t +1) to PFC (a, t), and t +1 is made the current time. And returning to the step 2.

Dynamic triangulation or tracking is possible as long as the field of view of the camera is dominated by static entities (equivalent to world coordinates static, with less area of the image occupied by active entities). The peak on the phase correlation plane corresponds to the camera motion as long as the camera frame and thus the area is dominated by static entities. This is well known in the art, as detailed in the reference article entitled "Television Motion Measurement for DATVand other Applications" by G.A. Thomas published in 1987 by British Broadcasting Corporation (BBC).

Alternative embodiments

In an alternative embodiment of the invention, the computational expense of steps 5 and 6 is amortized over K frames and the resulting correction is continued for future frames. For example, if the reference value is selected for every 5 camera frames (K ═ 5), the first frame is the reference frame, and steps 5 and 6 are performed in the time interval from the first frame sample to the fifth frame sample (t +1 to t + 5). At the same time, for all samples, all other steps (steps 1 to 4 and 7) are performed using the uncalibrated values for P and V of all sample frames. When the best match for the first frame is finally selected in the fifth frame, error correction is applied. The same error correction can be applied to all five values of P and V, and only P (t +5) and V (t +5) need be corrected since at t +5 all the original values of P and V have been discarded.

In another embodiment of the present invention, the computational expense of steps 5 and 6 is addressed by using multiple low cost game graphics processors, one for each hypothetical camera position.

In another embodiment of the present invention, instead of calculating the phase correlation surface between the camera frame and the rendered image in steps 5 and 6, the sum of the squares of the differences in luminance values may be calculated (known in the art as "direct method"). The best match is the sum of the rendered image and the least squares.

What has been described above is a preferred embodiment of the present invention. However, it is also possible to embody the invention in other specific forms than those in the preferred embodiment described above. For example, instead of square and rectangular regions 'a', circular regions may be used.

An example application of the invention is tracking the position and view angle of a camera. However, those of ordinary skill in the art will understand and appreciate that the apparatus and methods of operation according to the present invention may be applied to any scenario in which determination, navigation, and guidance of object positions are necessary. The preferred embodiments are merely exemplary and should not be considered as limiting in any way. The scope of the invention is given by the appended claims, rather than the description above, and all variations and equivalents which fall within the spirit of the claims are intended to be embraced therein.

Claims

1. A method for tracking in real time the position and view angle (ego-motion) of a calibrated camera, comprising the steps of:

creating a prior model of the world in which the cameras exist;

acquiring each raw, unprocessed video frame from the camera;

for each video frame, assume that a small set of possible positions and perspectives of the video frames are acquired;

for each video frame, rendering an image using a graphics processor and vertex data from a prior model, one image for each assumed position and view angle;

for each video frame, the best position and view is selected by finding the best matching image to the video frame.

2. The method of claim 1, wherein the prior model of the world drawn using a low cost graphics processor has been implemented in the prior art and used in a real graphics computer game.

3. The method of claim 2, wherein the first video frame is from a known position and view.

4. The method of claim 3, wherein the video frame and the rendered image have the same resolution; and both are divided into rectangular and square areas that overlap by 0 or up to one hundred percent pixels.

5. The method of claim 4, wherein the count of a hypothetical set of positions and perspectives is limited by calculating the maximum possible motion vector and perspective of the camera from two frames, one frame before the other;

the calculation comprises the following sub-steps:

calculating a fast fourier transform of each region of the current frame, each region being processed independently of the other region;

obtaining the phase component of the obtained fast Fourier transform matrix and providing a pure phase component matrix; storing the phase component matrix in a memory;

acquiring a phase difference between each region of the current camera frame and a corresponding region of the previous camera frame using the phase component matrices from the current and previous frames;

calculating the inverse fast Fourier transform of the phase difference matrix to obtain a phase correlation surface;

determining the 2D position of the maximum of the phase correlation surface in each region; the 2D positions form a 2D optical flow vector for each area;

the most likely 3D motion vector and perspective of the camera are calculated from the optical flow vectors of all areas.

6. The method according to claim 5, wherein the calculation of the maximum possible 3D motion vector and view angle from the 2D optical flow vectors comprises the sub-steps of:

determining the orientation or direction of motion in the referenced world frame and then defining a line along which the most likely next position is distributed;

determining a candidate next position along the orientation line using a previous velocity calculation;

selecting a plurality of most likely positions from the points selected by the cube around the calculated candidate position;

gradient descent is used to select the best next position among the cube-selected points.

7. The method according to claim 5, wherein the method of selecting the best matching rendered image for each video frame comprises the sub-steps of:

calculating a fast fourier transform of each region in the rendered image;

obtaining a phase component of a fast fourier transform matrix for each region;

acquiring a phase difference between each region of the current camera frame and a corresponding region of the rendered image using the phase component matrix from the current frame and the phase component matrix from the rendered image; the phase differences form a phase correlation matrix;

calculating the inverse fast Fourier transform of a phase correlation matrix between the camera frame area and the drawn image area to obtain a phase correlation surface of each area;

the best matching rendered image is the rendered image with the minimum sum of squared (dot product) optical flow vectors, summed over all regions.

8. The method of claim 5, wherein the method of selecting the best matching rendered image for each video frame comprises the sub-steps of; this is known in the art as the "direct process";

for each rendered image, obtaining a difference in gray level for each pixel between the rendered image and the video frame;

calculating a simple sum of squares of all said differences for each region;

the selected rendered image is the rendered image having the smallest sum of squared differences with the video frame.

9. The method of claim 5, wherein the prior model is built using currently available tools such as AutoCAD.

10. The method of claim 5, wherein the prior model is constructed by image processing of pre-acquired video frames from the world in which the cameras are present, using prior art methods.

11. The method of claim 5, wherein constructing the prior model in real-time occurs concurrently with, but separate from, motion estimation using prior art methods.

12. An apparatus for tracking the position and view angle (ego-motion) of a camera in real time, comprising:

a video camera and a frame buffer thereof, updating the content of the frame buffer at a fixed frame rate;

a digital processing device for calculating optical flow from one video frame to another and assuming a plurality of trial camera positions and viewing angles from such optical flow analysis;

a prior model of the world;

a graphics processor or a plurality of graphics processors capable of multiplying the rendering of the world model over a time slice of the camera update frame buffer;

a plurality of graphical surfaces or image buffers storing rendered surfaces, each rendered surface corresponding to a trial position and perspective in the world model;

digital processing means for comparing each rendered image with the video frame buffer and then selecting the best matching rendered image, thereby also determining the most accurate instantaneous position and viewing angle of the camera.

13. The apparatus of claim 12, wherein the prior model for rendering the world using a low cost graphics processor has been implemented in the prior art and has been used in actual graphics computer games.

14. The device of claim 13, wherein initializing the device causes calculations to be started from known positions, perspectives, velocities, and angular velocities.

15. The apparatus of claim 14, wherein the video frame and the rendered image have the same resolution; and both are divided into rectangular and square areas that overlap by 0 up to one hundred percent of pixels.

16. The apparatus of claim 15, wherein the count of a hypothetical set of positions and perspectives is limited by computing a maximum possible motion vector and perspective of the camera from two frames, one frame before the other; the calculation comprises the following steps:

acquiring a phase difference between each region of the current camera frame and a corresponding region of the previous camera frame using the phase component matrices from the current and previous frames, the phase differences forming a phase correlation matrix;

calculating the inverse fast Fourier transform of the phase correlation matrix to obtain a phase correlation surface;

17. The device of claim 16, configured for computing a maximum possible 3D motion vector and a view angle from optical flow vectors using the following computation:

determining the orientation or direction of activity in a reference frame of the world and then defining a line along which the most likely next position is distributed;

determining a candidate next position along the orientation line using the original velocity calculation;

18. The method of claim 16, configured to select the best matching rendered image for each video frame, using the following calculation:

calculating a fast fourier transform of each region in the rendered image;

obtaining a phase component of a fast fourier transform matrix for each region;

the best-match rendered image is the rendered image with the minimum sum of squared (dot product) optical flow vectors, summed over all regions.

19. The apparatus of claim 16, configured to select a best matching rendered image for each video frame using the following calculation; this is known in the art as the "direct process";

calculating a simple sum of squares of all said differences for each region;

the selected rendered image is the rendered image having the smallest sum of squared differences from the video frames.

20. The apparatus of claim 16, wherein the prior model is built using currently available tools such as AutoCAD.

21. The apparatus of claim 16, wherein the prior model is constructed by image processing of pre-acquired video frames from the world in which the cameras are present, using prior art methods.

22. The apparatus of claim 16, wherein the real-time construction of the prior model occurs simultaneously with, but separate from, the motion estimation using prior art methods.

23. A computer program product embodying the method of any one of claims 6, 7, 8, 9, 10 and 11.