CN116977408B

CN116977408B - Visual SLAM method based on dynamic target tracking and feature point filtering

Info

Publication number: CN116977408B
Application number: CN202310970050.1A
Authority: CN
Inventors: 俞成浦; 刘一萱
Original assignee: Chongqing Innovation Center of Beijing University of Technology
Current assignee: Chongqing Innovation Center of Beijing University of Technology
Priority date: 2023-08-02
Filing date: 2023-08-02
Publication date: 2025-09-12
Anticipated expiration: 2043-08-02
Also published as: CN116977408A

Abstract

The invention discloses a visual SLAM method based on dynamic target tracking and feature point filtering, which belongs to the field of visual SLAM and comprises the steps of identifying potential dynamic objects by using target detection when pose estimation is carried out, reducing the influence caused by false detection by using target tracking, filtering feature points of the potential dynamic objects according to geometric constraint and depth constraint, and eliminating the influence of the dynamic objects on positioning while maintaining effective information. The method and the device have the advantages of avoiding the influence of false detection on positioning to a large extent, optimizing the screening and the utilization of dynamic information, reserving partial information of previous frames, endowing the correlation of dynamic objects among image frames, being beneficial to matching characteristic points of the objects among the frames, being capable of determining the states of the objects more robustly, and being capable of obtaining accurate and real-time pose estimation which cannot be achieved by common visual SLAM under indoor and outdoor high-dynamic scenes.

Description

Visual SLAM method based on dynamic target tracking and feature point filtering

Technical Field

The invention relates to the field of visual SLAM, in particular to a visual SLAM method based on dynamic target tracking and feature point filtering.

Background

There is a technical solution that restores the position and posture of the camera itself by performing feature extraction on the visual input and establishes a point cloud map, which is called a visual SLAM technique.

The technical framework is to process and optimize the input video, so that the pose estimation result is greatly related to the input environmental condition. Most of the current visual positioning is researched aiming at static environments, but is difficult to be applied to dynamic environments with changeable scenes, under the condition of high dynamic, the inter-frame extraction features have huge difference, the positioning effect is poor, and the limitation is caused.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a visual SLAM method based on dynamic target tracking and feature point filtering, which can obtain accurate and real-time pose estimation which cannot be achieved by common visual SLAM in indoor and outdoor high-dynamic scenes.

The invention aims at realizing the following scheme:

a visual SLAM method based on dynamic target tracking and feature point filtering, comprising the steps of:

when pose estimation is carried out, potential dynamic objects are identified by using target detection, and then the influence caused by false detection is reduced by using target tracking;

And filtering the feature points of the potential dynamic object according to the geometric constraint and the depth constraint, and eliminating the influence of the dynamic object on positioning while maintaining effective information, thereby realizing pose estimation which simultaneously meets the real-time performance and the accuracy.

Further, the pose estimation is performed in a high dynamic environment.

Further, the method uses target detection to identify potential dynamic objects, and then uses target tracking to reduce the influence caused by false detection, and specifically comprises the following steps:

using the YOLO object detection algorithm to identify potential dynamic objects, and designing a standard Kalman filter with a constant velocity model and a linear observation model for predicting the position and state of the detected object in the next frame, comprising defining a state space Describing the state of the object and the motion information in the image coordinate system, wherein the motion information comprises the center coordinates (u, v), the aspect ratio (gamma), the height (h) and the speed of the boundary frames in the image coordinates, taking the boundary frame coordinates (u, v, gamma, h) as direct observation values of the state of the object, calculating the intersection ratio IOU of the boundary frames of the current frame and the boundary frames in prediction, matching the boundary frames between the two frames, updating the speed information and numbering.

Optimization of inter prediction was performed using the method of defining (y _i,S_i) as the projection of the ith tracking distribution in the measurement space, including the observation y _i and covariance matrix S _i, and d _j as the jth bounding box detection, the following formula being used between the kalman predictor and the observation:

optimizing and improving the observation reliability;

in the above formula, the mahalanobis distance d ^{(Mahalanobis)} is used for calculating the distance of the residual vector, for judging the difference between the observed value and the predicted value, T is the matrix transposition, The inverse matrix of S _i is represented.

The inter-frame prediction optimization is performed by using a method that when an object is blocked or a mahalanobis distance cannot function due to camera shake, cosine similarity is used for processing information of a boundary frame, namely the information of the boundary frame is input into a VGG16 network, L2 normalization is performed after feature extraction, a _j is obtained, wherein |a _j |=1, and then cosine similarity between an ith track and a jth detection result is measured in an appearance space:

then according to d ^(cosine)(i,j)＝min(1-s_i,j) calculating and finding out the neighbor cosine distance, namely the best match.

Further, the pose estimation is specifically based on an ORB_SLAM2 framework, and the method further comprises the steps that when detection and tracking of a target are completed, feature points are synchronously extracted and processed, namely, the ORB_SLAM2 divides each layer of image into a plurality of areas by utilizing an image pyramid, the ORB feature points are extracted and detected in each area, and the scaling scale expression of each layer of the image pyramid is as follows:

S_i＝αⁱ(i=0,1,2,...,n-1)

Wherein n is the number of layers of the image golden sub-tower, alpha is the scale factor of each layer of the golden sub-tower, S _i is the scaling scale of the ith layer of image;

if the total number of the feature points which can be extracted in the image is N, the number of the feature points which need to be extracted in each layer of pyramid image is as follows:

wherein N _i is the number of feature points required by the ith layer, S _i is the reciprocal of the scale factor;

dividing the regions of each layer of pyramid image, and defining AndFor the number of row and column divisions of the i-th layer image, the number of pre-fetches is set as:

And if the number N _t of extracted feature points is smaller than the number N' _i of expected extracted feature points, the feature points extracted by the region are considered to be empty, the threshold T _min is adjusted, the extraction of the feature points in the region is completed, the feature points are uniformly distributed, and the phenomenon of reduced tracking precision caused by the fact that the feature points are located in the region with a longer distance or a weaker texture after a dynamic object is removed is changed.

Further, the pose estimation is specifically based on an orb_slam2 frame, and when the detection and tracking of the target are completed, the extraction and processing of the feature points are also performed synchronously, that is, the view of the feature point extraction is adjusted, and the range of the extracted feature points is limited.

Further, the pose estimation method specifically comprises the following sub-steps:

let the i-th frame image all feature point set be:

Wherein, the Is the m-th feature point on the frame image, and the set of static points of the current frame is defined as

Defining a set of feature points in a reference frame asThen atAndFinding a matching point between the two to calculate a relative gesture;

defining a threshold to filter improper matches, setting the minimum Hamming distance in the matches as For any match, if the calculated Hamming distance is greater than the set value, then the match is considered to be a mismatch, and at the same time, an empirical threshold is setTo limitThe value of (2) is the same as the value of (3);

The method comprises the steps of obtaining the speed information of a reference frame, limiting the searching range of the feature points by utilizing the obtained speed information of the reference frame, improving the matching robustness, projecting the key point p ^r of the reference frame to the current frame, setting a region delta with a set pixel size around the projected feature points, and setting a pixel size in the current frame Searching is carried out in the delta distribution characteristic points, so that the time complexity and the unmatched probability can be reduced;

After the matching points are obtained, a basic matrix F is used for describing the posture relation between frames, a homography matrix H is used when a camera only rotates or feature points are coplanar, two methods are selected according to the number of the matching points meeting constraint conditions, the specific steps are that firstly, a random sampling consistency algorithm RANSAC and a normalized eight-point method are used for calculating F matrixes of a current frame and a reference frame, a RANSAC and a direct linear transformation method DLT are used for calculating H matrixes of the current frame and the reference frame, then, calculation of point-to-point distances, internal point numbers n _F and n _H is carried out, and finally, a matrix corresponding to the larger party is selected for carrying out posture estimation by comparing the size of n _F、n_H.

Further, the filtering of feature points of the potential dynamic object according to the geometric constraint and the depth constraint comprises the following substeps:

1) Calculating a basic matrix F of the two frames of images by using a RANSAC algorithm;

2) For each feature point p ₁ of the first frame image, calculating parameters of a straight line where the projected polar line e ₂p₂ is located by using a basic matrix;

3) The position of p ₂ is not exactly on the epipolar line, and the distance from the corresponding feature point p ₂ to the epipolar line is calculated, if it is too large, as the outlier.

The method comprises the following substeps of obtaining depth of map points through triangulation when RGBD pictures are used as input, converting feature point pairs matched with reference frames and current frames according to estimated frame pose matrixes to obtain depth projected to the current frames by the reference frames, calculating differences from true labels, defining a floating threshold th _depth for depth screening, and finally designing a program lock for guaranteeing that pose estimation is suspended at the moment of updating a static point map, restarting after updating is completed and guaranteeing more accurate positioning.

The beneficial effects of the invention include:

according to the embodiment of the invention, under the pose estimation framework based on ORB_SLAM2, the influence of false detection on positioning is avoided to a great extent by combining algorithms of potential dynamic target detection, priori semantic information, dynamic target tracking, epipolar geometry and depth constraint, and the screening and utilization of dynamic information are optimized. In addition, the retained partial information of the previous frame gives the correlation of the dynamic object between image frames. The technical scheme of the invention is not only beneficial to matching the characteristic points of the object between frames, but also can determine the state of the object more robustly. The accurate and real-time pose estimation which cannot be achieved by the common visual SLAM can be obtained in indoor and outdoor high-dynamic scenes.

The scheme of the embodiment of the invention reduces the influence of error detection on system optimization by utilizing the YOLO target detection and tracking prediction technology, reserves priori motion information and improves the positioning robustness.

The embodiment of the invention can be applied to indoor and outdoor places in various dynamic environments, and achieves the real-time requirement that the example segmentation is difficult to meet.

The embodiment of the invention provides a novel visual SLAM scheme based on semantic information, which does not reject a dynamic target as a whole, but can keep effective information as much as possible through a feature filtering mechanism.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a schematic diagram of a frame;

FIG. 2 is a comparison of YOLO conventional detection and potential dynamic target detection;

FIG. 3 is a schematic diagram of dynamic target prediction and tracking;

FIG. 4 is a diagrammatical representation of a epipolar geometry;

FIG. 5 is a schematic diagram satisfying the epipolar geometry but belonging to a mismatch;

FIG. 6 is a false positive of an outdoor scenario using only geometric constraints;

FIG. 7 is a depth constraint correction misjudgment;

FIG. 8 is a diagram of a conventional approach that is not normally used in a high dynamic indoor environment;

FIG. 9 is KITTI an outdoor scene application;

Note that the portrait of the person in fig. 2,3 and 8 are all persons in the public data set, and that the use of colors in fig. 6, 7, 8 and 9 can be more clearly expressed and distinguished, which is useful for understanding the present invention.

Detailed Description

All of the features disclosed in all of the embodiments of this specification, or all of the steps in any method or process disclosed implicitly, except for the mutually exclusive features and/or steps, may be combined and/or expanded and substituted in any way.

The embodiment of the invention relates to a visual SLAM (synchronous positioning and mapping) and pose estimation technology of a high dynamic environment, and particularly provides a visual SLAM method based on dynamic target tracking and feature point filtering. In the invention, when pose estimation is carried out under high-level environment, potential dynamic objects are identified by using YOLO target detection, and then the influence caused by false detection is reduced by using a target tracking algorithm, so that the problem that the existing dynamic SLAM scheme mostly cannot run in real time is solved. And filtering the feature points of the potential dynamic object according to geometric constraint and depth constraint, and eliminating the influence of the dynamic object on positioning while keeping effective information as much as possible, thereby realizing pose estimation which simultaneously meets real-time performance and accuracy in a high dynamic environment.

In a further inventive concept, aiming at the problem that the traditional visual SLAM algorithm cannot process dynamic scenes and the defects of the existing algorithm, the scheme of the embodiment of the invention combines priori semantic information and epipolar geometry and depth constraint algorithms under the ORB-SLAM2 framework, optimizes the utilization of dynamic information, and the system framework is shown in fig. 1, the invention synchronously works in two threads, and a conventional SLAM module is arranged below. In specific implementation, the technical scheme of the invention designs the following four parts:

1. Target detection and tracking

In the embodiment of the invention, YOLOv and YOLOv n with excellent precision and real-time performance are used for detecting objects with potential mobility, such as people, vehicles and the like, the display effect of the detection at this stage on the public data set TUM is shown as a figure 2, wherein the left graph is common object detection and comprises person, tv, keyboard and mouse, and the right graph is detection of potential dynamic targets.

The embodiment of the invention designs a standard Kalman filter with a constant speed model and a linear observation model, which is used for predicting the position and state of a detected object in the next frame. Specifically, the method comprises defining a state spaceDescribing the state of the object and the motion information in the image coordinate system, wherein the boundary frame center coordinate (u, v), the aspect ratio (gamma), the height (h) and the speed thereof in the image coordinate system are included, and the boundary frame coordinate (u, v, gamma, h) is taken as a direct observation value of the state of the object.

Then, the boundary box of the current frame and the boundary box in prediction are calculated to be the merging ratio IOU, so that the boundary boxes between the two frames are matched, the speed information is updated, and the number is given.

The scheme of the embodiment of the invention uses two algorithms to optimize the inter-frame prediction, wherein one is that (y _i,S_i) is defined as the projection of the ith tracking (track) distributed in a measurement space, the projection comprises an observed value y _i and a covariance matrix S _i;d_j which are defined as the detection result of the jth bounding box, and the formula is used between a Kalman predicted value and the observed value:

Optimizing and improving the observation reliability. The above mahalanobis distance d ^{(Mahalanobis)} is generally used to calculate the distance of the residual vector, which is used to determine the difference between the observed value and the predicted value.

Another is to use cosine similarity to process the information of the bounding box when the object is occluded or camera shake makes mahalanobis distance difficult to work with. In this algorithm, we input the information of the bounding box into VGG16 network, and perform L2 normalization after feature extraction to obtain descriptor a _j where |a _j |=1. Then, the cosine similarity between the ith track and the jth detection result is measured in the appearance space:

Then according to d ^(cosine)(i,j)＝min(1-s_i,j) calculating and finding out the neighbor cosine distance, namely the best match. The Prediction situation is shown in fig. 3, wherein the left image is the current frame Detection, the Detection is the Detection frame, the right image is the next frame Detection, the Detection is the Detection frame, and the Prediction is the algorithm Prediction frame of the invention.

2. Visual front-end processing

Upon completion of detection and tracking of the target, the SLAM system will perform feature point extraction and processing work synchronously. The two-point design is adopted in the method, so that a better effect is obtained on the selection and extraction of the characteristic points, and the robustness of the algorithm is further improved.

1) ORB_SLAM2 divides each layer of image into a plurality of areas by using an image pyramid, and ORB characteristic points are extracted and detected in each area, wherein the scaling scale expression of each layer of the image pyramid is as follows:

S_i＝αⁱ(i=0,1,2,...,n-1)

Wherein n is the number of layers of the image golden sub-tower, alpha is the scale factor of each layer of the golden sub-tower, and S _i is the scaling scale of the ith layer of the image.

Where N _i is the number of feature points required for the ith layer and S _i is the inverse of the scale factor.

We divide the area of each layer pyramid image, defineAndFor the number of row and column divisions of the i-th layer image, the number of pre-fetches is set as:

in the divided region, an initialization threshold T _init for extracting feature points is set. If the number N _t of the extracted feature points is smaller than the number N _i ^′ of the expected extracted feature points, the extracted feature points of the region are considered to be empty, the threshold T _min is adjusted, the extraction of the feature points in the region is completed, the feature points are uniformly distributed, and accordingly the phenomenon that tracking accuracy is reduced due to the fact that the feature points are located in the region with a long distance or a weak texture after a dynamic object is removed is changed.

2) And adjusting the view field of feature point extraction. As the camera itself moves past the object, the change in object or scene on both sides of itself will become more pronounced as the field of view widens, the more pronounced the change in pixels at the edges. The image size of the KITTI dataset input is 1241 x 376, so wide viewing angle makes us easily affected by large pixel variations at the edges as we pass the vehicle. Therefore, when an image with the aspect ratio of more than 2:1 is processed, the range of the extracted characteristic points is limited, and the range of 15% from each edge of the long side is eliminated.

3. Estimating an inter-frame initial pose

This part handles the impact of dynamic objects on pose estimation, as dynamic objects may be mistaken for stable features in the environment, resulting in errors in positioning and mapping.

First, classifying the dynamic and static points and removing the dynamic points. The total feature point set of the ith frame image is as follows:

Wherein, the Is the m-th feature point on the frame image. The set of static points of the current frame is defined asIs used for estimating the pose.

The specific method comprises defining the feature point set in the reference frame asThen atAndAnd find matching points between them to calculate the relative pose. This step exploits the idea of using hamming distance to measure the similarity of two BRIEF descriptors in the original ORB-SLAM 2. But in the process, violent searches are encountered, which have the problem of easily introducing false matches. To solve this problem, the embodiment of the invention defines reasonable threshold values to filter improper matching, namely, the minimum Hamming distance in the matching is set asFor any match, if the calculated hamming distance is greater than twice it, then it is considered a mismatch. At the position ofIn very small cases, this results in a large number of correct matches being filtered out, thus setting an empirical thresholdTo limit this.

And the search range of the feature points is limited by using the obtained speed information of the reference frame (namely, the gesture conversion from the reference frame to the previous frame), so that the matching robustness is improved. The key point p ^r of the reference frame is projected to the current frame, and a 9*5 or 7*7 pixel region delta (adjusted according to the camera motion mode) is set around the projected feature point. Therefore, only need to be atSearching among the feature points of the delta distribution of (c) can reduce the time complexity and the probability of mismatch.

After the matching points are acquired through the scheme set forth above, the basic matrix F can be used to describe the pose relationship between frames. The homography matrix H is used in the case where the camera is only rotated or feature points are coplanar. In the scheme, two methods are selected according to the number of matching points meeting constraint conditions.

The method mainly comprises the steps of firstly calculating F matrixes of a current frame and a reference frame by using a random sampling consistency algorithm (RANSAC) and a normalized eight-point method, calculating H matrixes of the current frame and the reference frame by using the RANSAC and a direct linear transformation method (DLT), then calculating a point-to-point distance, an inner point number n _F and an inner point number n _H, and finally comparing the size of n _F、n_H, and selecting a matrix corresponding to a larger party (n _F corresponds to the F matrix and n _H corresponds to the H matrix) to perform pose estimation. The principle of this step is further described below.

4. Filtering of feature points

The partial algorithm filters feature points contained in the boundary box where the potential dynamic target is located, and separates dynamic and static points contained in the boundary box, so that the condition that the whole is deleted due to the movement of a small part of an object is reduced, and the aim of eliminating the dynamic feature points under the condition that static information is kept as much as possible is fulfilled.

In the conventional SLAM algorithm, a sliding window search algorithm is widely used for feature point matching, which is a method for finding feature point matching in successive frames. It selects a window of fixed size containing some feature points and other information in the current frame, such as image coordinates, descriptors, etc., then searches the previous frame image for points matching the feature points in the window, calculates camera motion from the matching points, and uses it to update the current pose estimation and map. Repeating the above steps until the sliding window is completed.

However, in the case where a dynamic object exists, since the motion of the dynamic object may be different from the camera motion, the use of window searching may cause a large number of matching errors of feature points. Thus, when a match is established, only a violent match can be established, and a threshold is designed for the scheme in the upper part to solve some defects.

We apply epipolar constraints in the process of computing inliers, which principle is demonstrated in this section. The camera of two frames of images is located at O ₁,O₂, the projection e ₁,e₂ of the spatial point P is the pole, l ₁,l₂ is the polar line, and is the intersection line of the polar plane OPP and the image plane. The specific position of the spatial point P cannot be known only by the first frame image, and since the spatial point on the ray O ₁p₁ is projected onto the point P ₁, the spatial point P is on the ray O ₁p₁ and projected onto the second frame, and P ₂ is necessarily on the straight line e ₂p₂. I.e. pixels on the image of the first frame, are projected as a straight line on the image of the second frame. The spatial coordinates of P are specified as p= [ x, y, z ], and the position of the pixel P ₁,p₂ satisfies:

s₁p₁=KP

s₂p₂=K(RP+t)

K is the camera reference, R, t is the camera rotation and translation matrix. The epipolar constraint follows:

Then in the dynamic scenario, epipolar constraints must be used to detect outliers on the culled moving object, the flow is:

3) Because the position of p ₂ is not exactly on the epipolar line due to errors in the actual process, the distance from the corresponding feature point p ₂ to the epipolar line is calculated, and if the distance is too large, the distance is the outlier.

Fig. 4 is a diagrammatical representation of a epipolar geometry. There is also a limitation to the polar constraint that when an object moves along the epipolar line, the distance of P ₂ to the epipolar line is always 0, which cannot reflect the true error, as is the case when the motion is exactly in the polar plane (when P moves to the P' or p″ position), as shown in fig. 5. The disadvantage is especially remarkable when detecting the automobile running on the expressway, and the same problem is also caused in the data set test process, see fig. 6, and the automobile with the yellow frame in the middle of the front road in the image is misjudged to be static under the condition of using only the epipolar constraint, and participates in pose estimation, so that errors are caused.

Using the property that RGBD and Stereo data contain or have computable depth, embodiments of the present invention incorporate depth information filtering. When Stereo is input, the depth of the map point of the current frame can be easily obtained, and when RGBD pictures are used as input, the depth Ground Truth is known by a data set, and the depth of the map point needs to be obtained through triangulation. For the feature point pair matched with the reference frame and the current frame, the depth projected to the current frame by the reference frame can be obtained by converting the frame pose matrix estimated in the 2 nd part, and then the difference between the depth and Ground Truth is needed to be calculated. After this, a floatable threshold th _depth is specified for depth screening.

Through the algorithm, the epipolar geometry constraint and the depth constraint are designed and combined, and the characteristic point screening is completed and partial dynamic and static points are separated. The program lock is designed, so that the pose estimation is stopped at the moment of updating the static point map, and the position and the pose are restarted after the updating is finished, and more accurate positioning is ensured. In the data set test, the effect of adding the depth constraint is shown in fig. 7, and it can be seen that by adding the depth constraint, the erroneous judgment of setting only the epipolar constraint is corrected, and the effectiveness of the algorithm is proved. In addition, the calculated information is also used in the back-end optimization, and no redundant calculation load is introduced.

Under the pose estimation framework based on ORB_SLAM2, the influence of false detection on positioning is avoided to a great extent by combining algorithms of potential dynamic target detection, priori semantic information, dynamic target tracking, epipolar geometry and depth constraint, and the screening and the utilization of dynamic information are optimized. In addition, the retained partial information of the previous frame gives the correlation of the dynamic object between image frames. This scheme not only helps to match feature points of an object between frames, but also more robustly determines the state of the object. The accurate and real-time pose estimation which cannot be achieved by the common visual SLAM can be obtained in indoor and outdoor high-dynamic scenes.

The invention is used for estimating the pose in the indoor high dynamic environment. The indoor scene has limited environmental information, a small scene range and most of common targets which can be in high dynamic state are moving people. A fatal situation of disturbance of pose tracking often occurs for a period of time when a person passes in front of a camera. The dynamic target is processed by utilizing the target detection with high real-time performance, so that the situation can be effectively resisted.

To obtain the most accurate comparison of test accuracy, the TUM data set is used in this example, and this data set contains Ground Truth which is relatively accurate and is widely used in the SLAM test field.

The conventional visual scheme ORB_SLAM2 is used in an application scene as shown in FIG. 8, the left graph is the comparison of ORB_SLAM2 and the trace of a true mark, the XYZ three-axis coordinate units are meters, and the right graph is the screenshot of the running process. It can be seen that in such a highly dynamic indoor environment, the positioning result has a huge error, and the conventional scheme cannot be used normally, especially applied to unmanned robot navigation and planning.

In this embodiment, the person performing the movement is detected and tracked, then the feature points belonging to the dynamic frame are filtered in geometry and depth, and the quantifiable error representations such as Root Mean Square Error (RMSE) and standard deviation (STD) of Absolute Track Error (ATE) between the movement track and Ground Truth are calculated. The actual pose estimation in this environment is obtained and plotted in the table shown in fig. 9. Wherein xyz, static, rpy and halfsphere in the first column represent four types of camera self-motion, e.g., xyz represents camera movement along the x-y-z axis. W and S represent that the dynamic object is in walking and sitting states. RMSE and STD can better show the motion trajectories of the cameras.

The pose estimation result in the embodiment of the invention is remarkably improved, the improvement level of more than 80% is generally achieved, and the method has the advantages of robustness and excellent practical application capability.

Another aspect of the invention is for pose estimation in outdoor high dynamic environments. Outdoor scenes contain much useful information, and there are relatively many dynamic objects that may be present, such as various vehicles, pedestrians, etc., and the proportion of these objects that occupy the entire image is relatively small.

Likewise, we tested our outdoor performance using the most widely used very accurate KITTI dataset in autopilot. The data set is recorded mainly by a camera erected on a vehicle running on villages, city streets, city highways and the like. In the example, the segment(s) containing more dynamic objects and dense motion thereof is selected, so that the verification requirement is met.

Scene display as shown in fig. 9, the test results of the embodiment of the present invention are shown in table 1 below, and the description indexes RMSE and STD of the absolute track error ATE in the kitti outdoor scene are improved.

TABLE 1

In considering errors over the entire dataset, the RMSE of absolute trajectories will be comparable except for periods of time where there are more dynamic objects, since during these times no dynamic points occur that affect pose estimation. For a complete sequence, the scene is low dynamic, even static, for a long period of time, and the proposed algorithm is hardly different from ORBSLAM. Therefore, the improvement amplitude is smaller after the error is averaged. Whereas a significant improvement can be seen in dynamic dense segments.

In addition, there are also DynaSLAM and DS-SALM, etc. of the excellent dynamic SLAM algorithms currently available, which take at least 200-350ms for each frame of processing time, and if they are to be practically used, they cannot meet the real-time requirements. And after the scheme is rapidly detected by YOLO, the real-time operation requirement of at least 15 frames per second can be achieved by utilizing characteristic filtering.

After quantitative tests under indoor and outdoor scenes, compared with the conventional ORB_SLAM2, the pose estimation scheme based on dynamic target tracking and feature point filtering provided by the embodiment can obtain accurate and reliable estimation results in high dynamic environments indoors and outdoors, and has good real-time performance. The algorithm is operated on unmanned equipment, so that the positioning requirements of safety, real time and accuracy can be met, and the method has excellent practical application significance.

It should be noted that, within the scope of protection defined in the claims of the present invention, the following embodiments may be combined and/or expanded, and replaced in any manner that is logical from the above specific embodiments, such as the disclosed technical principles, the disclosed technical features or the implicitly disclosed technical features, etc.

Example 1

Example 2

On the basis of the embodiment 1, the pose estimation is performed in a high dynamic environment.

Example 3

Based on embodiment 1, the method for identifying potential dynamic objects by using target detection and then reducing the influence caused by false detection by using target tracking specifically comprises the following steps:

Example 4

optimizing and improving the observation reliability;

Example 5

Example 6

Based on embodiment 1-embodiment 5, the pose estimation is specifically based on an orb_slam2 framework, and further includes the steps of synchronously extracting and processing feature points when the detection and tracking of the target are completed, that is, the orb_slam2 divides each layer of image into a plurality of regions by using an image pyramid, and performs ORB feature point extraction and detection in each region, where the scaling scale expression of each layer of the image pyramid is as follows:

S_i＝αⁱ(i=0,1,2,...,n-1)

And if the number N _t of extracted feature points is smaller than the number N _i ^′ of expected extracted feature points, the feature points extracted by the region are considered to be empty, the threshold T _min is adjusted, the extraction of the feature points in the region is completed, the feature points are uniformly distributed, and the phenomenon of reduced tracking precision caused by the fact that the feature points are located in the region with a longer distance or a weaker texture after a dynamic object is removed is changed.

Example 7

On the basis of embodiment 1, the pose estimation is specifically based on the orb_slam2 framework, and when the detection and tracking of the target are completed, the extraction and processing of the feature points are also performed synchronously, that is, the view of the feature point extraction is adjusted, and the range of the extracted feature points is limited.

Example 8

Based on embodiment 1, the pose estimation specifically includes the following sub-steps:

let the i-th frame image all feature point set be:

Example 9

On the basis of embodiment 1, the filtering of feature points of the potential dynamic object according to geometric constraints and depth constraints comprises the following substeps:

Example 10

On the basis of embodiment 1, the filtering of feature points of potential dynamic objects according to geometric constraints and depth constraints comprises the following substeps of obtaining the depth of map points through triangulation when RGBD pictures are used as input, converting the feature point pairs matched with a reference frame and a current frame according to an estimated frame pose matrix to obtain the depth projected to the current frame by the reference frame, calculating the difference from a true mark, defining a floating threshold th _depth for depth screening, and finally designing a program lock for guaranteeing that pose estimation is suspended at the moment of updating a static point map, restarting after updating is completed, and guaranteeing more accurate positioning.

The units involved in the embodiments of the present invention may be implemented by software, or may be implemented by hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.

According to an aspect of embodiments of the present invention, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the methods provided in the various alternative implementations described above.

As another aspect, the embodiment of the present invention further provides a computer readable medium, which may be included in the electronic device described in the above embodiment, or may exist alone without being assembled into the electronic device. The computer-readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to implement the methods described in the above embodiments.

In addition to the foregoing examples, those skilled in the art will recognize from the foregoing disclosure that other embodiments can be made and in which various features of the embodiments can be interchanged or substituted, and that such modifications and changes can be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A visual SLAM method based on dynamic target tracking and feature point filtering, comprising the steps of:

filtering feature points of potential dynamic objects according to geometric constraints and depth constraints, eliminating the influence of the dynamic objects on positioning while maintaining effective information, and realizing pose estimation which simultaneously meets real-time performance and accuracy;

The pose estimation is specifically based on an ORB_SLAM2 framework, and further comprises the steps of synchronously extracting and processing characteristic points when the detection and tracking of the target are completed, namely, the ORB_SLAM2 divides each layer of image into a plurality of areas by utilizing an image pyramid, and ORB characteristic points are extracted and detected in each area, wherein the scaling scale expression of each layer of the image pyramid is as follows:

Wherein: The number of layers of the image golden tower; is the scale factor of each layer of the golden sub-tower; Is the first Scaling of the layer image;

If the total number of the feature points which can be extracted in the image is The number of feature points to be extracted for each layer of pyramid image is as follows:

Wherein: Is the first The number of feature points required by the layer; is the inverse of the scale factor;

Dividing the regions of each layer of pyramid image, and defining AndIs the firstThe number of line and column divisions of the layer image is set as:

In the divided regions, an initialization threshold value for extracting feature points is set Judging the number of extractionsIf less than the expected extraction number of feature points of (a)If the feature points extracted by the region at this time are considered to be empty, the threshold value is adjustedThe extraction of the characteristic points in the region is completed, and the characteristic points are ensured to be uniformly distributed, so that the phenomenon of reduced tracking precision caused by the fact that the characteristic points are in a far-away region or a weak texture region after a dynamic object is removed is changed;

the filtering of the feature points of the potential dynamic object according to the geometric constraint and the depth constraint comprises the following substeps:

2) For each feature point of the first frame image Calculating polar line after projection by using basic matrixParameters of the straight line;

3) The position of (2) is not exactly on the polar line, and the corresponding characteristic point is calculated If the distance from the polar line is too large, the distance is an outer point;

the filtering of feature points of potential dynamic objects according to geometric constraint and depth constraint comprises the following substeps of obtaining depth of map points through triangulation when RGBD picture is used as input, converting feature point pairs matched with a reference frame and a current frame according to estimated frame pose matrix to obtain depth projected to the current frame by the reference frame, calculating difference from true mark, and defining a floating threshold value after the depth is obtained And finally, designing a program lock for ensuring that the pose estimation is stopped at the moment of updating the static point map, restarting after updating is finished, and ensuring more accurate positioning.

2. The visual SLAM method of claim 1, wherein the pose estimation is performed in a highly dynamic environment.

3. The visual SLAM method of claim 1, wherein the potential dynamic object is identified using object detection and then the effect of false detection is reduced using object tracking, comprising the steps of:

using the YOLO object detection algorithm to identify potential dynamic objects, and designing a standard Kalman filter with a constant velocity model and a linear observation model for predicting the position and state of the detected object in the next frame, comprising defining a state space Describing the state of the object and the motion information in the image coordinate system, wherein the motion information comprises the center coordinates of the boundary frameAspect ratioHeight ofAnd its velocity in the image coordinates and coordinates the bounding boxAnd then, calculating the intersection ratio IOU of the boundary frame of the current frame and the boundary frame in prediction, so as to match the boundary frames between the two frames, update the speed information and assign numbers.

4. The visual SLAM method of claim 1, wherein the potential dynamic object is identified using object detection and then the effect of false detection is reduced using object tracking, comprising the steps of:

optimization of inter prediction is performed using a method that will Defined as the firstThe projections of the tracking distribution in the measurement space, including observationsSum covariance matrixWill (i) beDefined as the firstThe following formula is used between the Kalman predicted value and the observed value according to the detection result of the bounding box:

optimizing and improving the observation reliability;

In the above, the mahalanobis distance For calculating the distance of the residual vector, for determining the difference between the observed value and the predicted value, T being the matrix transposition,Representation ofAn inverse matrix.

5. The visual SLAM method of claim 1, wherein the potential dynamic object is identified using object detection and then the effect of false detection is reduced using object tracking, comprising the steps of:

The inter-frame prediction is optimized by using cosine similarity to process the information of the boundary frame when the object is blocked or the Mahalanobis distance cannot function due to camera shake, namely, inputting the information of the boundary frame into VGG16 network, extracting the characteristics and performing Normalization to obtain descriptorsWherein the method comprises the steps ofThen measure the first in the appearance spaceStrip track and the firstCosine similarity between the individual detection results:;

Then according to And calculating and finding out the neighbor cosine distance, namely the best matching.

6. The visual SLAM method based on dynamic object tracking and feature point filtering according to claim 1, wherein the pose estimation is specifically based on an orb_slam2 frame, and when the object detection and tracking are completed, feature point extraction and processing are also performed synchronously, i.e. the view of feature point extraction is adjusted, and the range of the extracted feature points is limited.

7. The visual SLAM method based on dynamic target tracking and feature point filtering of claim 1, wherein the pose estimation comprises the following steps:

Set the first The total feature point set of the frame image is as follows:

Wherein, the Is the first on the frame imageThe set of static points of the current frame is defined as;

Limiting the searching range of the characteristic points by using the obtained speed information of the reference frame, improving the matching robustness Projecting to the current frame, setting a region delta with a set pixel size around the projected feature point, and thenSearching is carried out in the delta distribution characteristic points, so that the time complexity and the unmatched probability can be reduced;

After the matching points are obtained, the basic matrix is used Describing the pose relationship between frames, using homography matrix in case of rotation of camera only or feature point coplanaritySelecting two methods according to the number of matching points meeting constraint conditions, specifically comprising the steps of firstly calculating F matrixes of a current frame and a reference frame by using a random sampling consistency algorithm RANSAC and a normalized eight-point method, and calculating the current frame and the reference frame by using a RANSAC and a direct linear transformation method DLTMatrix, then making point-to-point spacing and internal point numberAndCalculation of (c) and finally comparison、And selecting a matrix corresponding to the larger one for pose estimation.