Disclosure of Invention
In order to solve the technical defects in the prior art, the invention provides a hierarchical path planning method for a topology-grid-metric hybrid map, which divides path planning into a plurality of sub-processes, and solves the problems of larger search space, time consumption in planning, low obstacle avoidance instantaneity and the like in the prior art when facing a part of unknown complex large scene environment by applying different path planning methods on the topology, grid and metric map.
The invention is realized by the following technical scheme:
the hierarchical path planning method for the topology-grid-metric hybrid map comprises the following steps:
Step 1, creating a hybrid map of a mobile robot working environment according to a topology-grid-measurement hybrid map construction method, wherein the hybrid map comprises an abstract topological map, a concrete grid map and a fine measurement map;
Step 2, before the mobile robot operates, generating an overall optimized global preset path according to a topology-grid hierarchical planning method;
Step 3, detecting the surrounding environment in real time when the mobile robot runs along the global preset path, judging whether the distance from the nearest obstacle of the robot is smaller than the obstacle avoidance threshold value, if so, entering the step 4, and if not, entering the step 5;
Step 4, carrying out local path planning on a measurement map by utilizing an improved depth deterministic strategy Gradient (DEEP DETERMINISTIC Policy Gradient, DDPG) algorithm, namely firstly constructing a value classification experience playback pool, respectively storing a high-value sample, a common sample and a recent sample in three sub-playback pools of the value classification experience playback pool, randomly extracting a proper amount of samples from the three sub-playback pools by adopting a hierarchical sampling strategy for learning and training of the local path planning by adopting the DDPG algorithm, introducing a continuous rewarding function comprising arrival rewarding, collision punishment and process rewarding, rewarding or punishment on the local path planned by the DDPG algorithm, training the DDPG algorithm to plan a better local path, and returning to the global preset path after obstacle avoidance is completed through the local path planning;
and 5, judging whether the mobile robot reaches the target position, if so, ending the algorithm, and if not, entering the step 3.
Further, the topology-grid-metric hybrid map construction method includes the steps of:
step 1.1, rasterizing the whole operation environment to construct a global raster map;
Dividing a global grid map into a plurality of sub-grid maps, wherein each sub-grid map is provided with key nodes, describing a working environment into layered map representation by taking the sub-map as a tree node by using a hierarchical map, and in a k-layer map, the k+1th layer of sub-map is represented by the key nodes, and the key nodes of the same layer are connected through an offline priori path;
Step 1.3, generating an off-line priori path with higher safety degree on the grid map by adopting a refinement algorithm, and constructing a topological map corresponding to each level of sub-grid map by adopting a topological feature extraction method;
And 1.4, maintaining and updating the local measurement map by using the onboard sensor information carried by the mobile robot in the running process.
Further, the sub-map granularity is not necessarily too fine, and is generally divided into individual room dimensions.
Further, when the refinement algorithm processes the two-dimensional grid map, the free grid with the value of 0 is regarded as '1' in the image, the obstacle grid with the value of 1 is regarded as '0' in the image, and finally, the offline path of the idle area is generated.
Further, the topological feature extraction method takes bifurcation points of the offline prior paths as common nodes in a topological graph, manually selects regional entrance grids or other representative grids to represent key nodes of a sub-map, counts offline prior path lengths among the key nodes as weights of edges, and the path lengths are the number of grids contained in the paths.
Further, the topology-grid hierarchical planning method comprises the following steps:
Step 2.1, judging whether the initial grid S and the target grid D are in the same sub-map, if so, entering the step 2.2, and if not, entering the step 2.3;
step 2.2, aiming at the starting point S to the end point D, adopting an improved A-type algorithm to plan a global preset path of overall optimization, and ending the algorithm;
Step 2.3, respectively obtaining the depth L s、LD of a starting point S and a finishing point D in a hierarchical diagram framework and key nodes C S、CD of a sub-grid map where the starting point S and the finishing point D are located, if L s<LD is the key nodes C S、CD, starting from C D, searching from bottom to top in the hierarchical diagram framework until C S is found in a topological diagram corresponding to an L s-1 layer, otherwise, starting from C S to top, until C D is found in the topological diagram corresponding to the L D-1 layer, and planning an optimal node sequence on the topological map of each layer by using a Floyd algorithm in the searching process, and saving an off-line priori path between the nodes as a middle section preset path;
Step 2.4, using a starting point S as a starting point, a key node C S as an end point, searching a local path on a sub-grid map where the starting point S is positioned by using an improved A-type algorithm and storing the local path as a front-stage preset path, using a key node C D as the starting point and using an end point D as the end point, searching another local path on the sub-grid map where the starting point S is positioned by using the improved A-type algorithm and storing the local path as a rear-stage preset path;
and 2.5, merging the middle preset path generated in the step 2.3 with the front preset path and the rear preset path searched in the step 2.4 to obtain an overall optimized global preset path, and ending the algorithm.
Further, the improved algorithm introduces an extended point screening strategy, a bidirectional searching mechanism and a path redundant point eliminating technology into the standard algorithm.
The expanding point screening strategy ignores nodes with weak directivity with a certain probability during expanding, and comprises the following steps:
Rectangular area surrounded by current expansion point m, starting point S and end point D Reflecting the degree of compliance of m with the expected path,The smaller the node m and the higher the expected path compliance, the more opportunities to join the Open table to continue expansion.
In the formula (1), (x s,yS)、(xD,yD)、(xm,ym) is the grid coordinates of the starting point S, the end point D and the current expansion point m respectively;
Setting an area threshold When (when)When m directly joins the Open table specially storing the detected but not accessed node, whenAnd generating a random number p (m) between 0 and 1, judging whether the random number p (m) is larger than the trust threshold probability p 0, if so, ignoring the node m, and if not, adding an Open table.
The bidirectional searching mechanism starts forward searching and backward searching from a starting point and a finishing point at the same time, the forward searching and the backward searching respectively take the latest node obtained by the opposite side as a target node to perform alternative expansion searching, the opposite side is continuously closed until the target nodes of the forward searching and the backward searching meet, and finally, paths generated by the forward searching and the backward searching are spliced, and the bidirectional searching mechanism comprises the following steps:
Initializing Open1, open2, close1 and Close2 lists, adding a starting point S into the Open1, and adding an end point D into the Open 2;
Step 3.2, respectively taking out nodes n1 and n2 with the minimum cost value from the Open1 and Open2 lists, and respectively moving into Close1 and Close2;
step 3.3, judging whether n1 and n2 are current searching nodes in the relative searching direction, if yes, finishing the algorithm, splicing paths generated in the forward direction and the reverse direction to be a final optimized path, and if no, entering step 3.4;
Step 3.4, forward searching takes S as a starting point, n2 as an end point, n1 as a center and extends to the surrounding 8 neighborhood, and updating Open1 according to a rule of screening extension points;
step 3.5, reversely searching, namely expanding to the surrounding 8 neighborhood by taking D as a starting point and n1 as an end point and n2 as a center, and updating Open2 according to a rule of screening expansion points;
and 3.6, judging whether Open1 and Open2 are empty, if yes, ending the algorithm, and failing to find an effective path, and if not, entering step 3.2.
The path redundant point eliminating technology specifically comprises the following steps:
step 4.1, inserting new nodes at equal distances from each pair of adjacent nodes of the original path;
Step 4.2, recall that the initial node is P 1, the next node is P 2, number sequentially until the end point P k, and each node sets the previous node numbered by the node as the father node;
Setting P n as an optimization reference node, setting P c as a node under exploration, initializing n as k, and initializing c as k-2;
Step 4.4, judging whether the connection line L nc between the P n and the P c passes through an obstacle, if so, indicating that the P c+1 cannot be deleted, and jumping the P n to the P c+1, otherwise, deleting the P c+1, setting the P c as a father node of the P n, and jumping the P c to the father node P c-1;
Step 4.5, repeating the step 4.4 until P n=P1 or P c=P1;
And 4.6, connecting the rest nodes according to the pointers of the father nodes.
Further, the improved DDPG algorithm introduces a value classification experience playback pool and hierarchical sampling strategy and designs a continuous bonus function that blends target distance, security, motion, and number of steps.
The value classification experience playback pool consists of three queues, wherein the value classification experience playback pool comprises a sub playback pool B1 for storing high-value samples, a sub playback pool B2 for storing recent samples and a sub playback pool B3 for storing common samples, and each sub playback pool stores experience samples in the form of five tuples:
(St,at,rt,St+1,done) (2)
Equation (2) indicates that the agent reaches a new state S t+1 after performing action a t in state S t, while obtaining the environmental benefit r t, where done indicates whether S t+1 is a termination state.
The specific construction steps of the value classification experience playback pool are as follows:
the average value V m of all sample values in the two sub playback pools B1 and B3 is set to 0 during initialization, and the weight coefficient alpha of the bonus value is set to 1. And generating an experience sample at each time step in the training process, adding the generated sample into the tail part of the B2 queue, popping up a sample from the head part of the queue after the B2 reaches the upper limit of capacity, calculating the value of the sample according to the sample value measurement standard, adding the sample into the B1 if the value is higher than V m, and otherwise storing the sample into the B3. After the storage is finished, updating V m and alpha, and repeating the above processes until the training is finished.
Taking a sample generated in the time step t as an example, the sample value measurement standard is calculated by the following steps:
Vt=α*|rt|+(1-α)*|δt| (3)
δt=rt+γQ′(St+1,at+1;θQ′)-Q(St,at;θQ) (4)
In the formula (3), V t is the value of the sample, |r t | is the absolute value of the sample reward value, α is the weight coefficient of |r t |, |δ t | is the absolute value of the sample time difference error, and (1- α) is the weight of |δ t |;
in the formula (4), gamma is a reward value attenuation coefficient, Q' is a target value network state action value, theta Q′ is a target value network parameter, Q is an online value network state action value, theta Q is an online value network parameter, S t is a state space at time step t, a t is a decided action value in the state space, S t+1 is a new state reached by an intelligent agent after a t is executed in S t, and a t+1 is an action value in the new state;
and the hierarchical sampling strategy randomly extracts a proper amount of samples from the sub-playback pools B1, B2 and B3 respectively for training the network.
The continuous rewarding function comprises three parts of an arrival rewarding r a, a collision punishment r c and a process rewarding r n, and the rewarding function r t is defined as:
rn=rd+rs+rm+rp (6)
During navigation, the process reward r n obtained by the robot executing action is divided into 4 parts, namely a distance reward r d, a safety penalty r s, a motion penalty r m and a step number penalty r p, as shown in a formula (6).
R d is the most important part in rewarding navigation process, guiding the robot to get close to the target point, defined as:
rd=k1*(Dg(St-1)-Dg(St)) (7)
In equation (7), k 1 is a distance reward coefficient, D g(St-1) represents a distance from the target point in the last state of the robot, and D g(St) represents a distance from the target point in the current state of the robot.
R s is used for warning the robot to keep away from the obstacle as far as possible, selects a path with higher safety degree, and is defined as:
In the formula (8), k 2 is a safety penalty coefficient, D o is the distance between the robot and the nearest obstacle around, D s is a safety distance threshold, and s r represents the safety degree of the path and is defined as:
r m penalizes frequent swinging of the robot, simplifies the motion control of the robot, and is defined as:
In the equation (10), k 3 is a motion penalty coefficient, w t represents an angular velocity of the current state output, and w t-1 represents an angular velocity of the previous state output.
In order to avoid the blind pursuit of the maximum rewarding and rewarding of the robot and the direct trend of the target, r p is introduced to prevent the blind exploration of the robot, and the indirect stimulation of the robot to find a better path is defined as:
rp=k4 (11)
In equation (11), k 4 is the step penalty coefficient.
Compared with the prior art, the invention has at least the following beneficial effects or advantages:
the hierarchical path planning method for the topology-grid-metric hybrid map divides the whole operation environment into sub-maps with different levels, generates paths by respectively applying different path planning methods on each sub-map, and finally splices the paths. The layering idea avoids detailed description of the whole environment, only expands key point information when needed, limits the search space of a path planning algorithm to be in a sub map, reduces the scale of the path planning problem in a large-scale scene, and improves the search efficiency.
Meanwhile, in the aspect of global path planning, the global optimized initial path generated based on the topological map and the grid map has the safety and smoothness, and the quality of path planning is improved. In the aspect of local path planning, aiming at the problems of long time consumption, low utilization rate of experience samples and the like of DDPG algorithm training, a value classification experience playback pool, a layered sampling strategy and a continuous rewarding function are designed, the training process is accelerated, the network is guided to converge to a more reasonable obstacle avoidance strategy, the robot is helped to better cope with emergency situations, and the robot is more suitable for a dynamic unknown environment.
Detailed Description
The present invention will be described in further detail below with reference to the drawings and the specific embodiments of the present invention according to the technical solutions provided by the inventors, but the present invention is not limited to the embodiments, and all the similar methods and similar variations using the present invention should be included in the protection scope of the present invention.
The invention relates to a mixed map hierarchical path planning method facing to a topology-grid-metric map, which is shown in fig. 1 and specifically comprises the following steps:
And 1, creating a hybrid map of the mobile robot working environment according to a topology-grid-measurement hybrid map construction method.
The topology-grid-metric hybrid map completes abstract representation, materialization representation and refinement representation of the working environment from multiple dimensions through a topology map, a grid map and a metric map respectively, as shown in fig. 3. The topology-grid-metric hybrid map construction method comprises the following steps:
and 1.1, carrying out rasterization representation on the whole working environment, and constructing a global raster map.
The grid map is readily available, and the grid representation of the global environment can be obtained by using sonar sensors or tools such as gmapping function packages in the robotic operating system (Robot Operating System, ROS) to completely sweep the surrounding environment.
And 1.2, dividing the global grid map into a plurality of sub-grid maps, setting key nodes in each sub-grid map, describing the working environment as a hierarchical map representation taking the sub-map as a tree node by using a hierarchical map, wherein in a k-layer map, a k+1th sub-map is represented by the key nodes, and the key nodes of the same hierarchy are connected through an offline path.
The hierarchical graph is expressed by adopting a multi-way tree, the tree comprises a plurality of abstract levels, and the abstract level is reduced from top to bottom layer by layer, as shown in fig. 4. The granularity of the sub map is not suitable to be too fine, and is generally divided into room dimensions of 10-30m 2.
And 1.3, generating an offline priori path with higher safety degree on the grid map by adopting a refinement algorithm, and constructing a topological map corresponding to each level by a topological feature extraction method.
Since the refinement algorithm processes only the region having a pixel value of 1, when the two-dimensional grid map is processed, the free grid having a value of "0" is regarded as "1" in the image, and the obstacle grid having a value of "1" is regarded as "0" in the image, and the offline path of the free region is finally generated. On the basis, topological features are extracted, branch points of an offline prior path are used as common nodes in a topological graph, regional entrance grids or other representative grids are manually selected to represent key nodes of a sub-map, offline prior path lengths among the key nodes are counted to be used as weights of edges, the path lengths are the number of grids contained in the path, and finally the topological map corresponding to each level is constructed.
And 1.4, maintaining and updating the local measurement map by using the onboard sensor information carried by the mobile robot in the running process.
The refined position representation is realized by an inertial navigation system of the mobile robot equipment and a sensor for detecting the surrounding environment, such as a laser radar and the like, and a local measurement map is maintained.
And 2, before the mobile robot operates, generating an overall optimized global preset path according to a topology-grid hierarchical planning method.
The topology-grid hierarchical planning method comprises the following steps:
Step 2.1, judging whether the initial grid S and the target grid D are in the same sub-map, if so, entering the step 2.2, and if not, entering the step 2.3;
step 2.2, aiming at the starting point S to the end point D, adopting an improved A-type algorithm to plan a global preset path of overall optimization, and ending the algorithm;
Step 2.3, respectively obtaining the depth L s、LD of the starting point S and the end point D in the hierarchical map architecture and the key node C S、CD of the sub-grid map where the starting point S and the end point D are located, and if L s<LD, starting from C D, searching from bottom to top in the hierarchical map architecture layer by layer until C S is found in the L s-1 layers. In the searching process, an optimal node sequence is planned by using a Floyd algorithm on the topological map of each level, and an off-line prior path between nodes is saved as a middle-section preset path;
And 2.4, searching a local path on the sub-grid map where the starting point S is located by using the modified A-based algorithm by taking the starting point S as the starting point and the key node C S as the end point, and storing the local path as a pre-set path at the front section. Similarly, the key node C D is set as a starting point, the end point D is set as an end point, and another local path is searched on the sub-grid map where the starting point S is located by using the modified a-algorithm, and the preset path at the rear section is stored. If the feasible solution cannot be searched, the path planning fails, and the algorithm is ended;
and 2.5, merging the middle preset path generated in the step 2.3 with the front preset path and the rear preset path searched in the step 2.4 to obtain an overall optimized global preset path, and ending the algorithm.
And 3, detecting the surrounding environment in real time when the mobile robot runs along the global preset path, judging whether the distance from the nearest obstacle of the robot is smaller than the obstacle avoidance threshold value, if so, entering the step 4, and if not, entering the step 5.
The laser radar of the robot equipment scans for a week to obtain a plurality of point cloud data, the point cloud data reflect the distance between surrounding obstacles and the robot, and when the minimum value of the point cloud data is smaller than an obstacle avoidance threshold value, the robot is close to the obstacle and obstacle avoidance is needed.
And 4, carrying out local path planning on the measurement map by utilizing an improved depth deterministic strategy gradient algorithm, and returning to the global preset path after obstacle avoidance is completed.
The measurement map can easily obtain the relative pose of two points, and is beneficial to state space representation. The distance point cloud data of the laser radar, the current pose of the robot and the target pose are spliced into a state space, the state space is input into a trained obstacle avoidance strategy network, and the end-to-end output line speed control the movement of the robot. When the robot detects an obstacle, a sub-target point behind the obstacle is selected on the initial path according to the direction of the obstacle, and an obstacle avoidance route is planned automatically.
And 5, judging whether the mobile robot reaches the target position, if so, ending the algorithm, and if not, entering the step 3.
The improved A-algorithm introduces an expansion point screening strategy, a bidirectional searching mechanism and a path redundant point eliminating technology into the standard A-algorithm.
(1) Expansion point screening strategy
The method for filtering the strategy by the expansion point ignores the node with weak directivity with a certain probability when expanding, and comprises the following steps:
path planning requires that the path length be as short as possible, i.e. that the path direction be as close as possible to the direction from the start point to the end point. Rectangular area surrounded by current expansion point m, starting point S and end point D Reflecting the degree of compliance of m with the intended path, as indicated by the green rectangle in fig. 2.The smaller the node m and the higher the expected path compliance, the more opportunities to join the Open table to continue expansion.
In the formula (1), (x s,yS)、(xD,yD)、(xm,ym) is the grid coordinates of the starting point S, the end point D and the current expansion point m respectively;
Setting an area threshold When (when)When m directly joins the OpenTable, whenAnd generating a random number p (m), judging whether the random number p (m) is larger than the trust threshold probability p 0, if so, ignoring the node m, and if not, adding an Open table.
(2) Bidirectional search mechanism
The bidirectional searching mechanism starts to perform forward searching and backward searching from a starting point and a finishing point at the same time, the forward searching and the backward searching respectively take the latest node obtained by the opposite side as a target node to perform alternative expansion searching, the opposite side is continuously closed until the target nodes of the forward searching and the backward searching meet, and finally, paths generated by the forward searching and the backward searching are spliced, and the bidirectional searching mechanism comprises the following steps:
Initializing Open1, open2, close1 and Close2 lists, adding a starting point S into the Open1, and adding an end point D into the Open 2;
Step 3.2, respectively taking out nodes n1 and n2 with minimum cost value F from the Open1 and Open2 lists, and respectively moving into Close1 and Close2;
further, the cost value F is calculated as follows:
F(n)=g(n)+h(n) (2)
in equation (2), F (n) is an estimated cost of the starting point s to reach the target point via the node n, g (n) is an actual cost of the starting point s to reach the node n, and h (n) is an actual cost of the node n to reach the target d.
In the formula (3), (x n,yn)、(xs,yS) is the coordinates of the node n and the starting point s, respectively;
in the formula (4), (x d,yd) is the coordinate of the end point d;
step 3.3, judging whether n1 and n2 are current searching nodes in the relative searching direction, if yes, finishing the algorithm, splicing paths generated in the forward direction and the reverse direction to be a final optimized path, and if no, entering step 3.4;
Step 3.4, forward searching takes S as a starting point, n2 as an end point, n1 as a center and extends to the surrounding 8 neighborhood, and updating Open1 according to a rule of screening extension points;
step 3.5, reversely searching, namely expanding to the surrounding 8 neighborhood by taking D as a starting point and n1 as an end point and n2 as a center, and updating Open2 according to a rule of screening expansion points;
and 3.6, judging whether Open1 and Open2 are empty, if yes, ending the algorithm, and failing to find an effective path, and if not, entering step 3.2.
It should be noted that when the bidirectional search mechanism is extended to the middle, an extended point screening strategy is followed.
(3) Path redundancy point eliminating technology
The method further applies a path redundant point eliminating technology on the basis of generating an original path by bidirectional searching, and specifically comprises the following steps:
step 4.1, inserting new nodes at equal distances from each pair of adjacent nodes of the original path;
Step 4.2, recall that the initial node is P 1, the next node is P 2, number sequentially until the end point P k, and each node sets the previous node numbered by the node as the father node;
P n is a currently optimized node, P c is a node under exploration, n is initialized to k, and c is initialized to k-2;
Step 4.4, judging whether the connection line L nc between the P n and the P c passes through an obstacle, if so, indicating that the P c+1 cannot be deleted, and jumping the P n to the P c+1, otherwise, deleting the P c+1, setting the P c as a father node of the P n, and jumping the P c to the father node P c-1;
Step 4.5, repeating the step 3.4 until P n=P1 or P c=P1;
And 4.6, connecting the rest nodes according to the pointers of the father nodes.
The improved DDPG algorithm introduces a value classification experience playback pool and a hierarchical sampling strategy and designs a continuous rewarding function that fuses target distance, security, motion and number of steps.
(1) Value classification experience playback pool
The value classification experience playback pool consists of a sub playback pool B1 for storing high-value samples, a sub playback pool B2 for storing recent samples and a sub playback pool B3 for storing common samples, wherein each sub playback pool stores experience samples in the form of five tuples as follows:
(St,at,rt,St+1,done) (5)
Equation (5) indicates that the agent reaches a new state S t+1 after performing action a t in state S t, while obtaining the environmental benefit r t, where done indicates whether S t+1 is a termination state.
The specific construction steps of the value classification experience playback pool are as follows:
the average value V m of all sample values in the two sub playback pools B1 and B3 is set to 0 during initialization, and the weight coefficient alpha of the bonus value is set to 1. And generating an experience sample at each time step in the training process, adding the generated sample into the tail part of the B2 queue, popping up a sample from the head part of the queue after the B2 reaches the upper limit of capacity, calculating the value of the sample according to the sample value measurement standard, adding the sample into the B1 if the value is higher than V m, and otherwise storing the sample into the B3. After the storage is finished, updating V m and alpha, and repeating the above processes until the training is finished.
Taking a sample generated in the time step t as an example, the sample value measurement standard is calculated by the following steps:
Vt=α*|rt|+(1-α)*|δt| (6)
δt=rt+γQ′(St+1,at+1;θQ′)-Q(St,at;θQ) (7)
in the formula (6), V t is the value of the sample, |r t | is the absolute value of the sample prize value, α is the weight coefficient of |r t |, |δ t | is the absolute value of the sample time difference error, and (1- α) is the weight of |δ t |;
In the formula (7), gamma is a reward value attenuation coefficient, Q' is a target value network state action value, theta Q′ is a target value network parameter, Q is an online value network state action value, theta Q is an online value network parameter, S t is a state space at time step t, a t is a decided action value in the state space, S t+1 is a new state reached by an intelligent agent after a t is executed in S t, and a t+1 is an action value in the new state;
(2) Hierarchical sampling strategy
And the hierarchical sampling strategy randomly extracts a proper amount of samples from the sub-playback pools B1, B2 and B3 respectively for training the network. The total number of small batch samples is recorded as N, the number of samples from the sub playback pool B1 is N1, the number of samples from the sub playback pool B2 is N2, and the number of samples from the sub playback pool B3 is N3.
N=N1+N2+N3 (8)
N2 is a fixed value, so that each small batch of samples have a recent sample, and the samples have timeliness, thereby being beneficial to timely adjusting the network convergence direction. In the initial stage of training, N1 takes a larger value, and the network is promoted to converge towards the correct direction by selecting a large number of high-quality samples. With the development of training, the robot performance is more excellent, the appearance frequency of high-value samples is higher and higher, and in order to avoid over fitting, N1 is properly reduced, and N3 is improved.
(3) Continuous bonus function
The continuous rewarding function comprises three parts of arrival rewarding, collision punishment and process rewarding, wherein when the distance between the center of the robot and the target position is smaller than an arrival threshold value, as shown in a formula (9), the robot is judged to arrive at a target point to obtain a 300 rewarding value, when the minimum value of laser radar point cloud data of the robot equipment is smaller than a collision threshold value, the robot is judged to collide to obtain a-500 rewarding value, and a process rewarding r n for fusing the target distance, the safety, the movement and the step number is defined as a formula (10):
rn=rd+rs+rm+rp (10)
During navigation, a process reward r n obtained by the action performed by the robot is divided into 4 parts, namely a distance reward r d, a safety penalty r s, a motion penalty r m and a step number penalty r p.
R d is the most important part in rewarding navigation process, guiding the robot to get close to the target point, defined as:
rd=k1*(Dg(St-1)-Dg(St)) (11)
In equation (11), k 1 is a distance reward coefficient, and in this example, k 1=20,Dg(St-1) represents a distance from the target point in the previous state of the robot, and D g(St) represents a distance from the target point in the current state of the robot.
R s is used for warning the robot to keep away from the obstacle as far as possible, selects a path with higher safety degree, and is defined as:
In the formula (12), k 2 is a safety penalty coefficient, in this example, k 2=-1,Do is a distance between the robot and the nearest obstacle, d s is a safety distance threshold, and s r represents the safety degree of the path and is defined as:
r m penalizes frequent swinging of the robot, simplifies the motion control of the robot, and is defined as:
In equation (14), k 3 is a motion penalty coefficient, and in this example, k 3=-0.2;wt represents the angular velocity of the current state output, and w t-1 represents the angular velocity of the previous state output.
In order to avoid the blind pursuit of the maximum rewarding and rewarding of the robot and the direct trend of the target, r 4 is introduced to prevent the blind exploration of the robot, and the indirect stimulation of the robot to find a better path is defined as:
rp=k4 (15)
in the formula (15), k 4 is a step number penalty coefficient, and k 4 = -0.05 is taken as the example.
It should be noted that DDPG is a deep reinforcement learning algorithm, so the value classification experience playback pool and the continuous rewarding function are designed to accelerate the training process and ensure that a high quality obstacle avoidance strategy is finally learned. The training process can be performed in a simulation environment, and the training process can be transferred to a real environment by loading network parameters after the training process is completed.
The foregoing is merely a preferred embodiment of the invention, and it should be noted that modifications could be made by those skilled in the art without departing from the principles of the invention, which modifications would also be considered to be within the scope of the invention.