Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a robot path planning and control method combining reinforcement learning with a circulation network.
The aim of the invention can be achieved by the following technical scheme:
a robot path planning method combining reinforcement learning with a cyclic network, the method comprising:
constructing a circulation network for generating a robot path, wherein the circulation network sequentially generates path points in the robot path;
Training the circulation network by adopting a reinforcement learning method;
and performing robot path planning by using the trained loop network.
Preferably, the loop network comprises a plurality of cascaded path loop network models for outputting path points, and the input of the path network models comprises radar information, robot target point information and last path point information.
Preferably, the specific method for generating the robot path by the circulation network comprises the following steps:
establishing a robot local polar coordinate system, wherein the robot self coordinates are represented as Q o (0, 0) in the robot local polar coordinate system, and a path point set formed by path points in a robot path is represented as Wherein Q i represents an ith path point, ρ i、αi corresponds to the displacement distance and rotation angle of the ith path point Q i relative to the (i-1) th path point Q i-1, and n is the total number of path points constituting the robot path;
Acquiring current radar information O s, wherein O s is unchanged before the path generation is completed;
Determining a coordinate T (rho t,θt) of a robot target point T under a robot local polar coordinate system, wherein the T is unchanged before the generation of a path is completed;
For the kth waypoint to be generated, radar information O s, a robot target point T (ρ t,θt), and kth-1 one waypoint position information Q k-1(ρk-1,αk-1 are input to the path circulation network model, which outputs kth waypoint position information Q k(ρk,αk), k=1, 2.
Preferably, the reinforcement learning training cycle network comprises:
starting a plurality of processes, training a plurality of robots in the simulation map at the same time, and generating robot paths based on a circulation network respectively;
and constructing a return function for each robot path, and updating and optimizing a loop network of the generated path by using a reinforcement learning algorithm.
Preferably, the return function of the robot path is expressed as r:
r=rc+rn+rs
wherein r c is collision feedback of the generated robot path and the obstacle, r n is approach target point feedback, and r s is path smoothness feedback.
Preferably, the specific determination mode of the return function of the robot path includes:
To establish a robot local polar coordinate system, the robot's own coordinates are represented as Q o (0, 0) in the robot local polar coordinate system, and the set of path points consisting of path points in the robot path are represented as Wherein Q i represents an ith path point, ρ i、αi corresponds to the displacement distance and rotation angle of the ith path point Q i relative to the (i-1) th path point Q i-1, and n is the total number of path points constituting the robot path;
Collision feedback r c is calculated:
wherein a is a constant;
Calculating the proximity target point feedback r n:
Wherein d is the distance from the current position of the robot to the target point, and s i is the generated distance from the ith path point to the target point;
Calculating and generating path smoothness feedback r s:
wherein b is a constant;
R=r c+rn+rs is calculated.
Preferably, the reinforcement learning algorithm includes a PPO algorithm.
Preferably, the performing robot path planning using the trained loop network includes:
deploying the trained path circulation network model on the mobile robot;
Determining the current position of a robot, selecting a target point of the robot, starting a laser radar of the robot, and acquiring radar information;
The robot sequentially generates path points according to the current position of the robot, radar information, target point information and a path circulation network model, and the generated path points are sequentially arranged to form a robot path point set.
A robot control method of reinforcement learning combined with a cyclic network, the method comprising:
carrying out path planning on the robot by adopting the path planning method, and determining a robot path from the current position to the target point, wherein the robot path comprises a plurality of path points;
and controlling the robot to sequentially move according to the planned path points.
Preferably, after the robot moves for a period of time according to the planned route points, the route planning method is adopted again to carry out route planning on the robot, the robot is controlled to move according to the new robot route, and the above processes are executed circularly until the robot reaches the target point.
Compared with the prior art, the invention has the following advantages:
(1) The conventional path planning method needs to know global information so as to plan the global path, and the path planning method provided by the invention combines reinforcement learning and a circulation network and can generate a relatively optimized local path through limited information, therefore, the method does not need global information as the traditional method, does not need to predict the path, and does not optimize the generated path.
(2) Because the generation of the path is a rare task in reinforcement learning, the invention develops a circulation network model for the purpose, and in most cases, the previous path can influence the path generated later, and the process of sequentially generating the path points can be perfectly matched by utilizing the causal relationship between the front result and reasoning in the circulation network, according to the invention, a near-end optimization strategy (PPO) method in reinforcement learning is used for training a circulating network model to plan a path of the robot, the path planned by the method can generate a corresponding optimization result according to a real-time environment, meanwhile, the time consumption is low, the required reasoning times are low, the unknown environment can be greatly inferred while local information is limited, the resource is saved, and the efficiency is improved.
(3) The invention relates to a control method of a robot, which is characterized in that firstly, the gravity center of application of reinforcement learning is transferred to a path planning of the robot, the invention uses reinforcement learning to generate a local feasible path of the robot based on limited information which can be obtained by the robot, and then the robot is controlled to move based on the feasible path.
Detailed Description
The invention will now be described in detail with reference to the drawings and specific examples. Note that the following description of the embodiments is merely an example, and the present invention is not intended to be limited to the applications and uses thereof, and is not intended to be limited to the following embodiments.
Examples
The embodiment provides a robot path planning method combining reinforcement learning with a circulation network, which comprises the following steps:
constructing a circulation network for generating a robot path, and sequentially generating path points in the robot path by the circulation network;
training a circulation network by adopting a reinforcement learning method;
and performing robot path planning by using the trained loop network.
The loop network comprises a plurality of cascaded path loop network models for outputting path points, and the input of the path loop network models comprises radar information, robot target point information and last path point information.
According to the invention, reinforcement learning is combined with a circulation network, an optimal path which can be safely followed by the wheeled robot is generated through environment information, and the surrounding environment is explored through limited sensor information, so that the robot can complete a moving task through the path.
The method specifically comprises the following steps:
s1, defining a path:
The S1-1, path, representation method is a set of points, this set of points is called the "path set of points". All points in a set of path points constitute a path. The set of path points is lp= { Q 1,Q2,Q3,…,Qn }, where n is the maximum number of path points, Q 1,Q2,Q3,…,Qn is the 1 st, 2 nd, 3 rd, respectively.
S1-2, taking the position of the robot as the origin of a local coordinate system of the robot, taking the facing direction of the robot as the x axis, and establishing a local polar coordinate system of the robot. The robot's own coordinates are denoted as Q o (0, 0) in the robot's local polar coordinate system.
S1-3, sequentially generating a displacement distance amount and a rotation angle size (ρ i,αi) of an i-th point Q i relative to an i-1-th point Q i-1 from 1-n (n is the maximum number of path points and is also the length of a path point set), i=1, 2.
S1-4, the set of path points can be expressed as: Where Q i denotes an i-th path point, ρ i、αi corresponds to a displacement distance and a rotation angle of the i-th path point Q i with respect to the i-1-th path point Q i-1, and n is a total number of path points constituting the robot path.
S2, building a circulation network:
S2-1, as shown in FIG. 1 and FIG. 2, the robot radar information O s in the present embodiment is obtained by:
The area is divided into 180 parts by taking the range of 90 degrees of the robot facing direction as the effective area scanned by the robot radar, and one distance data is recorded by 1 degree average. Three frames of sensor data which are scanned by a robot radar sensor recently are taken, namely:
s2-2, acquiring robot target point information T:
T((ρt,θt)
S2-3, path information Lp, wherein the path consists of n points, namely:
Lp=[Q1,Q2,Q3,…,Qn]
S2-4, constructing a path circulation network model, wherein the input of the path network model comprises radar information, robot target point information and last path point information. The loop network comprises a plurality of cascaded loop networks for outputting path points, wherein the loop network is formed by cascading n path loop network models in fig. 3.
S3, generating a path Lp by a circulation network:
As shown in fig. 3, all the path points are sequentially generated through the loop network, and all the path points form a path point set to generate a path Lp.
S3-1, input data required by a circulation network are as follows:
S3-1-1, radar information O s at the current moment. The robot facing direction is 0 degrees, the ranges of 90 degrees (namely (-90 degrees, 90 degrees) are respectively about 180 degrees, the 180 degrees are divided into 180 dimensions, and each dimension data represents the distance from the obstacle to the radar sensor within 1 degree. The closer the obstacle is to the sensor, the smaller the magnitude of this angle. Every time the radar sensor rotates for one circle, the radar data is updated for 180-dimension data once, and after the sensor rotates for three circles, the radar data is updated for three times, so that 3×180-dimension data (3 frames×180 dimensions) are obtained. Then O s is 540-dimensional data. O s is unchanged until path generation is completed.
S3-1-2, and coordinates T (ρ t,θt) of a robot target point T under a robot local polar coordinate system. T is unchanged until path generation is completed.
S3-1-3, last point Q k-1(ρk-1,αk-1). Before the path generation is completed, when the i-th point Q i is generated, the input becomes the last point Q i-1 of the i-th point.
S3-2, path circulation network model:
The structure of the path circulation network model is shown in fig. 4, in which O s generated in step S3-1-1 is convolved into 256-dimensional data through two layers of convolution layers, the convolved 256-dimensional data is combined with two-dimensional data T (ρ t,θt) and two-dimensional data Q k-1(ρk-1,αk-1) in S3-1-2 and S3-1-3 into 260-dimensional data, and then the 260-dimensional data is input to the full connection layer for processing, and the output point Q k(ρk,αk is output.
S3-3, outputting the following by a circulation network:
S3-3-1, at the kth step, point Q k(ρk,αk of output). After the robot generates the route point k, the robot updates its own state, updates its own virtual predicted position and orientation (state information) to the newly generated route point k, and then estimates the next point Q k+1(ρk+1,αk+1 by combining the radar information O s and the target point information T, which are not changed during the route generation, with the virtual predicted position and orientation.
S3-3-2, after all n points in the path point set are generated, the path point set and path Lp generating process is completed, radar data O s is updated, and target point data T is updated.
S4, training phase:
The reinforcement learning training cycle network includes:
starting a plurality of processes, training a plurality of robots in the simulation map at the same time, and generating robot paths based on a circulation network respectively;
and constructing a return function for each robot path, and updating and optimizing a loop network of the generated path by using a reinforcement learning algorithm.
The method specifically comprises the following steps:
S4-1, after generating the path, judging the path by adopting a return function, wherein the return function of the robot path is expressed as r:
r=rc+rn+rs
wherein r c is collision feedback of the generated robot path and the obstacle, r n is approach target point feedback, and r s is path smoothness feedback.
The feedback r c for a collision is determined as follows:
where a is a constant, and in this embodiment, a takes a value of 10.
Here, the collision is a collision of the generated path with an obstacle, not a collision with the obstacle during the travel of the robot. If the generated path passes through collision with the scanned obstacle in the robot local polar coordinate system, r c = -10, and if the generated path can avoid the obstacle and finally reach the target point, r c = +10. In other cases r c = 0.
Feedback r n on whether the target point is approached:
Where d is the distance from the current position of the robot to the target point, s i is the generated i-th waypoint to the target point, and d-s i indicates that the generated i-th waypoint is closer to the length of the target point than the current position of the robot. If d-s i >0, then this indicates that the robot is farther from the target point, and the generated path point is closer to the target point, so the feedback is positive. In generating the path point Q i, if i is larger, the generated path point may be closer to the target point, (d-s i)/i can balance the feedback of all points to better calculate the effect of each point on the overall path.
The smoothness feedback r s for the generated path can reflect the smoothness of this path of the robot by the magnitude of the deviation of a i from the current direction, expressed as:
where b is a constant, and in this embodiment, b takes a value of 0.0005.
R=r c+rn+rs is calculated.
S4-2, starting a plurality of processes, training a plurality of robots in a simulation map simultaneously, and updating and optimizing a circulation network of a generated path by using a reinforcement learning near-end policy optimization (PPO) algorithm through paths generated by the plurality of robots and a return function r defined in S4-1.
S5, execution stage
S5-1, deploying the trained path circulation network model on the mobile robot.
S5-2, selecting a target point of the robot, starting a laser radar of the robot, and acquiring radar information.
S5-3, the robot generates a trackable path Lp at the current moment according to the radar information O s, the target point information T and the path circulation network model M.
Based on the path planning method, as shown in fig. 5, the embodiment also provides a robot control method combining reinforcement learning with a circulation network, which includes:
Carrying out path planning on the robot by adopting the path planning method, and determining that the robot moves from the current position to the target point in the path of the robot, wherein the path of the robot comprises a plurality of path points;
the robot controller controls the robot to act, including the release of information such as linear velocity, angular velocity and the like, and controls the robot to move sequentially according to the planned path points.
After the robot moves for a period of time deltat according to the planned path point, generating a path Lp 'of the robot at the new position according to the new radar data O s', the target point data T ', the starting position Q 0' and the network model M, and issuing a corresponding instruction again by the movement controller according to the new path until the robot reaches the target point.
The conventional path planning method needs to know global information so as to plan the global path, and the path planning method provided by the invention combines reinforcement learning and a circulation network and can generate a relatively optimized local path through limited information, therefore, the method does not need global information as the traditional method, does not need to predict the path, and does not optimize the generated path. Because the generation of the path is a rare task in reinforcement learning, the invention develops a circulation network model for the purpose, and in most cases, the previous path can influence the path generated later, and the process of sequentially generating the path points can be perfectly matched by utilizing the causal relationship between the front result and reasoning in the circulation network, according to the invention, a near-end optimization strategy (PPO) method in reinforcement learning is used for training a circulating network model to plan a path of the robot, the path planned by the method can generate a corresponding optimization result according to a real-time environment, meanwhile, the time consumption is low, the required reasoning times are low, the unknown environment can be greatly inferred while local information is limited, the resource is saved, and the efficiency is improved.
The robot is controlled on the basis of the path planning of the robot, so that the robot control method of the invention transfers the application gravity center of reinforcement learning to the path planning of the robot, the invention generates a locally feasible path of the robot by reinforcement learning based on the limited information which can be obtained by the robot, and then controlling the robot to move based on the feasible path, which is different from the end-to-end output robot movement control method, the invention needs to add the controller to enable the robot to move along the generated local feasible path, thereby finding the target point in the complex scene and realizing the movement control of the robot.
The above embodiments are merely examples, and do not limit the scope of the present invention. These embodiments may be implemented in various other ways, and various omissions, substitutions, and changes may be made without departing from the scope of the technical idea of the present invention.